Building a Simple Search Index
Designing an inverted index from scratch with tokenization, ranking, and query parsing, then comparing it against Postgres full-text search.
Context
I built a search index for a documentation site with 5,000 pages. The goal was to understand how search works at a fundamental level, then compare the custom implementation against Postgres full-text search to determine when building your own is justified (rarely).
Problem
Full-text search engines like Elasticsearch and Meilisearch are powerful but operationally expensive for small corpora. Postgres full-text search is good but opaque. Building a search index from scratch reveals the trade-offs that production systems make and helps diagnose search quality issues in any system.
Constraints
- Corpus: 5,000 documentation pages, average 800 words each (4M total words)
- Index storage: must fit in memory (target: under 200MB)
- Query latency: under 50ms for single-term and multi-term queries
- Ranking: relevance-based (TF-IDF), with title matches weighted higher
- Must support: prefix matching, phrase matching, and boolean AND/OR
- No external dependencies (pure TypeScript implementation)
Design
Tokenization Pipeline
Raw Text
-> Lowercase
-> Remove punctuation
-> Split on whitespace
-> Remove stop words (the, a, is, etc.)
-> Apply stemming (Porter stemmer)
-> Output: array of tokens
function tokenize(text: string): string[] {
return text
.toLowerCase()
.replace(/[^\w\s]/g, ' ')
.split(/\s+/)
.filter(word => word.length > 1)
.filter(word => !STOP_WORDS.has(word))
.map(word => stem(word));
}Example:
- Input: "Implementing server-side rendering in Next.js"
- After lowercase: "implementing server-side rendering in next.js"
- After punctuation removal: "implementing server side rendering in next js"
- After stop word removal: "implementing server side rendering next js"
- After stemming: "implement server side render next js"
See also: Implementing Server-Side Rendering Without Overhead.
Inverted Index Structure
The core data structure maps each token to the list of documents containing it, along with position information:
interface InvertedIndex {
// token -> document postings
[token: string]: Posting[];
}
interface Posting {
docId: string;
frequency: number; // how many times token appears in doc
positions: number[]; // positions of token in doc (for phrase matching)
field: 'title' | 'body'; // which field the token came from
}Index Construction
function buildIndex(documents: Document[]): InvertedIndex {
const index: InvertedIndex = {};
const docLengths: Map<string, number> = new Map();
for (const doc of documents) {
// Index title tokens with field marker
const titleTokens = tokenize(doc.title);
for (let i = 0; i < titleTokens.length; i++) {
addToIndex(index, titleTokens[i], doc.id, i, 'title');
}
// Index body tokens
const bodyTokens = tokenize(doc.body);
for (let i = 0; i < bodyTokens.length; i++) {
addToIndex(index, bodyTokens[i], doc.id, i, 'body');
}
docLengths.set(doc.id, titleTokens.length + bodyTokens.length);
}
return index;
}Ranking: TF-IDF with Field Boosting
function score(
token: string,
posting: Posting,
index: InvertedIndex,
totalDocs: number,
docLength: number,
avgDocLength: number,
): number {
// Term frequency (BM25 variant)
const k1 = 1.2;
const b = 0.75;
const tf = posting.frequency;
const normalizedTf = (tf * (k1 + 1)) /
(tf + k1 * (1 - b + b * (docLength / avgDocLength)));
// Inverse document frequency
const df = index[token].length;
const idf = Math.log((totalDocs - df + 0.5) / (df + 0.5) + 1);
// Field boost
const fieldBoost = posting.field === 'title' ? 3.0 : 1.0;
return normalizedTf * idf * fieldBoost;
}Query Processing
function search(query: string, index: InvertedIndex, limit: number = 10): SearchResult[] {
const tokens = tokenize(query);
const scores: Map<string, number> = new Map();
for (const token of tokens) {
// Support prefix matching
const matchingTokens = token.endsWith('*')
? Object.keys(index).filter(t => t.startsWith(token.slice(0, -1)))
: [token];
for (const matchToken of matchingTokens) {
const postings = index[matchToken] || [];
for (const posting of postings) {
const current = scores.get(posting.docId) || 0;
scores.set(
posting.docId,
current + score(matchToken, posting, index, totalDocs, docLengths.get(posting.docId), avgDocLength)
);
}
}
}
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, limit)
.map(([docId, score]) => ({ docId, score }));
}Phrase Matching
For phrase queries (quoted strings), verify that tokens appear in sequence using position data:
function phraseMatch(tokens: string[], index: InvertedIndex, docId: string): boolean {
const postingsPerToken = tokens.map(t =>
(index[t] || []).filter(p => p.docId === docId)
);
if (postingsPerToken.some(p => p.length === 0)) return false;
// Check for sequential positions
const firstPositions = postingsPerToken[0].flatMap(p => p.positions);
return firstPositions.some(startPos =>
tokens.every((_, i) =>
postingsPerToken[i].some(p => p.positions.includes(startPos + i))
)
);
}Trade-offs
Custom Index vs Postgres Full-Text Search
| Metric | Custom Index | Postgres FTS |
|---|---|---|
| Index build time (5,000 docs) | 1.2s | 3.5s (GIN index) |
| Index size in memory | 85MB | N/A (disk-based) |
| Query latency (single term, p50) | 2ms | 12ms |
| Query latency (3 terms, p50) | 8ms | 25ms |
| Phrase matching | Yes (position-based) | Yes (tsquery with <->) |
| Typo tolerance | No | No (without pg_trgm) |
| Stemming quality | Porter (English only) | Snowball (multi-language) |
| Ranking quality | BM25 + field boost | ts_rank (TF-based) |
| Operational overhead | High (custom code) | Low (built-in) |
Relevance Quality (20 test queries, correct result in top 3)
| Query Type | Custom Index | Postgres FTS |
|---|---|---|
| Single keyword | 19/20 | 18/20 |
| Multi-keyword | 17/20 | 16/20 |
| Phrase query | 18/20 | 17/20 |
| Concept (e.g., "how to deploy") | 12/20 | 14/20 |
The custom index performs slightly better on keyword queries due to BM25 ranking (Postgres uses a simpler TF-based ranking by default). Postgres performs better on concept queries because its Snowball stemmer handles more word forms.
Memory Usage Breakdown
| Component | Size |
|---|---|
| Token dictionary (unique tokens) | 2.1MB (180,000 tokens) |
| Posting lists | 62MB |
| Position data | 18MB |
| Document metadata | 3MB |
| Total | 85MB |
Position data (for phrase matching) accounts for 21% of the index. Removing it reduces memory to 67MB but eliminates phrase matching capability.
Failure Modes
Memory exhaustion with large corpora: At 50,000 documents, the index would consume approximately 850MB. At 500,000 documents, it would exceed available memory on most application servers. The in-memory approach has a hard ceiling.
Tokenization language mismatch: The Porter stemmer is English-only. German compound words, CJK character segmentation, and Arabic morphology require language-specific tokenizers. Searching multi-language content with an English stemmer produces poor results for non-English documents.
Index staleness after content updates: The in-memory index is built at startup. New or updated documents are not searchable until the index is rebuilt. For static documentation, this is fine (rebuild on deploy). For dynamic content, an incremental update mechanism is needed.
Prefix matching performance: The prefix matching implementation scans all tokens in the index. With 180,000 tokens, a prefix like "a*" matches 12,000 tokens, and the query takes 200ms+ instead of 2ms. Mitigation: use a trie data structure for prefix lookups, or limit prefix results to the top 100 tokens by document frequency.
Scaling Considerations
- For corpora beyond 50,000 documents, use a disk-based index (B-tree for the token dictionary, memory-mapped files for posting lists).
- For real-time indexing, maintain a small in-memory delta index for recent changes and merge it periodically with the main index.
- Postgres FTS is the correct choice for most applications. It handles up to 1M documents efficiently with GIN indexes and requires no custom code.
- Beyond 1M documents, consider Elasticsearch or Meilisearch. The operational overhead is justified at that scale.
Observability
- Log query latency, result count, and zero-result queries
- Track the most frequent search terms (helps identify content gaps)
- Monitor index build time and memory usage after each deployment
- Measure relevance through click-through rate on search results (manual review for small corpora)
- Alert on queries exceeding 100ms (indicates prefix explosion or index corruption)
Key Takeaways
- An inverted index with BM25 ranking and field boosting produces competitive search quality in under 200 lines of code.
- The in-memory approach is viable for corpora under 50,000 documents. Beyond that, disk-based indexes are necessary.
- Position data enables phrase matching but costs 21% of index memory. Include it only if phrase queries are required.
- Postgres full-text search is the pragmatic default. It requires zero custom code and handles 1M+ documents.
- Build your own search index to understand the mechanics. Deploy Postgres FTS (or Meilisearch) for production.
Related: Debugging Performance Issues in Large Android Apps.
Further Reading
- Comparing Search Implementations: Client vs Server: Measuring latency, bundle size, and relevance quality between client-side search with Fuse.js and server-side search with Postgres full-t...
- Building a View Counter System With Postgres: Designing a page view counter that handles concurrent writes, avoids double-counting, and stays responsive under load using Postgres.
- Building a Minimal Feature Flag Service: Designing a feature flag system with percentage rollouts, user targeting, and kill switches using Postgres and an in-memory cache.
Final Thoughts
Building a search index from scratch took 2 days and produced an 85MB in-memory index that queries in 2-8ms. The exercise was valuable for understanding tokenization, posting lists, and TF-IDF ranking. For production, I use Postgres FTS. The custom index runs in the development environment for testing search quality changes before deploying them as Postgres query modifications. Understanding the fundamentals makes debugging search relevance issues in any system significantly easier.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.