Building a Simple Search Index

Context

I built a search index for a documentation site with 5,000 pages. The goal was to understand how search works at a fundamental level, then compare the custom implementation against Postgres full-text search to determine when building your own is justified (rarely).

Problem

Full-text search engines like Elasticsearch and Meilisearch are powerful but operationally expensive for small corpora. Postgres full-text search is good but opaque. Building a search index from scratch reveals the trade-offs that production systems make and helps diagnose search quality issues in any system.

Constraints

Corpus: 5,000 documentation pages, average 800 words each (4M total words)
Index storage: must fit in memory (target: under 200MB)
Query latency: under 50ms for single-term and multi-term queries
Ranking: relevance-based (TF-IDF), with title matches weighted higher
Must support: prefix matching, phrase matching, and boolean AND/OR
No external dependencies (pure TypeScript implementation)

Design

Tokenization Pipeline

Raw Text
  -> Lowercase
  -> Remove punctuation
  -> Split on whitespace
  -> Remove stop words (the, a, is, etc.)
  -> Apply stemming (Porter stemmer)
  -> Output: array of tokens

function tokenize(text: string): string[] {
  return text
    .toLowerCase()
    .replace(/[^\w\s]/g, ' ')
    .split(/\s+/)
    .filter(word => word.length > 1)
    .filter(word => !STOP_WORDS.has(word))
    .map(word => stem(word));
}

Example:

Input: "Implementing server-side rendering in Next.js"
After lowercase: "implementing server-side rendering in next.js"
After punctuation removal: "implementing server side rendering in next js"
After stop word removal: "implementing server side rendering next js"
After stemming: "implement server side render next js"

Inverted Index Structure

The core data structure maps each token to the list of documents containing it, along with position information:

interface InvertedIndex {
  // token -> document postings
  [token: string]: Posting[];
}
 
interface Posting {
  docId: string;
  frequency: number;       // how many times token appears in doc
  positions: number[];     // positions of token in doc (for phrase matching)
  field: 'title' | 'body'; // which field the token came from
}

Index Construction

function buildIndex(documents: Document[]): InvertedIndex {
  const index: InvertedIndex = {};
  const docLengths: Map<string, number> = new Map();
 
  for (const doc of documents) {
    // Index title tokens with field marker
    const titleTokens = tokenize(doc.title);
    for (let i = 0; i < titleTokens.length; i++) {
      addToIndex(index, titleTokens[i], doc.id, i, 'title');
    }
 
    // Index body tokens
    const bodyTokens = tokenize(doc.body);
    for (let i = 0; i < bodyTokens.length; i++) {
      addToIndex(index, bodyTokens[i], doc.id, i, 'body');
    }
 
    docLengths.set(doc.id, titleTokens.length + bodyTokens.length);
  }
 
  return index;
}

Ranking: TF-IDF with Field Boosting

function score(
  token: string,
  posting: Posting,
  index: InvertedIndex,
  totalDocs: number,
  docLength: number,
  avgDocLength: number,
): number {
  // Term frequency (BM25 variant)
  const k1 = 1.2;
  const b = 0.75;
  const tf = posting.frequency;
  const normalizedTf = (tf * (k1 + 1)) /
    (tf + k1 * (1 - b + b * (docLength / avgDocLength)));
 
  // Inverse document frequency
  const df = index[token].length;
  const idf = Math.log((totalDocs - df + 0.5) / (df + 0.5) + 1);
 
  // Field boost
  const fieldBoost = posting.field === 'title' ? 3.0 : 1.0;
 
  return normalizedTf * idf * fieldBoost;
}

Query Processing

function search(query: string, index: InvertedIndex, limit: number = 10): SearchResult[] {
  const tokens = tokenize(query);
  const scores: Map<string, number> = new Map();
 
  for (const token of tokens) {
    // Support prefix matching
    const matchingTokens = token.endsWith('*')
      ? Object.keys(index).filter(t => t.startsWith(token.slice(0, -1)))
      : [token];
 
    for (const matchToken of matchingTokens) {
      const postings = index[matchToken] || [];
      for (const posting of postings) {
        const current = scores.get(posting.docId) || 0;
        scores.set(
          posting.docId,
          current + score(matchToken, posting, index, totalDocs, docLengths.get(posting.docId), avgDocLength)
        );
      }
    }
  }
 
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .slice(0, limit)
    .map(([docId, score]) => ({ docId, score }));
}

Phrase Matching

For phrase queries (quoted strings), verify that tokens appear in sequence using position data:

function phraseMatch(tokens: string[], index: InvertedIndex, docId: string): boolean {
  const postingsPerToken = tokens.map(t =>
    (index[t] || []).filter(p => p.docId === docId)
  );
 
  if (postingsPerToken.some(p => p.length === 0)) return false;
 
  // Check for sequential positions
  const firstPositions = postingsPerToken[0].flatMap(p => p.positions);
  return firstPositions.some(startPos =>
    tokens.every((_, i) =>
      postingsPerToken[i].some(p => p.positions.includes(startPos + i))
    )
  );
}

Trade-offs

Custom Index vs Postgres Full-Text Search

Metric	Custom Index	Postgres FTS
Index build time (5,000 docs)	1.2s	3.5s (GIN index)
Index size in memory	85MB	N/A (disk-based)
Query latency (single term, p50)	2ms	12ms
Query latency (3 terms, p50)	8ms	25ms
Phrase matching	Yes (position-based)	Yes (tsquery with `<->`)
Typo tolerance	No	No (without pg_trgm)
Stemming quality	Porter (English only)	Snowball (multi-language)
Ranking quality	BM25 + field boost	ts_rank (TF-based)
Operational overhead	High (custom code)	Low (built-in)

Relevance Quality (20 test queries, correct result in top 3)

Query Type	Custom Index	Postgres FTS
Single keyword	19/20	18/20
Multi-keyword	17/20	16/20
Phrase query	18/20	17/20
Concept (e.g., "how to deploy")	12/20	14/20

The custom index performs slightly better on keyword queries due to BM25 ranking (Postgres uses a simpler TF-based ranking by default). Postgres performs better on concept queries because its Snowball stemmer handles more word forms.

Memory Usage Breakdown

Component	Size
Token dictionary (unique tokens)	2.1MB (180,000 tokens)
Posting lists	62MB
Position data	18MB
Document metadata	3MB
Total	85MB

Position data (for phrase matching) accounts for 21% of the index. Removing it reduces memory to 67MB but eliminates phrase matching capability.

Failure Modes

Memory exhaustion with large corpora: At 50,000 documents, the index would consume approximately 850MB. At 500,000 documents, it would exceed available memory on most application servers. The in-memory approach has a hard ceiling.

Tokenization language mismatch: The Porter stemmer is English-only. German compound words, CJK character segmentation, and Arabic morphology require language-specific tokenizers. Searching multi-language content with an English stemmer produces poor results for non-English documents.

Index staleness after content updates: The in-memory index is built at startup. New or updated documents are not searchable until the index is rebuilt. For static documentation, this is fine (rebuild on deploy). For dynamic content, an incremental update mechanism is needed.

Prefix matching performance: The prefix matching implementation scans all tokens in the index. With 180,000 tokens, a prefix like "a*" matches 12,000 tokens, and the query takes 200ms+ instead of 2ms. Mitigation: use a trie data structure for prefix lookups, or limit prefix results to the top 100 tokens by document frequency.

Scaling Considerations

For corpora beyond 50,000 documents, use a disk-based index (B-tree for the token dictionary, memory-mapped files for posting lists).
For real-time indexing, maintain a small in-memory delta index for recent changes and merge it periodically with the main index.
Postgres FTS is the correct choice for most applications. It handles up to 1M documents efficiently with GIN indexes and requires no custom code.
Beyond 1M documents, consider Elasticsearch or Meilisearch. The operational overhead is justified at that scale.

Observability

Log query latency, result count, and zero-result queries
Track the most frequent search terms (helps identify content gaps)
Monitor index build time and memory usage after each deployment
Measure relevance through click-through rate on search results (manual review for small corpora)
Alert on queries exceeding 100ms (indicates prefix explosion or index corruption)

Key Takeaways

An inverted index with BM25 ranking and field boosting produces competitive search quality in under 200 lines of code.
The in-memory approach is viable for corpora under 50,000 documents. Beyond that, disk-based indexes are necessary.
Position data enables phrase matching but costs 21% of index memory. Include it only if phrase queries are required.
Postgres full-text search is the pragmatic default. It requires zero custom code and handles 1M+ documents.
Build your own search index to understand the mechanics. Deploy Postgres FTS (or Meilisearch) for production.

Final Thoughts

Building a search index from scratch took 2 days and produced an 85MB in-memory index that queries in 2-8ms. The exercise was valuable for understanding tokenization, posting lists, and TF-IDF ranking. For production, I use Postgres FTS. The custom index runs in the development environment for testing search quality changes before deploying them as Postgres query modifications. Understanding the fundamentals makes debugging search relevance issues in any system significantly easier.