Building a Simple Search Index

Dhruval Dhameliya·June 14, 2025·8 min read

Designing an inverted index from scratch with tokenization, ranking, and query parsing, then comparing it against Postgres full-text search.

Context

I built a search index for a documentation site with 5,000 pages. The goal was to understand how search works at a fundamental level, then compare the custom implementation against Postgres full-text search to determine when building your own is justified (rarely).

Problem

Full-text search engines like Elasticsearch and Meilisearch are powerful but operationally expensive for small corpora. Postgres full-text search is good but opaque. Building a search index from scratch reveals the trade-offs that production systems make and helps diagnose search quality issues in any system.

Constraints

  • Corpus: 5,000 documentation pages, average 800 words each (4M total words)
  • Index storage: must fit in memory (target: under 200MB)
  • Query latency: under 50ms for single-term and multi-term queries
  • Ranking: relevance-based (TF-IDF), with title matches weighted higher
  • Must support: prefix matching, phrase matching, and boolean AND/OR
  • No external dependencies (pure TypeScript implementation)

Design

Tokenization Pipeline

Raw Text
  -> Lowercase
  -> Remove punctuation
  -> Split on whitespace
  -> Remove stop words (the, a, is, etc.)
  -> Apply stemming (Porter stemmer)
  -> Output: array of tokens
function tokenize(text: string): string[] {
  return text
    .toLowerCase()
    .replace(/[^\w\s]/g, ' ')
    .split(/\s+/)
    .filter(word => word.length > 1)
    .filter(word => !STOP_WORDS.has(word))
    .map(word => stem(word));
}

Example:

  • Input: "Implementing server-side rendering in Next.js"
  • After lowercase: "implementing server-side rendering in next.js"
  • After punctuation removal: "implementing server side rendering in next js"
  • After stop word removal: "implementing server side rendering next js"
  • After stemming: "implement server side render next js"

See also: Implementing Server-Side Rendering Without Overhead.

Inverted Index Structure

The core data structure maps each token to the list of documents containing it, along with position information:

interface InvertedIndex {
  // token -> document postings
  [token: string]: Posting[];
}
 
interface Posting {
  docId: string;
  frequency: number;       // how many times token appears in doc
  positions: number[];     // positions of token in doc (for phrase matching)
  field: 'title' | 'body'; // which field the token came from
}

Index Construction

function buildIndex(documents: Document[]): InvertedIndex {
  const index: InvertedIndex = {};
  const docLengths: Map<string, number> = new Map();
 
  for (const doc of documents) {
    // Index title tokens with field marker
    const titleTokens = tokenize(doc.title);
    for (let i = 0; i < titleTokens.length; i++) {
      addToIndex(index, titleTokens[i], doc.id, i, 'title');
    }
 
    // Index body tokens
    const bodyTokens = tokenize(doc.body);
    for (let i = 0; i < bodyTokens.length; i++) {
      addToIndex(index, bodyTokens[i], doc.id, i, 'body');
    }
 
    docLengths.set(doc.id, titleTokens.length + bodyTokens.length);
  }
 
  return index;
}

Ranking: TF-IDF with Field Boosting

function score(
  token: string,
  posting: Posting,
  index: InvertedIndex,
  totalDocs: number,
  docLength: number,
  avgDocLength: number,
): number {
  // Term frequency (BM25 variant)
  const k1 = 1.2;
  const b = 0.75;
  const tf = posting.frequency;
  const normalizedTf = (tf * (k1 + 1)) /
    (tf + k1 * (1 - b + b * (docLength / avgDocLength)));
 
  // Inverse document frequency
  const df = index[token].length;
  const idf = Math.log((totalDocs - df + 0.5) / (df + 0.5) + 1);
 
  // Field boost
  const fieldBoost = posting.field === 'title' ? 3.0 : 1.0;
 
  return normalizedTf * idf * fieldBoost;
}

Query Processing

function search(query: string, index: InvertedIndex, limit: number = 10): SearchResult[] {
  const tokens = tokenize(query);
  const scores: Map<string, number> = new Map();
 
  for (const token of tokens) {
    // Support prefix matching
    const matchingTokens = token.endsWith('*')
      ? Object.keys(index).filter(t => t.startsWith(token.slice(0, -1)))
      : [token];
 
    for (const matchToken of matchingTokens) {
      const postings = index[matchToken] || [];
      for (const posting of postings) {
        const current = scores.get(posting.docId) || 0;
        scores.set(
          posting.docId,
          current + score(matchToken, posting, index, totalDocs, docLengths.get(posting.docId), avgDocLength)
        );
      }
    }
  }
 
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .slice(0, limit)
    .map(([docId, score]) => ({ docId, score }));
}

Phrase Matching

For phrase queries (quoted strings), verify that tokens appear in sequence using position data:

function phraseMatch(tokens: string[], index: InvertedIndex, docId: string): boolean {
  const postingsPerToken = tokens.map(t =>
    (index[t] || []).filter(p => p.docId === docId)
  );
 
  if (postingsPerToken.some(p => p.length === 0)) return false;
 
  // Check for sequential positions
  const firstPositions = postingsPerToken[0].flatMap(p => p.positions);
  return firstPositions.some(startPos =>
    tokens.every((_, i) =>
      postingsPerToken[i].some(p => p.positions.includes(startPos + i))
    )
  );
}

Trade-offs

Custom Index vs Postgres Full-Text Search

MetricCustom IndexPostgres FTS
Index build time (5,000 docs)1.2s3.5s (GIN index)
Index size in memory85MBN/A (disk-based)
Query latency (single term, p50)2ms12ms
Query latency (3 terms, p50)8ms25ms
Phrase matchingYes (position-based)Yes (tsquery with <->)
Typo toleranceNoNo (without pg_trgm)
Stemming qualityPorter (English only)Snowball (multi-language)
Ranking qualityBM25 + field boostts_rank (TF-based)
Operational overheadHigh (custom code)Low (built-in)

Relevance Quality (20 test queries, correct result in top 3)

Query TypeCustom IndexPostgres FTS
Single keyword19/2018/20
Multi-keyword17/2016/20
Phrase query18/2017/20
Concept (e.g., "how to deploy")12/2014/20

The custom index performs slightly better on keyword queries due to BM25 ranking (Postgres uses a simpler TF-based ranking by default). Postgres performs better on concept queries because its Snowball stemmer handles more word forms.

Memory Usage Breakdown

ComponentSize
Token dictionary (unique tokens)2.1MB (180,000 tokens)
Posting lists62MB
Position data18MB
Document metadata3MB
Total85MB

Position data (for phrase matching) accounts for 21% of the index. Removing it reduces memory to 67MB but eliminates phrase matching capability.

Failure Modes

Memory exhaustion with large corpora: At 50,000 documents, the index would consume approximately 850MB. At 500,000 documents, it would exceed available memory on most application servers. The in-memory approach has a hard ceiling.

Tokenization language mismatch: The Porter stemmer is English-only. German compound words, CJK character segmentation, and Arabic morphology require language-specific tokenizers. Searching multi-language content with an English stemmer produces poor results for non-English documents.

Index staleness after content updates: The in-memory index is built at startup. New or updated documents are not searchable until the index is rebuilt. For static documentation, this is fine (rebuild on deploy). For dynamic content, an incremental update mechanism is needed.

Prefix matching performance: The prefix matching implementation scans all tokens in the index. With 180,000 tokens, a prefix like "a*" matches 12,000 tokens, and the query takes 200ms+ instead of 2ms. Mitigation: use a trie data structure for prefix lookups, or limit prefix results to the top 100 tokens by document frequency.

Scaling Considerations

  • For corpora beyond 50,000 documents, use a disk-based index (B-tree for the token dictionary, memory-mapped files for posting lists).
  • For real-time indexing, maintain a small in-memory delta index for recent changes and merge it periodically with the main index.
  • Postgres FTS is the correct choice for most applications. It handles up to 1M documents efficiently with GIN indexes and requires no custom code.
  • Beyond 1M documents, consider Elasticsearch or Meilisearch. The operational overhead is justified at that scale.

Observability

  • Log query latency, result count, and zero-result queries
  • Track the most frequent search terms (helps identify content gaps)
  • Monitor index build time and memory usage after each deployment
  • Measure relevance through click-through rate on search results (manual review for small corpora)
  • Alert on queries exceeding 100ms (indicates prefix explosion or index corruption)

Key Takeaways

  • An inverted index with BM25 ranking and field boosting produces competitive search quality in under 200 lines of code.
  • The in-memory approach is viable for corpora under 50,000 documents. Beyond that, disk-based indexes are necessary.
  • Position data enables phrase matching but costs 21% of index memory. Include it only if phrase queries are required.
  • Postgres full-text search is the pragmatic default. It requires zero custom code and handles 1M+ documents.
  • Build your own search index to understand the mechanics. Deploy Postgres FTS (or Meilisearch) for production.

Related: Debugging Performance Issues in Large Android Apps.


Further Reading

Final Thoughts

Building a search index from scratch took 2 days and produced an 85MB in-memory index that queries in 2-8ms. The exercise was valuable for understanding tokenization, posting lists, and TF-IDF ranking. For production, I use Postgres FTS. The custom index runs in the development environment for testing search quality changes before deploying them as Postgres query modifications. Understanding the fundamentals makes debugging search relevance issues in any system significantly easier.

Recommended