system designintermediate 12m2026-06-09

Designing a Search Engine — Crawl, Index, Query, Ranking

Session 44 of the 48-session learning series.

Why this session matters

This is Session 44 of 48 in the System Design track. "Search" looks simple until you have to build it. Inverted indices, query parsing, ranking signals, freshness — the layers are well-understood but rarely covered properly. Knowing how Google works at a sketch-level is a system-design interview classic and a real architectural skill.

Agenda

The 4 stages — crawl, index, query, rank
The inverted index — how it actually works, what it costs
Query parsing, tokenisation, normalisation
Ranking — BM25 + signals + ML re-rank
Modern hybrid search — keyword + vector together

Pre-read (skim before the session)

Deep dive

1. The 4-stage pipeline

[ CRAWL ]   →  [ INDEX ]   →   [ QUERY ]    →   [ RANK ]
fetch web      build index     parse user      sort results
respect robots tokenise, store query, expand   BM25 + signals
discover URLs  inverted index  lookup postings ML re-rank

Each stage scales independently. Crawl is the slow, expensive front. Query latency budget is < 200 ms; everything else can be slow.

2. The inverted index

A normal index: docID → content. Inverted index: term → list of (docID, position, frequency).

"transformer" → [(d1, [3, 47]), (d7, [12]), (d99, [1, 8, 33])]
"attention"   → [(d1, [4]), (d12, [2]), (d99, [2])]

Query "transformer attention" = intersect posting lists for both terms, score the intersection.

Critical for: storing efficiently (compressed integers, delta encoding), supporting fast intersection, ranking via term-frequency stats.

3. Tokenisation and normalisation

Text in → terms out. Steps:

Tokenise (split on whitespace, punctuation; language-aware).
Lowercase (or case-preserve where case matters — names, code).
Remove stop words (the, a, is) — or keep with low weight.
Stem (running → run) or lemmatise (better → good).
Normalise unicode (NFC), strip diacritics for tolerance.
Synonyms (tv ↔ television).
N-grams (bigrams for phrase support).

Every choice changes recall + precision tradeoff. Tune by language.

4. Building the index

For a small corpus, one in-memory index. For billions of docs:

Segments — many small immutable indices, merged in background (LSM-tree-like).
Sharding — index split across nodes by document hash.
Replication — each shard on N nodes for HA + read throughput.

Lucene (Elasticsearch / OpenSearch / Solr underneath) is the canonical implementation. New indices: Tantivy (Rust), Vespa (multi-modal first-class), Meilisearch (small-scale UX).

5. The query path

"best laptop for coding 2026"
    ↓
parse → ["best", "laptop", "coding", "2026"]   (stopword "for")
    ↓
synonyms → ["best", "laptop", "notebook", "coding", "programming", "2026"]
    ↓
fetch posting lists for each
    ↓
intersect (AND) or union (OR) per query operator
    ↓
score candidates (BM25 + signals)
    ↓
top-K to ranker
    ↓
return

Lucene-style engines do steps 1–6 in milliseconds for indices in the billions.

6. BM25 — the classic ranker

A scoring function that beats vanilla TF-IDF for most workloads:

score(D, Q) = Σ IDF(qi) * f(qi, D) * (k1+1) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))

Where:

f(qi, D) — term frequency.
IDF — inverse document frequency.
|D| — document length; avgdl — average doc length.
k1, b — tunable constants (k1=1.2, b=0.75 typical).

BM25 captures: rare terms matter more (IDF), term frequency helps but saturates (k1), shorter docs scoring higher per term (b).

7. Signals on top of BM25

Real ranking blends BM25 with:

Freshness — recent docs ranked higher for some queries.
Popularity — clickthrough rate, view count.
Quality — author authority, trustworthiness, spam score.
Personalisation — user history match.
Engagement — dwell time, return rate.
Geographic / language — match user locale.

Each signal is a feature. Sum/weight them, or learn via LTR (next).

8. Learning to Rank (LTR)

Modern engines (Google, Bing, e-commerce search) use ML to combine signals:

Pointwise — predict relevance score; sort.
Pairwise — train to order pairs correctly.
Listwise — optimise NDCG directly (LambdaMART).

LightGBM's LambdaRank is the default for many production search teams. Fast at serve, easy to interpret.

9. Hybrid search — keyword + vector

Pure keyword fails on semantic mismatch ("how to make my code faster" vs docs titled "optimization techniques"). Pure vector misses exact matches and rare terms.

Hybrid: run both retrievals, fuse results.

Fusion strategies:

Reciprocal Rank Fusion (RRF) — score = Σ 1/(k + rank) across retrievers. Simple, robust.
Weighted score — final = α * keyword + (1-α) * vector. Needs normalisation.
Rerank — feed top-K from both into an LLM/cross-encoder reranker.

OpenSearch, Elasticsearch, Vespa, Pinecone, Weaviate all support hybrid out of the box now.

10. Crawl politely

If you're the search engine:

Respect robots.txt.
Rate limit per host (don't DDoS).
Identify yourself in User-Agent.
Recrawl based on change frequency (popular news = hourly; static docs = weekly).
Discover URLs from sitemaps + link graph.
Deduplicate (canonical URLs, shingle-based near-duplicate detection).

A polite crawl of the web is a massive project. Most teams crawl their own data only.

11. Query understanding

Hard problems:

Spell correction — tesla cybertuck → tesla cybertruck.
Query expansion — add synonyms.
Intent classification — informational / navigational / transactional.
Entity detection — restaurants in tokyo → location filter on tokyo.
Question answering — modern hybrid: search + LLM summarises top results.

Each is a small ML model. Stacked, they 2× search quality without changing the index.

12. Observability

Latency — p50, p95, p99 query time.
Recall — known good docs in top-N for sample queries.
Click-through rate (CTR) — by position; flat curve = bad ranking.
Zero-result rate — too high = expansion / synonyms broken.
Click-to-action rate — quality at the answer level.

A search engine without click metrics is operating blind.

13. Reality check

A search engine for a SaaS product, day-1:

Postgres tsvector or SQLite FTS5 for < 100k docs.
Elasticsearch / OpenSearch for 100k–1B docs.
Vespa / proprietary for > 1B docs or multi-modal.
Hybrid (keyword + vector via embedding) for any modern UX.
Click logging from day 1.
LTR layer when you have > 6 months of click data.

The hardest part isn't the index; it's the evaluation. Build the click-log + relevance-judgement pipeline early.

Reading material

Books:

Introduction to Information Retrieval — Manning, Raghavan, Schütze (the canonical free IR textbook; required reading from Stanford CS276)
Search Patterns — Peter Morville & Jeffery Callender (the patterns book that grounds product side of search)
Designing Data-Intensive Applications — Martin Kleppmann (the search + indexing chapters that frame how Lucene fits into a data system)
Lucene in Action (2nd ed.) — Erik Hatcher, Otis Gospodnetic, Mike McCandless (the canonical Lucene book; older but still the right mental model)
Relevant Search — Doug Turnbull & John Berryman (the canonical book on building Elasticsearch / Solr search that actually ranks well)

Papers:

The Anatomy of a Large-Scale Hypertextual Web Search Engine — Brin & Page 1998 — the original Google paper; the founding document of modern search.
PageRank: Bringing Order to the Web — Page, Brin, Motwani, Winograd 1999 — the canonical link-analysis paper.
The Probabilistic Relevance Framework: BM25 — Robertson & Zaragoza 2009 — the canonical BM25 paper.
Learning to Rank for Information Retrieval — Tie-Yan Liu 2009 — the survey that named the field.
Pretrained Transformers for Text Ranking: BERT and Beyond — Lin et al. 2021 — the canonical survey of neural ranking.
Approximate Nearest Neighbor Search Algorithms (HNSW, IVF) — Malkov & Yashunin 2016 — the foundational HNSW paper behind every modern vector index.

Official docs:

Apache Lucene documentation — the canonical reference for the inverted-index engine inside Elasticsearch, Solr, OpenSearch.
Elasticsearch — Definitive Guide (open book) — old but still the right end-to-end story.
OpenSearch documentation — AWS's Elasticsearch fork; same APIs, divergent roadmap.
Vespa documentation — the Yahoo-origin tensor-aware search + ranking engine, now used by Spotify and others.
Pinecone — Hybrid search guide — the canonical primer on dense+sparse hybrid retrieval.
Weaviate documentation — modern hybrid vector + keyword DB.
Solr Reference Guide — the alternative Lucene-on-top engine.

Blog posts:

Doug Turnbull — Search relevance posts — the canonical practitioner-blog on production search relevance.
Spotify Engineering — Search & ranking — production Spotify search posts (track + podcast + audiobook ranking).
Algolia blog — the canonical search-as-a-service practitioner write-ups.
Elastic blog — Search Labs — Elastic's deep-engineering blog; the most active source on hybrid search at scale.
Daniel Tunkelang — Query understanding posts — the canonical practitioner blog on query parsing + intent.
Yelp Engineering — Search posts — production search at Yelp scale.

In-depth research material

Apache Lucene — github.com/apache/lucene — ~3k ★, the canonical inverted-index engine.
Elasticsearch — github.com/elastic/elasticsearch — ~70k ★, the most-used search engine in production.
OpenSearch — github.com/opensearch-project/OpenSearch — ~10k ★, AWS's Apache 2.0 fork.
Tantivy — github.com/quickwit-oss/tantivy — ~13k ★, the Rust full-text search engine inspired by Lucene.
MeiliSearch — github.com/meilisearch/meilisearch — ~48k ★, a Rust search engine focused on instant-search UX.
Typesense — github.com/typesense/typesense — ~22k ★, fast typo-tolerant search engine.
Vespa — github.com/vespa-engine/vespa — ~6k ★, Yahoo-origin tensor-ranking search engine (used by Spotify, Wikipedia).
Quickwit — github.com/quickwit-oss/quickwit — ~8k ★, cloud-native distributed search engine.
Pinecone Engineering Blog — practitioner posts on hybrid + vector search at scale.
Weaviate — github.com/weaviate/weaviate — ~12k ★, vector DB with built-in hybrid retrieval.
Qdrant — github.com/qdrant/qdrant — ~21k ★, Rust vector DB used in production RAG.
Spotify research papers — the published search + recsys papers from Spotify.

Videos

Design a Search Engine — System Design Interview — System Design Interview channel · 32 min — the canonical interview-frame walkthrough.
How Google Search Works — Matt Cutts (TED) — Matt Cutts · 14 min — a former Google search lead's accessible 14-minute overview.
Building Vespa: A Hands-on Tour of a Big Data Serving Engine — Jon Bratseth — Jon Bratseth (Vespa creator) · 48 min — the inside view from Yahoo's production search architect.
Inside Lucene — Adrien Grand (Elastic) — Adrien Grand (Lucene committer) · 42 min — how the inverted index, posting lists, and BM25 actually work in code.
Hybrid Search: BM25 + Dense Retrieval — Hamel Husain × Jo Bergum (Vespa) — Jo Bergum (Vespa) · 56 min — the practitioner case for hybrid retrieval in modern search + RAG.

LeetCode — Design Search Autocomplete System

Link: https://leetcode.com/problems/design-search-autocomplete-system/
Difficulty: Hard
Why this problem: Trie + frequency tracking — the in-memory data structure behind every search-as-you-type box. (We saw this earlier in S09; different lens.)
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Draw the inverted index data structure with delta encoding.
Walk through query lookup: parse → expand → intersect → rank.
Explain BM25 in plain English; state the role of k1, b.
Combine keyword + vector with RRF or weighted score fusion.
Sketch a crawl pipeline respecting robots.txt + rate limits.
Solve design-search-autocomplete-system — trie-based prefix lookup, the autocomplete engine.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Online Learning, Bandits, Counterfactual Evaluation

LLM Safety — Jailbreaks, Prompt Injection, Output Filtering, Red-Teaming