Episode 9 — System Design / 9.11 — Real World System Design Problems
9.11.i Design a Search Engine (Google Search -- Simplified)
Problem Statement
Design a simplified web search engine that crawls the web, builds an index, ranks results, and serves search queries with autocomplete suggestions. Focus on the core infrastructure rather than advanced NLP.
1. Requirements
Functional Requirements
- Crawl and index web pages (billions of pages)
- Full-text search with relevance ranking
- Autocomplete / search suggestions
- Snippet generation (preview of matching content)
- Image and video search (basic)
- Spelling correction ("Did you mean...")
- Pagination of results
Non-Functional Requirements
- Index 15+ billion web pages
- Search latency: < 500ms (p99)
- Index freshness: popular pages updated within hours
- Support 100,000 queries per second
- High availability (99.99%)
- Relevant results on the first page
2. Capacity Estimation
Crawling
Total pages to index: 15 billion
Average page size: 100 KB (HTML)
Total raw data: 15B * 100 KB = 1.5 PB
Crawl rate: 1 billion pages/day (re-crawl schedule)
Crawl bandwidth: 1B * 100 KB / 86,400 = 1.16 TB/sec
(Distributed across thousands of crawlers)
Index
Unique terms: ~100 million
Average posting list: 10,000 documents per term
Index size: 100M terms * 10K * 8 bytes = 8 TB
With position data: ~40 TB
Compressed: ~10 TB
Serving
Queries per second: 100,000
Average query terms: 3 words
Index lookups per query: ~10 (including synonyms, corrections)
Latency budget: 500ms total
- Network: 50ms
- Index lookup: 100ms
- Ranking: 200ms
- Snippet generation: 100ms
- Overhead: 50ms
3. High-Level Architecture
+-------------------+
| Query Service |
User Query --------->| (API + Web UI) |
+--------+----------+
|
+--------------+--------------+
| | |
+--------v------+ +----v-------+ +---v-----------+
| Autocomplete | | Spell | | Query |
| Service | | Correction | | Parser |
+---------------+ +----+-------+ +---+-----------+
| |
+------v--------------v------+
| Search/Ranking Service |
+------+---------------------+
|
+-------------+-------------+
| | |
+-----v-----+ +---v-------+ +---v--------+
| Index | | Index | | Index |
| Shard 0 | | Shard 1 | | Shard N |
+------------+ +-----------+ +------------+
===== Offline Pipeline (Crawl + Index) =====
+-------------+ +---------------+ +----------------+
| URL |---->| Crawler |---->| HTML |
| Frontier | | (Distributed) | | Store |
+-------------+ +---------------+ +--------+-------+
|
+--------v-------+
| Parser / |
| Extractor |
+--------+-------+
|
+--------v-------+
| Indexer |
| (Build Inverted|
| Index) |
+--------+-------+
|
+--------v-------+
| PageRank |
| Computation |
+----------------+
4. API Design
GET /api/v1/search?q=distributed+systems&page=1&limit=10
Response 200: {
"query": "distributed systems",
"spelling_suggestion": null,
"results_count": 2340000,
"results": [
{
"url": "https://example.com/distributed-systems-guide",
"title": "Complete Guide to Distributed Systems",
"snippet": "...a <b>distributed system</b> is a collection of independent
computers that appear to users as a single coherent <b>system</b>...",
"favicon": "https://example.com/favicon.ico",
"cached_url": "https://cache.search.com/abc123",
"rank_score": 0.94
}
],
"page": 1,
"total_pages": 234000,
"search_time_ms": 230
}
GET /api/v1/autocomplete?q=distrib&limit=10
Response 200: {
"suggestions": [
"distributed systems",
"distributed computing",
"distributed database",
"distribution center near me",
"distributed ledger technology"
]
}
GET /api/v1/search/images?q=sunset+beach&page=1
Response 200: {
"images": [
{
"thumbnail_url": "...",
"full_url": "...",
"source_page": "...",
"width": 1920,
"height": 1080,
"alt_text": "Sunset over tropical beach"
}
]
}
5. Database / Storage Schema
URL Frontier (Redis + PostgreSQL)
CREATE TABLE url_frontier (
url_hash VARCHAR(64) PRIMARY KEY, -- SHA-256 of URL
url TEXT NOT NULL,
domain VARCHAR(255) NOT NULL,
priority INTEGER DEFAULT 5, -- 1 (highest) to 10 (lowest)
last_crawled TIMESTAMP,
next_crawl TIMESTAMP,
crawl_frequency INTERVAL DEFAULT '7 days',
status VARCHAR(20) DEFAULT 'pending',
http_status INTEGER,
content_hash VARCHAR(64), -- detect unchanged pages
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_frontier_next ON url_frontier(next_crawl, priority)
WHERE status = 'pending';
CREATE INDEX idx_frontier_domain ON url_frontier(domain);
Document Store (distributed object store)
Storage: S3 or HDFS
Path: /documents/{url_hash_prefix}/{url_hash}
Contents:
- Raw HTML
- Extracted text
- Parsed metadata (title, description, links)
- Crawl timestamp
Inverted Index Structure
Term -> [Posting List]
Posting List Entry:
{
doc_id: uint64, // document identifier
term_freq: uint16, // how many times term appears
positions: []uint32, // positions in document
field_flags: uint8 // which fields (title, body, URL, anchor)
}
Example:
"distributed" -> [
{ doc_id: 101, tf: 15, positions: [3,45,102,...], fields: TITLE|BODY },
{ doc_id: 205, tf: 8, positions: [12,67,...], fields: BODY },
{ doc_id: 342, tf: 22, positions: [1,5,18,...], fields: TITLE|BODY|URL },
...
]
Storage format: compressed posting lists (varint encoding)
- DocID gaps stored (delta encoding): 101, 104, 137 -> 101, 3, 33
- Positions delta-encoded within each posting
- Compression ratio: ~4x
PageRank Store
Key: pagerank:{url_hash}
Value: float64 (PageRank score)
Stored in: Redis for fast lookup during ranking
Computed: Offline via MapReduce/Spark (iterative algorithm)
6. Deep Dive: Web Crawler
Crawler Architecture
+-------------------+
| URL Frontier | (Priority queue of URLs to crawl)
+--------+----------+
|
| Dequeue batch (1000 URLs)
v
+--------+----------+
| URL Filter | (robots.txt check, domain rate limit, dedup)
+--------+----------+
|
| Filtered URLs
v
+--------+----------+ +------------------+
| Fetcher Pool |------>| DNS Resolver |
| (HTTP clients) | | (Cached) |
+--------+----------+ +------------------+
|
| Raw HTML
v
+--------+----------+
| Content Processor |
| - Extract text |
| - Extract links |
| - Detect language |
| - Compute hash |
+--------+----------+
|
+----+----+
| |
v v
Store Enqueue
Document new URLs
Politeness and Rate Limiting
Rules:
1. Obey robots.txt for every domain
2. Minimum 1-second delay between requests to same domain
3. Maximum 10 concurrent connections per domain
4. Identify as search bot in User-Agent header
5. Respect "noindex" and "nofollow" directives
Domain queue:
Per-domain FIFO queue ensures requests are spaced out.
Each domain has a "next_allowed_time" timestamp.
Crawl Priority and Freshness
Priority factors:
1. PageRank of the page (high PR = crawl more often)
2. Change frequency (pages that change often = crawl more)
3. Page type (news sites = hourly, static pages = weekly)
4. Explicit priority boost (trending topics)
Crawl schedule:
Tier 1 (top 1M pages): Every 1-4 hours
Tier 2 (top 100M pages): Every 1-3 days
Tier 3 (remaining): Every 1-4 weeks
Tier 4 (low quality): Monthly or on-demand
Deduplication
Problem: Same content on multiple URLs.
Detection methods:
1. URL normalization: strip tracking params, normalize case
"HTTP://Example.COM/page?utm_source=twitter"
-> "https://example.com/page"
2. Content hashing (SimHash):
- Compute fingerprint of page content
- Similar pages (hamming distance < 3) are near-duplicates
- Keep the canonical URL, discard duplicates
3. Exact hash: SHA-256 of extracted text body
- Detects exact copies across different URLs
7. Deep Dive: Indexing and Inverted Index
Index Building Pipeline
Raw Documents --> Tokenizer --> Linguistic Processing --> Index Builder
Tokenizer:
Input: "The quick-brown FOX jumped! over 3 lazy dogs."
Output: ["the", "quick", "brown", "fox", "jumped", "over", "3", "lazy", "dogs"]
Linguistic Processing:
1. Lowercasing: "FOX" -> "fox"
2. Stop word removal: remove "the", "over", "a", "is"
3. Stemming: "jumped" -> "jump", "dogs" -> "dog"
4. Synonyms: index both "quick" and "fast"
Index Builder:
For each (term, document):
Add document to term's posting list
Record position, frequency, field
Index Sharding Strategy
Option A: Document-based sharding
Shard 0: documents 0 - 999,999
Shard 1: documents 1,000,000 - 1,999,999
...
Query hits ALL shards, each returns top-K, merge results.
Pros: Each shard is a complete mini-index
Cons: Every query fans out to all shards
Option B: Term-based sharding
Shard 0: terms starting with a-d
Shard 1: terms starting with e-h
...
Query hits only shards for its terms (fewer shards).
Pros: Less fan-out per query
Cons: Hot terms cause hot shards (e.g., "the")
Choice: Document-based sharding (industry standard)
- More predictable load distribution
- Each shard handles any query independently
- Scatter-gather pattern with merge
Index Serving
Query: "distributed systems tutorial"
1. Parse query: ["distributed", "systems", "tutorial"]
2. Fan out to all N index shards
3. Each shard:
a. Look up posting list for each term
b. Intersect posting lists (AND semantics)
c. Score each matching document
d. Return top 100 results
4. Merge results from all shards
5. Re-rank merged results
6. Return top 10 to user
Shard 0 Shard 1 Shard 2
[doc5: 0.9] [doc12: 0.95] [doc28: 0.88]
[doc8: 0.85] [doc15: 0.82] [doc31: 0.86]
... ... ...
\ | /
+---- Merge + Re-rank --------+
| [doc12: 0.95] |
| [doc5: 0.9] |
| [doc28: 0.88] |
| ... |
+-----------------------------+
8. Deep Dive: Ranking (PageRank + BM25)
BM25 Scoring (Text Relevance)
BM25(D, Q) = SUM over terms qi in Q:
IDF(qi) * (tf(qi, D) * (k1 + 1)) / (tf(qi, D) + k1 * (1 - b + b * |D|/avgdl))
Where:
tf(qi, D) = term frequency of qi in document D
IDF(qi) = log((N - n(qi) + 0.5) / (n(qi) + 0.5))
N = total number of documents
n(qi) = number of documents containing qi
|D| = length of document D (in words)
avgdl = average document length
k1 = 1.2 (tuning parameter, term frequency saturation)
b = 0.75 (tuning parameter, length normalization)
Example:
Query: "distributed systems"
Document has "distributed" 5 times, "systems" 3 times
Document length: 500 words, average: 400 words
BM25 scores each document independently.
Higher TF helps, but with diminishing returns (k1 controls this).
PageRank (Link Authority)
PageRank algorithm (simplified):
PR(A) = (1 - d) / N + d * SUM(PR(Ti) / C(Ti))
Where:
d = damping factor (0.85)
N = total number of pages
Ti = pages that link to page A
C(Ti)= number of outgoing links from page Ti
Intuition:
- A page is important if important pages link to it
- PageRank is the probability of landing on a page via random surfing
- Computed iteratively until convergence (~40-50 iterations)
Computation:
- MapReduce/Spark job over the entire web graph
- Run weekly (or as link graph changes)
- Store results in key-value store for fast lookup
Combined Ranking Score
final_score = w1 * BM25_score // text relevance
+ w2 * pagerank_score // authority
+ w3 * freshness_score // recency
+ w4 * field_boost // title match > body match
+ w5 * click_through_rate // user behavior signal
+ w6 * domain_authority // trusted domains
+ w7 * query_doc_embedding // semantic similarity
Weights (approximate):
w1 = 0.35 (text relevance is primary)
w2 = 0.20 (authority matters)
w3 = 0.10 (freshness for time-sensitive queries)
w4 = 0.15 (title matches are strong signal)
w5 = 0.10 (CTR as engagement signal)
w6 = 0.05 (domain reputation)
w7 = 0.05 (semantic matching)
9. Deep Dive: Autocomplete
Trie-Based Autocomplete
Trie structure for search suggestions:
root
/ | \
d s ...
/ |
i y
/ |
s s
/ |
t t
/ |
r e
/ |
i m
/ \
b c <- "distributed" (score: 85)
<- "district" (score: 42)
Each node stores:
- Top-K completions with scores (precomputed)
- Score = search frequency * recency weight
Redis-Based Implementation
ZSET per prefix:
autocomplete:d -> {("distributed", 85), ("disney", 78), ...}
autocomplete:di -> {("distributed", 85), ("disney", 78), ...}
autocomplete:dis -> {("distributed", 85), ("disney", 42), ...}
autocomplete:dist -> {("distributed", 85), ("district", 42), ...}
Query "dist" -> ZREVRANGE autocomplete:dist 0 9
Returns top 10 suggestions by score.
Update:
When user searches for "distributed systems":
ZINCRBY autocomplete:d "distributed systems" 1
ZINCRBY autocomplete:di "distributed systems" 1
... (for each prefix)
Trending and Personalized Suggestions
Base suggestions: Global popularity (updated hourly)
Trending boost: Topics trending in last hour get 2x score
Personalization: User's recent searches weighted higher
Location: "restaurants" suggests local results
Final score = global_score * 0.5 + trending_boost * 0.2
+ personal_score * 0.2 + location_score * 0.1
10. Scaling Considerations
Index Distribution
15 billion documents / 10,000 docs per shard = 1.5 million shards
(In practice, shards are larger: ~1M docs per shard = 15,000 shards)
Each shard: ~1 GB index data
Total index: ~15 TB (compressed)
Replicated 3x: ~45 TB
Shard servers: 500 servers, each holding ~30 shard replicas
Each server: 128 GB RAM, 2 TB NVMe SSD
Most-accessed index data cached in RAM.
Query Fan-Out Optimization
Problem: Querying 15,000 shards per query = massive fan-out.
Solution: Two-level sharding
Level 1: 50 clusters (by document quality/PageRank)
- Cluster 0: Top 1% of pages (highest PageRank)
- Cluster 1: Next 5%
- ...
Level 2: Shards within each cluster
Query processing:
1. Always search Cluster 0 (top pages)
2. If enough high-quality results -> return
3. Otherwise expand to Cluster 1, 2, etc.
Result: Most queries only hit top 2-3 clusters.
Fan-out reduced from 15,000 to ~300 shards.
Caching
Query result cache (Redis):
Key: query_cache:{normalized_query_hash}
Value: Serialized search results
TTL: 1 hour (5 minutes for trending queries)
Cache hit rate: ~30-40% (many queries are repeated)
"weather", "news", "stock prices" = very high cache hit
Index cache (in-memory):
Hot posting lists cached in RAM
LRU eviction, ~80 GB per index server
Cache hit rate on posting lists: ~90%
11. Key Tradeoffs
| Decision | Option A | Option B | Our Choice |
|---|---|---|---|
| Index sharding | Term-based | Document-based | Document-based |
| Crawl strategy | BFS | Priority-based | Priority |
| Ranking | BM25 only | BM25 + PageRank + ML | Combined |
| Freshness | Real-time index | Batch re-index | Hybrid |
| Autocomplete storage | Trie in memory | Redis sorted sets | Redis |
| Deduplication | Exact hash | SimHash (near-dedup) | Both |
| Snippet generation | At index time | At query time | Query time |
12. Failure Scenarios and Mitigations
Scenario Mitigation
------------------------------------------------------------------------
Index shard unavailable 3x replication; query routes to replica
Crawler blacklisted by site Rotate IPs; respect robots.txt; back off
Stale index (crawler lag) Priority crawling for popular/fresh pages
Autocomplete Redis failure Fall back to pre-computed static suggestions
Query overload Result cache + query deduplication
Spam pages in index Spam classifier filters at index time
PageRank manipulation (link farms) Detect and penalize unnatural link patterns
Key Takeaways
- The inverted index is the fundamental data structure -- understanding posting lists, TF-IDF, and BM25 is essential.
- PageRank provides authority signal independent of query terms -- it is computed offline and combined with text relevance at query time.
- Document-based sharding with scatter-gather is the practical choice -- term-based sharding creates hot spots on common terms.
- Crawl prioritization determines index freshness -- crawl important and frequently-changing pages first with limited resources.
- Autocomplete is a separate subsystem optimized for prefix matching -- it uses pre-computed suggestions, not real-time index queries.