Episode 9 — System Design / 9.11 — Real World System Design Problems

9.11.i Design a Search Engine (Google Search -- Simplified)

Problem Statement

Design a simplified web search engine that crawls the web, builds an index, ranks results, and serves search queries with autocomplete suggestions. Focus on the core infrastructure rather than advanced NLP.

1. Requirements

Functional Requirements

Crawl and index web pages (billions of pages)
Full-text search with relevance ranking
Autocomplete / search suggestions
Snippet generation (preview of matching content)
Image and video search (basic)
Spelling correction ("Did you mean...")
Pagination of results

Non-Functional Requirements

Index 15+ billion web pages
Search latency: < 500ms (p99)
Index freshness: popular pages updated within hours
Support 100,000 queries per second
High availability (99.99%)
Relevant results on the first page

2. Capacity Estimation

Crawling

Total pages to index:    15 billion
Average page size:       100 KB (HTML)
Total raw data:          15B * 100 KB = 1.5 PB
Crawl rate:              1 billion pages/day (re-crawl schedule)
Crawl bandwidth:         1B * 100 KB / 86,400 = 1.16 TB/sec
(Distributed across thousands of crawlers)

Index

Unique terms:            ~100 million
Average posting list:    10,000 documents per term
Index size:              100M terms * 10K * 8 bytes = 8 TB
With position data:      ~40 TB
Compressed:              ~10 TB

Serving

Queries per second:      100,000
Average query terms:     3 words
Index lookups per query: ~10 (including synonyms, corrections)
Latency budget:          500ms total
  - Network:             50ms
  - Index lookup:        100ms
  - Ranking:             200ms
  - Snippet generation:  100ms
  - Overhead:            50ms

3. High-Level Architecture

                          +-------------------+
                          | Query Service     |
     User Query --------->| (API + Web UI)    |
                          +--------+----------+
                                   |
                    +--------------+--------------+
                    |              |              |
           +--------v------+ +----v-------+ +---v-----------+
           | Autocomplete  | | Spell      | | Query         |
           | Service       | | Correction | | Parser        |
           +---------------+ +----+-------+ +---+-----------+
                                  |              |
                           +------v--------------v------+
                           | Search/Ranking Service     |
                           +------+---------------------+
                                  |
                    +-------------+-------------+
                    |             |             |
              +-----v-----+ +---v-------+ +---v--------+
              | Index      | | Index     | | Index      |
              | Shard 0    | | Shard 1   | | Shard N    |
              +------------+ +-----------+ +------------+

    ===== Offline Pipeline (Crawl + Index) =====

+-------------+     +---------------+     +----------------+
| URL         |---->| Crawler       |---->| HTML           |
| Frontier    |     | (Distributed) |     | Store          |
+-------------+     +---------------+     +--------+-------+
                                                   |
                                          +--------v-------+
                                          | Parser /       |
                                          | Extractor      |
                                          +--------+-------+
                                                   |
                                          +--------v-------+
                                          | Indexer         |
                                          | (Build Inverted|
                                          |  Index)        |
                                          +--------+-------+
                                                   |
                                          +--------v-------+
                                          | PageRank       |
                                          | Computation    |
                                          +----------------+

4. API Design

GET /api/v1/search?q=distributed+systems&page=1&limit=10
  Response 200: {
    "query": "distributed systems",
    "spelling_suggestion": null,
    "results_count": 2340000,
    "results": [
      {
        "url": "https://example.com/distributed-systems-guide",
        "title": "Complete Guide to Distributed Systems",
        "snippet": "...a <b>distributed system</b> is a collection of independent
                    computers that appear to users as a single coherent <b>system</b>...",
        "favicon": "https://example.com/favicon.ico",
        "cached_url": "https://cache.search.com/abc123",
        "rank_score": 0.94
      }
    ],
    "page": 1,
    "total_pages": 234000,
    "search_time_ms": 230
  }

GET /api/v1/autocomplete?q=distrib&limit=10
  Response 200: {
    "suggestions": [
      "distributed systems",
      "distributed computing",
      "distributed database",
      "distribution center near me",
      "distributed ledger technology"
    ]
  }

GET /api/v1/search/images?q=sunset+beach&page=1
  Response 200: {
    "images": [
      {
        "thumbnail_url": "...",
        "full_url": "...",
        "source_page": "...",
        "width": 1920,
        "height": 1080,
        "alt_text": "Sunset over tropical beach"
      }
    ]
  }

5. Database / Storage Schema

URL Frontier (Redis + PostgreSQL)

CREATE TABLE url_frontier (
    url_hash        VARCHAR(64) PRIMARY KEY,  -- SHA-256 of URL
    url             TEXT NOT NULL,
    domain          VARCHAR(255) NOT NULL,
    priority        INTEGER DEFAULT 5,        -- 1 (highest) to 10 (lowest)
    last_crawled    TIMESTAMP,
    next_crawl      TIMESTAMP,
    crawl_frequency INTERVAL DEFAULT '7 days',
    status          VARCHAR(20) DEFAULT 'pending',
    http_status     INTEGER,
    content_hash    VARCHAR(64),              -- detect unchanged pages
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_frontier_next ON url_frontier(next_crawl, priority)
    WHERE status = 'pending';
CREATE INDEX idx_frontier_domain ON url_frontier(domain);

Document Store (distributed object store)

Storage: S3 or HDFS
Path: /documents/{url_hash_prefix}/{url_hash}

Contents:
  - Raw HTML
  - Extracted text
  - Parsed metadata (title, description, links)
  - Crawl timestamp

Inverted Index Structure

Term -> [Posting List]

Posting List Entry:
{
  doc_id:      uint64,    // document identifier
  term_freq:   uint16,    // how many times term appears
  positions:   []uint32,  // positions in document
  field_flags: uint8      // which fields (title, body, URL, anchor)
}

Example:
  "distributed" -> [
    { doc_id: 101, tf: 15, positions: [3,45,102,...], fields: TITLE|BODY },
    { doc_id: 205, tf: 8,  positions: [12,67,...],    fields: BODY },
    { doc_id: 342, tf: 22, positions: [1,5,18,...],   fields: TITLE|BODY|URL },
    ...
  ]

Storage format: compressed posting lists (varint encoding)
  - DocID gaps stored (delta encoding): 101, 104, 137 -> 101, 3, 33
  - Positions delta-encoded within each posting
  - Compression ratio: ~4x

PageRank Store

Key:    pagerank:{url_hash}
Value:  float64 (PageRank score)

Stored in: Redis for fast lookup during ranking
Computed: Offline via MapReduce/Spark (iterative algorithm)

6. Deep Dive: Web Crawler

Crawler Architecture

+-------------------+
| URL Frontier      |  (Priority queue of URLs to crawl)
+--------+----------+
         |
         | Dequeue batch (1000 URLs)
         v
+--------+----------+
| URL Filter        |  (robots.txt check, domain rate limit, dedup)
+--------+----------+
         |
         | Filtered URLs
         v
+--------+----------+       +------------------+
| Fetcher Pool      |------>| DNS Resolver     |
| (HTTP clients)    |       | (Cached)         |
+--------+----------+       +------------------+
         |
         | Raw HTML
         v
+--------+----------+
| Content Processor |
| - Extract text    |
| - Extract links   |
| - Detect language |
| - Compute hash    |
+--------+----------+
         |
    +----+----+
    |         |
    v         v
  Store    Enqueue
  Document new URLs

Politeness and Rate Limiting

Rules:
1. Obey robots.txt for every domain
2. Minimum 1-second delay between requests to same domain
3. Maximum 10 concurrent connections per domain
4. Identify as search bot in User-Agent header
5. Respect "noindex" and "nofollow" directives

Domain queue:
  Per-domain FIFO queue ensures requests are spaced out.
  Each domain has a "next_allowed_time" timestamp.

Crawl Priority and Freshness

Priority factors:
  1. PageRank of the page (high PR = crawl more often)
  2. Change frequency (pages that change often = crawl more)
  3. Page type (news sites = hourly, static pages = weekly)
  4. Explicit priority boost (trending topics)

Crawl schedule:
  Tier 1 (top 1M pages):    Every 1-4 hours
  Tier 2 (top 100M pages):  Every 1-3 days
  Tier 3 (remaining):       Every 1-4 weeks
  Tier 4 (low quality):     Monthly or on-demand

Deduplication

Problem: Same content on multiple URLs.

Detection methods:
1. URL normalization: strip tracking params, normalize case
   "HTTP://Example.COM/page?utm_source=twitter"
   -> "https://example.com/page"

2. Content hashing (SimHash):
   - Compute fingerprint of page content
   - Similar pages (hamming distance < 3) are near-duplicates
   - Keep the canonical URL, discard duplicates

3. Exact hash: SHA-256 of extracted text body
   - Detects exact copies across different URLs

7. Deep Dive: Indexing and Inverted Index

Index Building Pipeline

Raw Documents --> Tokenizer --> Linguistic Processing --> Index Builder

Tokenizer:
  Input:  "The quick-brown FOX jumped! over 3 lazy dogs."
  Output: ["the", "quick", "brown", "fox", "jumped", "over", "3", "lazy", "dogs"]

Linguistic Processing:
  1. Lowercasing:    "FOX" -> "fox"
  2. Stop word removal: remove "the", "over", "a", "is"
  3. Stemming:       "jumped" -> "jump", "dogs" -> "dog"
  4. Synonyms:       index both "quick" and "fast"

Index Builder:
  For each (term, document):
    Add document to term's posting list
    Record position, frequency, field

Index Sharding Strategy

Option A: Document-based sharding
  Shard 0: documents 0 - 999,999
  Shard 1: documents 1,000,000 - 1,999,999
  ...
  
  Query hits ALL shards, each returns top-K, merge results.
  Pros: Each shard is a complete mini-index
  Cons: Every query fans out to all shards

Option B: Term-based sharding
  Shard 0: terms starting with a-d
  Shard 1: terms starting with e-h
  ...
  
  Query hits only shards for its terms (fewer shards).
  Pros: Less fan-out per query
  Cons: Hot terms cause hot shards (e.g., "the")

Choice: Document-based sharding (industry standard)
  - More predictable load distribution
  - Each shard handles any query independently
  - Scatter-gather pattern with merge

Index Serving

Query: "distributed systems tutorial"

1. Parse query: ["distributed", "systems", "tutorial"]
2. Fan out to all N index shards
3. Each shard:
   a. Look up posting list for each term
   b. Intersect posting lists (AND semantics)
   c. Score each matching document
   d. Return top 100 results
4. Merge results from all shards
5. Re-rank merged results
6. Return top 10 to user

   Shard 0          Shard 1          Shard 2
   [doc5: 0.9]      [doc12: 0.95]    [doc28: 0.88]
   [doc8: 0.85]     [doc15: 0.82]    [doc31: 0.86]
   ...               ...              ...
         \               |               /
          +---- Merge + Re-rank --------+
          |  [doc12: 0.95]              |
          |  [doc5: 0.9]               |
          |  [doc28: 0.88]             |
          |  ...                        |
          +-----------------------------+

8. Deep Dive: Ranking (PageRank + BM25)

BM25 Scoring (Text Relevance)

BM25(D, Q) = SUM over terms qi in Q:
  IDF(qi) * (tf(qi, D) * (k1 + 1)) / (tf(qi, D) + k1 * (1 - b + b * |D|/avgdl))

Where:
  tf(qi, D) = term frequency of qi in document D
  IDF(qi)   = log((N - n(qi) + 0.5) / (n(qi) + 0.5))
  N         = total number of documents
  n(qi)     = number of documents containing qi
  |D|       = length of document D (in words)
  avgdl     = average document length
  k1        = 1.2 (tuning parameter, term frequency saturation)
  b         = 0.75 (tuning parameter, length normalization)

Example:
  Query: "distributed systems"
  Document has "distributed" 5 times, "systems" 3 times
  Document length: 500 words, average: 400 words
  
  BM25 scores each document independently.
  Higher TF helps, but with diminishing returns (k1 controls this).

PageRank (Link Authority)

PageRank algorithm (simplified):

PR(A) = (1 - d) / N + d * SUM(PR(Ti) / C(Ti))

Where:
  d    = damping factor (0.85)
  N    = total number of pages
  Ti   = pages that link to page A
  C(Ti)= number of outgoing links from page Ti

Intuition:
  - A page is important if important pages link to it
  - PageRank is the probability of landing on a page via random surfing
  - Computed iteratively until convergence (~40-50 iterations)

Computation:
  - MapReduce/Spark job over the entire web graph
  - Run weekly (or as link graph changes)
  - Store results in key-value store for fast lookup

Combined Ranking Score

final_score = w1 * BM25_score           // text relevance
            + w2 * pagerank_score        // authority
            + w3 * freshness_score       // recency
            + w4 * field_boost           // title match > body match
            + w5 * click_through_rate    // user behavior signal
            + w6 * domain_authority      // trusted domains
            + w7 * query_doc_embedding   // semantic similarity

Weights (approximate):
  w1 = 0.35  (text relevance is primary)
  w2 = 0.20  (authority matters)
  w3 = 0.10  (freshness for time-sensitive queries)
  w4 = 0.15  (title matches are strong signal)
  w5 = 0.10  (CTR as engagement signal)
  w6 = 0.05  (domain reputation)
  w7 = 0.05  (semantic matching)

9. Deep Dive: Autocomplete

Trie-Based Autocomplete

Trie structure for search suggestions:

                    root
                   / | \
                  d  s   ...
                 /   |
                i    y
               /     |
              s      s
             /       |
            t        t
           /         |
          r          e
         /           |
        i            m
       / \
      b   c  <- "distributed" (score: 85)
              <- "district"    (score: 42)

Each node stores:
  - Top-K completions with scores (precomputed)
  - Score = search frequency * recency weight

Redis-Based Implementation

ZSET per prefix:
  autocomplete:d        -> {("distributed", 85), ("disney", 78), ...}
  autocomplete:di       -> {("distributed", 85), ("disney", 78), ...}
  autocomplete:dis      -> {("distributed", 85), ("disney", 42), ...}
  autocomplete:dist     -> {("distributed", 85), ("district", 42), ...}

Query "dist" -> ZREVRANGE autocomplete:dist 0 9
Returns top 10 suggestions by score.

Update:
  When user searches for "distributed systems":
    ZINCRBY autocomplete:d "distributed systems" 1
    ZINCRBY autocomplete:di "distributed systems" 1
    ... (for each prefix)

Trending and Personalized Suggestions

Base suggestions:     Global popularity (updated hourly)
Trending boost:       Topics trending in last hour get 2x score
Personalization:      User's recent searches weighted higher
Location:             "restaurants" suggests local results

Final score = global_score * 0.5 + trending_boost * 0.2 
            + personal_score * 0.2 + location_score * 0.1

10. Scaling Considerations

Index Distribution

15 billion documents / 10,000 docs per shard = 1.5 million shards
(In practice, shards are larger: ~1M docs per shard = 15,000 shards)

Each shard: ~1 GB index data
Total index: ~15 TB (compressed)
Replicated 3x: ~45 TB

Shard servers: 500 servers, each holding ~30 shard replicas
Each server: 128 GB RAM, 2 TB NVMe SSD
Most-accessed index data cached in RAM.

Query Fan-Out Optimization

Problem: Querying 15,000 shards per query = massive fan-out.

Solution: Two-level sharding

Level 1: 50 clusters (by document quality/PageRank)
  - Cluster 0: Top 1% of pages (highest PageRank)
  - Cluster 1: Next 5%
  - ...
  
Level 2: Shards within each cluster

Query processing:
  1. Always search Cluster 0 (top pages)
  2. If enough high-quality results -> return
  3. Otherwise expand to Cluster 1, 2, etc.

Result: Most queries only hit top 2-3 clusters.
Fan-out reduced from 15,000 to ~300 shards.

Caching

Query result cache (Redis):
  Key:   query_cache:{normalized_query_hash}
  Value: Serialized search results
  TTL:   1 hour (5 minutes for trending queries)
  
  Cache hit rate: ~30-40% (many queries are repeated)
  "weather", "news", "stock prices" = very high cache hit

Index cache (in-memory):
  Hot posting lists cached in RAM
  LRU eviction, ~80 GB per index server
  Cache hit rate on posting lists: ~90%

11. Key Tradeoffs

Decision	Option A	Option B	Our Choice
Index sharding	Term-based	Document-based	Document-based
Crawl strategy	BFS	Priority-based	Priority
Ranking	BM25 only	BM25 + PageRank + ML	Combined
Freshness	Real-time index	Batch re-index	Hybrid
Autocomplete storage	Trie in memory	Redis sorted sets	Redis
Deduplication	Exact hash	SimHash (near-dedup)	Both
Snippet generation	At index time	At query time	Query time

12. Failure Scenarios and Mitigations

Scenario                          Mitigation
------------------------------------------------------------------------
Index shard unavailable           3x replication; query routes to replica
Crawler blacklisted by site       Rotate IPs; respect robots.txt; back off
Stale index (crawler lag)         Priority crawling for popular/fresh pages
Autocomplete Redis failure        Fall back to pre-computed static suggestions
Query overload                    Result cache + query deduplication
Spam pages in index               Spam classifier filters at index time
PageRank manipulation (link farms) Detect and penalize unnatural link patterns

Key Takeaways

The inverted index is the fundamental data structure -- understanding posting lists, TF-IDF, and BM25 is essential.
PageRank provides authority signal independent of query terms -- it is computed offline and combined with text relevance at query time.
Document-based sharding with scatter-gather is the practical choice -- term-based sharding creates hot spots on common terms.
Crawl prioritization determines index freshness -- crawl important and frequently-changing pages first with limited resources.
Autocomplete is a separate subsystem optimized for prefix matching -- it uses pre-computed suggestions, not real-time index queries.