Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.f — Search Systems

Introduction

Search is one of the most frequently asked specialized subsystems in system design interviews. Whether you are designing Twitter search, product search for an e-commerce site, or a log search system, the core concepts are the same: inverted indexes, ranking, and distributed search architecture.

1. Full-Text Search vs Database Queries

+------------------------------------------------------------------------+
|  TRADITIONAL SQL                  FULL-TEXT SEARCH ENGINE               |
|                                                                        |
|  SELECT * FROM products           "running shoes for flat feet"         |
|  WHERE name LIKE '%shoes%'                                             |
|                                                                        |
|  Problems:                        Solutions:                            |
|  - No relevance ranking           - TF-IDF / BM25 ranking             |
|  - Full table scan (slow)         - Inverted index (O(1) lookup)       |
|  - No typo tolerance              - Fuzzy matching (Levenshtein)       |
|  - No synonym matching            - Synonym expansion                  |
|  - Exact string match only        - Tokenization + stemming            |
|    "shoes" won't match "shoe"       "running" -> "run"                  |
+------------------------------------------------------------------------+

2. Inverted Index

The fundamental data structure behind all search engines.

+------------------------------------------------------------------------+
|                     INVERTED INDEX                                       |
|                                                                        |
|  DOCUMENTS:                                                            |
|    Doc 1: "the quick brown fox"                                        |
|    Doc 2: "the quick rabbit"                                           |
|    Doc 3: "the brown cow"                                              |
|                                                                        |
|  FORWARD INDEX (what we normally store):                               |
|    Doc 1 -> [the, quick, brown, fox]                                   |
|    Doc 2 -> [the, quick, rabbit]                                       |
|    Doc 3 -> [the, brown, cow]                                          |
|                                                                        |
|  INVERTED INDEX (optimized for search):                                |
|    "brown"  -> [Doc 1, Doc 3]                                          |
|    "cow"    -> [Doc 3]                                                 |
|    "fox"    -> [Doc 1]                                                 |
|    "quick"  -> [Doc 1, Doc 2]                                          |
|    "rabbit" -> [Doc 2]                                                 |
|    "the"    -> [Doc 1, Doc 2, Doc 3]                                   |
|                                                                        |
|  Search for "quick brown":                                             |
|    "quick"  -> [Doc 1, Doc 2]                                          |
|    "brown"  -> [Doc 1, Doc 3]                                          |
|    Intersection: [Doc 1]  --> Result!                                   |
+------------------------------------------------------------------------+

Text Processing Pipeline (Indexing)

  Raw Text: "The Quick Brown FOX jumped over the lazy dog!!"
       |
       v
  1. TOKENIZE: ["The", "Quick", "Brown", "FOX", "jumped", "over",
                 "the", "lazy", "dog"]
       |
       v
  2. LOWERCASE: ["the", "quick", "brown", "fox", "jumped", "over",
                  "the", "lazy", "dog"]
       |
       v
  3. REMOVE STOP WORDS: ["quick", "brown", "fox", "jumped", "lazy", "dog"]
       |
       v
  4. STEMMING: ["quick", "brown", "fox", "jump", "lazi", "dog"]
       |
       v
  5. Store in INVERTED INDEX with document ID + position

Key operations:

Operation	What It Does	Example
Tokenization	Split text into words	"New York City" -> ["New", "York", "City"]
Lowercasing	Normalize case	"iPhone" -> "iphone"
Stop word removal	Remove common words	"the", "is", "at" -> removed
Stemming	Reduce to root form	"running", "runs", "ran" -> "run"
Lemmatization	Reduce to dictionary form	"better" -> "good" (smarter than stemming)

3. Elasticsearch Basics

Elasticsearch is the most widely used search engine in production systems.

+------------------------------------------------------------------------+
|                  ELASTICSEARCH ARCHITECTURE                              |
|                                                                        |
|  INDEX (like a database table)                                         |
|    |                                                                   |
|    +-- SHARD 0 (primary) ---- SHARD 0 (replica)                       |
|    |     [Doc 1, Doc 4, Doc 7]                                         |
|    |                                                                   |
|    +-- SHARD 1 (primary) ---- SHARD 1 (replica)                       |
|    |     [Doc 2, Doc 5, Doc 8]                                         |
|    |                                                                   |
|    +-- SHARD 2 (primary) ---- SHARD 2 (replica)                       |
|          [Doc 3, Doc 6, Doc 9]                                         |
|                                                                        |
|  CLUSTER:                                                              |
|  +----------+  +----------+  +----------+                              |
|  |  Node 1  |  |  Node 2  |  |  Node 3  |                              |
|  | Shard 0P |  | Shard 1P |  | Shard 2P |                              |
|  | Shard 2R |  | Shard 0R |  | Shard 1R |                              |
|  +----------+  +----------+  +----------+                              |
|                                                                        |
|  P = Primary shard, R = Replica shard                                  |
|  Each shard is a self-contained Lucene index                           |
+------------------------------------------------------------------------+

Key Elasticsearch concepts:

Concept	Analogy	Description
Index	Database table	Collection of documents with similar structure
Document	Row	A JSON object stored in an index
Shard	Partition	Horizontal split of an index across nodes
Replica	Read replica	Copy of a shard for HA and read throughput
Mapping	Schema	Defines field types (text, keyword, integer, date)
Analyzer	Text processor	Tokenizer + filters (lowercase, stemming, etc.)

How a Search Query Works in ES

  Client: GET /products/_search
          { "query": { "match": { "name": "running shoes" } } }
     |
     v
  Coordinating Node (any node)
     |
     +-- Fan out to ALL shards containing "products" index
     |
     v
  Shard 0: Search inverted index -> [Doc 4: score 2.3, Doc 7: score 1.8]
  Shard 1: Search inverted index -> [Doc 2: score 3.1]
  Shard 2: Search inverted index -> [Doc 6: score 1.5]
     |
     v
  Coordinating Node: Merge + Sort by score
     Result: [Doc 2 (3.1), Doc 4 (2.3), Doc 7 (1.8), Doc 6 (1.5)]
     |
     v
  Fetch phase: Retrieve full documents for top N results
     |
     v
  Return to client

4. Search Ranking

TF-IDF (Term Frequency - Inverse Document Frequency)

+------------------------------------------------------------------------+
|                         TF-IDF                                          |
|                                                                        |
|  TF (Term Frequency):                                                  |
|    How often does the term appear in THIS document?                    |
|    TF("shoes", Doc1) = 3 occurrences / 100 words = 0.03               |
|                                                                        |
|  IDF (Inverse Document Frequency):                                     |
|    How rare is this term across ALL documents?                         |
|    IDF("shoes") = log(1,000,000 total docs / 50,000 docs with "shoes")|
|                 = log(20) = 1.3                                        |
|                                                                        |
|  TF-IDF = TF x IDF = 0.03 x 1.3 = 0.039                              |
|                                                                        |
|  Intuition:                                                            |
|    - High TF: term appears frequently in this doc (relevant)           |
|    - High IDF: term is rare across corpus (distinctive)                |
|    - "the" has high TF but very low IDF (not useful)                   |
|    - "elasticsearch" has moderate TF and high IDF (very useful)        |
+------------------------------------------------------------------------+

BM25 (Modern Standard)

BM25 is an improvement over TF-IDF used by Elasticsearch by default:

Adds saturation: after a term appears N times, additional occurrences matter less
Adds document length normalization: longer documents don't get unfair advantage

Boosting and Custom Ranking

  Score = text_relevance (BM25)
        + recency_boost (newer = higher)
        + popularity_boost (more likes = higher)
        + personalization (user preferences)

  Example (e-commerce):
    score = BM25_score * 0.5
          + (1 / days_since_listed) * 0.2
          + log(sales_count + 1) * 0.2
          + (user_brand_affinity) * 0.1

5. Autocomplete / Typeahead

+------------------------------------------------------------------------+
|                    AUTOCOMPLETE ARCHITECTURE                             |
|                                                                        |
|  User types: "run"                                                     |
|       |                                                                |
|       v                                                                |
|  +------------------+     +------------------+                         |
|  |  Client debounce |     |   Trie / Prefix  |                         |
|  |  (200ms wait)    |---->|   Index           |                         |
|  +------------------+     +------------------+                         |
|                                  |                                     |
|                           Matches for "run":                           |
|                           +----------------------+                     |
|                           | running shoes  (15K) |                     |
|                           | running watch  (8K)  |                     |
|                           | run tracker    (5K)  |                     |
|                           | runway fashion (2K)  |                     |
|                           +----------------------+                     |
|                              sorted by popularity                      |
|                                                                        |
|  IMPLEMENTATION OPTIONS:                                               |
|  +-------------------+----------------------------------------------+  |
|  | Trie (in-memory)  | Fast prefix lookup, high memory usage        |  |
|  | Prefix queries    | ES prefix query, moderate speed              |  |
|  | Completion        | ES completion suggester (FST-based, fastest) |  |
|  | Edge n-grams     | Index "run", "runn", "runni", "runnin", etc. |  |
|  +-------------------+----------------------------------------------+  |
+------------------------------------------------------------------------+

Performance requirements:

Latency: < 100ms (ideally < 50ms)
Start after 2-3 characters typed
Client-side debounce: 150-300ms
Cache popular prefixes aggressively

6. Fuzzy Search (Typo Tolerance)

  User types: "runnign shoes"  (typo: "runnign")
                    |
                    v
  Edit Distance (Levenshtein):
    "runnign" -> "running" = 1 edit (swap g and n)

  +--------------------------------------------------------------+
  |  EDIT OPERATIONS:                                            |
  |    Insert:    "runing"  -> "running"  (distance = 1)         |
  |    Delete:    "runningg" -> "running" (distance = 1)         |
  |    Replace:   "runnikg" -> "running"  (distance = 1)         |
  |    Transpose: "runnign" -> "running"  (distance = 1)         |
  +--------------------------------------------------------------+

  Elasticsearch fuzzy query:
  {
    "query": {
      "match": {
        "name": {
          "query": "runnign shoes",
          "fuzziness": "AUTO"    // 0 edits for 1-2 chars
        }                        // 1 edit for 3-5 chars
      }                          // 2 edits for 6+ chars
    }
  }

7. Search Architecture (Full System)

+------------------------------------------------------------------------+
|                SEARCH SYSTEM ARCHITECTURE                                |
|                                                                        |
|  DATA SOURCES                    INDEXING PIPELINE                      |
|  +----------+                                                          |
|  |  MySQL   |--+                                                       |
|  +----------+  |   +----------+   +------------+   +--------------+    |
|  +----------+  +-->| Change   |-->| Transform  |-->| Elasticsearch|    |
|  | MongoDB  |--+   | Data     |   | + Enrich   |   | Cluster      |    |
|  +----------+  |   | Capture  |   |            |   |              |    |
|  +----------+  |   | (CDC)    |   | - Tokenize |   | Index A      |    |
|  | Events   |--+   | or       |   | - Stem     |   | Index B      |    |
|  | (Kafka)  |      | Kafka    |   | - Enrich   |   | Index C      |    |
|  +----------+      +----------+   +------------+   +------+-------+    |
|                                                           |            |
|  QUERY PATH                                               |            |
|  +----------+    +----------+    +----------+            |            |
|  |  Client  |--->| Search   |--->| Query    |<-----------+            |
|  |          |    | API      |    | Router   |                          |
|  +----------+    +-----+----+    +----------+                          |
|                        |                                               |
|                        v                                               |
|                  +----------+                                          |
|                  | Result   |  - Re-rank                               |
|                  | Post-    |  - Filter (ACL, geo, price)              |
|                  | Process  |  - Highlight matching text               |
|                  +----------+  - Paginate                              |
+------------------------------------------------------------------------+

Key Components

Component	Purpose
CDC / Event Stream	Capture database changes in real-time for near-instant index updates
Transform + Enrich	Clean text, add metadata (category, geo), compute features
Elasticsearch Cluster	Store inverted index, execute queries
Search API	Parse user query, add filters, call ES, post-process results
Query Router	Route to correct index/shard, fan-out for multi-index search

Indexing Strategies

Strategy	Latency	Use Case
Real-time (CDC)	Seconds	Product search (new listings appear fast)
Near-real-time (queue)	Minutes	News articles, social media posts
Batch (cron job)	Hours	Analytics dashboards, report search
Hybrid	Varies	Batch for full reindex + CDC for incremental

8. Search in System Design Interviews

Example: Design Twitter Search

+------------------------------------------------------------------------+
|                TWITTER SEARCH - HIGH LEVEL                              |
|                                                                        |
|  REQUIREMENTS:                                                         |
|  - Search tweets by text content                                       |
|  - Return results ranked by relevance + recency                        |
|  - Support hashtag search and user mentions                            |
|  - Handle 500M tweets/day, search latency < 200ms                     |
|                                                                        |
|  INDEXING:                                                              |
|  Tweet Created -> Kafka -> Indexer -> Elasticsearch                    |
|                                                                        |
|  INDEX DESIGN:                                                         |
|    - Time-based indices: tweets-2024-03, tweets-2024-02                |
|    - Hot/warm/cold: Recent = fast SSD, old = cheaper storage           |
|    - Sharded by time range (latest week on more shards)                |
|                                                                        |
|  QUERY FLOW:                                                           |
|    Search "election results"                                           |
|      |-> Parse query                                                   |
|      |-> Search latest N time-based indices                            |
|      |-> Merge results                                                 |
|      |-> Re-rank: relevance * recency * engagement                     |
|      |-> Return top 20                                                 |
|                                                                        |
|  SPECIAL FEATURES:                                                     |
|    - Hashtag search: exact match on keyword field                      |
|    - Trending: aggregate hashtag counts in sliding window              |
|    - Typeahead: completion suggester on usernames + hashtags           |
|    - Filters: by date, from:user, has:media                           |
+------------------------------------------------------------------------+

Interview Checklist for Search Design

  [ ] 1. What are we searching? (tweets, products, logs, docs)
  [ ] 2. How big is the corpus? (millions vs billions)
  [ ] 3. How fresh must results be? (real-time vs hourly)
  [ ] 4. Indexing pipeline (CDC / Kafka / batch)
  [ ] 5. Inverted index via Elasticsearch
  [ ] 6. Ranking strategy (BM25 + custom signals)
  [ ] 7. Autocomplete approach
  [ ] 8. Fuzzy search / typo tolerance
  [ ] 9. Sharding strategy (by document ID or time)
  [ ] 10. Scaling (replicas for read throughput, shards for data volume)

Key Takeaways

Inverted index is the core data structure. It maps terms to documents, enabling O(1) lookup by term.
Text processing pipeline matters. Tokenize -> lowercase -> remove stop words -> stem/lemmatize.
BM25 is the default ranking algorithm. Understand TF-IDF intuition and how BM25 improves on it.
Elasticsearch is the de facto search engine. Know shards, replicas, and the query-then-fetch flow.
Autocomplete needs special data structures. Tries, completion suggesters, or edge n-grams.
Fuzzy search uses edit distance. Levenshtein distance with configurable fuzziness.
Indexing pipeline is as important as querying. CDC or event streaming keeps the index fresh.
Search ranking is multi-signal. Combine text relevance with recency, popularity, and personalization.

Explain-It Challenge

Design question: You are designing the search system for an e-commerce platform with 50 million products. Users search by product name, description, category, and brand.

Address:

How do you build the inverted index?

How do you rank results (relevance vs popularity vs recency)?

How do you handle "red nike running shoes size 10" (multi-facet query)?

How do you implement autocomplete that returns both products and categories?

What is your sharding strategy for 50M documents?

Next -> 9.10-Exercise-Questions