Episode 9 — System Design / 9.10 — Advanced Distributed Systems
9.10.f — Search Systems
Introduction
Search is one of the most frequently asked specialized subsystems in system design interviews. Whether you are designing Twitter search, product search for an e-commerce site, or a log search system, the core concepts are the same: inverted indexes, ranking, and distributed search architecture.
1. Full-Text Search vs Database Queries
+------------------------------------------------------------------------+
| TRADITIONAL SQL FULL-TEXT SEARCH ENGINE |
| |
| SELECT * FROM products "running shoes for flat feet" |
| WHERE name LIKE '%shoes%' |
| |
| Problems: Solutions: |
| - No relevance ranking - TF-IDF / BM25 ranking |
| - Full table scan (slow) - Inverted index (O(1) lookup) |
| - No typo tolerance - Fuzzy matching (Levenshtein) |
| - No synonym matching - Synonym expansion |
| - Exact string match only - Tokenization + stemming |
| "shoes" won't match "shoe" "running" -> "run" |
+------------------------------------------------------------------------+
2. Inverted Index
The fundamental data structure behind all search engines.
+------------------------------------------------------------------------+
| INVERTED INDEX |
| |
| DOCUMENTS: |
| Doc 1: "the quick brown fox" |
| Doc 2: "the quick rabbit" |
| Doc 3: "the brown cow" |
| |
| FORWARD INDEX (what we normally store): |
| Doc 1 -> [the, quick, brown, fox] |
| Doc 2 -> [the, quick, rabbit] |
| Doc 3 -> [the, brown, cow] |
| |
| INVERTED INDEX (optimized for search): |
| "brown" -> [Doc 1, Doc 3] |
| "cow" -> [Doc 3] |
| "fox" -> [Doc 1] |
| "quick" -> [Doc 1, Doc 2] |
| "rabbit" -> [Doc 2] |
| "the" -> [Doc 1, Doc 2, Doc 3] |
| |
| Search for "quick brown": |
| "quick" -> [Doc 1, Doc 2] |
| "brown" -> [Doc 1, Doc 3] |
| Intersection: [Doc 1] --> Result! |
+------------------------------------------------------------------------+
Text Processing Pipeline (Indexing)
Raw Text: "The Quick Brown FOX jumped over the lazy dog!!"
|
v
1. TOKENIZE: ["The", "Quick", "Brown", "FOX", "jumped", "over",
"the", "lazy", "dog"]
|
v
2. LOWERCASE: ["the", "quick", "brown", "fox", "jumped", "over",
"the", "lazy", "dog"]
|
v
3. REMOVE STOP WORDS: ["quick", "brown", "fox", "jumped", "lazy", "dog"]
|
v
4. STEMMING: ["quick", "brown", "fox", "jump", "lazi", "dog"]
|
v
5. Store in INVERTED INDEX with document ID + position
Key operations:
| Operation | What It Does | Example |
|---|---|---|
| Tokenization | Split text into words | "New York City" -> ["New", "York", "City"] |
| Lowercasing | Normalize case | "iPhone" -> "iphone" |
| Stop word removal | Remove common words | "the", "is", "at" -> removed |
| Stemming | Reduce to root form | "running", "runs", "ran" -> "run" |
| Lemmatization | Reduce to dictionary form | "better" -> "good" (smarter than stemming) |
3. Elasticsearch Basics
Elasticsearch is the most widely used search engine in production systems.
+------------------------------------------------------------------------+
| ELASTICSEARCH ARCHITECTURE |
| |
| INDEX (like a database table) |
| | |
| +-- SHARD 0 (primary) ---- SHARD 0 (replica) |
| | [Doc 1, Doc 4, Doc 7] |
| | |
| +-- SHARD 1 (primary) ---- SHARD 1 (replica) |
| | [Doc 2, Doc 5, Doc 8] |
| | |
| +-- SHARD 2 (primary) ---- SHARD 2 (replica) |
| [Doc 3, Doc 6, Doc 9] |
| |
| CLUSTER: |
| +----------+ +----------+ +----------+ |
| | Node 1 | | Node 2 | | Node 3 | |
| | Shard 0P | | Shard 1P | | Shard 2P | |
| | Shard 2R | | Shard 0R | | Shard 1R | |
| +----------+ +----------+ +----------+ |
| |
| P = Primary shard, R = Replica shard |
| Each shard is a self-contained Lucene index |
+------------------------------------------------------------------------+
Key Elasticsearch concepts:
| Concept | Analogy | Description |
|---|---|---|
| Index | Database table | Collection of documents with similar structure |
| Document | Row | A JSON object stored in an index |
| Shard | Partition | Horizontal split of an index across nodes |
| Replica | Read replica | Copy of a shard for HA and read throughput |
| Mapping | Schema | Defines field types (text, keyword, integer, date) |
| Analyzer | Text processor | Tokenizer + filters (lowercase, stemming, etc.) |
How a Search Query Works in ES
Client: GET /products/_search
{ "query": { "match": { "name": "running shoes" } } }
|
v
Coordinating Node (any node)
|
+-- Fan out to ALL shards containing "products" index
|
v
Shard 0: Search inverted index -> [Doc 4: score 2.3, Doc 7: score 1.8]
Shard 1: Search inverted index -> [Doc 2: score 3.1]
Shard 2: Search inverted index -> [Doc 6: score 1.5]
|
v
Coordinating Node: Merge + Sort by score
Result: [Doc 2 (3.1), Doc 4 (2.3), Doc 7 (1.8), Doc 6 (1.5)]
|
v
Fetch phase: Retrieve full documents for top N results
|
v
Return to client
4. Search Ranking
TF-IDF (Term Frequency - Inverse Document Frequency)
+------------------------------------------------------------------------+
| TF-IDF |
| |
| TF (Term Frequency): |
| How often does the term appear in THIS document? |
| TF("shoes", Doc1) = 3 occurrences / 100 words = 0.03 |
| |
| IDF (Inverse Document Frequency): |
| How rare is this term across ALL documents? |
| IDF("shoes") = log(1,000,000 total docs / 50,000 docs with "shoes")|
| = log(20) = 1.3 |
| |
| TF-IDF = TF x IDF = 0.03 x 1.3 = 0.039 |
| |
| Intuition: |
| - High TF: term appears frequently in this doc (relevant) |
| - High IDF: term is rare across corpus (distinctive) |
| - "the" has high TF but very low IDF (not useful) |
| - "elasticsearch" has moderate TF and high IDF (very useful) |
+------------------------------------------------------------------------+
BM25 (Modern Standard)
BM25 is an improvement over TF-IDF used by Elasticsearch by default:
- Adds saturation: after a term appears N times, additional occurrences matter less
- Adds document length normalization: longer documents don't get unfair advantage
Boosting and Custom Ranking
Score = text_relevance (BM25)
+ recency_boost (newer = higher)
+ popularity_boost (more likes = higher)
+ personalization (user preferences)
Example (e-commerce):
score = BM25_score * 0.5
+ (1 / days_since_listed) * 0.2
+ log(sales_count + 1) * 0.2
+ (user_brand_affinity) * 0.1
5. Autocomplete / Typeahead
+------------------------------------------------------------------------+
| AUTOCOMPLETE ARCHITECTURE |
| |
| User types: "run" |
| | |
| v |
| +------------------+ +------------------+ |
| | Client debounce | | Trie / Prefix | |
| | (200ms wait) |---->| Index | |
| +------------------+ +------------------+ |
| | |
| Matches for "run": |
| +----------------------+ |
| | running shoes (15K) | |
| | running watch (8K) | |
| | run tracker (5K) | |
| | runway fashion (2K) | |
| +----------------------+ |
| sorted by popularity |
| |
| IMPLEMENTATION OPTIONS: |
| +-------------------+----------------------------------------------+ |
| | Trie (in-memory) | Fast prefix lookup, high memory usage | |
| | Prefix queries | ES prefix query, moderate speed | |
| | Completion | ES completion suggester (FST-based, fastest) | |
| | Edge n-grams | Index "run", "runn", "runni", "runnin", etc. | |
| +-------------------+----------------------------------------------+ |
+------------------------------------------------------------------------+
Performance requirements:
- Latency: < 100ms (ideally < 50ms)
- Start after 2-3 characters typed
- Client-side debounce: 150-300ms
- Cache popular prefixes aggressively
6. Fuzzy Search (Typo Tolerance)
User types: "runnign shoes" (typo: "runnign")
|
v
Edit Distance (Levenshtein):
"runnign" -> "running" = 1 edit (swap g and n)
+--------------------------------------------------------------+
| EDIT OPERATIONS: |
| Insert: "runing" -> "running" (distance = 1) |
| Delete: "runningg" -> "running" (distance = 1) |
| Replace: "runnikg" -> "running" (distance = 1) |
| Transpose: "runnign" -> "running" (distance = 1) |
+--------------------------------------------------------------+
Elasticsearch fuzzy query:
{
"query": {
"match": {
"name": {
"query": "runnign shoes",
"fuzziness": "AUTO" // 0 edits for 1-2 chars
} // 1 edit for 3-5 chars
} // 2 edits for 6+ chars
}
}
7. Search Architecture (Full System)
+------------------------------------------------------------------------+
| SEARCH SYSTEM ARCHITECTURE |
| |
| DATA SOURCES INDEXING PIPELINE |
| +----------+ |
| | MySQL |--+ |
| +----------+ | +----------+ +------------+ +--------------+ |
| +----------+ +-->| Change |-->| Transform |-->| Elasticsearch| |
| | MongoDB |--+ | Data | | + Enrich | | Cluster | |
| +----------+ | | Capture | | | | | |
| +----------+ | | (CDC) | | - Tokenize | | Index A | |
| | Events |--+ | or | | - Stem | | Index B | |
| | (Kafka) | | Kafka | | - Enrich | | Index C | |
| +----------+ +----------+ +------------+ +------+-------+ |
| | |
| QUERY PATH | |
| +----------+ +----------+ +----------+ | |
| | Client |--->| Search |--->| Query |<-----------+ |
| | | | API | | Router | |
| +----------+ +-----+----+ +----------+ |
| | |
| v |
| +----------+ |
| | Result | - Re-rank |
| | Post- | - Filter (ACL, geo, price) |
| | Process | - Highlight matching text |
| +----------+ - Paginate |
+------------------------------------------------------------------------+
Key Components
| Component | Purpose |
|---|---|
| CDC / Event Stream | Capture database changes in real-time for near-instant index updates |
| Transform + Enrich | Clean text, add metadata (category, geo), compute features |
| Elasticsearch Cluster | Store inverted index, execute queries |
| Search API | Parse user query, add filters, call ES, post-process results |
| Query Router | Route to correct index/shard, fan-out for multi-index search |
Indexing Strategies
| Strategy | Latency | Use Case |
|---|---|---|
| Real-time (CDC) | Seconds | Product search (new listings appear fast) |
| Near-real-time (queue) | Minutes | News articles, social media posts |
| Batch (cron job) | Hours | Analytics dashboards, report search |
| Hybrid | Varies | Batch for full reindex + CDC for incremental |
8. Search in System Design Interviews
Example: Design Twitter Search
+------------------------------------------------------------------------+
| TWITTER SEARCH - HIGH LEVEL |
| |
| REQUIREMENTS: |
| - Search tweets by text content |
| - Return results ranked by relevance + recency |
| - Support hashtag search and user mentions |
| - Handle 500M tweets/day, search latency < 200ms |
| |
| INDEXING: |
| Tweet Created -> Kafka -> Indexer -> Elasticsearch |
| |
| INDEX DESIGN: |
| - Time-based indices: tweets-2024-03, tweets-2024-02 |
| - Hot/warm/cold: Recent = fast SSD, old = cheaper storage |
| - Sharded by time range (latest week on more shards) |
| |
| QUERY FLOW: |
| Search "election results" |
| |-> Parse query |
| |-> Search latest N time-based indices |
| |-> Merge results |
| |-> Re-rank: relevance * recency * engagement |
| |-> Return top 20 |
| |
| SPECIAL FEATURES: |
| - Hashtag search: exact match on keyword field |
| - Trending: aggregate hashtag counts in sliding window |
| - Typeahead: completion suggester on usernames + hashtags |
| - Filters: by date, from:user, has:media |
+------------------------------------------------------------------------+
Interview Checklist for Search Design
[ ] 1. What are we searching? (tweets, products, logs, docs)
[ ] 2. How big is the corpus? (millions vs billions)
[ ] 3. How fresh must results be? (real-time vs hourly)
[ ] 4. Indexing pipeline (CDC / Kafka / batch)
[ ] 5. Inverted index via Elasticsearch
[ ] 6. Ranking strategy (BM25 + custom signals)
[ ] 7. Autocomplete approach
[ ] 8. Fuzzy search / typo tolerance
[ ] 9. Sharding strategy (by document ID or time)
[ ] 10. Scaling (replicas for read throughput, shards for data volume)
Key Takeaways
- Inverted index is the core data structure. It maps terms to documents, enabling O(1) lookup by term.
- Text processing pipeline matters. Tokenize -> lowercase -> remove stop words -> stem/lemmatize.
- BM25 is the default ranking algorithm. Understand TF-IDF intuition and how BM25 improves on it.
- Elasticsearch is the de facto search engine. Know shards, replicas, and the query-then-fetch flow.
- Autocomplete needs special data structures. Tries, completion suggesters, or edge n-grams.
- Fuzzy search uses edit distance. Levenshtein distance with configurable fuzziness.
- Indexing pipeline is as important as querying. CDC or event streaming keeps the index fresh.
- Search ranking is multi-signal. Combine text relevance with recency, popularity, and personalization.
Explain-It Challenge
Design question: You are designing the search system for an e-commerce platform with 50 million products. Users search by product name, description, category, and brand.
Address:
- How do you build the inverted index?
- How do you rank results (relevance vs popularity vs recency)?
- How do you handle "red nike running shoes size 10" (multi-facet query)?
- How do you implement autocomplete that returns both products and categories?
- What is your sharding strategy for 50M documents?
Next -> 9.10-Exercise-Questions