9.11 Quick Revision -- Cheat Sheet for All 12 System Designs
Use this file for last-minute review before interviews. Each design is condensed
to its essential components, key decisions, and common pitfalls.
Comparison Table: All 12 Systems at a Glance
| # | System | Read/Write | Consistency | Real-time? | Core Pattern |
|---|
| 9.11.a | URL Shortener | Read-heavy (100:1) | Eventual | No | Key-value lookup + cache |
| 9.11.b | Rate Limiter | Read-heavy | Strong (approx) | Yes | Atomic counters + sliding window |
| 9.11.c | Chat System | Write-heavy | Eventual | Yes | WebSocket + pub/sub |
| 9.11.d | Social Media Feed | Read-heavy | Eventual | Near-RT | Fan-out (hybrid push/pull) |
| 9.11.e | Video Streaming | Read-heavy | Eventual | Yes (live) | CDN + adaptive bitrate |
| 9.11.f | E-Commerce | Mixed | Strong (orders) | No | Microservices + SAGA |
| 9.11.g | Notification Sys. | Write-heavy | At-least-once | Yes | Priority queues + fan-out |
| 9.11.h | File Storage | Mixed | Eventual (sync) | Near-RT | Chunking + dedup + sync |
| 9.11.i | Search Engine | Read-heavy | Eventual | No | Inverted index + ranking |
| 9.11.j | Ride Sharing | Write-heavy (location) | Strong (trips) | Yes | Geospatial index + matching |
| 9.11.k | Payment System | Write-heavy | Strong (ACID) | No | Idempotency + double-entry ledger |
| 9.11.l | Monitoring System | Write-heavy | Eventual | Yes | Time-series DB + alerting |
Detailed Dimension Comparison
Data Volume and Throughput
| System | Write Rate | Read Rate | Storage (yearly) |
|---|
| URL Shortener | 40/sec | 4,000/sec | 3.6 TB |
| Rate Limiter | 1M checks/sec | 1M checks/sec | 2 GB (in-memory) |
| Chat System | 230K msgs/sec | Variable | ~4 TB |
| Social Media Feed | 5,800 posts/sec | 300K feed loads/sec | ~20 TB |
| Video Streaming | 500 hrs/min upload | 50M concurrent views | 400+ PB |
| E-Commerce | 115 orders/sec | 5,800 searches/sec | ~50 TB |
| Notification Sys. | 116K notifs/sec | N/A (push) | ~10 TB |
| File Storage | 1,150 uploads/sec | Variable | 50 PB |
| Search Engine | Crawl: batch | 100K queries/sec | 10 TB (index) |
| Ride Sharing | 1.25M loc/sec | 500K geo queries/sec | ~970 TB (loc.) |
| Payment System | 580 TPS | 2,900 API calls/sec | ~36 TB |
| Monitoring System | 10M points/sec | 5K dashboard q/sec | ~81 TB (compressed) |
Primary Database Choices
| System | Primary DB | Cache/Fast Store | Analytics Store |
|---|
| URL Shortener | DynamoDB / Cassandra | Redis | ClickHouse |
| Rate Limiter | N/A | Redis (Lua scripts) | N/A |
| Chat System | Cassandra | Redis (sessions) | N/A |
| Social Media Feed | Cassandra | Redis (feeds, graph) | N/A |
| Video Streaming | PostgreSQL + S3 | CDN | ClickHouse |
| E-Commerce | PostgreSQL | Redis (cart, catalog) | Elasticsearch |
| Notification Sys. | PostgreSQL | Redis (dedup) | N/A |
| File Storage | PostgreSQL (metadata) | Redis (sync state) | N/A |
| Search Engine | Custom (inverted index) | N/A | N/A |
| Ride Sharing | PostgreSQL (trips) | Redis GEO (location) | Druid |
| Payment System | PostgreSQL (ACID) | Redis (idempotency) | N/A |
| Monitoring System | Custom TSDB | In-memory head block | Elasticsearch (logs) |
Per-System Summaries
9.11.a URL Shortener (TinyURL)
Summary: Read-heavy key-value service with caching and analytics pipeline.
Components: Pre-Generated Key Service, Redis cache, NoSQL DB, Kafka analytics
Key Decisions:
- Pre-generated keys over hash-based generation (no collisions)
- 302 redirect over 301 (for analytics accuracy)
- Async analytics via Kafka (keeps redirect fast)
Pitfalls:
- 301 redirect caches in browser, killing analytics accuracy
- Hash collisions under high volume without collision detection
- Not caching hot URLs wastes DB capacity on 80/20 workload
9.11.b Rate Limiter
Summary: Distributed atomic counter system protecting APIs from abuse.
Components: Redis cluster, Lua scripts, API gateway integration
Key Decisions:
- Sliding window counter over fixed window (avoids 2x burst at boundary)
- Redis Lua for atomicity (check + decrement in one round trip)
- Fail OPEN on Redis failure (never block all traffic)
Pitfalls:
- Fixed window allows 2x burst at window boundary
- Fail CLOSED on limiter failure takes down the entire service
- Not using Lua scripts means race conditions between check and decrement
9.11.c Chat System (WhatsApp)
Summary: Real-time bidirectional messaging with presence and group support.
Components: WebSocket servers, Redis (sessions/presence), Cassandra (messages), Kafka
Key Decisions:
- WebSocket for real-time, REST for message history
- At-least-once delivery with client-side deduplication
- Lazy presence (fetch on demand, not broadcast to all contacts)
Pitfalls:
- Broadcasting presence to all contacts is prohibitively expensive
- Not handling message ordering across partitions causes out-of-order display
- Large group fan-out without pub/sub overwhelms individual servers
9.11.d Social Media Feed (Twitter)
Summary: Timeline generation using hybrid fan-out with ML-based ranking.
Components: Fan-out service, Feed cache (Redis), Ranking ML model, Cassandra
Key Decisions:
- Hybrid fan-out: push for regular users, pull for celebrities (> 10K followers)
- ML-ranked feed over purely chronological
- Cursor-based pagination (not offset) for real-time insertions
Pitfalls:
- Pure push fails for users with 50M+ followers (33-minute write storm)
- Offset pagination breaks when new posts are inserted during scroll
- Not handling the celebrity problem means one viral post can overwhelm writes
9.11.e Video Streaming (YouTube)
Summary: Upload pipeline with transcoding, CDN delivery, and adaptive bitrate.
Components: Chunked upload (S3), GPU transcoding workers, HLS/DASH, 3-tier CDN
Key Decisions:
- Eager transcoding on upload (not on first play)
- CDN absorbs 99%+ of streaming bandwidth
- Approximate view counting (batched, not per-view DB write)
Pitfalls:
- Lazy transcoding means first viewer waits minutes for encoding
- Not using adaptive bitrate causes buffering on slow connections
- Per-view DB writes at 50M concurrent streams would overwhelm any database
9.11.f E-Commerce (Amazon)
Summary: Microservices architecture with SAGA pattern for distributed transactions.
Components: Product catalog, Cart (Redis), Inventory, Order, Payment services
Key Decisions:
- SAGA over 2PC (more resilient, supports eventual consistency)
- Cart in Redis (sub-ms access, TTL-based expiration)
- Flash sale: virtual queue + Redis atomic decrement
Pitfalls:
- 2PC blocks all participants on coordinator failure
- Not using TTL on inventory reservations leaves phantom holds
- Flash sale without a virtual queue causes thundering herd to inventory DB
9.11.g Notification System
Summary: Multi-channel delivery system with priority routing and fan-out.
Components: Kafka (priority topics), Channel workers (APNS/FCM/email/SMS), templates
Key Decisions:
- Separate queues per priority per channel (12 total topics)
- Circuit breaker on providers with automatic failover
- Per-user rate limits to prevent notification fatigue
Pitfalls:
- Single queue for all priorities means marketing delays critical alerts
- No per-user rate limit causes app uninstalls
- Not deduplicating leads to sending the same notification 3 times on retry
9.11.h File Storage (Dropbox)
Summary: Chunked file storage with deduplication and cross-device sync.
Components: CDC chunker (Rabin fingerprint), S3 (blob store), PostgreSQL (metadata)
Key Decisions:
- Content-Defined Chunking over fixed-size (efficient delta sync)
- Conflict copy over last-writer-wins (no data loss)
- Direct-to-S3 upload via presigned URL (bypasses app servers)
Pitfalls:
- Fixed-size chunking: 1-byte insert invalidates ALL subsequent chunks
- Last-writer-wins silently loses one user's changes
- Routing all uploads through app servers creates a bandwidth bottleneck
9.11.i Search Engine (Google)
Summary: Crawl-index-rank pipeline with inverted index and distributed serving.
Components: Crawler (priority frontier), Inverted index, Ranking (BM25 + PageRank + ML)
Key Decisions:
- Document-based sharding over term-based (avoids hot shards)
- Tiered index: query top-quality pages first, expand if needed
- SimHash for near-duplicate detection
Pitfalls:
- Term-based sharding creates hot shards on common terms ("the")
- Not respecting robots.txt gets your crawler blocked
- Single-tier index wastes computation ranking low-quality pages
9.11.j Ride Sharing (Uber)
Summary: Real-time geospatial matching with location ingestion at 1.25M/sec.
Components: Redis GEO, Matching service, Surge pricing, ETA service, WebSocket
Key Decisions:
- Redis GEO over custom quadtree (simpler, fast enough)
- Scored greedy matching over global optimization (speed > optimality)
- Per-city sharding (trips never cross city boundaries)
Pitfalls:
- Two concurrent requests matching same driver without atomic locking
- Not batching location updates overwhelms the network (5M updates/4sec)
- Stale driver locations (no TTL) match riders to drivers who left the area
9.11.k Payment System (Stripe)
Summary: Exactly-once financial processing with ledger, fraud detection, PCI.
Components: Payment state machine, Ledger (double-entry), Fraud engine, Card vault
Key Decisions:
- Idempotency keys at every boundary (client, PSP, ledger)
- Outbox pattern for atomic DB update + event publishing
- Tokenization in isolated PCI environment (merchants never see card numbers)
Pitfalls:
- Dual-write without outbox pattern loses events or creates inconsistency
- Storing idempotency keys only in Redis risks loss on restart = double charge
- Single PSP means one outage stops all payments (use multi-PSP routing)
9.11.l Monitoring System (Datadog)
Summary: High-throughput metrics ingestion with TSDB, alerting, and tracing.
Components: Agents, Kafka, custom TSDB (Gorilla compression), Alert evaluators
Key Decisions:
- Hybrid push/pull collection (push for custom, pull for Prometheus compat.)
- Gorilla compression: 12x reduction (delta-of-delta + XOR encoding)
- Downsampling: 15 days full-res, 3 months 1-min, 1 year 5-min
Pitfalls:
- High cardinality tags (user_id as label) creates millions of time series
- No downsampling keeps all data forever at 43 TB/day
- Fixed alert thresholds cause flapping; need hysteresis + seasonal baselines
Cross-Problem Pattern Reference
Pattern Problems Where Used
-------------------------------------------------------------------
Caching (Redis) URL Shortener, Feed, E-Commerce, Search, Ride Sharing
Message Queue (Kafka) Chat, Notifications, Video, Monitoring, Payment
Database Sharding URL Shortener, Chat, E-Commerce, Payment, Search
CDN Video Streaming, File Storage, E-Commerce
Fan-out Feed (hybrid), Chat (groups), Notifications
Idempotency Payment, E-Commerce, Notifications
Rate Limiting Rate Limiter, Notifications, URL Shortener, Payment
WebSocket Chat, Ride Sharing, File Storage (sync), Monitoring
SAGA Pattern E-Commerce, Payment (refund flow)
Geospatial Index Ride Sharing
Inverted Index Search Engine, Monitoring (label index)
Time-Series Storage Monitoring
Optimistic Locking E-Commerce (inventory), Payment (state machine)
Circuit Breaker Notifications (provider failover), E-Commerce
Pre-generated Keys URL Shortener
Content-Defined Chunking File Storage
Sliding Window Rate Limiter
Outbox Pattern Payment
Double-Entry Ledger Payment
Gorilla Compression Monitoring (TSDB)
Tail-based Sampling Monitoring (traces)
Read-Heavy vs Write-Heavy Classification
Most Read-Heavy (caching is highly effective)
1. URL Shortener 100:1 read-to-write ratio
Why caching works: small key-value lookups, 80/20 traffic distribution,
data is immutable after creation.
2. Search Engine Reads dominate (index built in batch)
Why caching works: popular queries repeat frequently, results change
slowly, retrieval phase is expensive to compute.
3. Video Streaming Millions of views per upload
Why caching works: CDN caches video segments at edge, popular content
follows power law (top 1% of videos = 90% of views).
Most Write-Heavy (append-only/WAL/queues are critical)
1. Monitoring System 10M data points/sec ingested
Why WAL matters: Data points must survive writer crashes; WAL provides
durability without expensive sync writes per point.
2. Ride Sharing 1.25M location updates/sec
Why queues matter: Kafka buffers location updates during spikes; Redis
TTL ensures stale data auto-expires.
3. Chat System 230K messages/sec
Why append-only: Cassandra's append-only log-structured storage handles
high write throughput; messages are immutable once sent.
Consistency vs Availability Classification
Strong Consistency Required
Payment System: Cannot double-charge; ACID transactions mandatory
E-Commerce Orders: Cannot oversell inventory; optimistic locking on stock
Ride Sharing Trips: Cannot assign driver to two riders simultaneously
Eventual Consistency Acceptable
URL Shortener: Slightly stale redirect is fine (cache TTL)
Social Media Feed: 5-30 second delay before post appears acceptable
Search Engine: Index can be minutes behind real-time
Monitoring: Metrics can tolerate seconds of lag
Chat (messages): At-least-once with client dedup; order resolved client-side
Real-Time vs Batch Classification
Real-Time Systems (sub-second response required)
Chat System: Messages must arrive in < 1 second
Ride Sharing: Location updates every 4 seconds, matching in < 10 seconds
Monitoring: Alert evaluation within 30 seconds of metric arrival
Notification: Critical alerts delivered within seconds
Near-Real-Time (seconds to minutes acceptable)
Social Media Feed: 5-30 second delay for new posts
File Storage Sync: Sync within a few seconds across devices
Batch (minutes to hours acceptable)
Video Transcoding: Minutes to transcode after upload
Search Indexing: Crawl and index cycle is hours/days
Payment Settlement: T+2 day settlement batches
Reconciliation: Daily batch process
Common Interview Traps and How to Avoid Them
Trap How to Avoid
------------------------------------------------------------------
Jumping to database choice Start with requirements and access patterns
before understanding access first; the DB choice follows naturally.
patterns.
Saying "use Kafka" without Explain WHY: decoupling, buffering, durability.
explaining why. Name what ordering/delivery guarantees you need.
Designing for Google scale Start with realistic numbers. 100 users/sec is
on day one. fine for an MVP. Show how to scale later.
Ignoring failure modes. For every component, state what happens when
it fails and how the system degrades gracefully.
"We will just add more Identify the specific bottleneck (CPU, memory,
servers." disk, network) and how sharding/replication
addresses it.
Single point of failure in Every critical path component needs redundancy:
the architecture. load balancer, database replica, queue cluster.
Not stating tradeoffs. Every decision has a cost. "I chose X because
Y, at the expense of Z" shows engineering maturity.
Final Interview Reminders
1. Always start with requirements clarification (5 minutes).
2. Do capacity estimation early -- it guides database and caching choices.
3. Draw the high-level diagram BEFORE diving into any component.
4. Name your tradeoffs: "I chose X because Y, at the cost of Z."
5. Deep dive into the 2 most interesting or challenging components.
6. Mention failure modes and how you handle them.
7. Think out loud -- the interviewer wants to see your reasoning process.
8. Time management: do not spend 20 minutes on schema and run out of time
for the deep dive (where most points are awarded).