Episode 9 — System Design / 9.11 — Real World System Design Problems
9.11 Interview Questions with Model Answers
Interview-style system design questions with detailed model answers. Each question includes context on why interviewers ask it and what they evaluate. Questions are organized by difficulty: beginner, intermediate, and advanced.
Beginner Level
These test whether you understand fundamental system design concepts and can explain the design of a well-known system clearly.
Q1: Explain how a URL shortener works end-to-end.
Why interviewers ask this: The most common warm-up system design question. Used to gauge whether you can structure your thinking, cover the basics (API, storage, encoding), and communicate clearly before moving to harder problems.
Model Answer:
A URL shortener takes a long URL and produces a short alias. The core flow has two paths: write (create a short URL) and read (redirect).
Write path:
- Client sends a POST request with the long URL.
- The service generates a unique 7-character code. Three approaches exist: base62 encoding of an auto-increment ID, hashing the URL with MD5/SHA256 and taking a prefix, or drawing from a pre-generated key service. The pre-generated key approach (9.11.a) is preferred because it avoids collisions and has no runtime computation cost.
- The mapping (short_code -> long_url) is stored in a NoSQL database (DynamoDB or Cassandra) because the access pattern is simple key-value lookups with massive read throughput.
Read path:
- Client sends GET /abc1234.
- Service checks a Redis cache first. Expected hit rate is 90%+ since the workload is heavily read-skewed (100:1 read-to-write ratio).
- On cache hit, return a 302 redirect to the original URL.
- On cache miss, query the database, populate the cache, then redirect.
- Asynchronously log the click event to Kafka for analytics processing.
Key design decisions:
- 302 redirect (not 301) so every click hits our server for analytics accuracy. Use 301 if reducing server load matters more.
- Pre-generated key service avoids hash collisions at scale.
- Cache the top 20% of URLs by the 80/20 rule (~41 GB of cache).
Q2: How would you design a basic notification system?
Why interviewers ask this: Tests understanding of message queues, fan-out patterns, and multi-channel delivery. Good test of whether you think about reliability (what if delivery fails?) and user experience (notification fatigue).
Model Answer:
A notification system (9.11.g) accepts events from upstream services and delivers messages to users across multiple channels.
Architecture:
- Event ingestion: Services publish notification events to Kafka. Each event includes: recipient, channel, priority, template ID, and dynamic data.
- Priority routing: A router consumes events and routes to channel-specific queues. Critical notifications (security alerts) go to a fast-track queue; marketing notifications go to a bulk queue with rate limiting.
- Channel workers: Separate worker pools per channel:
- Push: APNS (iOS) and FCM (Android)
- Email: SendGrid/SES
- SMS: Twilio
- Delivery tracking: Each attempt is logged. Failed deliveries retry with exponential backoff (up to 3 retries), then optionally fall back to an alternate channel.
Key considerations:
- Deduplication: Idempotency key per notification prevents duplicate sends.
- User preferences: Check preference center before sending.
- Rate limiting: Cap at N notifications per user per hour to prevent fatigue.
- Template rendering: Templates stored separately; dynamic data substituted at delivery time, not at ingestion time.
Q3: Walk me through how a rate limiter works in a distributed system.
Why interviewers ask this: Rate limiting seems simple but becomes nuanced in distributed systems. Interviewers want to see the tension between accuracy and latency, and whether you can explain multiple algorithms with their tradeoffs.
Model Answer:
A rate limiter (9.11.b) restricts how many requests a client can make within a time window. In a distributed system, the challenge is maintaining accurate counts across multiple servers.
Token bucket algorithm (most common in production):
- Each user has a bucket with capacity C and refill rate R tokens/second.
- Each request costs 1 token. If tokens remain, the request proceeds. If the bucket is empty, return 429 Too Many Requests.
- Example: C=100, R=10/sec allows burst of 100 requests but sustains at 10/sec.
Distributed implementation with Redis:
- A Lua script runs atomically on Redis: calculate tokens to add since last refill, check if enough tokens remain, decrement, return result.
- This adds ~2-5ms latency per request.
Key tradeoffs:
- Centralized Redis: accurate counts but adds network hop and Redis is a single point of failure.
- Local counters with periodic sync: faster (no network hop) but may over-allow by up to N-servers worth of requests.
- Recommendation: Centralized Redis with fail-open behavior. If Redis goes down, allow requests rather than blocking all traffic.
Q4: What is the difference between fan-out-on-write and fan-out-on-read? When would you use each?
Why interviewers ask this: Tests understanding of a fundamental pattern used in social media feeds (9.11.d), chat systems (9.11.c), and notification systems (9.11.g). Reveals whether you understand the celebrity problem.
Model Answer:
Fan-out-on-write (push model): When a user posts content, immediately write it to every follower's timeline.
- Pros: read is fast (timeline is pre-computed).
- Cons: write amplification. If user has 1M followers, one post = 1M writes.
- Used for: regular users with < 10K followers in the social media feed (9.11.d).
Fan-out-on-read (pull model): When a user opens their timeline, fetch posts from everyone they follow in real-time and merge.
- Pros: writes are simple (just store the post once).
- Cons: reads are slow (must query N sources and merge).
- Used for: celebrities with millions of followers (9.11.d hybrid approach).
Hybrid (the real-world answer): The social media feed in 9.11.d uses push for regular users (< 10K followers) and pull for celebrities (> 10K followers). When a user opens their timeline, they get the pre-pushed regular posts merged with freshly pulled celebrity posts. This bounds write amplification while keeping reads fast.
Intermediate Level
These test your ability to make and defend design decisions, handle scale, and reason about failure modes.
Q5: How would you scale a chat system to support 100 million concurrent connections?
Why interviewers ask this: Tests understanding of connection management, stateful services (WebSocket servers), and fan-out for group messages. Separates candidates who have thought about production constraints.
Model Answer:
100 million concurrent WebSocket connections is a significant challenge. A single server with 64 GB RAM handles roughly 500K connections (~100 KB each with buffers).
Approach:
- Connection layer: 100M / 500K = 200 WebSocket servers. Connections are sticky; once established, a client stays on the same server.
- Connection registry: Redis maps user_id -> server_id so the system knows where each user is connected.
- Message routing:
- User A messages User B: look up B's server from the registry.
- Same server: deliver via local memory.
- Different server: publish to Redis Pub/Sub; B's server delivers locally.
- Group chats: publish to a channel topic; each server delivers to its local members.
- Server failure handling: If a server crashes, 500K connections drop. Clients auto-reconnect to a healthy server. Messages during reconnection are buffered in the queue and delivered once the client reconnects.
Bottleneck: The primary bottleneck is not connections but message fan-out for large groups. A 10,000-member group generates significant cross-server traffic. Solution: assign large groups a dedicated pub/sub channel with multicast within the data center.
Q6: Design the caching layer for an e-commerce product catalog serving 100K requests/sec.
Why interviewers ask this: Caching seems simple, but e-commerce has complex invalidation patterns. Tests whether you can design a cache that is both fast and correct enough for the business.
Model Answer:
Multi-tier cache architecture:
Browser Cache (5 min) -> CDN (1 min) -> Redis (10 min) -> Database
Cache strategy by data type (referencing 9.11.f):
| Data Type | TTL | Invalidation | Why |
|---|---|---|---|
| Product metadata | 10 min | Event-driven on update | Changes rarely |
| Product images | 24 hours | Versioned URLs | Immutable files |
| Price | 30 sec | Short TTL (near-real-time) | Legal liability |
| Inventory count | Never cached | Always query DB | Must be accurate |
| Reviews | 5 min | Background refresh | Staleness OK |
Key design decisions:
- Inventory is never cached because showing "in stock" when sold out leads to failed checkouts.
- Cache stampede prevention: When a hot product's cache expires, use a distributed lock (Redis SETNX) so only one request rebuilds the cache.
- Cache warming on deploy: Pre-warm top 10K products before routing live traffic to avoid thundering herd.
Q7: How does the matching algorithm in a ride-sharing platform handle concurrent ride requests?
Why interviewers ask this: Matching under concurrency is a real challenge. Interviewers want to see awareness of race conditions, locking strategies, and the tradeoff between optimal matching and speed.
Model Answer:
At 1,000 concurrent requests/sec in a city (9.11.j), multiple riders could be matched to the same driver without proper coordination.
Approach: Scored greedy with optimistic locking.
- Find candidates: Query Redis GEO for nearest 50 available drivers (~1ms).
- Score and rank: distance (30%), ETA (30%), rating (15%), acceptance rate (15%), vehicle match (10%).
- Atomic reservation: Redis Lua script atomically checks driver status and sets it to "reserved." If already reserved by another request, move to the next driver in the ranked list.
- Why not global optimization? A Hungarian algorithm finds the global optimum but requires batching (delay) and is O(n^3). Riders expect a match in seconds, not minutes. Greedy with good scoring produces results within 5% of optimal in practice.
- High-contention zones: In downtown, 50 riders may compete for 20 drivers. Optimistic locking means some riders retry their second or third choice, adding ~10-50ms per retry -- well within the 10-second SLA.
Q8: Explain how idempotency works in a payment system and why it matters.
Why interviewers ask this: Idempotency is critical in distributed systems, and payments make the consequences of getting it wrong extremely concrete. Tests whether you understand the problem (double charging) and the implementation details.
Model Answer:
Idempotency (9.11.k) guarantees that performing the same operation multiple times has the same effect as performing it once.
How it works:
- Merchant includes an
Idempotency-Keyheader (e.g.,order-12345-payment). - On first request: insert key into idempotency_keys table with request body hash, process payment, store response.
- On retry (same key): find existing row, verify request body matches, return stored response without reprocessing.
- On concurrent duplicate: INSERT with UNIQUE constraint fails; return 409.
Why database (not Redis) for idempotency keys:
- Keys must survive restarts (durability).
- Key must be in the same transaction as the payment (atomicity).
- PostgreSQL gives both.
Edge case: What if the system crashes after charging the card but before saving the idempotency key? The payment and key save are in a single DB transaction. The PSP charge uses its own idempotency key, so retrying the entire operation does not double-charge.
Advanced Level
These test deep architectural thinking, cross-system reasoning, and the ability to navigate ambiguous constraints.
Q9: Design a monitoring system that ingests 10 million metrics/sec with sub-second query latency. Walk through both the write and read paths.
Why interviewers ask this: Tests time-series database internals, compression techniques, and the tension between write throughput and read latency. Senior candidates should know about LSM trees, Gorilla compression, and downsampling.
Model Answer:
Write path (optimized for throughput, referencing 9.11.l):
- Metrics arrive at 50 ingestion gateway instances.
- Published to Kafka (128 partitions, partitioned by series_id).
- TSDB writer instances consume from Kafka (~250K points/sec each).
- Writer appends to in-memory head block. Compression: delta-of-delta for timestamps (regular intervals compress to near-zero bits) and XOR/Gorilla encoding for values (~1.37 bits per data point).
- Every 2 hours, head block flushes to immutable disk block with inverted index mapping metric_name + tags to series_id.
- WAL ensures no data loss if writer crashes before flushing.
Read path (optimized for latency):
- Query arrives:
avg:cpu.usage{env:prod} by {host} [last 1h]. - Recent data (last 2 hours): in-memory head blocks -- fastest path.
- Older data: disk blocks with inverted index for series lookup and chunk index for direct seek.
- Fan out to all shards containing matching series. Each returns partial results.
- Merge and aggregate at query layer.
- Latency: ~50ms (last hour, in-memory), ~200ms (last day, 1-min rollups), ~500ms (last week, 5-min rollups).
Key insight: Separation of recent data (in-memory) from historical data (on-disk, compressed, pre-aggregated rollups) achieves both high write throughput and low read latency simultaneously.
Q10: You are designing a payment system across 3 regions. How do you handle cross-region consistency for financial data?
Why interviewers ask this: Multi-region consistency is one of the hardest distributed systems problems. For payments, the stakes are high. Tests whether you understand CAP in practice, not just in theory.
Model Answer:
Approach: Region-affinity with synchronous intra-region, async cross-region.
- Merchant assignment: Each merchant is assigned a home region by country. US merchant -> US-East, EU merchant -> EU-West.
- Within-region: PostgreSQL with synchronous replication (primary + 2 sync replicas). RPO = 0 (no data loss). Every transaction committed to 3 nodes before acknowledging.
- Cross-region: Async replication for disaster recovery. Lag: 100-500ms.
- Failover: If US-East fails, DNS routes US merchants to EU-West. Async replicas may be 500ms behind. We accept RPO = 500ms during full regional failure. In-flight transactions reconciled using PSP records as source of truth.
- Why not synchronous cross-region? Adds 80-150ms to every transaction. At 10K TPS, this is unacceptable. If the cross-region link fails, all payments stop. Async + PSP reconciliation is the industry standard.
Q11: How would you handle a celebrity with 200 million followers posting on a social media platform?
Why interviewers ask this: The "celebrity problem" is a classic advanced question testing understanding of fan-out strategies and hybrid architectures for skewed workloads.
Model Answer:
Naive push: 200M timeline writes at 1 KB each = 200 GB of writes. At 100K writes/sec to the timeline store, this takes 2,000 seconds (33 minutes) -- completely unacceptable.
Hybrid fan-out (9.11.d approach):
- Classify users: Regular (< 10K followers) = push. Celebrity (> 10K followers) = pull.
- When celebrity posts: Store the post. NO fan-out. Add to a "celebrity posts" Redis sorted set keyed by celebrity user_id.
- When follower opens feed: Fetch pre-computed timeline (regular user posts, already pushed) + pull recent posts from each celebrity they follow. Merge and rank.
- Optimization: Push only to followers active in the last 24 hours (~5% of 200M = 10M). The remaining 190M pull when they next open the app.
Performance: Celebrity post visible immediately with zero fan-out writes. Read path adds ~5-10ms per celebrity followed.
Q12: Compare data storage strategies for time-series metrics, financial ledgers, and search inverted indexes. Why does each need a different approach?
Why interviewers ask this: Meta-question testing whether you truly understand storage engines or just memorize "use X for Y." Interviewers want reasoning about access patterns, consistency needs, and compression.
Model Answer:
| Dimension | Monitoring TSDB (9.11.l) | Payment Ledger (9.11.k) | Search Index (9.11.i) |
|---|---|---|---|
| Write pattern | Append-only, time-ordered | Append-only, transactional | Batch index builds |
| Read pattern | Range scans by time | Point lookups + range | Term lookups + scoring |
| Consistency | Eventual OK | Strong ACID required | Eventual OK |
| Compression | Gorilla (12x) | Standard (2-3x) | Posting list (5-10x) |
| Update/delete | Never updated | Never deleted (audit) | Segments merged/deleted |
Why different storage:
- TSDB: Tiny data points (timestamp + value) at extreme rates. Gorilla compression exploits timestamp regularity and value similarity for 12x compression. Generic DBs waste 10x more storage at this scale.
- Ledger (PostgreSQL): ACID transactions required -- payment and ledger entries must be atomically consistent. Append-only is a business rule for auditability, not a storage optimization. Volume is manageable (100 GB/day).
- Inverted index: Maps terms to document ID lists. The access pattern (lookup word, get document list, intersect lists) does not fit a TSDB or RDBMS naturally. Lucene-style segment storage is purpose-built for this.
Principle: Storage engines optimized for specific access patterns are 10-100x more efficient than general-purpose databases for those patterns.
Q13: Your monitoring system has 100,000 alert rules. An evaluator node fails. How do you ensure no alerts are missed during failover?
Why interviewers ask this: Tests highly available stateful service design. Alerting is safety-critical -- a missed alert could mean an outage goes undetected. Looking for: heartbeat detection, work redistribution, state recovery, gap-filling.
Model Answer:
Architecture: Consistent hashing with externalized state.
- Normal operation: 100K rules distributed across 20 evaluators via consistent hashing (5K rules/worker). Each evaluates its rules every 30 seconds. Heartbeats to a coordinator (etcd) every 5 seconds.
- Failure detection: 3 missed heartbeats (15 seconds) triggers coordinator to declare the worker dead.
- Rule redistribution: Dead worker's 5K rules reassigned to surviving 19 workers via consistent hashing with virtual nodes (~263 extra rules each).
- State recovery: Alert state (OK/WARNING/CRITICAL, consecutive breach count) stored in Redis, not worker local memory. Surviving workers read state from Redis and continue evaluation without state loss.
- Gap-filling: After reassignment, each rule runs an immediate evaluation. Worst case: 1 missed evaluation cycle (30 seconds).
- Preventing false recovery: Consecutive-breach count persisted in Redis, so the new evaluator does not restart counting from zero.
Worst case gap: 15s detection + 5s reassignment + 30s next cycle = 50 seconds, within the typical 60-second alerting SLA.
Q14: Design a ride-sharing system for a region with unreliable mobile networks. Drivers go offline for 30-60 seconds frequently.
Why interviewers ask this: Constraint-based question testing adaptability. Pushes beyond "assume perfect network" and reveals understanding of offline-first design and graceful degradation.
Model Answer:
Adaptations to 9.11.j:
- Staleness-aware matching: Penalize drivers whose location is > 10s old. Drivers last seen > 30s ago are excluded from matching entirely.
- Parallel offers: Instead of sequential (15s timeout per driver), send the ride offer to top 3 drivers simultaneously. First to accept wins. Reduces worst-case matching from 45s to 15s.
- Client-side location buffering: Driver app buffers GPS updates locally when offline. On reconnect, sends the batch. Server updates geospatial index with latest position.
- Aggressive TTL: 30-second TTL on Redis GEO entries. Offline drivers auto-removed, re-added on reconnect.
- SMS fallback: If push notification not acknowledged in 5 seconds, send ride offer via SMS (works on 2G).
- Trip interpolation: During active trips, if driver goes offline, show the rider an interpolated position (last known + estimated speed) until the driver reconnects and reconciles.
Q15: Your search engine handles 10K queries/sec. A new requirement says results must be personalized per user. How does this change the architecture?
Why interviewers ask this: Personalization fundamentally changes caching. Tests whether you understand the tension between per-user computation and shared caching, and how to achieve both.
Model Answer:
Impact: Without personalization, "laptop" returns identical results for all users -- highly cacheable. With personalization, results differ per user. Naive per-user caching: 10M users * 100 queries = 1B cache entries -- infeasible.
Two-phase retrieval + re-ranking (extending 9.11.i):
- Phase 1: Retrieval (shared, cacheable): Inverted index returns top 1,000 candidates for "laptop" based on BM25 relevance. Identical for all users. Cache hit rate: 80%+.
- Phase 2: Personalized re-ranking (per-user): A lightweight neural network re-scores 1,000 candidates using user features: purchase history, browse history, price sensitivity, location. Returns top 20.
- User feature store: Pre-computed 50-dimension vector per user in Redis (10M users * 400 bytes = 4 GB). Updated hourly from batch pipeline.
- Caching strategy: Cache phase 1 (5-min TTL). Do NOT cache phase 2 (low reuse). Preserves 80% of caching benefit while adding personalization.
Latency: Phase 1 cached: 5ms. Feature lookup: 2ms. Re-ranking: 10ms. Total: 17ms -- well within the 200ms target.
Study Tips for System Design Interviews
1. Practice out loud. Explain your design to a rubber duck or a friend.
Articulation clarity is 50% of the interview grade.
2. Always start with requirements. 3 minutes clarifying scope saves 10
minutes designing the wrong system.
3. Know your numbers:
- 1 day ~= 100K seconds (86,400)
- 1 million seconds ~= 12 days
- 1 billion seconds ~= 32 years
- SSD read: ~100 microseconds
- Network (same DC): ~0.5 ms
- Network (cross-continent): ~100 ms
4. Have strong opinions, weakly held. "I chose X because of Y, but if Z
were a concern, I would consider W."
5. Address failure explicitly. For every component you draw, ask: "What
happens when this fails?" Say it out loud before the interviewer asks.
6. Prioritize depth over breadth. It is better to deep-dive 2 components
thoroughly than to superficially cover 6.
7. Reference patterns by name. Saying "I would use the outbox pattern here"
demonstrates fluency with the vocabulary of distributed systems.