Episode 9 — System Design / 9.11 — Real World System Design Problems

9.11 Exercise Questions

Practice questions covering all 12 system designs in this section. These questions reference specific designs from 9.11.a through 9.11.l and test your ability to apply, extend, compare, and troubleshoot real-world systems.

Section A: Mini Design Challenges

Scoped-down design problems. Spend 15-20 minutes on each.

A.1 URL Shortener (9.11.a)

Q1: Design a URL shortener that supports link expiration based on click count (e.g., "this link expires after 1,000 clicks"). How does this change the redirect flow compared to the time-based expiration in 9.11.a?

Q2: Design a vanity URL service where businesses can use custom domains (e.g., go.company.com/sale). What changes to DNS, certificate management, and the URL resolution flow are needed?

Q3: The URL shortener from 9.11.a uses a Pre-Generated Key Service. Design an alternative where keys are generated using a Snowflake-style distributed ID generator. Compare failure modes between the two approaches.

A.2 Rate Limiter (9.11.b)

Q4: Design a rate limiter for a GraphQL API where a single request can have variable cost (a simple query costs 1 point, a complex nested query costs 50 points). How does this change the token bucket algorithm from 9.11.b?

Q5: Design a distributed rate limiter that works across 5 data centers with eventual consistency. What is the maximum over-limit error you would tolerate, and how would you minimize it?

A.3 Chat System (9.11.c)

Q6: Extend the chat system from 9.11.c to support message reactions (emoji reactions like Slack). Design the data model, real-time update mechanism, and how reaction counts are aggregated.

Q7: Design end-to-end encryption for the chat system. How does this change message storage, server-side search, and the server's role in message delivery?

A.4 Social Media Feed (9.11.d)

Q8: The social media feed in 9.11.d uses a hybrid push/pull model. Design a pure push system for a corporate social network with a maximum of 5,000 followers per user. What are the storage and latency implications compared to the hybrid approach?

Q9: Design a "trending topics" feature for the social media platform. How do you detect trending topics in real-time from millions of posts per hour? What time-decay function would you use to prevent old topics from staying trending forever?

A.5 Video Streaming (9.11.e)

Q10: Design a live streaming system (like Twitch) where one broadcaster streams to 500,000 concurrent viewers. How does the architecture differ from the video-on-demand system in 9.11.e? What changes in the transcoding, delivery, and CDN strategy?

Q11: Add a "watch party" feature to the video streaming platform where multiple users watch the same video in sync. What synchronization protocol would you use, and how do you handle users with different network speeds?

A.6 E-Commerce (9.11.f)

Q12: Design the flash sale subsystem for the e-commerce platform from 9.11.f. 10,000 units of a product go on sale at exactly 12:00 PM, and 1 million users are waiting to buy. How do you handle the thundering herd problem, ensure fair inventory allocation, and prevent overselling?

Q13: Design a recommendation engine for the e-commerce platform. Compare collaborative filtering ("users who bought X also bought Y") vs content-based filtering ("products similar to X by attributes"). Which is better for cold-start users with no purchase history?

A.7 Notification System (9.11.g)

Q14: Design a notification preference center where users can configure granular settings (e.g., "email for order updates, push for promotions, no SMS ever"). How does this integrate with the notification system pipeline from 9.11.g? Where in the pipeline should preference checks occur?

Q15: The notification system needs to support timezone-aware scheduled delivery (e.g., "send this campaign at 9 AM in each user's local timezone"). Design the scheduling subsystem. How do you handle 24 time zones sending to millions of users each?

A.8 File Storage (9.11.h)

Q16: Add real-time collaboration (Google Docs-style simultaneous editing) to the file storage system from 9.11.h. Compare Operational Transform (OT) vs CRDT for conflict resolution. Which would you choose for a plain-text editor, and why?

Q17: Design a file versioning system that allows users to restore any previous version of a file. How does this interact with the chunking and deduplication strategy from 9.11.h? What is the storage overhead of keeping 100 versions of a file that changes 1% between versions?

A.9 Search Engine (9.11.i)

Q18: Design a typeahead/autocomplete feature for the search engine from 9.11.i. The system must return suggestions within 100ms as the user types. What data structure would you use? How do you rank suggestions?

Q19: Add personalized search ranking to the search engine. When User A searches for "python," they should see programming results; when User B searches for "python," they should see snake-related results. How do you modify the ranking pipeline from 9.11.i without sacrificing cache hit rates?

A.10 Ride Sharing (9.11.j)

Q20: Design a ride-pooling feature (like UberPool) where multiple riders share a vehicle. How does the matching algorithm from 9.11.j change when you need to optimize for multiple pickups and dropoffs along a route?

Q21: Design the driver incentive system that offers bonuses during high-demand periods. How do you set incentive amounts to maximize driver supply without overspending? How does this interact with the surge pricing system?

A.11 Payment System (9.11.k)

Q22: Design a subscription billing system on top of the payment platform from 9.11.k. Handle recurring charges, dunning (retry failed payments), plan changes mid-cycle with proration, and cancellation. Draw the state machine for a subscription lifecycle.

Q23: Design the dispute/chargeback handling workflow end-to-end. A cardholder calls their bank to dispute a $100 charge. Walk through every step from dispute initiation to resolution, including evidence collection, deadline management, and financial impact on the merchant.

A.12 Monitoring System (9.11.l)

Q24: Design an anomaly detection system that automatically identifies unusual metric patterns without manual threshold configuration. What ML approach would you use? How do you handle seasonality (e.g., traffic is always lower on weekends) and avoid false positives during known events (deployments, marketing campaigns)?

Q25: Design a Service Level Objective (SLO) monitoring feature. Users define SLOs like "99.9% of requests complete under 500ms over a 30-day rolling window." How do you track the error budget (how much downtime remains)? How do you alert when the error budget burn rate is too high?

Section B: Capacity Estimation Drills

Spend 5 minutes on each. Show your work.

Q26: A messaging app has 500 million daily active users. Each user sends an average of 40 messages per day. Average message size is 200 bytes. Messages are stored for 5 years.

What is the write throughput in messages/sec?
What is the total storage needed (before replication)?
With 3x replication, how much raw storage?

Q27: A video platform hosts 800 million videos. Average video size is 500 MB (after transcoding to multiple resolutions). The platform serves 1 billion video views per day, with average view duration of 5 minutes at 5 Mbps bitrate.

What is the total storage for all videos?
What is the peak bandwidth needed (assume 3x peak-to-average ratio)?
How many CDN edge servers (each handling 10 Gbps) are needed at peak?

Q28: A ride-sharing platform operates in 500 cities with 2 million active drivers. Each driver sends a GPS update every 4 seconds. Each update is 100 bytes. Location data is retained for 90 days.

What is the ingestion rate in updates/sec and MB/sec?
What is the total storage for 90 days (uncompressed)?
If each Redis instance handles 100K operations/sec, how many instances for the geospatial index?

Q29: A payment system processes 50 million transactions per day. Each transaction generates 4 ledger entries (200 bytes each) and 3 webhook events (1 KB each). Transaction records are 2 KB each and retained for 7 years.

What is the daily storage for transactions + ledger?
What is the webhook delivery throughput (events/sec)?
What is the total 7-year storage for transactions only?

Q30: A monitoring system tracks 50 million unique time series. Each series receives a data point every 10 seconds. Data points are 50 bytes raw but compress to 4 bytes with Gorilla compression. Full resolution is kept 15 days, 1-minute rollups for 1 year.

What is the raw ingestion rate (points/sec and GB/sec)?
What is the compressed 15-day storage?
What is the 1-year rollup storage (at 1/6 the full resolution, compressed)?

Section C: Trade-Off Analysis Questions

For each question, argue both sides before giving your recommendation.

Q31: Consistency vs Availability The payment system (9.11.k) chooses strong consistency, while the social media feed (9.11.d) chooses eventual consistency. For each system, argue what would happen if you swapped the consistency model. Under what specific failure scenario would the swap cause the most damage?

Q32: Push vs Pull Compare the push model used in the notification system (9.11.g) vs the pull model option in the monitoring system (9.11.l). What properties of each system's data flow make one model more appropriate than the other? Could the notification system use pull instead?

Q33: SQL vs NoSQL The URL shortener (9.11.a) recommends NoSQL, while the payment system (9.11.k) requires PostgreSQL. Identify the specific properties of each workload that drive this database choice. Could you build the URL shortener on PostgreSQL? What would you gain and what would you lose?

Q34: Sync vs Async Processing The fraud detection in the payment system (9.11.k) uses synchronous ML scoring, while analytics in the URL shortener (9.11.a) uses async processing. For each: what is the cost of making it the opposite? Quantify the latency and accuracy impact.

Q35: Centralized vs Distributed The rate limiter (9.11.b) discusses centralized (single Redis) vs distributed (per-node) approaches. The ride-sharing matching (9.11.j) uses per-city architecture. Compare the coordination overhead, accuracy, and failure blast radius of centralized vs distributed approaches in these two contexts.

Q36: Cache Strategies Compared Compare caching across three systems:

URL shortener (9.11.a): simple key-value cache with LRU
Social media feed (9.11.d): pre-computed feed cache
E-commerce (9.11.f): product catalog cache with invalidation

For each: What is the cache hit ratio target? What happens on a cache miss? How is cache invalidation handled? Which system is most sensitive to serving stale data?

Section D: Failure Scenario Questions

For each scenario, describe: (1) how the failure manifests to end users, (2) how the system detects it, (3) what automated mitigation kicks in, and (4) what manual intervention may be needed.

Q37: The Redis cluster used for geospatial indexing in the ride-sharing platform (9.11.j) suffers a complete failure. 5 million active drivers are sending location updates and 1,000 ride requests per second are incoming.

Q38: The Kafka cluster in the monitoring system (9.11.l) loses 2 of 3 brokers simultaneously. Metrics are being produced at 10 million points/sec by agents across 100,000 hosts.

Q39: A merchant on the payment platform (9.11.k) accidentally sends 1 million payment requests in 5 minutes due to a retry bug in their code. Each request has a different idempotency key but the same customer and amount.

Q40: The CDN serving video content (9.11.e) has a cache purge event that invalidates 80% of cached content simultaneously. Origin servers are sized to handle only 10% of normal traffic directly.

Q41: A celebrity with 200 million followers posts on the social media platform (9.11.d) during peak hours. The fan-out system needs to update 200 million timelines.

Q42: The HSM (Hardware Security Module) in the payment system's PCI environment (9.11.k) becomes unresponsive. No new card tokenization or payment processing can occur until it recovers.

Section E: Cross-System Comparison Questions

Q43: Message Queue Usage Compare message queue usage across five systems: Chat (9.11.c), Notification (9.11.g), Video Streaming (9.11.e), Payment (9.11.k), and Monitoring (9.11.l). For each: What data is in the queue? What ordering guarantees are needed? What happens if a message is lost? What happens if a message is processed twice?

Q44: WebSocket / Persistent Connections Three systems use WebSocket or persistent connections: Chat (9.11.c), Ride-Sharing (9.11.j for driver locations), and Monitoring dashboards (9.11.l for real-time updates). Compare: the connection lifecycle, message frequency, payload size, and what happens when the connection drops in each system.

Q45: Database Sharding Strategies Compare sharding across five systems:

URL Shortener (9.11.a): hash on short_code
Chat (9.11.c): shard on conversation_id
E-Commerce (9.11.f): shard on product_id or merchant_id
Payment (9.11.k): shard on merchant_id
Ride-Sharing (9.11.j): shard by city

For each: Why was that shard key chosen? What query patterns become expensive (cross-shard) due to the choice?

Q46: Data Retention and Lifecycle Four systems have notable data retention concerns: File Storage (9.11.h), Monitoring (9.11.l), Payment (9.11.k), and Search Engine (9.11.i). Compare how each handles data lifecycle -- from creation through archival to deletion. Which has the strictest retention requirements and why?

Q47: Read/Write Ratio Impact Rank all 12 systems by read-to-write ratio (most read-heavy to most write-heavy). For the top 3 read-heavy systems, explain why caching is especially effective. For the top 3 write-heavy systems, explain why write-ahead logs, append-only structures, or message queues are important.

Section F: Design Extension Challenges

These require combining concepts from multiple system designs.

Q48: API Analytics Platform Design an API analytics platform that combines the rate limiter (9.11.b), the monitoring system (9.11.l), and the notification system (9.11.g). Merchants can view their API usage, set usage alerts, and receive notifications when approaching rate limits. Sketch the architecture showing how data flows from rate limiter events to analytics to alerts.

Q49: Food Delivery Platform Design a food delivery platform that combines ride-sharing (9.11.j) for driver dispatch, e-commerce (9.11.f) for restaurant menus and ordering, chat (9.11.c) for customer-driver communication, and payment (9.11.k) for checkout. Identify which components you can reuse directly, which need modification, and what new component (e.g., order preparation tracking) is needed that does not exist in any of the 12 designs.

Q50: Enterprise Unified Search Design a search system for an enterprise platform that spans file storage (9.11.h), chat messages (9.11.c), and wiki documents. Users search once and see results from all three sources, ranked by relevance. How do you build a unified index across different data sources? How do you handle access control (user can only see files/messages they have permission for)?

Recommended Study Approach

Week 1: Complete Section A (Mini Design Challenges)
  - 2-3 problems per day
  - Time-box to 20 minutes each
  - Write out your design on paper before reviewing the source files

Week 2: Complete Sections B and C (Estimation + Tradeoffs)
  - Practice estimation daily (5 min each)
  - Tradeoff questions: write pros/cons for both sides first

Week 3: Complete Sections D and E (Failures + Comparisons)
  - These are common in senior-level interviews
  - Focus on depth: automated detection, mitigation, blast radius

Week 4: Complete Section F (Extensions)
  - These test combining multiple system patterns
  - Practice drawing architecture diagrams from scratch
  - Time yourself to 45 minutes per problem