Episode 9 — System Design / 9.9 — Core Infrastructure
9.9 Exercise Questions
Caching Strategies (Questions 1-8)
Q1. You have an e-commerce product catalog with 10 million products. Product details are read 1000x more often than they are updated. Which caching pattern (cache-aside, write-through, write-behind, read-through) would you choose, and why? What eviction policy would you use?
Q2. A cache entry for a viral tweet expires, and 50,000 requests hit the database simultaneously. (a) Name this problem. (b) Describe three different strategies to prevent it. (c) Which strategy would you choose for this specific scenario and why?
Q3. Compare Redis and Memcached across the following dimensions: data structures, persistence, clustering, and use cases. When would you choose Memcached over Redis?
Q4. You are using cache-aside with a TTL of 60 seconds. A user updates their profile name. Explain two different strategies to ensure the user sees their updated name immediately after the update, even though the cache has not expired.
Q5. Design a multi-level caching strategy for a news website. Identify what you would cache at each level (browser, CDN, application, distributed cache, database query cache) and what TTL you would set for each.
Q6. You have a Redis cluster with 5 nodes using consistent hashing. Node 3 goes down. (a) What happens to the cached data on node 3? (b) How does consistent hashing minimize the impact? (c) What percentage of keys are affected compared to simple modular hashing?
Q7. Explain the difference between LRU and LFU eviction policies. Design a scenario where LFU performs significantly better than LRU, and a scenario where LRU performs better than LFU.
Q8. Your application uses write-behind caching. The cache node crashes with 500 uncommitted writes in the buffer. (a) What is the impact? (b) How could you mitigate this risk? (c) What types of data are appropriate for write-behind despite this risk?
CDN and Content Delivery (Questions 9-15)
Q9. Explain the difference between push CDN and pull CDN. Your startup is launching a marketing website with 50 pages and a media library of 10,000 images. Which CDN type would you use for (a) the HTML pages and (b) the images? Justify your choice.
Q10. Write the Cache-Control header for each of the following:
- (a) A JavaScript bundle with a hash in the filename (e.g.,
app.abc123.js) - (b) A user's profile page that changes occasionally
- (c) A real-time stock price API response
- (d) A static image that rarely changes but may be updated
Q11. A user in Australia reports that your website loads slowly. Your servers are in US-East. Describe step-by-step how you would use a CDN to improve their experience. Include DNS resolution, cache behavior, and what happens on a cache miss.
Q12. You deploy a bug fix to your JavaScript file, but users are still getting the old version from the CDN. Describe three different strategies to force the CDN to serve the new version.
Q13. What is Edge Side Includes (ESI)? Draw an ASCII diagram showing how a CDN could serve a partially cached page where the header and footer are cached but the user greeting is dynamic.
Q14. Your CDN costs are increasing rapidly. Upon investigation, you find that cache hit ratio is only 40%. List five possible causes of a low cache hit ratio and a mitigation strategy for each.
Q15. Explain how Anycast routing works for CDNs. How does it differ from DNS-based geographic routing? What are the advantages of each approach?
Load Balancing (Questions 16-23)
Q16. You are designing a system with two types of traffic: (a) short-lived HTTP API requests and (b) long-lived WebSocket connections for real-time chat. Should you use Layer 4 or Layer 7 load balancing for each? Explain your reasoning.
Q17. You have 3 servers. Server A has 8 CPU cores, Server B has 4 cores, and Server C has 2 cores. (a) Configure weighted round robin with appropriate weights. (b) Show the request distribution for the first 14 requests.
Q18. Explain why sticky sessions are considered an anti-pattern. Describe the problems they cause and the preferred alternative. Draw an ASCII diagram showing the alternative architecture.
Q19. Server 2 in your pool begins returning 500 errors for 50% of requests but responds to TCP health checks successfully. (a) Why does the health check not catch this? (b) Design a health check that would detect this issue.
Q20. You are adding a 4th cache server to a cluster of 3. (a) With simple modular hashing (hash % N), what percentage of keys are remapped? (b) With consistent hashing, what percentage? (c) Explain why this difference matters.
Q21. Design a global server load balancing (GSLB) strategy for a service with data centers in US-East, EU-West, and Asia-Pacific. How do you handle the scenario where the US-East data center goes down?
Q22. Compare AWS ALB and AWS NLB. When would you use each? Can you use both together, and if so, how?
Q23. What is the "thundering herd" problem in the context of load balancing when a server comes back online? How does a load balancer's health check configuration (interval, threshold) help mitigate it?
API Gateway (Questions 24-30)
Q24. List all the responsibilities of an API gateway. For each responsibility, explain whether it would be better handled in the gateway or in individual microservices, and why.
Q25. Explain the BFF (Backend for Frontend) pattern. You have three client types: iOS app, Android app, and web dashboard. Should you create three separate BFF gateways or one shared gateway? Discuss trade-offs.
Q26. Design a rate limiting strategy for an API with three tiers: Free (100 req/min), Pro (1000 req/min), and Enterprise (10000 req/min). (a) Which algorithm would you use? (b) Where would you store the counters? (c) How do you handle distributed rate limiting across multiple gateway instances?
Q27. Your API gateway is becoming a bottleneck. Response times have increased by 50ms across all services. (a) List five possible causes. (b) For each cause, suggest a mitigation.
Q28. Explain how an API gateway handles request aggregation. The mobile app needs a "home screen" that shows: user profile (User Service), recent orders (Order Service), and recommendations (Recommendation Service). Draw the aggregation flow with and without the gateway.
Q29. Your partner API requires API key authentication, while your mobile app uses JWT tokens. How would you configure the API gateway to handle both authentication methods? Draw the flow for each.
Q30. Compare Kong, AWS API Gateway, and Nginx as API gateways. For each, describe a scenario where it would be the best choice.
Message Queues (Questions 31-38)
Q31. You are building a food delivery app. Classify each of the following operations as synchronous or asynchronous, and explain why:
- (a) Validating the customer's payment method
- (b) Notifying the restaurant about a new order
- (c) Sending an order confirmation email to the customer
- (d) Checking if the requested items are available
- (e) Updating the analytics dashboard
Q32. Your Kafka consumer processes an order, charges the customer's card, but crashes before committing the offset. When it restarts, it processes the same message again, charging the customer twice. (a) Name this delivery semantic. (b) Describe two ways to prevent the double charge.
Q33. Design a dead letter queue strategy for an order processing system. What information should be captured in the DLQ? How would you monitor it? How would you reprocess messages after fixing a bug?
Q34. Compare the Saga pattern (choreography) with two-phase commit (2PC) for the following scenario: an order requires charging a card, reserving inventory, and creating a shipping label. Draw the flow for both approaches and explain which you would choose.
Q35. You are choosing between RabbitMQ and Kafka for each scenario. Justify your choice:
- (a) Processing 10 million clickstream events per second for analytics
- (b) Distributing tasks to 50 worker processes with complex routing rules
- (c) Building an event sourcing system where you need to replay events
- (d) Simple job queue for sending emails with retry logic
Q36. Explain backpressure in message queues. Your producer generates 10,000 messages/second but your consumer can only process 2,000 messages/second. (a) What happens without backpressure? (b) Design a backpressure strategy using auto-scaling. (c) What is your fallback if auto-scaling cannot keep up?
Q37. Draw the architecture for an event-driven e-commerce system where a single "order placed" event triggers five downstream actions. Show which services subscribe to the event and what each does with it.
Q38. Explain the Outbox Pattern. Why is publishing a message to a queue and writing to a database in the same "transaction" problematic? Draw the outbox pattern that solves this.
Microservices Architecture (Questions 39-45)
Q39. Your monolithic application has three modules: User Management, Order Processing, and Reporting. The Reporting module causes CPU spikes that degrade Order Processing performance. (a) How does microservices architecture solve this? (b) What are the new challenges you introduce? (c) Would you extract all three at once or one at a time?
Q40. Compare client-side and server-side service discovery. Draw ASCII diagrams for both. Which would you choose for a Kubernetes-based deployment and why?
Q41. You are decomposing a monolith. The Order table has a foreign key to the User table, and 15 queries join these tables. After splitting into Order Service and User Service with separate databases, how do you handle the data that was previously joined?
Q42. Design a canary deployment pipeline for a payment microservice that processes real money. What metrics would you monitor? What is your rollback criteria? How long would you run each canary phase?
Q43. Explain the Circuit Breaker pattern. Service A calls Service B, which starts returning errors. (a) Draw the circuit breaker state machine (Closed, Open, Half-Open). (b) What happens to requests when the circuit is open? (c) How does the circuit breaker decide to try again?
Q44. Your microservices architecture has 20 services. A user reports that an API call is slow. Explain how distributed tracing helps you identify the bottleneck. Include specific details about trace IDs, spans, and what tools you would use.
Q45. A startup with 5 engineers asks your advice on architecture. They currently have a monolith and are considering moving to microservices because "that is what Netflix uses." What advice would you give? Under what circumstances should they consider microservices?
Cross-Cutting / Integration (Questions 46-50)
Q46. Design the complete infrastructure for a URL shortener that handles 1 billion redirects per day. Include: caching strategy, CDN usage, load balancing, database choice, and whether you need message queues. Draw the full architecture diagram.
Q47. You are designing a notification system that sends push notifications, emails, and SMS. A single event (e.g., "order shipped") may trigger all three. Design the infrastructure using message queues, including how you handle failures for each channel independently.
Q48. A flash sale starts at noon and 500,000 users hit your e-commerce site simultaneously. Walk through how each infrastructure component (CDN, load balancer, API gateway, cache, message queue) handles the traffic spike. Identify which components absorb load and which are potential bottlenecks.
Q49. Design the infrastructure for a real-time multiplayer game backend that supports 100,000 concurrent players. Address: how players are routed to game servers (load balancing), how game state is shared (caching), how match results are processed (queues), and how the system scales.
Q50. You are migrating from a single-region to a multi-region architecture. Describe the changes needed for each infrastructure component: (a) Load balancing (b) Caching (c) Message queues (d) Databases (e) CDN. What new problems does multi-region introduce?