Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10 — Exercise Questions

How to use this material (instructions)

Attempt each question on paper or a whiteboard before looking at hints.
Time yourself: aim for 2-5 minutes per conceptual question, 10-15 minutes per design question.
Speak your answer aloud to practice interview articulation.
Mark questions you struggled with and revisit them before interviews.

Section A: Reliability and Availability (Questions 1-8)

Q1. A system has three components in series with availabilities of 99.9%, 99.95%, and 99.99%. Calculate the overall system availability.

Q2. You add a redundant replica to the 99.9% component from Q1 (so two 99.9% instances in parallel). Recalculate the overall system availability. How much did it improve?

Q3. Your SLA promises 99.95% uptime. How many minutes of downtime per month does this allow? Per year?

Q4. Explain the difference between SLI, SLO, and SLA with a concrete example for a payment processing service.

Q5. Compare active-passive and active-active failover. For each of the following systems, which would you recommend and why?

a) A single-region e-commerce checkout service
b) A global content delivery system
c) A financial trading platform

Q6. Define RPO and RTO. A bank requires RPO = 0 and RTO < 30 seconds. What disaster recovery strategy is needed?

Q7. Draw a multi-region deployment architecture for a social media platform serving users in North America, Europe, and Asia. Show where data replication occurs and what happens when the EU region goes down.

Q8. A system has five components, each at 99.9% availability. Calculate the overall availability when they are:

a) All in series
b) Each with one parallel redundant copy, then all in series

Section B: Fault Tolerance (Questions 9-16)

Q9. Explain the three states of a circuit breaker. What are reasonable values for failure threshold and timeout for a payment service calling a fraud detection API?

Q10. A service has the following dependencies: Database, Cache (Redis), Email Service, Recommendation Engine. Design a bulkhead strategy that ensures the core flow (read/write data) survives even if Email and Recommendations fail completely.

Q11. Write pseudocode for an exponential backoff retry with jitter. Your base delay is 500ms, max delay is 30 seconds, and max retries is 4.

Q12. What is the "thundering herd" problem? How does jitter solve it? Give a concrete scenario.

Q13. Explain graceful degradation for Netflix. If the recommendation service goes down, what should happen? What about the streaming service itself?

Q14. List all single points of failure in this architecture and propose a fix for each:

  [User] -> [1 Nginx LB] -> [1 App Server] -> [1 PostgreSQL DB]

Q15. You are designing a chaos engineering experiment for a microservices system with 12 services. Describe:

What steady state looks like
Three failure scenarios to test
How you would measure success/failure of each experiment

Q16. Explain the difference between a crash failure, an omission failure, and a Byzantine failure. Which is most common in practice? Which is hardest to handle?

Section C: Observability (Questions 17-24)

Q17. Name the three pillars of observability. For each, give one tool/technology and one example metric or data point.

Q18. Convert this unstructured log line into a structured JSON log:

2024-03-15 ERROR OrderService: Failed to process order #12345 for user alice@example.com - insufficient inventory for SKU-789

Q19. A distributed system has 5 microservices. A request takes 3.2 seconds end-to-end. Describe exactly how you would use distributed tracing to identify the bottleneck. What headers are involved?

Q20. Explain the RED method and the USE method. When do you use each?

Q21. Write three PromQL queries:

a) Request rate per second over the last 5 minutes
b) 95th percentile latency
c) Error rate as a percentage

Q22. Compare the ELK stack and Prometheus + Grafana. When would you use each? Can they be used together?

Q23. Design an alerting strategy for an e-commerce checkout service. Define at least 3 alerts with severity levels, thresholds, and actions.

Q24. What is OpenTelemetry? Why is it becoming the industry standard? How does it relate to Jaeger, Prometheus, and the ELK stack?

Section D: Rate Limiting (Questions 25-32)

Q25. Explain the token bucket algorithm. A bucket has capacity 10 and refill rate 5 tokens/second. If 15 requests arrive simultaneously, how many are allowed?

Q26. What is the boundary problem with fixed window rate limiting? Draw a timeline showing how a user can send 200 requests in 60 seconds with a 100 req/min limit.

Q27. Compare token bucket, leaky bucket, and sliding window counter algorithms. For each, state: memory usage, burst tolerance, and accuracy.

Q28. You need to implement distributed rate limiting across 20 servers. Three options: Redis centralized, local counters with periodic sync, and sticky sessions. Compare the trade-offs of each approach.

Q29. Design the rate limiting strategy for a public REST API with three tiers:

Free: 60 requests/minute
Pro: 600 requests/minute
Enterprise: 6000 requests/minute What algorithm do you use? Where in the architecture do you enforce it? What headers do you return?

Q30. A user sends a request that is rate limited. Write the complete HTTP response (status code, headers, body) that the server should return.

Q31. How does DDoS protection differ from application-level rate limiting? At what network layers does each operate?

Q32. Your rate limiting Redis instance goes down. What is your fallback strategy? Is it acceptable to temporarily allow more traffic than the limit?

Section E: Auth and Authorization (Questions 33-38)

Q33. Compare session-based and token-based (JWT) authentication across five dimensions: scalability, revocation, storage, mobile friendliness, and cross-domain support.

Q34. Draw the complete JWT refresh token flow. Include what happens when:

a) The access token expires
b) The refresh token expires
c) The refresh token is stolen

Q35. Explain the OAuth 2.0 Authorization Code flow step by step. Why is the authorization code exchanged server-to-server rather than sent directly to the client?

Q36. A SaaS platform has three roles: Admin, Editor, Viewer. Design the RBAC permission matrix for the following resources: Users, Articles, Settings, Billing.

Q37. In a microservices architecture, where should JWT validation happen? Compare gateway-level auth vs per-service auth. What are the security implications of each?

Q38. Explain mTLS in a service mesh. How does it differ from regular TLS? Why is it important for zero-trust architecture?

Section F: Search Systems (Questions 39-45)

Q39. Build an inverted index for the following three documents:

Doc 1: "distributed systems are complex"
Doc 2: "search systems use inverted indexes"
Doc 3: "complex search requires distributed indexes" Show the complete inverted index after tokenization and lowercasing.

Q40. Explain TF-IDF with a concrete example. Calculate the TF-IDF score for the term "distributed" in Doc 1 from Q39 (assume a corpus of 1000 documents where 50 contain "distributed").

Q41. How does Elasticsearch handle a search query across a cluster with 5 shards? Describe the scatter-gather pattern.

Q42. Design an autocomplete system for a search box on an e-commerce site with 50 million products. Address:

Data structure choice
Latency requirements
How to rank suggestions (popularity vs relevance)
How to handle the cold-start problem for new products

Q43. Explain fuzzy search. How does Elasticsearch implement typo tolerance? What is the performance trade-off of increasing fuzziness?

Q44. Design the indexing pipeline for a news website where articles must be searchable within 30 seconds of publication. Show the data flow from article creation to search availability.

Q45. Compare these search approaches and when you would use each:

a) SQL LIKE query
b) PostgreSQL full-text search (tsvector)
c) Elasticsearch
d) Algolia (managed search)

Bonus: Cross-Cutting Design Questions (Questions 46-48)

Q46. You are designing a ride-sharing app (like Uber). For each subtopic in 9.10, describe one specific requirement:

Reliability/Availability
Fault Tolerance
Observability
Rate Limiting
Auth
Search

Q47. Rank the following from easiest to hardest to achieve and explain why:

99.9% availability
99.99% availability
99.999% availability

Q48. Design a system health dashboard that combines metrics from all six subtopics in this section. What panels would it have? What are the top 5 alerts?

Use the Quick Revision sheet to review key concepts after completing these exercises.