Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10 — Interview Questions with Model Answers

How to use this material (instructions)

Cover the model answer and attempt each question aloud (simulate an interview).
Time yourself: aim for 2-3 minutes per answer.
Compare your answer with the model answer -- focus on structure and key points, not exact wording.
Practice the weakest answers until you can deliver them fluently.

Q1: What does "five nines" availability mean, and why is it so hard to achieve?

Model Answer:

Five nines means 99.999% uptime, which translates to approximately 5.26 minutes of downtime per year. To put that in perspective, three nines (99.9%) allows 8.76 hours -- a massive difference.

It is hard to achieve because availability of serial components multiplies:

  3 components at 99.99% each:
  0.9999 x 0.9999 x 0.9999 = 99.97% (not even four nines)

To reach five nines, you need redundancy at every layer -- multiple app servers, replicated databases, redundant load balancers, multi-AZ deployment, and automated failover that completes in seconds. You also need zero-downtime deployments, because even a 5-minute deployment window would consume your entire annual budget.

The cost grows exponentially with each additional nine. Going from 99.9% to 99.99% might double your infrastructure cost; going from 99.99% to 99.999% might require multi-region active-active deployment, which can 3-5x costs.

Q2: Explain the circuit breaker pattern. When would you use it?

Model Answer:

The circuit breaker pattern prevents a failing downstream service from cascading failures upstream. It has three states:

Closed (normal): Requests pass through. The breaker tracks failure count.
Open (tripped): After N failures in a time window, the breaker "trips open." All requests immediately return a fallback response without contacting the downstream service. This gives the failing service time to recover.
Half-Open (testing): After a configured timeout, the breaker allows one test request through. If it succeeds, the breaker closes (recovery confirmed). If it fails, it reopens.

I would use a circuit breaker on any call to an external dependency -- a downstream microservice, a database, a third-party API. For example, in an e-commerce system, if the recommendation service starts timing out, the circuit breaker trips and the product page returns without recommendations rather than hanging for 10 seconds per request.

The key benefit is fail-fast behavior: instead of all threads blocking on a slow service, requests fail immediately with a fallback, preserving system responsiveness.

Q3: How do logs, metrics, and traces differ? When do you use each?

Model Answer:

The three pillars serve different purposes:

Logs are discrete events with context -- "User 42 got a 500 error on /checkout at 14:32." They are high-cardinality (one per event) and best for investigating specific incidents. I use structured JSON logs so I can filter by fields like user_id, error_code, and trace_id.

Metrics are numeric measurements over time -- "p99 latency is 340ms; error rate is 2.3%." They are low-cardinality and cheap to store. I use them for dashboards, trend analysis, and alerting. The RED method (Rate, Errors, Duration) covers request-driven services.

Traces show the journey of a single request across services. A trace contains spans with timing information: "This request spent 50ms in auth, 1200ms in payment, 800ms of that in the database." I use traces when I know something is slow but I do not know where.

In practice, they work together: an alert fires on a metric (error rate spike), I search logs to find the failing requests, then I grab a trace_id from a log entry and view the trace to find the bottleneck. The correlation between all three via trace_id is critical.

Q4: Compare token bucket and sliding window counter for rate limiting. Which would you choose for a public API?

Model Answer:

Token bucket starts with a full bucket of N tokens, refilling at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. It allows bursts up to the bucket capacity while maintaining a steady-state rate.

Sliding window counter keeps counters for the current and previous time windows. It estimates the count in the current sliding window using a weighted formula: previous_count * overlap_percentage + current_count. Memory usage is O(1) -- just two counters per user.

	Token Bucket	Sliding Window Counter
Burst handling	Allows controlled bursts	Minimal bursts
Memory	O(1) per key	O(1) per key
Accuracy	Good	Very good (~99.9%)
Implementation	Simpler	Slightly more complex

For a public API, I would choose token bucket because:

Bursts are acceptable and expected (a client loading a page makes 10 API calls at once).
It is intuitive for developers to understand ("you get 100 tokens, refilled at 10/second").
It is the industry standard (AWS, Stripe, and most API gateways use it).
With Redis and Lua scripts, it scales well for distributed rate limiting.

Q5: Explain OAuth 2.0 Authorization Code flow. Why not send the token directly to the client?

Model Answer:

The Authorization Code flow has these steps:

User clicks "Login with Google" on the client app.
Client redirects to Google's authorization endpoint with client_id, redirect_uri, scope, and state.
User authenticates with Google and consents to the requested scopes.
Google redirects back to the client's redirect_uri with an authorization code (not a token).
The client's backend server sends the authorization code plus the client_secret directly to Google's token endpoint (server-to-server).
Google returns an access token and refresh token to the backend.

Why not send the token directly? Because in step 4, the redirect happens through the user's browser, which means the URL is visible in browser history, server logs, and could be intercepted by browser extensions. An authorization code is:

Short-lived (typically 60 seconds)
Single-use (can only be exchanged once)
Useless without the client_secret (which only the backend knows)

Even if an attacker captures the authorization code, they cannot exchange it without the client_secret. This server-to-server exchange is what makes the Authorization Code flow the most secure. For SPAs and mobile apps, PKCE (Proof Key for Code Exchange) adds a code_verifier/code_challenge mechanism to achieve similar security without a client_secret.

Q6: Design a high-level search architecture for a product catalog with 100 million items.

Model Answer:

Indexing pipeline:

Product database (MySQL/PostgreSQL) emits changes via CDC (Change Data Capture) to Kafka.
An indexer service consumes from Kafka, transforms the data (tokenize product name and description, extract brand/category facets), and writes to Elasticsearch.
Full reindex runs weekly as a batch job; CDC handles incremental updates for near-real-time freshness.

Elasticsearch cluster:

Index sharded by product category or hash of product ID (I would benchmark both).
For 100M products, approximately 10-20 primary shards with 1 replica each.
Hot-warm architecture: recent/popular products on SSD nodes, older products on cheaper storage.

Query path:

Search API parses the user query, identifies facets (brand, price range, size), and constructs an Elasticsearch query.
BM25 for text relevance, boosted by: sales velocity, user reviews, recency, and personalization signals.
Autocomplete via a separate completion suggester index, returning within 50ms.
Fuzzy matching with fuzziness "AUTO" for typo tolerance.

Scaling:

Read throughput: add replicas (Elasticsearch can serve reads from any replica).
Write throughput: add primary shards (but resharding is expensive -- plan capacity upfront).
Caching: Redis cache for the top 1000 most common queries (cache hit rate typically 30-40% for search).

Q7: What is graceful degradation? Give three real-world examples.

Model Answer:

Graceful degradation means that when a component fails, the system continues to operate with reduced functionality rather than failing completely. The user experience degrades, but the core function remains available.

Example 1 -- Netflix: If the personalized recommendation engine is down, Netflix shows a generic "Trending" or "Popular" list. Users still browse and watch content; they just do not get personalized suggestions.

Example 2 -- Amazon product page: If the reviews service is unreachable, the product page still displays the product details, pricing, and "Add to Cart" button. The review section shows "Reviews temporarily unavailable." The core shopping flow is unaffected.

Example 3 -- Google Maps: If real-time traffic data is unavailable, Maps still provides directions using estimated travel times based on historical data. The experience is slightly less accurate but fully functional.

The key principle is to identify core vs non-core features. Core features (checkout, streaming, navigation) must have fallbacks. Non-core features (recommendations, reviews, real-time traffic) can be hidden or replaced with cached/default data.

In system design interviews, I always identify which components can degrade gracefully and what the fallback behavior is. This shows the interviewer I think about production realities.

Q8: How would you implement distributed rate limiting across 30 servers?

Model Answer:

My primary approach would be centralized rate limiting with Redis:

Each server, before processing a request, makes a call to Redis to check and decrement the token bucket for that user. I would use a Lua script executed atomically in Redis to handle the token refill calculation and decrement in a single round trip. This ensures consistency -- the 30 servers see the same counter.

Design decisions:

Algorithm: Token bucket (allows bursts, simple state).
Key structure: ratelimit:{user_id}:{endpoint} for per-user-per-endpoint limits.
TTL: Set a TTL on each key equal to the refill window, so keys for inactive users expire automatically.

Handling Redis failure:

If Redis is unreachable, fall back to local in-memory rate limiting per server. With 30 servers, the effective limit becomes ~30x the per-server limit. This is acceptable for short outages -- slightly over-limit is better than either blocking all requests or having no limit at all.

Performance:

Redis latency: ~1ms per check (sub-millisecond on local network).
For ultra-high throughput, I could use local token buckets that sync with Redis periodically (e.g., every 100ms). This reduces Redis calls by 10x at the cost of slight inaccuracy.

Alternative considered: Sticky sessions route each user to a single server, allowing local-only rate limiting. But this creates uneven load distribution and breaks on server failure.

Q9: What is distributed tracing and how does it work?

Model Answer:

Distributed tracing tracks a single request as it flows through multiple microservices. It answers: "Where did this request spend its time?"

Core concepts:

Trace: Represents the entire end-to-end request. Identified by a unique trace_id.
Span: Represents one unit of work within a trace (e.g., one service call). Each span has a span_id, a parent_span_id, a start time, and a duration.
Context propagation: When Service A calls Service B, it passes the trace_id and its span_id via HTTP headers (W3C traceparent header). Service B creates a new span with A's span as its parent.

How it works in practice:

The API gateway generates a trace_id and creates the root span.
Each downstream service extracts the trace context from incoming headers, creates a child span, and forwards the context to any services it calls.
All spans are sent asynchronously to a trace collector (Jaeger, Zipkin).
The collector assembles spans into a tree and displays a waterfall/timeline view.

Example: A request that takes 2 seconds end-to-end might show: API Gateway (200ms) -> Auth (50ms) -> Order Service (1800ms), where Order Service internally calls Payment (1200ms) and DB (800ms). The waterfall view immediately shows the bottleneck is the database query inside the payment flow.

OpenTelemetry provides vendor-neutral SDKs that auto-instrument HTTP clients, database drivers, and message queues, so you get tracing with minimal code changes.

Q10: Compare RBAC and ABAC. When would you choose each?

Model Answer:

RBAC (Role-Based Access Control) assigns users to roles, and roles have permissions. Access decisions are simple lookups: "Does this user's role have the required permission?"

ABAC (Attribute-Based Access Control) evaluates policies against attributes of the user, resource, action, and environment. Access decisions are policy evaluations: "Is this user in the engineering department, accessing a non-classified document, during business hours, from a corporate IP?"

Aspect	RBAC	ABAC
Decision model	User -> Role -> Permission	Policy(user attrs, resource attrs, context)
Granularity	Coarse	Fine-grained
Management	Role assignment (simple)	Policy authoring (complex)
Performance	Fast (table lookup)	Slower (rule evaluation)
Audit	Easy (who has which role)	Complex (which policies applied)

When I choose RBAC: For the vast majority of applications -- SaaS platforms, content management systems, e-commerce admin panels. If my access control can be expressed as "admins can do X, editors can do Y, viewers can do Z," RBAC is sufficient and much simpler.

When I choose ABAC: For systems with complex, context-dependent access rules -- healthcare (doctor can access patient records only for patients in their ward, during their shift), financial systems (trader can execute trades only below their risk limit, for their assigned products), or government systems with classification levels.

In practice, many systems use RBAC as the foundation and add ABAC-like policies for specific edge cases. For example, GitHub uses RBAC (owner, admin, write, read) with ABAC-like rules for branch protection policies.

Q11: You are designing a global payment system. Walk through the reliability, fault tolerance, and observability requirements.

Model Answer:

Reliability requirements:

Availability target: 99.99% (52 minutes downtime per year). Payments are revenue-critical.
RPO = 0 (zero data loss -- every transaction must be durable).
RTO < 60 seconds (fast failover to minimize impact).
Active-active multi-region deployment (at least US + EU) for geographic redundancy and low latency.

Fault tolerance patterns:

Circuit breaker on all external calls (bank APIs, fraud detection service). If a bank API is slow, trip the breaker and return "payment pending -- retry later" rather than timing out.
Bulkhead isolation: Separate thread/connection pools for different payment processors. If Stripe is down, PayPal payments continue unaffected.
Retry with idempotency: Payments must be idempotent. Each payment gets a unique idempotency_key. If a retry happens (network timeout after the payment succeeded), the system detects the duplicate and returns the original result.
Timeouts at every boundary: Connection timeout 3s, request timeout 10s, total deadline 30s.
Graceful degradation: If fraud scoring is down, allow payments below a threshold amount through while queuing high-value payments for manual review.

Observability:

Metrics: Payment success rate, latency p50/p95/p99, amount processed per minute, decline rate by reason code.
Alerts: Success rate drops below 99.5% -> page on-call. Latency p99 exceeds 5 seconds -> ticket. Any single bank API error rate exceeds 20% -> alert with auto-circuit-breaker.
Distributed tracing: Every payment request traced from API gateway through auth, fraud scoring, payment processor, and settlement. Trace ID included in all logs.
Business-level monitoring: Daily reconciliation between our records and bank records. Alert on any discrepancy.

This design ensures that even during partial failures, money movement is reliable, no transactions are lost, and the team can quickly diagnose issues.

Review the Quick Revision sheet before your interview for a rapid refresher.