Episode 9 — System Design / 9.10 — Advanced Distributed Systems
9.10.b — Fault Tolerance
Introduction
Availability (9.10.a) tells you the goal. Fault tolerance tells you how to get there. A fault-tolerant system continues to operate -- possibly at reduced capacity -- when components fail. In distributed systems, failure is not a question of "if" but "when."
Guiding Principle: "Design for failure. Assume every component will fail, and make sure the system survives it."
1. Failure Modes in Distributed Systems
+------------------------------------------------------------------------+
| TYPES OF FAILURES |
| |
| +----------------+ +------------------+ +------------------------+ |
| | Crash Failure | | Omission Failure | | Byzantine Failure | |
| | Server stops | | Messages lost | | Server sends wrong | |
| | responding | | or dropped | | or malicious responses | |
| +----------------+ +------------------+ +------------------------+ |
| | | | |
| Easiest to Moderate Hardest to |
| detect & handle complexity handle |
+------------------------------------------------------------------------+
| Failure Type | Example | Detection | Handling |
|---|---|---|---|
| Crash | Server runs out of memory | Heartbeat / health check | Failover to replica |
| Omission | Network packet dropped | Timeout + retry | Retry with idempotency |
| Timing | Response too slow | Deadline / timeout | Timeout + fallback |
| Byzantine | Corrupted data, malicious node | Consensus algorithms | BFT protocols (rare in practice) |
2. Graceful Degradation
Instead of complete failure, reduce functionality progressively:
FULL SERVICE DEGRADED MINIMAL
+------------------+ +------------------+ +------------------+
| - Search works | | - Search works | | - Cached results |
| - Recommendations| | - No recs (slow) | | only |
| - Reviews load | -> | - Reviews cached | -> | - Static pages |
| - Real-time chat | | - Chat disabled | | - Maintenance |
| available | | | | banner |
+------------------+ +------------------+ +------------------+
100% capacity ~70% capacity ~30% capacity
Real-world examples:
- Netflix: If the recommendation engine is down, show a generic "Popular" list instead of personalized picks.
- Twitter: If the timeline service is overloaded, show a cached version from 30 seconds ago.
- Amazon: If reviews service is down, still show the product page without reviews.
3. Circuit Breaker Pattern
Prevents a failing service from being hammered with requests, giving it time to recover.
+-------------------------------------------+
| CIRCUIT BREAKER STATES |
+-------------------------------------------+
| |
| +----------+ |
| | CLOSED | (Normal -- requests pass) |
| +----+-----+ |
| | |
| Failure threshold |
| exceeded (e.g., 5 |
| failures in 10s) |
| | |
| v |
| +----------+ |
| | OPEN | (Failing -- requests |
| +----+-----+ rejected immediately) |
| | |
| After timeout |
| (e.g., 30 seconds) |
| | |
| v |
| +-----------+ |
| | HALF-OPEN | (Testing -- let ONE |
| +-----+-----+ request through) |
| | |
| +----+----+ |
| | | |
| Success Failure |
| | | |
| v v |
| CLOSED OPEN |
| |
+-------------------------------------------+
Pseudocode:
class CircuitBreaker:
state = CLOSED
failure_count = 0
failure_threshold = 5
timeout = 30 seconds
last_failure_time = null
function call(request):
if state == OPEN:
if now() - last_failure_time > timeout:
state = HALF_OPEN
else:
return fallback_response() // Fast fail
try:
response = forward(request)
if state == HALF_OPEN:
state = CLOSED // Recovery confirmed
failure_count = 0
return response
catch error:
failure_count += 1
last_failure_time = now()
if failure_count >= failure_threshold:
state = OPEN // Trip the breaker
return fallback_response()
When to use: Any call to an external or downstream service (database, API, third-party).
4. Bulkhead Pattern
Isolates failures so one failing component does not take down the entire system -- inspired by ship bulkheads that prevent flooding from spreading.
WITHOUT BULKHEADS: WITH BULKHEADS:
+------------------------+ +----------+ +----------+
| Shared Thread Pool | | Pool A | | Pool B |
| (100 threads) | | (50 thr) | | (50 thr) |
| | | Service | | Service |
| Service A [overload] | | A [over- | | B [OK] |
| (uses all 100) | | loaded] | | |
| Service B [starved!] | | (stuck) | | (fine!) |
+------------------------+ +----------+ +----------+
Result: Both services Result: Only A affected,
fail together B continues normally
Implementation approaches:
| Approach | Mechanism | Example |
|---|---|---|
| Thread pool isolation | Separate thread pool per dependency | Hystrix-style per-service pools |
| Process isolation | Separate process / container | Microservices in separate pods |
| Connection pool isolation | Dedicated DB connection pool per service | Service A: 20 conns, Service B: 20 conns |
| Priority queues | Separate queues for critical vs non-critical | Payment requests get dedicated queue |
5. Retry with Exponential Backoff
Request 1: FAIL
|
+-- Wait 1 second
|
Request 2: FAIL
|
+-- Wait 2 seconds
|
Request 3: FAIL
|
+-- Wait 4 seconds (+ random jitter 0-1s)
|
Request 4: FAIL
|
+-- Wait 8 seconds (+ random jitter 0-1s)
|
Request 5: SUCCESS (or give up after max retries)
Formula:
wait_time = min(base_delay * 2^attempt + random_jitter, max_delay)
Why jitter matters:
WITHOUT JITTER: WITH JITTER:
1000 clients all retry 1000 clients retry at
at exactly 1s, 2s, 4s 1.0-1.9s, 2.0-2.9s, 4.0-4.9s
| |
Thundering herd! Spread out over time
(spike repeats) (load distributed)
Best practices:
| Practice | Why |
|---|---|
| Always add jitter | Prevents thundering herd |
| Set a max retry count | Avoid infinite loops (usually 3-5) |
| Set a max delay cap | Don't wait forever (usually 30-60s) |
| Only retry idempotent ops | Retrying a non-idempotent operation can cause duplicates |
| Use a retry budget | Limit total retries across all clients (e.g., 10% of requests) |
6. Timeout Strategies
+------------------------------------------------------------------------+
| TIMEOUT LAYERS |
| |
| Client |
| |-- Connection timeout (e.g., 3s) |
| | "How long to wait for TCP handshake?" |
| | |
| |-- Request timeout (e.g., 10s) |
| | "How long to wait for a response?" |
| | |
| |-- Total timeout / deadline (e.g., 30s) |
| "Max time for the entire operation including retries" |
| |
| Service-to-Service |
| |-- Propagate deadline from upstream |
| | "If caller's deadline is 10s and 6s passed, |
| | downstream has only 4s" |
| |
| Database |
| |-- Query timeout (e.g., 5s) |
| |-- Lock wait timeout |
+------------------------------------------------------------------------+
Common mistakes:
- Setting timeouts too high (cascading slowness)
- Setting timeouts too low (false failures under normal load)
- Not propagating deadlines across services
- Forgetting to timeout at every network boundary
7. Identifying and Eliminating Single Points of Failure (SPOFs)
BEFORE (SPOFs everywhere): AFTER (Redundancy at every layer):
[Client] [Client]
| |
[1 LB] <-- SPOF [LB-1] + [LB-2] (DNS failover)
| |
[1 App] <-- SPOF [App-1] + [App-2] + [App-3]
| |
[1 DB] <-- SPOF [DB Primary] + [DB Replica]
| |
[1 Disk] <-- SPOF [RAID] + [Backups] + [Cross-region]
SPOF checklist for interviews:
| Layer | SPOF Risk | Redundancy Solution |
|---|---|---|
| DNS | Single DNS provider | Multi-provider DNS (Route53 + Cloudflare) |
| Load Balancer | Single LB | LB pair with floating IP / cloud LB |
| Application | Single instance | Multiple instances + auto-scaling |
| Database | Single primary | Primary-replica + automatic failover |
| Storage | Single disk / volume | RAID, replication, cross-region backup |
| Network | Single ISP / path | Dual ISP, multi-AZ, multi-region |
| Data Center | Single DC | Multi-AZ at minimum, multi-region ideally |
8. Chaos Engineering
Deliberately inject failures to verify your fault tolerance actually works.
+------------------------------------------------------------------------+
| CHAOS ENGINEERING WORKFLOW |
| |
| 1. DEFINE steady state |
| "Normal: p99 latency < 200ms, error rate < 0.1%" |
| |
| 2. HYPOTHESIZE |
| "If we kill 1 of 3 app servers, the system stays within |
| steady state because the LB reroutes traffic" |
| |
| 3. INJECT failure |
| - Kill a server / container |
| - Add network latency (tc netem) |
| - Fill disk space |
| - Corrupt DNS responses |
| |
| 4. OBSERVE |
| "Did the system stay within steady state?" |
| |
| 5. FIX weaknesses found |
| "The LB took 45 seconds to detect failure -- reduce |
| health check interval from 30s to 5s" |
+------------------------------------------------------------------------+
Notable tools:
- Chaos Monkey (Netflix) -- randomly kills instances in production
- Gremlin -- enterprise chaos engineering platform
- Litmus -- chaos engineering for Kubernetes
- Toxiproxy -- simulate network conditions (latency, timeouts, partitions)
9. Fault Tolerance at Every Layer (Summary Diagram)
+------------------------------------------------------------------------+
| LAYER PATTERN PURPOSE |
+------------------------------------------------------------------------+
| |
| Client Retry + Backoff Recover from transient |
| Timeout failures |
| Circuit Breaker Prevent cascading failure |
| |
| Load Balancer Health Checks Route away from failures |
| Redundant LBs No LB SPOF |
| |
| Application Bulkhead Isolate failure domains |
| Graceful Degradation Partial functionality > |
| Fallback Responses total failure |
| |
| Database Replication Survive node loss |
| Automatic Failover Minimize downtime |
| Read Replicas Spread read load |
| |
| Infrastructure Multi-AZ / Region Survive AZ/region failure |
| Chaos Engineering Verify resilience |
| Auto-scaling Handle load spikes |
| |
+------------------------------------------------------------------------+
Key Takeaways
- Assume failure is inevitable. Design every component with a plan for when it fails.
- Circuit breakers prevent cascading failure. They stop a failing downstream from bringing down upstream services.
- Bulkheads isolate blast radius. One failing dependency should not consume resources for everything else.
- Always use exponential backoff with jitter. Never retry in a tight loop.
- Set timeouts at every network boundary. A missing timeout is a hidden SPOF.
- Eliminate SPOFs at every layer -- DNS, LB, app, database, storage, network.
- Chaos engineering validates your assumptions. If you have not tested a failure, you do not know if you can survive it.
- Graceful degradation > complete failure. Partial functionality preserves user trust.
Explain-It Challenge
Scenario: You have three microservices: Orders, Payments, and Inventory. The Payments service starts responding with 500 errors and 10-second latency.
Walk through what happens step by step:
- Without any fault tolerance patterns
- With circuit breaker, bulkhead, timeout, and graceful degradation in place
Hint: Show the cascading failure in the first case, and the isolated failure in the second.
Next -> 9.10.c — Observability