Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.b — Fault Tolerance

Introduction

Availability (9.10.a) tells you the goal. Fault tolerance tells you how to get there. A fault-tolerant system continues to operate -- possibly at reduced capacity -- when components fail. In distributed systems, failure is not a question of "if" but "when."

Guiding Principle: "Design for failure. Assume every component will fail, and make sure the system survives it."


1. Failure Modes in Distributed Systems

+------------------------------------------------------------------------+
|                   TYPES OF FAILURES                                     |
|                                                                        |
|  +----------------+  +------------------+  +------------------------+  |
|  | Crash Failure  |  | Omission Failure |  | Byzantine Failure      |  |
|  | Server stops   |  | Messages lost    |  | Server sends wrong     |  |
|  | responding     |  | or dropped       |  | or malicious responses |  |
|  +----------------+  +------------------+  +------------------------+  |
|        |                     |                       |                  |
|   Easiest to              Moderate                 Hardest to           |
|   detect & handle         complexity               handle              |
+------------------------------------------------------------------------+
Failure TypeExampleDetectionHandling
CrashServer runs out of memoryHeartbeat / health checkFailover to replica
OmissionNetwork packet droppedTimeout + retryRetry with idempotency
TimingResponse too slowDeadline / timeoutTimeout + fallback
ByzantineCorrupted data, malicious nodeConsensus algorithmsBFT protocols (rare in practice)

2. Graceful Degradation

Instead of complete failure, reduce functionality progressively:

  FULL SERVICE                  DEGRADED                    MINIMAL
  +------------------+    +------------------+    +------------------+
  | - Search works   |    | - Search works   |    | - Cached results |
  | - Recommendations|    | - No recs (slow) |    |   only           |
  | - Reviews load   | -> | - Reviews cached | -> | - Static pages   |
  | - Real-time chat |    | - Chat disabled  |    | - Maintenance    |
  |   available      |    |                  |    |   banner         |
  +------------------+    +------------------+    +------------------+
       100% capacity          ~70% capacity          ~30% capacity

Real-world examples:

  • Netflix: If the recommendation engine is down, show a generic "Popular" list instead of personalized picks.
  • Twitter: If the timeline service is overloaded, show a cached version from 30 seconds ago.
  • Amazon: If reviews service is down, still show the product page without reviews.

3. Circuit Breaker Pattern

Prevents a failing service from being hammered with requests, giving it time to recover.

                    +-------------------------------------------+
                    |         CIRCUIT BREAKER STATES             |
                    +-------------------------------------------+
                    |                                           |
                    |   +----------+                            |
                    |   |  CLOSED  |  (Normal -- requests pass) |
                    |   +----+-----+                            |
                    |        |                                  |
                    |   Failure threshold                       |
                    |   exceeded (e.g., 5                       |
                    |   failures in 10s)                        |
                    |        |                                  |
                    |        v                                  |
                    |   +----------+                            |
                    |   |   OPEN   |  (Failing -- requests      |
                    |   +----+-----+   rejected immediately)    |
                    |        |                                  |
                    |   After timeout                           |
                    |   (e.g., 30 seconds)                      |
                    |        |                                  |
                    |        v                                  |
                    |   +-----------+                           |
                    |   | HALF-OPEN |  (Testing -- let ONE      |
                    |   +-----+-----+   request through)        |
                    |         |                                 |
                    |    +----+----+                            |
                    |    |         |                            |
                    | Success   Failure                        |
                    |    |         |                            |
                    |    v         v                            |
                    | CLOSED     OPEN                           |
                    |                                           |
                    +-------------------------------------------+

Pseudocode:

class CircuitBreaker:
    state = CLOSED
    failure_count = 0
    failure_threshold = 5
    timeout = 30 seconds
    last_failure_time = null

    function call(request):
        if state == OPEN:
            if now() - last_failure_time > timeout:
                state = HALF_OPEN
            else:
                return fallback_response()    // Fast fail

        try:
            response = forward(request)
            if state == HALF_OPEN:
                state = CLOSED               // Recovery confirmed
                failure_count = 0
            return response
        catch error:
            failure_count += 1
            last_failure_time = now()
            if failure_count >= failure_threshold:
                state = OPEN                 // Trip the breaker
            return fallback_response()

When to use: Any call to an external or downstream service (database, API, third-party).


4. Bulkhead Pattern

Isolates failures so one failing component does not take down the entire system -- inspired by ship bulkheads that prevent flooding from spreading.

  WITHOUT BULKHEADS:                    WITH BULKHEADS:
  +------------------------+           +----------+  +----------+
  |  Shared Thread Pool    |           | Pool A   |  | Pool B   |
  |  (100 threads)         |           | (50 thr) |  | (50 thr) |
  |                        |           | Service  |  | Service  |
  |  Service A [overload]  |           | A [over- |  | B [OK]   |
  |  (uses all 100)        |           |  loaded] |  |          |
  |  Service B [starved!]  |           | (stuck)  |  | (fine!)  |
  +------------------------+           +----------+  +----------+
  Result: Both services                Result: Only A affected,
  fail together                        B continues normally

Implementation approaches:

ApproachMechanismExample
Thread pool isolationSeparate thread pool per dependencyHystrix-style per-service pools
Process isolationSeparate process / containerMicroservices in separate pods
Connection pool isolationDedicated DB connection pool per serviceService A: 20 conns, Service B: 20 conns
Priority queuesSeparate queues for critical vs non-criticalPayment requests get dedicated queue

5. Retry with Exponential Backoff

  Request 1: FAIL
      |
      +-- Wait 1 second
      |
  Request 2: FAIL
      |
      +-- Wait 2 seconds
      |
  Request 3: FAIL
      |
      +-- Wait 4 seconds (+ random jitter 0-1s)
      |
  Request 4: FAIL
      |
      +-- Wait 8 seconds (+ random jitter 0-1s)
      |
  Request 5: SUCCESS (or give up after max retries)

Formula:

  wait_time = min(base_delay * 2^attempt + random_jitter, max_delay)

Why jitter matters:

  WITHOUT JITTER:                  WITH JITTER:
  1000 clients all retry           1000 clients retry at
  at exactly 1s, 2s, 4s           1.0-1.9s, 2.0-2.9s, 4.0-4.9s
        |                                |
  Thundering herd!                 Spread out over time
  (spike repeats)                  (load distributed)

Best practices:

PracticeWhy
Always add jitterPrevents thundering herd
Set a max retry countAvoid infinite loops (usually 3-5)
Set a max delay capDon't wait forever (usually 30-60s)
Only retry idempotent opsRetrying a non-idempotent operation can cause duplicates
Use a retry budgetLimit total retries across all clients (e.g., 10% of requests)

6. Timeout Strategies

+------------------------------------------------------------------------+
|                       TIMEOUT LAYERS                                    |
|                                                                        |
|  Client                                                                |
|    |-- Connection timeout (e.g., 3s)                                   |
|    |     "How long to wait for TCP handshake?"                         |
|    |                                                                   |
|    |-- Request timeout (e.g., 10s)                                     |
|    |     "How long to wait for a response?"                            |
|    |                                                                   |
|    |-- Total timeout / deadline (e.g., 30s)                            |
|          "Max time for the entire operation including retries"          |
|                                                                        |
|  Service-to-Service                                                    |
|    |-- Propagate deadline from upstream                                |
|    |     "If caller's deadline is 10s and 6s passed,                   |
|    |      downstream has only 4s"                                      |
|                                                                        |
|  Database                                                              |
|    |-- Query timeout (e.g., 5s)                                        |
|    |-- Lock wait timeout                                               |
+------------------------------------------------------------------------+

Common mistakes:

  • Setting timeouts too high (cascading slowness)
  • Setting timeouts too low (false failures under normal load)
  • Not propagating deadlines across services
  • Forgetting to timeout at every network boundary

7. Identifying and Eliminating Single Points of Failure (SPOFs)

  BEFORE (SPOFs everywhere):          AFTER (Redundancy at every layer):

  [Client]                            [Client]
     |                                   |
  [1 LB] <-- SPOF                    [LB-1] + [LB-2]  (DNS failover)
     |                                   |
  [1 App] <-- SPOF                    [App-1] + [App-2] + [App-3]
     |                                   |
  [1 DB]  <-- SPOF                    [DB Primary] + [DB Replica]
     |                                   |
  [1 Disk] <-- SPOF                   [RAID] + [Backups] + [Cross-region]

SPOF checklist for interviews:

LayerSPOF RiskRedundancy Solution
DNSSingle DNS providerMulti-provider DNS (Route53 + Cloudflare)
Load BalancerSingle LBLB pair with floating IP / cloud LB
ApplicationSingle instanceMultiple instances + auto-scaling
DatabaseSingle primaryPrimary-replica + automatic failover
StorageSingle disk / volumeRAID, replication, cross-region backup
NetworkSingle ISP / pathDual ISP, multi-AZ, multi-region
Data CenterSingle DCMulti-AZ at minimum, multi-region ideally

8. Chaos Engineering

Deliberately inject failures to verify your fault tolerance actually works.

+------------------------------------------------------------------------+
|                    CHAOS ENGINEERING WORKFLOW                            |
|                                                                        |
|  1. DEFINE steady state                                                |
|     "Normal: p99 latency < 200ms, error rate < 0.1%"                  |
|                                                                        |
|  2. HYPOTHESIZE                                                        |
|     "If we kill 1 of 3 app servers, the system stays within            |
|      steady state because the LB reroutes traffic"                     |
|                                                                        |
|  3. INJECT failure                                                     |
|     - Kill a server / container                                        |
|     - Add network latency (tc netem)                                   |
|     - Fill disk space                                                  |
|     - Corrupt DNS responses                                            |
|                                                                        |
|  4. OBSERVE                                                            |
|     "Did the system stay within steady state?"                         |
|                                                                        |
|  5. FIX weaknesses found                                               |
|     "The LB took 45 seconds to detect failure -- reduce                |
|      health check interval from 30s to 5s"                             |
+------------------------------------------------------------------------+

Notable tools:

  • Chaos Monkey (Netflix) -- randomly kills instances in production
  • Gremlin -- enterprise chaos engineering platform
  • Litmus -- chaos engineering for Kubernetes
  • Toxiproxy -- simulate network conditions (latency, timeouts, partitions)

9. Fault Tolerance at Every Layer (Summary Diagram)

+------------------------------------------------------------------------+
|  LAYER             PATTERN                  PURPOSE                     |
+------------------------------------------------------------------------+
|                                                                        |
|  Client            Retry + Backoff          Recover from transient     |
|                    Timeout                  failures                    |
|                    Circuit Breaker          Prevent cascading failure   |
|                                                                        |
|  Load Balancer     Health Checks            Route away from failures   |
|                    Redundant LBs            No LB SPOF                 |
|                                                                        |
|  Application       Bulkhead                 Isolate failure domains    |
|                    Graceful Degradation     Partial functionality >     |
|                    Fallback Responses        total failure              |
|                                                                        |
|  Database          Replication              Survive node loss          |
|                    Automatic Failover       Minimize downtime          |
|                    Read Replicas            Spread read load           |
|                                                                        |
|  Infrastructure    Multi-AZ / Region        Survive AZ/region failure  |
|                    Chaos Engineering        Verify resilience          |
|                    Auto-scaling             Handle load spikes         |
|                                                                        |
+------------------------------------------------------------------------+

Key Takeaways

  1. Assume failure is inevitable. Design every component with a plan for when it fails.
  2. Circuit breakers prevent cascading failure. They stop a failing downstream from bringing down upstream services.
  3. Bulkheads isolate blast radius. One failing dependency should not consume resources for everything else.
  4. Always use exponential backoff with jitter. Never retry in a tight loop.
  5. Set timeouts at every network boundary. A missing timeout is a hidden SPOF.
  6. Eliminate SPOFs at every layer -- DNS, LB, app, database, storage, network.
  7. Chaos engineering validates your assumptions. If you have not tested a failure, you do not know if you can survive it.
  8. Graceful degradation > complete failure. Partial functionality preserves user trust.

Explain-It Challenge

Scenario: You have three microservices: Orders, Payments, and Inventory. The Payments service starts responding with 500 errors and 10-second latency.

Walk through what happens step by step:

  • Without any fault tolerance patterns
  • With circuit breaker, bulkhead, timeout, and graceful degradation in place

Hint: Show the cascading failure in the first case, and the isolated failure in the second.


Next -> 9.10.c — Observability