Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.b — Fault Tolerance

Introduction

Availability (9.10.a) tells you the goal. Fault tolerance tells you how to get there. A fault-tolerant system continues to operate -- possibly at reduced capacity -- when components fail. In distributed systems, failure is not a question of "if" but "when."

Guiding Principle: "Design for failure. Assume every component will fail, and make sure the system survives it."

1. Failure Modes in Distributed Systems

+------------------------------------------------------------------------+
|                   TYPES OF FAILURES                                     |
|                                                                        |
|  +----------------+  +------------------+  +------------------------+  |
|  | Crash Failure  |  | Omission Failure |  | Byzantine Failure      |  |
|  | Server stops   |  | Messages lost    |  | Server sends wrong     |  |
|  | responding     |  | or dropped       |  | or malicious responses |  |
|  +----------------+  +------------------+  +------------------------+  |
|        |                     |                       |                  |
|   Easiest to              Moderate                 Hardest to           |
|   detect & handle         complexity               handle              |
+------------------------------------------------------------------------+

Failure Type	Example	Detection	Handling
Crash	Server runs out of memory	Heartbeat / health check	Failover to replica
Omission	Network packet dropped	Timeout + retry	Retry with idempotency
Timing	Response too slow	Deadline / timeout	Timeout + fallback
Byzantine	Corrupted data, malicious node	Consensus algorithms	BFT protocols (rare in practice)

2. Graceful Degradation

Instead of complete failure, reduce functionality progressively:

  FULL SERVICE                  DEGRADED                    MINIMAL
  +------------------+    +------------------+    +------------------+
  | - Search works   |    | - Search works   |    | - Cached results |
  | - Recommendations|    | - No recs (slow) |    |   only           |
  | - Reviews load   | -> | - Reviews cached | -> | - Static pages   |
  | - Real-time chat |    | - Chat disabled  |    | - Maintenance    |
  |   available      |    |                  |    |   banner         |
  +------------------+    +------------------+    +------------------+
       100% capacity          ~70% capacity          ~30% capacity

Real-world examples:

Netflix: If the recommendation engine is down, show a generic "Popular" list instead of personalized picks.
Twitter: If the timeline service is overloaded, show a cached version from 30 seconds ago.
Amazon: If reviews service is down, still show the product page without reviews.

3. Circuit Breaker Pattern

Prevents a failing service from being hammered with requests, giving it time to recover.

                    +-------------------------------------------+
                    |         CIRCUIT BREAKER STATES             |
                    +-------------------------------------------+
                    |                                           |
                    |   +----------+                            |
                    |   |  CLOSED  |  (Normal -- requests pass) |
                    |   +----+-----+                            |
                    |        |                                  |
                    |   Failure threshold                       |
                    |   exceeded (e.g., 5                       |
                    |   failures in 10s)                        |
                    |        |                                  |
                    |        v                                  |
                    |   +----------+                            |
                    |   |   OPEN   |  (Failing -- requests      |
                    |   +----+-----+   rejected immediately)    |
                    |        |                                  |
                    |   After timeout                           |
                    |   (e.g., 30 seconds)                      |
                    |        |                                  |
                    |        v                                  |
                    |   +-----------+                           |
                    |   | HALF-OPEN |  (Testing -- let ONE      |
                    |   +-----+-----+   request through)        |
                    |         |                                 |
                    |    +----+----+                            |
                    |    |         |                            |
                    | Success   Failure                        |
                    |    |         |                            |
                    |    v         v                            |
                    | CLOSED     OPEN                           |
                    |                                           |
                    +-------------------------------------------+

Pseudocode:

class CircuitBreaker:
    state = CLOSED
    failure_count = 0
    failure_threshold = 5
    timeout = 30 seconds
    last_failure_time = null

    function call(request):
        if state == OPEN:
            if now() - last_failure_time > timeout:
                state = HALF_OPEN
            else:
                return fallback_response()    // Fast fail

        try:
            response = forward(request)
            if state == HALF_OPEN:
                state = CLOSED               // Recovery confirmed
                failure_count = 0
            return response
        catch error:
            failure_count += 1
            last_failure_time = now()
            if failure_count >= failure_threshold:
                state = OPEN                 // Trip the breaker
            return fallback_response()

When to use: Any call to an external or downstream service (database, API, third-party).

4. Bulkhead Pattern

Isolates failures so one failing component does not take down the entire system -- inspired by ship bulkheads that prevent flooding from spreading.

  WITHOUT BULKHEADS:                    WITH BULKHEADS:
  +------------------------+           +----------+  +----------+
  |  Shared Thread Pool    |           | Pool A   |  | Pool B   |
  |  (100 threads)         |           | (50 thr) |  | (50 thr) |
  |                        |           | Service  |  | Service  |
  |  Service A [overload]  |           | A [over- |  | B [OK]   |
  |  (uses all 100)        |           |  loaded] |  |          |
  |  Service B [starved!]  |           | (stuck)  |  | (fine!)  |
  +------------------------+           +----------+  +----------+
  Result: Both services                Result: Only A affected,
  fail together                        B continues normally

Implementation approaches:

Approach	Mechanism	Example
Thread pool isolation	Separate thread pool per dependency	Hystrix-style per-service pools
Process isolation	Separate process / container	Microservices in separate pods
Connection pool isolation	Dedicated DB connection pool per service	Service A: 20 conns, Service B: 20 conns
Priority queues	Separate queues for critical vs non-critical	Payment requests get dedicated queue

5. Retry with Exponential Backoff

  Request 1: FAIL
      |
      +-- Wait 1 second
      |
  Request 2: FAIL
      |
      +-- Wait 2 seconds
      |
  Request 3: FAIL
      |
      +-- Wait 4 seconds (+ random jitter 0-1s)
      |
  Request 4: FAIL
      |
      +-- Wait 8 seconds (+ random jitter 0-1s)
      |
  Request 5: SUCCESS (or give up after max retries)

Formula:

  wait_time = min(base_delay * 2^attempt + random_jitter, max_delay)

Why jitter matters:

  WITHOUT JITTER:                  WITH JITTER:
  1000 clients all retry           1000 clients retry at
  at exactly 1s, 2s, 4s           1.0-1.9s, 2.0-2.9s, 4.0-4.9s
        |                                |
  Thundering herd!                 Spread out over time
  (spike repeats)                  (load distributed)

Best practices:

Practice	Why
Always add jitter	Prevents thundering herd
Set a max retry count	Avoid infinite loops (usually 3-5)
Set a max delay cap	Don't wait forever (usually 30-60s)
Only retry idempotent ops	Retrying a non-idempotent operation can cause duplicates
Use a retry budget	Limit total retries across all clients (e.g., 10% of requests)

6. Timeout Strategies

+------------------------------------------------------------------------+
|                       TIMEOUT LAYERS                                    |
|                                                                        |
|  Client                                                                |
|    |-- Connection timeout (e.g., 3s)                                   |
|    |     "How long to wait for TCP handshake?"                         |
|    |                                                                   |
|    |-- Request timeout (e.g., 10s)                                     |
|    |     "How long to wait for a response?"                            |
|    |                                                                   |
|    |-- Total timeout / deadline (e.g., 30s)                            |
|          "Max time for the entire operation including retries"          |
|                                                                        |
|  Service-to-Service                                                    |
|    |-- Propagate deadline from upstream                                |
|    |     "If caller's deadline is 10s and 6s passed,                   |
|    |      downstream has only 4s"                                      |
|                                                                        |
|  Database                                                              |
|    |-- Query timeout (e.g., 5s)                                        |
|    |-- Lock wait timeout                                               |
+------------------------------------------------------------------------+

Common mistakes:

Setting timeouts too high (cascading slowness)
Setting timeouts too low (false failures under normal load)
Not propagating deadlines across services
Forgetting to timeout at every network boundary

7. Identifying and Eliminating Single Points of Failure (SPOFs)

  BEFORE (SPOFs everywhere):          AFTER (Redundancy at every layer):

  [Client]                            [Client]
     |                                   |
  [1 LB] <-- SPOF                    [LB-1] + [LB-2]  (DNS failover)
     |                                   |
  [1 App] <-- SPOF                    [App-1] + [App-2] + [App-3]
     |                                   |
  [1 DB]  <-- SPOF                    [DB Primary] + [DB Replica]
     |                                   |
  [1 Disk] <-- SPOF                   [RAID] + [Backups] + [Cross-region]

SPOF checklist for interviews:

Layer	SPOF Risk	Redundancy Solution
DNS	Single DNS provider	Multi-provider DNS (Route53 + Cloudflare)
Load Balancer	Single LB	LB pair with floating IP / cloud LB
Application	Single instance	Multiple instances + auto-scaling
Database	Single primary	Primary-replica + automatic failover
Storage	Single disk / volume	RAID, replication, cross-region backup
Network	Single ISP / path	Dual ISP, multi-AZ, multi-region
Data Center	Single DC	Multi-AZ at minimum, multi-region ideally

8. Chaos Engineering

Deliberately inject failures to verify your fault tolerance actually works.

+------------------------------------------------------------------------+
|                    CHAOS ENGINEERING WORKFLOW                            |
|                                                                        |
|  1. DEFINE steady state                                                |
|     "Normal: p99 latency < 200ms, error rate < 0.1%"                  |
|                                                                        |
|  2. HYPOTHESIZE                                                        |
|     "If we kill 1 of 3 app servers, the system stays within            |
|      steady state because the LB reroutes traffic"                     |
|                                                                        |
|  3. INJECT failure                                                     |
|     - Kill a server / container                                        |
|     - Add network latency (tc netem)                                   |
|     - Fill disk space                                                  |
|     - Corrupt DNS responses                                            |
|                                                                        |
|  4. OBSERVE                                                            |
|     "Did the system stay within steady state?"                         |
|                                                                        |
|  5. FIX weaknesses found                                               |
|     "The LB took 45 seconds to detect failure -- reduce                |
|      health check interval from 30s to 5s"                             |
+------------------------------------------------------------------------+

Notable tools:

Chaos Monkey (Netflix) -- randomly kills instances in production
Gremlin -- enterprise chaos engineering platform
Litmus -- chaos engineering for Kubernetes
Toxiproxy -- simulate network conditions (latency, timeouts, partitions)

9. Fault Tolerance at Every Layer (Summary Diagram)

+------------------------------------------------------------------------+
|  LAYER             PATTERN                  PURPOSE                     |
+------------------------------------------------------------------------+
|                                                                        |
|  Client            Retry + Backoff          Recover from transient     |
|                    Timeout                  failures                    |
|                    Circuit Breaker          Prevent cascading failure   |
|                                                                        |
|  Load Balancer     Health Checks            Route away from failures   |
|                    Redundant LBs            No LB SPOF                 |
|                                                                        |
|  Application       Bulkhead                 Isolate failure domains    |
|                    Graceful Degradation     Partial functionality >     |
|                    Fallback Responses        total failure              |
|                                                                        |
|  Database          Replication              Survive node loss          |
|                    Automatic Failover       Minimize downtime          |
|                    Read Replicas            Spread read load           |
|                                                                        |
|  Infrastructure    Multi-AZ / Region        Survive AZ/region failure  |
|                    Chaos Engineering        Verify resilience          |
|                    Auto-scaling             Handle load spikes         |
|                                                                        |
+------------------------------------------------------------------------+

Key Takeaways

Assume failure is inevitable. Design every component with a plan for when it fails.
Circuit breakers prevent cascading failure. They stop a failing downstream from bringing down upstream services.
Bulkheads isolate blast radius. One failing dependency should not consume resources for everything else.
Always use exponential backoff with jitter. Never retry in a tight loop.
Set timeouts at every network boundary. A missing timeout is a hidden SPOF.
Eliminate SPOFs at every layer -- DNS, LB, app, database, storage, network.
Chaos engineering validates your assumptions. If you have not tested a failure, you do not know if you can survive it.
Graceful degradation > complete failure. Partial functionality preserves user trust.

Explain-It Challenge

Scenario: You have three microservices: Orders, Payments, and Inventory. The Payments service starts responding with 500 errors and 10-second latency.

Walk through what happens step by step:

Without any fault tolerance patterns

With circuit breaker, bulkhead, timeout, and graceful degradation in place

Hint: Show the cascading failure in the first case, and the isolated failure in the second.

Next -> 9.10.c — Observability