Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.a — Reliability and Availability

Introduction

Reliability and availability are the foundational promises you make about your system. Every system design interview implicitly or explicitly asks: "How do you keep this running?" This section gives you the math, the vocabulary, and the architectural strategies to answer confidently.


1. Key Definitions

TermDefinitionExample
ReliabilityThe probability that a system performs its intended function without failure for a given period"The payment service processes transactions correctly 99.99% of the time"
AvailabilityThe fraction of time a system is operational and accessible"The website is reachable 99.95% of the year"
DurabilityThe guarantee that stored data will not be lost"S3 provides 99.999999999% (11 nines) durability"

Key distinction: A system can be available but unreliable (it's reachable but returns wrong data). Reliability is the stronger guarantee.


2. The "Nines" of Availability

+-------+------------+------------------+---------------------+-----------------------+
| Nines | Percentage | Downtime / Year  | Downtime / Month    | Downtime / Day        |
+-------+------------+------------------+---------------------+-----------------------+
|   1   |   90%      | 36.5 days        | 72 hours            | 2.4 hours             |
|   2   |   99%      | 3.65 days        | 7.2 hours           | 14.4 minutes          |
|   3   |   99.9%    | 8.76 hours       | 43.8 minutes        | 1.44 minutes          |
|   4   |   99.99%   | 52.6 minutes     | 4.38 minutes        | 8.64 seconds          |
|   5   |   99.999%  | 5.26 minutes     | 26.3 seconds        | 0.864 seconds         |
+-------+------------+------------------+---------------------+-----------------------+

Interview tip: Memorize at least 99.9% (three nines = ~8.76 hours/year) and 99.99% (four nines = ~52 minutes/year). These are the two most commonly referenced targets.


3. SLA, SLO, and SLI

+---------------------------------------------------------------------+
|                                                                     |
|  SLI (Service Level INDICATOR)                                      |
|    = The actual metric you measure                                  |
|    Example: "Request latency at p99 is 200ms"                       |
|                                                                     |
|       |                                                             |
|       v                                                             |
|                                                                     |
|  SLO (Service Level OBJECTIVE)                                      |
|    = The internal target you set                                    |
|    Example: "p99 latency SHOULD be < 300ms"                         |
|                                                                     |
|       |                                                             |
|       v                                                             |
|                                                                     |
|  SLA (Service Level AGREEMENT)                                      |
|    = The contractual promise to customers                           |
|    Example: "If p99 > 500ms for >0.1% of requests,                 |
|              we credit 10% of the monthly bill"                     |
|                                                                     |
|  Relationship: SLI measures --> SLO targets --> SLA guarantees      |
+---------------------------------------------------------------------+
SLISLOSLA
Who sets it?Engineering (instrumentation)Engineering (internal goal)Business + Legal (contract)
Consequence of breachAlert firesError budget consumedFinancial penalty / customer churn
ExampleUptime = 99.93%Uptime target >= 99.95%Guarantee uptime >= 99.9%

4. Availability Math

4a. Components in Series (AND -- all must work)

When components are chained and ALL must be operational:

  Client --> [LB] --> [App Server] --> [Database]
              99.99%    99.99%          99.99%

  Overall = 0.9999 x 0.9999 x 0.9999 = 0.9997 = 99.97%

Formula:

A_total = A1 x A2 x A3 x ... x An

Key insight: Each additional serial component REDUCES overall availability. A chain of ten 99.9% components yields only 99.0%.

4b. Components in Parallel (OR -- at least one must work)

When you add redundancy (any one working is sufficient):

              +--[ Server A  99.9% ]--+
  Client ---->|                       |----> DB
              +--[ Server B  99.9% ]--+

  P(both fail) = 0.001 x 0.001 = 0.000001
  A_total      = 1 - 0.000001  = 99.9999%

Formula:

A_total = 1 - (1 - A1) x (1 - A2) x ... x (1 - An)

Key insight: Parallel redundancy DRAMATICALLY improves availability.

4c. Combined Example (Interview-Style)

                    +--[ App A 99.9% ]--+
  [LB 99.99%] ---->|                    |---->  [DB Primary 99.95%]
                    +--[ App B 99.9% ]--+              |
                                                [DB Replica 99.95%]

  Step 1: App layer (parallel)
    A_app = 1 - (1 - 0.999)^2 = 1 - 0.000001 = 0.999999

  Step 2: DB layer (parallel)
    A_db = 1 - (1 - 0.9995)^2 = 1 - 0.00000025 = 0.99999975

  Step 3: Whole system (series: LB -> App -> DB)
    A_total = 0.9999 x 0.999999 x 0.99999975
            = 0.99989875
            ~ 99.99%

5. Redundancy Strategies

Active-Passive (Hot Standby)

  +-------------------+       +-------------------+
  |   PRIMARY (Active)|       |  STANDBY (Passive)|
  |   Handles ALL     | ----> |  Receives data    |
  |   traffic         |  sync |  via replication   |
  +-------------------+       +-------------------+
          |                           |
     [Heartbeat] <----- monitors --- [Health Check]
          |
     If primary fails:
          |
          v
  +-------------------+       +-------------------+
  |   FAILED          |       |  NOW ACTIVE       |
  |   (being repaired)|       |  Takes over       |
  +-------------------+       +-------------------+
AspectDetail
How it worksStandby replicates data from primary; takes over on failure
Failover timeSeconds to minutes (depends on detection + switchover)
Data riskPossible data loss if replication is async
CostModerate -- standby resources are partially idle
Use caseTraditional databases, single-leader setups

Active-Active

  +-------------------+       +-------------------+
  |   NODE A (Active) | <---> |   NODE B (Active) |
  |   Handles traffic | sync  |   Handles traffic |
  +-------------------+       +-------------------+
          ^                           ^
          |                           |
     +----+---------------------------+----+
     |          LOAD BALANCER              |
     +-------------------------------------+
                     ^
                     |
                 [Clients]
AspectDetail
How it worksAll nodes serve traffic simultaneously
Failover timeNear-zero -- traffic is already distributed
Data riskConflict resolution needed (last-write-wins, CRDTs, etc.)
CostHigher -- all resources fully utilized
Use caseGlobal services, CDNs, DNS, multi-leader databases

Comparison Table

Active-PassiveActive-Active
Resource utilization~50% (standby idle)~100%
Failover speedSeconds-minutesNear-instant
ComplexityLowerHigher (conflict resolution)
Write scalabilitySingle writerMultiple writers
Data consistencyEasier (single source)Harder (conflicts possible)

6. Failover Strategies

StrategyHow It WorksSpeedUse Case
Cold standbyBackup server is OFF; started on failureMinutes-hoursCost-sensitive, non-critical
Warm standbyBackup is running but not serving trafficSeconds-minutesDatabases, internal services
Hot standbyBackup receives replicated data in real-timeSecondsPayment systems, critical services
Multi-activeAll nodes active; load balancedNear-zeroGlobal web apps, CDNs

7. Disaster Recovery

+------------------------------------------------------------------------+
|                     DISASTER RECOVERY METRICS                           |
|                                                                        |
|  RPO (Recovery Point Objective)     RTO (Recovery Time Objective)       |
|  = How much data can we LOSE?       = How fast must we RECOVER?         |
|                                                                        |
|  Data Loss <--|-- RPO --|-- Disaster --|-- RTO --|-->  Recovery         |
|              last       event          system     service               |
|              backup     occurs         restored   resumed               |
|                                                                        |
|  RPO = 0  --> No data loss (sync replication)                          |
|  RPO = 1h --> Up to 1 hour of data may be lost                         |
|  RTO = 0  --> Instant recovery (active-active)                         |
|  RTO = 4h --> Service restored within 4 hours                          |
+------------------------------------------------------------------------+

Disaster Recovery Tiers

TierStrategyRPORTOCost
1Backup & restoreHours-daysHours-days$
2Pilot light (minimal standby)Minutes10-30 min$$
3Warm standby (scaled-down copy)Seconds-minutesMinutes$$$
4Multi-region active-activeNear-zeroNear-zero$$$$

8. Multi-Region Deployment

                        +-------------------+
                        |   Global DNS /    |
                        |   GeoDNS / CDN    |
                        +---------+---------+
                                  |
                   +--------------+--------------+
                   |                              |
          +--------v--------+          +----------v--------+
          |  US-EAST Region |          |  EU-WEST Region   |
          |                 |          |                    |
          |  [LB]           |          |  [LB]             |
          |    |            |          |    |               |
          |  [App x3]      |          |  [App x3]          |
          |    |            |          |    |               |
          |  [DB Primary]--+--->sync<--+--[DB Primary]     |
          |  [DB Replica]   |          |  [DB Replica]      |
          |  [Cache]        |          |  [Cache]           |
          +-----------------+          +--------------------+

Key decisions in multi-region:

DecisionOptionsTrade-off
Data replicationSync vs asyncConsistency vs latency
Conflict resolutionLast-write-wins, CRDTs, app-levelSimplicity vs correctness
Traffic routingGeoDNS, anycast, latency-basedUser experience vs complexity
Data residencyRegional data isolationCompliance vs engineering cost
Failover scopeRegion-level vs AZ-levelRecovery speed vs cost

9. Availability in System Design Interviews

When an interviewer says "design X with high availability," follow this checklist:

+-----------------------------------------------------------------------+
|              AVAILABILITY CHECKLIST FOR INTERVIEWS                      |
|                                                                        |
|  [ ] 1. State the availability target (e.g., 99.99%)                   |
|  [ ] 2. Calculate error budget (52.6 min/year downtime)                |
|  [ ] 3. Eliminate single points of failure (SPOF)                      |
|         - Replicate databases (primary + replica)                      |
|         - Multiple app server instances behind LB                      |
|         - Redundant load balancers                                     |
|  [ ] 4. Choose failover strategy (active-passive or active-active)     |
|  [ ] 5. Specify health checks and failure detection                    |
|  [ ] 6. Define RPO and RTO                                             |
|  [ ] 7. Discuss multi-region if global availability is needed          |
|  [ ] 8. Mention monitoring and alerting (tie to 9.10.c)               |
+-----------------------------------------------------------------------+

Key Takeaways

  1. Availability is measured in nines. Three nines (99.9%) allows 8.76 hours of downtime per year. Four nines (99.99%) allows only 52 minutes.
  2. Serial components multiply failure risk. Each component in a chain reduces overall availability.
  3. Parallel redundancy is the primary weapon. Two 99.9% servers in parallel give 99.9999%.
  4. Active-passive is simpler; active-active is faster. Choose based on your requirements and budget.
  5. RPO and RTO define disaster recovery. Always state both when discussing DR.
  6. SLA > SLO > SLI. Know the hierarchy: indicators feed objectives feed agreements.
  7. Multi-region is the gold standard for global high availability, but comes with data consistency challenges.

Explain-It Challenge

Explain to a non-technical stakeholder: "We need to upgrade from 99.9% to 99.99% availability."

What does this mean in practical terms? How would you justify the cost? What architectural changes are needed?

Hint: Frame it as "reducing downtime from 8.76 hours/year to 52 minutes/year" and list the redundancy investments required.


Next -> 9.10.b — Fault Tolerance