Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.a — Reliability and Availability

Introduction

Reliability and availability are the foundational promises you make about your system. Every system design interview implicitly or explicitly asks: "How do you keep this running?" This section gives you the math, the vocabulary, and the architectural strategies to answer confidently.

1. Key Definitions

Term	Definition	Example
Reliability	The probability that a system performs its intended function without failure for a given period	"The payment service processes transactions correctly 99.99% of the time"
Availability	The fraction of time a system is operational and accessible	"The website is reachable 99.95% of the year"
Durability	The guarantee that stored data will not be lost	"S3 provides 99.999999999% (11 nines) durability"

Key distinction: A system can be available but unreliable (it's reachable but returns wrong data). Reliability is the stronger guarantee.

2. The "Nines" of Availability

+-------+------------+------------------+---------------------+-----------------------+
| Nines | Percentage | Downtime / Year  | Downtime / Month    | Downtime / Day        |
+-------+------------+------------------+---------------------+-----------------------+
|   1   |   90%      | 36.5 days        | 72 hours            | 2.4 hours             |
|   2   |   99%      | 3.65 days        | 7.2 hours           | 14.4 minutes          |
|   3   |   99.9%    | 8.76 hours       | 43.8 minutes        | 1.44 minutes          |
|   4   |   99.99%   | 52.6 minutes     | 4.38 minutes        | 8.64 seconds          |
|   5   |   99.999%  | 5.26 minutes     | 26.3 seconds        | 0.864 seconds         |
+-------+------------+------------------+---------------------+-----------------------+

Interview tip: Memorize at least 99.9% (three nines = ~8.76 hours/year) and 99.99% (four nines = ~52 minutes/year). These are the two most commonly referenced targets.

3. SLA, SLO, and SLI

+---------------------------------------------------------------------+
|                                                                     |
|  SLI (Service Level INDICATOR)                                      |
|    = The actual metric you measure                                  |
|    Example: "Request latency at p99 is 200ms"                       |
|                                                                     |
|       |                                                             |
|       v                                                             |
|                                                                     |
|  SLO (Service Level OBJECTIVE)                                      |
|    = The internal target you set                                    |
|    Example: "p99 latency SHOULD be < 300ms"                         |
|                                                                     |
|       |                                                             |
|       v                                                             |
|                                                                     |
|  SLA (Service Level AGREEMENT)                                      |
|    = The contractual promise to customers                           |
|    Example: "If p99 > 500ms for >0.1% of requests,                 |
|              we credit 10% of the monthly bill"                     |
|                                                                     |
|  Relationship: SLI measures --> SLO targets --> SLA guarantees      |
+---------------------------------------------------------------------+

	SLI	SLO	SLA
Who sets it?	Engineering (instrumentation)	Engineering (internal goal)	Business + Legal (contract)
Consequence of breach	Alert fires	Error budget consumed	Financial penalty / customer churn
Example	Uptime = 99.93%	Uptime target >= 99.95%	Guarantee uptime >= 99.9%

4. Availability Math

4a. Components in Series (AND -- all must work)

When components are chained and ALL must be operational:

  Client --> [LB] --> [App Server] --> [Database]
              99.99%    99.99%          99.99%

  Overall = 0.9999 x 0.9999 x 0.9999 = 0.9997 = 99.97%

Formula:

A_total = A1 x A2 x A3 x ... x An

Key insight: Each additional serial component REDUCES overall availability. A chain of ten 99.9% components yields only 99.0%.

4b. Components in Parallel (OR -- at least one must work)

When you add redundancy (any one working is sufficient):

              +--[ Server A  99.9% ]--+
  Client ---->|                       |----> DB
              +--[ Server B  99.9% ]--+

  P(both fail) = 0.001 x 0.001 = 0.000001
  A_total      = 1 - 0.000001  = 99.9999%

Formula:

A_total = 1 - (1 - A1) x (1 - A2) x ... x (1 - An)

Key insight: Parallel redundancy DRAMATICALLY improves availability.

4c. Combined Example (Interview-Style)

                    +--[ App A 99.9% ]--+
  [LB 99.99%] ---->|                    |---->  [DB Primary 99.95%]
                    +--[ App B 99.9% ]--+              |
                                                [DB Replica 99.95%]

  Step 1: App layer (parallel)
    A_app = 1 - (1 - 0.999)^2 = 1 - 0.000001 = 0.999999

  Step 2: DB layer (parallel)
    A_db = 1 - (1 - 0.9995)^2 = 1 - 0.00000025 = 0.99999975

  Step 3: Whole system (series: LB -> App -> DB)
    A_total = 0.9999 x 0.999999 x 0.99999975
            = 0.99989875
            ~ 99.99%

5. Redundancy Strategies

Active-Passive (Hot Standby)

  +-------------------+       +-------------------+
  |   PRIMARY (Active)|       |  STANDBY (Passive)|
  |   Handles ALL     | ----> |  Receives data    |
  |   traffic         |  sync |  via replication   |
  +-------------------+       +-------------------+
          |                           |
     [Heartbeat] <----- monitors --- [Health Check]
          |
     If primary fails:
          |
          v
  +-------------------+       +-------------------+
  |   FAILED          |       |  NOW ACTIVE       |
  |   (being repaired)|       |  Takes over       |
  +-------------------+       +-------------------+

Aspect	Detail
How it works	Standby replicates data from primary; takes over on failure
Failover time	Seconds to minutes (depends on detection + switchover)
Data risk	Possible data loss if replication is async
Cost	Moderate -- standby resources are partially idle
Use case	Traditional databases, single-leader setups

Active-Active

  +-------------------+       +-------------------+
  |   NODE A (Active) | <---> |   NODE B (Active) |
  |   Handles traffic | sync  |   Handles traffic |
  +-------------------+       +-------------------+
          ^                           ^
          |                           |
     +----+---------------------------+----+
     |          LOAD BALANCER              |
     +-------------------------------------+
                     ^
                     |
                 [Clients]

Aspect	Detail
How it works	All nodes serve traffic simultaneously
Failover time	Near-zero -- traffic is already distributed
Data risk	Conflict resolution needed (last-write-wins, CRDTs, etc.)
Cost	Higher -- all resources fully utilized
Use case	Global services, CDNs, DNS, multi-leader databases

Comparison Table

	Active-Passive	Active-Active
Resource utilization	~50% (standby idle)	~100%
Failover speed	Seconds-minutes	Near-instant
Complexity	Lower	Higher (conflict resolution)
Write scalability	Single writer	Multiple writers
Data consistency	Easier (single source)	Harder (conflicts possible)

6. Failover Strategies

Strategy	How It Works	Speed	Use Case
Cold standby	Backup server is OFF; started on failure	Minutes-hours	Cost-sensitive, non-critical
Warm standby	Backup is running but not serving traffic	Seconds-minutes	Databases, internal services
Hot standby	Backup receives replicated data in real-time	Seconds	Payment systems, critical services
Multi-active	All nodes active; load balanced	Near-zero	Global web apps, CDNs

7. Disaster Recovery

+------------------------------------------------------------------------+
|                     DISASTER RECOVERY METRICS                           |
|                                                                        |
|  RPO (Recovery Point Objective)     RTO (Recovery Time Objective)       |
|  = How much data can we LOSE?       = How fast must we RECOVER?         |
|                                                                        |
|  Data Loss <--|-- RPO --|-- Disaster --|-- RTO --|-->  Recovery         |
|              last       event          system     service               |
|              backup     occurs         restored   resumed               |
|                                                                        |
|  RPO = 0  --> No data loss (sync replication)                          |
|  RPO = 1h --> Up to 1 hour of data may be lost                         |
|  RTO = 0  --> Instant recovery (active-active)                         |
|  RTO = 4h --> Service restored within 4 hours                          |
+------------------------------------------------------------------------+

Disaster Recovery Tiers

Tier	Strategy	RPO	RTO	Cost
1	Backup & restore	Hours-days	Hours-days	$
2	Pilot light (minimal standby)	Minutes	10-30 min	$$
3	Warm standby (scaled-down copy)	Seconds-minutes	Minutes	$$$
4	Multi-region active-active	Near-zero	Near-zero	$$$$

8. Multi-Region Deployment

                        +-------------------+
                        |   Global DNS /    |
                        |   GeoDNS / CDN    |
                        +---------+---------+
                                  |
                   +--------------+--------------+
                   |                              |
          +--------v--------+          +----------v--------+
          |  US-EAST Region |          |  EU-WEST Region   |
          |                 |          |                    |
          |  [LB]           |          |  [LB]             |
          |    |            |          |    |               |
          |  [App x3]      |          |  [App x3]          |
          |    |            |          |    |               |
          |  [DB Primary]--+--->sync<--+--[DB Primary]     |
          |  [DB Replica]   |          |  [DB Replica]      |
          |  [Cache]        |          |  [Cache]           |
          +-----------------+          +--------------------+

Key decisions in multi-region:

Decision	Options	Trade-off
Data replication	Sync vs async	Consistency vs latency
Conflict resolution	Last-write-wins, CRDTs, app-level	Simplicity vs correctness
Traffic routing	GeoDNS, anycast, latency-based	User experience vs complexity
Data residency	Regional data isolation	Compliance vs engineering cost
Failover scope	Region-level vs AZ-level	Recovery speed vs cost

9. Availability in System Design Interviews

When an interviewer says "design X with high availability," follow this checklist:

+-----------------------------------------------------------------------+
|              AVAILABILITY CHECKLIST FOR INTERVIEWS                      |
|                                                                        |
|  [ ] 1. State the availability target (e.g., 99.99%)                   |
|  [ ] 2. Calculate error budget (52.6 min/year downtime)                |
|  [ ] 3. Eliminate single points of failure (SPOF)                      |
|         - Replicate databases (primary + replica)                      |
|         - Multiple app server instances behind LB                      |
|         - Redundant load balancers                                     |
|  [ ] 4. Choose failover strategy (active-passive or active-active)     |
|  [ ] 5. Specify health checks and failure detection                    |
|  [ ] 6. Define RPO and RTO                                             |
|  [ ] 7. Discuss multi-region if global availability is needed          |
|  [ ] 8. Mention monitoring and alerting (tie to 9.10.c)               |
+-----------------------------------------------------------------------+

Key Takeaways

Availability is measured in nines. Three nines (99.9%) allows 8.76 hours of downtime per year. Four nines (99.99%) allows only 52 minutes.
Serial components multiply failure risk. Each component in a chain reduces overall availability.
Parallel redundancy is the primary weapon. Two 99.9% servers in parallel give 99.9999%.
Active-passive is simpler; active-active is faster. Choose based on your requirements and budget.
RPO and RTO define disaster recovery. Always state both when discussing DR.
SLA > SLO > SLI. Know the hierarchy: indicators feed objectives feed agreements.
Multi-region is the gold standard for global high availability, but comes with data consistency challenges.

Explain-It Challenge

Explain to a non-technical stakeholder: "We need to upgrade from 99.9% to 99.99% availability."

What does this mean in practical terms? How would you justify the cost? What architectural changes are needed?

Hint: Frame it as "reducing downtime from 8.76 hours/year to 52 minutes/year" and list the redundancy investments required.

Next -> 9.10.b — Fault Tolerance