Episode 9 — System Design / 9.10 — Advanced Distributed Systems
9.10.a — Reliability and Availability
Introduction
Reliability and availability are the foundational promises you make about your system. Every system design interview implicitly or explicitly asks: "How do you keep this running?" This section gives you the math, the vocabulary, and the architectural strategies to answer confidently.
1. Key Definitions
| Term | Definition | Example |
|---|---|---|
| Reliability | The probability that a system performs its intended function without failure for a given period | "The payment service processes transactions correctly 99.99% of the time" |
| Availability | The fraction of time a system is operational and accessible | "The website is reachable 99.95% of the year" |
| Durability | The guarantee that stored data will not be lost | "S3 provides 99.999999999% (11 nines) durability" |
Key distinction: A system can be available but unreliable (it's reachable but returns wrong data). Reliability is the stronger guarantee.
2. The "Nines" of Availability
+-------+------------+------------------+---------------------+-----------------------+
| Nines | Percentage | Downtime / Year | Downtime / Month | Downtime / Day |
+-------+------------+------------------+---------------------+-----------------------+
| 1 | 90% | 36.5 days | 72 hours | 2.4 hours |
| 2 | 99% | 3.65 days | 7.2 hours | 14.4 minutes |
| 3 | 99.9% | 8.76 hours | 43.8 minutes | 1.44 minutes |
| 4 | 99.99% | 52.6 minutes | 4.38 minutes | 8.64 seconds |
| 5 | 99.999% | 5.26 minutes | 26.3 seconds | 0.864 seconds |
+-------+------------+------------------+---------------------+-----------------------+
Interview tip: Memorize at least 99.9% (three nines = ~8.76 hours/year) and 99.99% (four nines = ~52 minutes/year). These are the two most commonly referenced targets.
3. SLA, SLO, and SLI
+---------------------------------------------------------------------+
| |
| SLI (Service Level INDICATOR) |
| = The actual metric you measure |
| Example: "Request latency at p99 is 200ms" |
| |
| | |
| v |
| |
| SLO (Service Level OBJECTIVE) |
| = The internal target you set |
| Example: "p99 latency SHOULD be < 300ms" |
| |
| | |
| v |
| |
| SLA (Service Level AGREEMENT) |
| = The contractual promise to customers |
| Example: "If p99 > 500ms for >0.1% of requests, |
| we credit 10% of the monthly bill" |
| |
| Relationship: SLI measures --> SLO targets --> SLA guarantees |
+---------------------------------------------------------------------+
| SLI | SLO | SLA | |
|---|---|---|---|
| Who sets it? | Engineering (instrumentation) | Engineering (internal goal) | Business + Legal (contract) |
| Consequence of breach | Alert fires | Error budget consumed | Financial penalty / customer churn |
| Example | Uptime = 99.93% | Uptime target >= 99.95% | Guarantee uptime >= 99.9% |
4. Availability Math
4a. Components in Series (AND -- all must work)
When components are chained and ALL must be operational:
Client --> [LB] --> [App Server] --> [Database]
99.99% 99.99% 99.99%
Overall = 0.9999 x 0.9999 x 0.9999 = 0.9997 = 99.97%
Formula:
A_total = A1 x A2 x A3 x ... x An
Key insight: Each additional serial component REDUCES overall availability. A chain of ten 99.9% components yields only 99.0%.
4b. Components in Parallel (OR -- at least one must work)
When you add redundancy (any one working is sufficient):
+--[ Server A 99.9% ]--+
Client ---->| |----> DB
+--[ Server B 99.9% ]--+
P(both fail) = 0.001 x 0.001 = 0.000001
A_total = 1 - 0.000001 = 99.9999%
Formula:
A_total = 1 - (1 - A1) x (1 - A2) x ... x (1 - An)
Key insight: Parallel redundancy DRAMATICALLY improves availability.
4c. Combined Example (Interview-Style)
+--[ App A 99.9% ]--+
[LB 99.99%] ---->| |----> [DB Primary 99.95%]
+--[ App B 99.9% ]--+ |
[DB Replica 99.95%]
Step 1: App layer (parallel)
A_app = 1 - (1 - 0.999)^2 = 1 - 0.000001 = 0.999999
Step 2: DB layer (parallel)
A_db = 1 - (1 - 0.9995)^2 = 1 - 0.00000025 = 0.99999975
Step 3: Whole system (series: LB -> App -> DB)
A_total = 0.9999 x 0.999999 x 0.99999975
= 0.99989875
~ 99.99%
5. Redundancy Strategies
Active-Passive (Hot Standby)
+-------------------+ +-------------------+
| PRIMARY (Active)| | STANDBY (Passive)|
| Handles ALL | ----> | Receives data |
| traffic | sync | via replication |
+-------------------+ +-------------------+
| |
[Heartbeat] <----- monitors --- [Health Check]
|
If primary fails:
|
v
+-------------------+ +-------------------+
| FAILED | | NOW ACTIVE |
| (being repaired)| | Takes over |
+-------------------+ +-------------------+
| Aspect | Detail |
|---|---|
| How it works | Standby replicates data from primary; takes over on failure |
| Failover time | Seconds to minutes (depends on detection + switchover) |
| Data risk | Possible data loss if replication is async |
| Cost | Moderate -- standby resources are partially idle |
| Use case | Traditional databases, single-leader setups |
Active-Active
+-------------------+ +-------------------+
| NODE A (Active) | <---> | NODE B (Active) |
| Handles traffic | sync | Handles traffic |
+-------------------+ +-------------------+
^ ^
| |
+----+---------------------------+----+
| LOAD BALANCER |
+-------------------------------------+
^
|
[Clients]
| Aspect | Detail |
|---|---|
| How it works | All nodes serve traffic simultaneously |
| Failover time | Near-zero -- traffic is already distributed |
| Data risk | Conflict resolution needed (last-write-wins, CRDTs, etc.) |
| Cost | Higher -- all resources fully utilized |
| Use case | Global services, CDNs, DNS, multi-leader databases |
Comparison Table
| Active-Passive | Active-Active | |
|---|---|---|
| Resource utilization | ~50% (standby idle) | ~100% |
| Failover speed | Seconds-minutes | Near-instant |
| Complexity | Lower | Higher (conflict resolution) |
| Write scalability | Single writer | Multiple writers |
| Data consistency | Easier (single source) | Harder (conflicts possible) |
6. Failover Strategies
| Strategy | How It Works | Speed | Use Case |
|---|---|---|---|
| Cold standby | Backup server is OFF; started on failure | Minutes-hours | Cost-sensitive, non-critical |
| Warm standby | Backup is running but not serving traffic | Seconds-minutes | Databases, internal services |
| Hot standby | Backup receives replicated data in real-time | Seconds | Payment systems, critical services |
| Multi-active | All nodes active; load balanced | Near-zero | Global web apps, CDNs |
7. Disaster Recovery
+------------------------------------------------------------------------+
| DISASTER RECOVERY METRICS |
| |
| RPO (Recovery Point Objective) RTO (Recovery Time Objective) |
| = How much data can we LOSE? = How fast must we RECOVER? |
| |
| Data Loss <--|-- RPO --|-- Disaster --|-- RTO --|--> Recovery |
| last event system service |
| backup occurs restored resumed |
| |
| RPO = 0 --> No data loss (sync replication) |
| RPO = 1h --> Up to 1 hour of data may be lost |
| RTO = 0 --> Instant recovery (active-active) |
| RTO = 4h --> Service restored within 4 hours |
+------------------------------------------------------------------------+
Disaster Recovery Tiers
| Tier | Strategy | RPO | RTO | Cost |
|---|---|---|---|---|
| 1 | Backup & restore | Hours-days | Hours-days | $ |
| 2 | Pilot light (minimal standby) | Minutes | 10-30 min | $$ |
| 3 | Warm standby (scaled-down copy) | Seconds-minutes | Minutes | $$$ |
| 4 | Multi-region active-active | Near-zero | Near-zero | $$$$ |
8. Multi-Region Deployment
+-------------------+
| Global DNS / |
| GeoDNS / CDN |
+---------+---------+
|
+--------------+--------------+
| |
+--------v--------+ +----------v--------+
| US-EAST Region | | EU-WEST Region |
| | | |
| [LB] | | [LB] |
| | | | | |
| [App x3] | | [App x3] |
| | | | | |
| [DB Primary]--+--->sync<--+--[DB Primary] |
| [DB Replica] | | [DB Replica] |
| [Cache] | | [Cache] |
+-----------------+ +--------------------+
Key decisions in multi-region:
| Decision | Options | Trade-off |
|---|---|---|
| Data replication | Sync vs async | Consistency vs latency |
| Conflict resolution | Last-write-wins, CRDTs, app-level | Simplicity vs correctness |
| Traffic routing | GeoDNS, anycast, latency-based | User experience vs complexity |
| Data residency | Regional data isolation | Compliance vs engineering cost |
| Failover scope | Region-level vs AZ-level | Recovery speed vs cost |
9. Availability in System Design Interviews
When an interviewer says "design X with high availability," follow this checklist:
+-----------------------------------------------------------------------+
| AVAILABILITY CHECKLIST FOR INTERVIEWS |
| |
| [ ] 1. State the availability target (e.g., 99.99%) |
| [ ] 2. Calculate error budget (52.6 min/year downtime) |
| [ ] 3. Eliminate single points of failure (SPOF) |
| - Replicate databases (primary + replica) |
| - Multiple app server instances behind LB |
| - Redundant load balancers |
| [ ] 4. Choose failover strategy (active-passive or active-active) |
| [ ] 5. Specify health checks and failure detection |
| [ ] 6. Define RPO and RTO |
| [ ] 7. Discuss multi-region if global availability is needed |
| [ ] 8. Mention monitoring and alerting (tie to 9.10.c) |
+-----------------------------------------------------------------------+
Key Takeaways
- Availability is measured in nines. Three nines (99.9%) allows 8.76 hours of downtime per year. Four nines (99.99%) allows only 52 minutes.
- Serial components multiply failure risk. Each component in a chain reduces overall availability.
- Parallel redundancy is the primary weapon. Two 99.9% servers in parallel give 99.9999%.
- Active-passive is simpler; active-active is faster. Choose based on your requirements and budget.
- RPO and RTO define disaster recovery. Always state both when discussing DR.
- SLA > SLO > SLI. Know the hierarchy: indicators feed objectives feed agreements.
- Multi-region is the gold standard for global high availability, but comes with data consistency challenges.
Explain-It Challenge
Explain to a non-technical stakeholder: "We need to upgrade from 99.9% to 99.99% availability."
What does this mean in practical terms? How would you justify the cost? What architectural changes are needed?
Hint: Frame it as "reducing downtime from 8.76 hours/year to 52 minutes/year" and list the redundancy investments required.
Next -> 9.10.b — Fault Tolerance