Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10 — Quick Revision Cheat Sheet

How to use this material (instructions)

Read through once the day before your interview.
Cover the right column of each table and try to recall from the left column.
Focus on diagrams -- redraw the ASCII diagrams from memory to test understanding.
Spend 60 seconds per section for a complete 6-minute rapid review.

1. Reliability & Availability

Nines Table (MEMORIZE)

Nines	Uptime	Downtime/Year
99.9% (3 nines)	8.76 hours	43.8 min
99.99% (4 nines)	52.6 minutes	4.38 min
99.999% (5 nines)	5.26 minutes	26.3 sec

Availability Math

  SERIES:    A_total = A1 x A2 x A3         (all must work)
  PARALLEL:  A_total = 1 - (1-A1)(1-A2)     (any one works)

SLA / SLO / SLI

  SLI (what you MEASURE) -> SLO (what you TARGET) -> SLA (what you PROMISE)

Failover Comparison

	Active-Passive	Active-Active
Failover speed	Seconds-minutes	Near-instant
Resource use	~50%	~100%
Complexity	Lower	Higher (conflicts)

Disaster Recovery

  RPO = How much data can we LOSE?
  RTO = How fast must we RECOVER?

2. Fault Tolerance

Pattern Quick Reference

Pattern	Purpose	One-Line Summary
Circuit Breaker	Prevent cascading failure	CLOSED -> (failures) -> OPEN -> (timeout) -> HALF-OPEN
Bulkhead	Isolate failure blast radius	Separate thread/connection pools per dependency
Retry + Backoff	Recover from transient failures	wait = min(base * 2^attempt + jitter, max)
Timeout	Prevent indefinite blocking	Set at EVERY network boundary
Graceful Degradation	Partial service > no service	Disable non-critical features, serve cached data

Circuit Breaker States

  CLOSED --[failures > threshold]--> OPEN --[after timeout]--> HALF-OPEN
    ^                                                              |
    +-----[success]------------------------------------------------+
    OPEN <-----[failure]-------------------------------------------+

SPOF Elimination Checklist

  DNS       -> Multi-provider
  LB        -> Redundant pair / cloud LB
  App       -> Multiple instances + auto-scaling
  DB        -> Primary + replica + auto-failover
  Storage   -> RAID + cross-region backups
  Network   -> Multi-AZ / multi-region

3. Observability

Three Pillars

Pillar	Data Type	Best For	Tool
Logs	Events (text/JSON)	Debugging specific errors	ELK Stack
Metrics	Numbers over time	Dashboards, alerting	Prometheus + Grafana
Traces	Spans (request flow)	Finding latency bottlenecks	Jaeger / Zipkin

RED Method (Services) & USE Method (Infrastructure)

  RED: Rate, Errors, Duration       USE: Utilization, Saturation, Errors

Distributed Tracing

  Trace (1 per request) contains Spans (1 per service hop)
  Context propagated via: traceparent HTTP header
  Visualized as: Waterfall / timeline diagram

Alerting Rule

  Alert on SYMPTOMS (error rate > 5%), not CAUSES (disk I/O high)
  Every alert must be ACTIONABLE

4. Rate Limiting

Algorithm Comparison

Algorithm	Memory	Bursts?	Accuracy	Use When
Token Bucket	O(1)	Yes	Good	Default choice (most APIs)
Leaky Bucket	O(1)	No	Good	Need constant output rate
Fixed Window	O(1)	Edge problem	Low	Simplest, acceptable for low stakes
Sliding Log	O(n)	No	Perfect	Need exact counting
Sliding Counter	O(1)	Partial	High	Best accuracy/memory trade-off

Distributed Rate Limiting

  Standard: Redis + Lua script (atomic token bucket)
  Fallback: Local rate limiting if Redis is down

Response Headers

  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 73
  X-RateLimit-Reset: 1620000060
  Retry-After: 30              (on 429 response)

5. Auth & Authorization

AuthN vs AuthZ

  AuthN: "WHO are you?"    -> 401 on failure
  AuthZ: "WHAT can you do?" -> 403 on failure

Session vs JWT

	Session	JWT
State	Server-side (Redis/DB)	Client-side (token)
Scalability	Harder (shared store)	Easier (stateless)
Revocation	Easy (delete session)	Hard (wait for expiry)
Best for	Traditional web apps	SPAs, mobile, microservices

JWT Structure

  HEADER.PAYLOAD.SIGNATURE   (Base64 encoded, dot-separated)
  Access token:  ~15 min TTL   (used for every request)
  Refresh token: ~7 days TTL   (used only to get new access tokens)

OAuth 2.0 (Authorization Code Flow)

  1. Redirect to auth server with client_id
  2. User logs in + consents
  3. Redirect back with authorization CODE
  4. Backend exchanges code + client_secret for TOKEN (server-to-server)

RBAC vs ABAC

	RBAC	ABAC
Model	User -> Role -> Permission	Policy on attributes
Complexity	Simple	Complex
Use case	Most apps	Healthcare, finance, government

Microservices Auth Pattern

  Client -> [API Gateway: validate JWT] -> [Internal services: trust gateway headers]
  Inter-service: mTLS via service mesh (Istio/Envoy)

6. Search Systems

Inverted Index

  Forward:  Doc -> [terms]
  Inverted: Term -> [Doc IDs]     (this is what search engines use)

Text Processing Pipeline

  Raw text -> Tokenize -> Lowercase -> Remove stop words -> Stem -> Index

Ranking

  TF-IDF:  score = TF (frequency in doc) x IDF (rarity across corpus)
  BM25:    Improved TF-IDF with saturation + length normalization
  Custom:  BM25 + recency + popularity + personalization

Elasticsearch Key Concepts

  Index = Table,  Document = Row,  Shard = Partition,  Replica = Copy
  Query flow: Fan out to all shards -> merge + sort -> fetch top N docs

Autocomplete

  Options: Trie, ES completion suggester, edge n-grams
  Requirements: < 100ms latency, debounce on client (200ms)

Search Architecture

  DB -> CDC/Kafka -> Indexer -> Elasticsearch <- Search API <- Client

7. Interview Quick-Draw Templates

"Design X with high availability"

  1. State target (99.99%)
  2. Eliminate SPOFs (redundancy at every layer)
  3. Choose failover (active-passive or active-active)
  4. Define RPO/RTO
  5. Multi-region if global

"How would you make this fault tolerant?"

  1. Circuit breaker on all external calls
  2. Bulkhead isolation for dependencies
  3. Retry + exponential backoff + jitter
  4. Timeouts at every boundary
  5. Graceful degradation for non-critical features

"How would you monitor this system?"

  1. Structured logs with trace IDs
  2. RED metrics (rate, errors, duration) on every service
  3. Distributed tracing across service boundaries
  4. Alerts on SLO violations
  5. Dashboards in Grafana

"How would you secure this API?"

  1. JWT auth at API gateway
  2. OAuth 2.0 for third-party access
  3. RBAC for permissions
  4. Rate limiting (token bucket + Redis)
  5. HTTPS everywhere, mTLS between services

8. Numbers to Memorize

Metric	Value
3 nines downtime/year	8.76 hours
4 nines downtime/year	52.6 minutes
5 nines downtime/year	5.26 minutes
JWT access token TTL	~15 minutes
JWT refresh token TTL	~7 days
Autocomplete latency target	< 100ms
Client debounce for typeahead	150-300ms
Circuit breaker timeout	10-60 seconds
Retry backoff formula	base * 2^attempt + jitter
S3 durability	11 nines (99.999999999%)

Go back to the README for the full section overview.