Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10 — Quick Revision Cheat Sheet

How to use this material (instructions)

  1. Read through once the day before your interview.
  2. Cover the right column of each table and try to recall from the left column.
  3. Focus on diagrams -- redraw the ASCII diagrams from memory to test understanding.
  4. Spend 60 seconds per section for a complete 6-minute rapid review.

1. Reliability & Availability

Nines Table (MEMORIZE)

NinesUptimeDowntime/YearDowntime/Month
99.9% (3 nines)8.76 hours43.8 min
99.99% (4 nines)52.6 minutes4.38 min
99.999% (5 nines)5.26 minutes26.3 sec

Availability Math

  SERIES:    A_total = A1 x A2 x A3         (all must work)
  PARALLEL:  A_total = 1 - (1-A1)(1-A2)     (any one works)

SLA / SLO / SLI

  SLI (what you MEASURE) -> SLO (what you TARGET) -> SLA (what you PROMISE)

Failover Comparison

Active-PassiveActive-Active
Failover speedSeconds-minutesNear-instant
Resource use~50%~100%
ComplexityLowerHigher (conflicts)

Disaster Recovery

  RPO = How much data can we LOSE?
  RTO = How fast must we RECOVER?

2. Fault Tolerance

Pattern Quick Reference

PatternPurposeOne-Line Summary
Circuit BreakerPrevent cascading failureCLOSED -> (failures) -> OPEN -> (timeout) -> HALF-OPEN
BulkheadIsolate failure blast radiusSeparate thread/connection pools per dependency
Retry + BackoffRecover from transient failureswait = min(base * 2^attempt + jitter, max)
TimeoutPrevent indefinite blockingSet at EVERY network boundary
Graceful DegradationPartial service > no serviceDisable non-critical features, serve cached data

Circuit Breaker States

  CLOSED --[failures > threshold]--> OPEN --[after timeout]--> HALF-OPEN
    ^                                                              |
    +-----[success]------------------------------------------------+
    OPEN <-----[failure]-------------------------------------------+

SPOF Elimination Checklist

  DNS       -> Multi-provider
  LB        -> Redundant pair / cloud LB
  App       -> Multiple instances + auto-scaling
  DB        -> Primary + replica + auto-failover
  Storage   -> RAID + cross-region backups
  Network   -> Multi-AZ / multi-region

3. Observability

Three Pillars

PillarData TypeBest ForTool
LogsEvents (text/JSON)Debugging specific errorsELK Stack
MetricsNumbers over timeDashboards, alertingPrometheus + Grafana
TracesSpans (request flow)Finding latency bottlenecksJaeger / Zipkin

RED Method (Services) & USE Method (Infrastructure)

  RED: Rate, Errors, Duration       USE: Utilization, Saturation, Errors

Distributed Tracing

  Trace (1 per request) contains Spans (1 per service hop)
  Context propagated via: traceparent HTTP header
  Visualized as: Waterfall / timeline diagram

Alerting Rule

  Alert on SYMPTOMS (error rate > 5%), not CAUSES (disk I/O high)
  Every alert must be ACTIONABLE

4. Rate Limiting

Algorithm Comparison

AlgorithmMemoryBursts?AccuracyUse When
Token BucketO(1)YesGoodDefault choice (most APIs)
Leaky BucketO(1)NoGoodNeed constant output rate
Fixed WindowO(1)Edge problemLowSimplest, acceptable for low stakes
Sliding LogO(n)NoPerfectNeed exact counting
Sliding CounterO(1)PartialHighBest accuracy/memory trade-off

Distributed Rate Limiting

  Standard: Redis + Lua script (atomic token bucket)
  Fallback: Local rate limiting if Redis is down

Response Headers

  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 73
  X-RateLimit-Reset: 1620000060
  Retry-After: 30              (on 429 response)

5. Auth & Authorization

AuthN vs AuthZ

  AuthN: "WHO are you?"    -> 401 on failure
  AuthZ: "WHAT can you do?" -> 403 on failure

Session vs JWT

SessionJWT
StateServer-side (Redis/DB)Client-side (token)
ScalabilityHarder (shared store)Easier (stateless)
RevocationEasy (delete session)Hard (wait for expiry)
Best forTraditional web appsSPAs, mobile, microservices

JWT Structure

  HEADER.PAYLOAD.SIGNATURE   (Base64 encoded, dot-separated)
  Access token:  ~15 min TTL   (used for every request)
  Refresh token: ~7 days TTL   (used only to get new access tokens)

OAuth 2.0 (Authorization Code Flow)

  1. Redirect to auth server with client_id
  2. User logs in + consents
  3. Redirect back with authorization CODE
  4. Backend exchanges code + client_secret for TOKEN (server-to-server)

RBAC vs ABAC

RBACABAC
ModelUser -> Role -> PermissionPolicy on attributes
ComplexitySimpleComplex
Use caseMost appsHealthcare, finance, government

Microservices Auth Pattern

  Client -> [API Gateway: validate JWT] -> [Internal services: trust gateway headers]
  Inter-service: mTLS via service mesh (Istio/Envoy)

6. Search Systems

Inverted Index

  Forward:  Doc -> [terms]
  Inverted: Term -> [Doc IDs]     (this is what search engines use)

Text Processing Pipeline

  Raw text -> Tokenize -> Lowercase -> Remove stop words -> Stem -> Index

Ranking

  TF-IDF:  score = TF (frequency in doc) x IDF (rarity across corpus)
  BM25:    Improved TF-IDF with saturation + length normalization
  Custom:  BM25 + recency + popularity + personalization

Elasticsearch Key Concepts

  Index = Table,  Document = Row,  Shard = Partition,  Replica = Copy
  Query flow: Fan out to all shards -> merge + sort -> fetch top N docs

Autocomplete

  Options: Trie, ES completion suggester, edge n-grams
  Requirements: < 100ms latency, debounce on client (200ms)

Search Architecture

  DB -> CDC/Kafka -> Indexer -> Elasticsearch <- Search API <- Client

7. Interview Quick-Draw Templates

"Design X with high availability"

  1. State target (99.99%)
  2. Eliminate SPOFs (redundancy at every layer)
  3. Choose failover (active-passive or active-active)
  4. Define RPO/RTO
  5. Multi-region if global

"How would you make this fault tolerant?"

  1. Circuit breaker on all external calls
  2. Bulkhead isolation for dependencies
  3. Retry + exponential backoff + jitter
  4. Timeouts at every boundary
  5. Graceful degradation for non-critical features

"How would you monitor this system?"

  1. Structured logs with trace IDs
  2. RED metrics (rate, errors, duration) on every service
  3. Distributed tracing across service boundaries
  4. Alerts on SLO violations
  5. Dashboards in Grafana

"How would you secure this API?"

  1. JWT auth at API gateway
  2. OAuth 2.0 for third-party access
  3. RBAC for permissions
  4. Rate limiting (token bucket + Redis)
  5. HTTPS everywhere, mTLS between services

8. Numbers to Memorize

MetricValue
3 nines downtime/year8.76 hours
4 nines downtime/year52.6 minutes
5 nines downtime/year5.26 minutes
JWT access token TTL~15 minutes
JWT refresh token TTL~7 days
Autocomplete latency target< 100ms
Client debounce for typeahead150-300ms
Circuit breaker timeout10-60 seconds
Retry backoff formulabase * 2^attempt + jitter
S3 durability11 nines (99.999999999%)

Go back to the README for the full section overview.