9.10 — Quick Revision Cheat Sheet
How to use this material (instructions)
- Read through once the day before your interview.
- Cover the right column of each table and try to recall from the left column.
- Focus on diagrams -- redraw the ASCII diagrams from memory to test understanding.
- Spend 60 seconds per section for a complete 6-minute rapid review.
1. Reliability & Availability
Nines Table (MEMORIZE)
| Nines | Uptime | Downtime/Year | Downtime/Month |
|---|
| 99.9% (3 nines) | 8.76 hours | 43.8 min | |
| 99.99% (4 nines) | 52.6 minutes | 4.38 min | |
| 99.999% (5 nines) | 5.26 minutes | 26.3 sec | |
Availability Math
SERIES: A_total = A1 x A2 x A3 (all must work)
PARALLEL: A_total = 1 - (1-A1)(1-A2) (any one works)
SLA / SLO / SLI
SLI (what you MEASURE) -> SLO (what you TARGET) -> SLA (what you PROMISE)
Failover Comparison
| Active-Passive | Active-Active |
|---|
| Failover speed | Seconds-minutes | Near-instant |
| Resource use | ~50% | ~100% |
| Complexity | Lower | Higher (conflicts) |
Disaster Recovery
RPO = How much data can we LOSE?
RTO = How fast must we RECOVER?
2. Fault Tolerance
Pattern Quick Reference
| Pattern | Purpose | One-Line Summary |
|---|
| Circuit Breaker | Prevent cascading failure | CLOSED -> (failures) -> OPEN -> (timeout) -> HALF-OPEN |
| Bulkhead | Isolate failure blast radius | Separate thread/connection pools per dependency |
| Retry + Backoff | Recover from transient failures | wait = min(base * 2^attempt + jitter, max) |
| Timeout | Prevent indefinite blocking | Set at EVERY network boundary |
| Graceful Degradation | Partial service > no service | Disable non-critical features, serve cached data |
Circuit Breaker States
CLOSED --[failures > threshold]--> OPEN --[after timeout]--> HALF-OPEN
^ |
+-----[success]------------------------------------------------+
OPEN <-----[failure]-------------------------------------------+
SPOF Elimination Checklist
DNS -> Multi-provider
LB -> Redundant pair / cloud LB
App -> Multiple instances + auto-scaling
DB -> Primary + replica + auto-failover
Storage -> RAID + cross-region backups
Network -> Multi-AZ / multi-region
3. Observability
Three Pillars
| Pillar | Data Type | Best For | Tool |
|---|
| Logs | Events (text/JSON) | Debugging specific errors | ELK Stack |
| Metrics | Numbers over time | Dashboards, alerting | Prometheus + Grafana |
| Traces | Spans (request flow) | Finding latency bottlenecks | Jaeger / Zipkin |
RED Method (Services) & USE Method (Infrastructure)
RED: Rate, Errors, Duration USE: Utilization, Saturation, Errors
Distributed Tracing
Trace (1 per request) contains Spans (1 per service hop)
Context propagated via: traceparent HTTP header
Visualized as: Waterfall / timeline diagram
Alerting Rule
Alert on SYMPTOMS (error rate > 5%), not CAUSES (disk I/O high)
Every alert must be ACTIONABLE
4. Rate Limiting
Algorithm Comparison
| Algorithm | Memory | Bursts? | Accuracy | Use When |
|---|
| Token Bucket | O(1) | Yes | Good | Default choice (most APIs) |
| Leaky Bucket | O(1) | No | Good | Need constant output rate |
| Fixed Window | O(1) | Edge problem | Low | Simplest, acceptable for low stakes |
| Sliding Log | O(n) | No | Perfect | Need exact counting |
| Sliding Counter | O(1) | Partial | High | Best accuracy/memory trade-off |
Distributed Rate Limiting
Standard: Redis + Lua script (atomic token bucket)
Fallback: Local rate limiting if Redis is down
Response Headers
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1620000060
Retry-After: 30 (on 429 response)
5. Auth & Authorization
AuthN vs AuthZ
AuthN: "WHO are you?" -> 401 on failure
AuthZ: "WHAT can you do?" -> 403 on failure
Session vs JWT
| Session | JWT |
|---|
| State | Server-side (Redis/DB) | Client-side (token) |
| Scalability | Harder (shared store) | Easier (stateless) |
| Revocation | Easy (delete session) | Hard (wait for expiry) |
| Best for | Traditional web apps | SPAs, mobile, microservices |
JWT Structure
HEADER.PAYLOAD.SIGNATURE (Base64 encoded, dot-separated)
Access token: ~15 min TTL (used for every request)
Refresh token: ~7 days TTL (used only to get new access tokens)
OAuth 2.0 (Authorization Code Flow)
1. Redirect to auth server with client_id
2. User logs in + consents
3. Redirect back with authorization CODE
4. Backend exchanges code + client_secret for TOKEN (server-to-server)
RBAC vs ABAC
| RBAC | ABAC |
|---|
| Model | User -> Role -> Permission | Policy on attributes |
| Complexity | Simple | Complex |
| Use case | Most apps | Healthcare, finance, government |
Microservices Auth Pattern
Client -> [API Gateway: validate JWT] -> [Internal services: trust gateway headers]
Inter-service: mTLS via service mesh (Istio/Envoy)
6. Search Systems
Inverted Index
Forward: Doc -> [terms]
Inverted: Term -> [Doc IDs] (this is what search engines use)
Text Processing Pipeline
Raw text -> Tokenize -> Lowercase -> Remove stop words -> Stem -> Index
Ranking
TF-IDF: score = TF (frequency in doc) x IDF (rarity across corpus)
BM25: Improved TF-IDF with saturation + length normalization
Custom: BM25 + recency + popularity + personalization
Elasticsearch Key Concepts
Index = Table, Document = Row, Shard = Partition, Replica = Copy
Query flow: Fan out to all shards -> merge + sort -> fetch top N docs
Autocomplete
Options: Trie, ES completion suggester, edge n-grams
Requirements: < 100ms latency, debounce on client (200ms)
Search Architecture
DB -> CDC/Kafka -> Indexer -> Elasticsearch <- Search API <- Client
7. Interview Quick-Draw Templates
"Design X with high availability"
1. State target (99.99%)
2. Eliminate SPOFs (redundancy at every layer)
3. Choose failover (active-passive or active-active)
4. Define RPO/RTO
5. Multi-region if global
"How would you make this fault tolerant?"
1. Circuit breaker on all external calls
2. Bulkhead isolation for dependencies
3. Retry + exponential backoff + jitter
4. Timeouts at every boundary
5. Graceful degradation for non-critical features
"How would you monitor this system?"
1. Structured logs with trace IDs
2. RED metrics (rate, errors, duration) on every service
3. Distributed tracing across service boundaries
4. Alerts on SLO violations
5. Dashboards in Grafana
"How would you secure this API?"
1. JWT auth at API gateway
2. OAuth 2.0 for third-party access
3. RBAC for permissions
4. Rate limiting (token bucket + Redis)
5. HTTPS everywhere, mTLS between services
8. Numbers to Memorize
| Metric | Value |
|---|
| 3 nines downtime/year | 8.76 hours |
| 4 nines downtime/year | 52.6 minutes |
| 5 nines downtime/year | 5.26 minutes |
| JWT access token TTL | ~15 minutes |
| JWT refresh token TTL | ~7 days |
| Autocomplete latency target | < 100ms |
| Client debounce for typeahead | 150-300ms |
| Circuit breaker timeout | 10-60 seconds |
| Retry backoff formula | base * 2^attempt + jitter |
| S3 durability | 11 nines (99.999999999%) |
Go back to the README for the full section overview.