Episode 9 — System Design / 9.7 — System Design Foundations

9.7 — System Design Foundations: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim top-to-bottom in one pass before quizzes or interviews.
If a row feels fuzzy — open the matching lesson: README.md → 9.7.a…9.7.e.
Deep practice — 9.7-Exercise-Questions.md.
Polish phrasing — 9.7-Interview-Questions.md.

HLD vs LLD (master table)

	HLD	LLD
Scope	Entire system	Single module/service
Decisions	Services, DBs, caches, protocols	Classes, methods, patterns
Diagrams	Architecture, data flow	Class, sequence, UML
Trade-offs	CAP, latency vs consistency	Coupling vs cohesion
Interview	"Design Twitter at scale"	"Design a Parking Lot"

Building Blocks of HLD

Block	Purpose	Examples
Services	Business logic units	Auth Service, Feed Service
Databases	Persistent storage	PostgreSQL, Cassandra, DynamoDB
Caches	Fast reads, reduce DB load	Redis, Memcached
Queues	Async processing, decoupling	Kafka, SQS, RabbitMQ
Load Balancers	Distribute traffic	Nginx, AWS ALB
CDN	Edge-cache static assets	CloudFront, Cloudflare
API Gateway	Routing, auth, rate limiting	Kong, AWS API Gateway

Requirements Checklist (5 min)

  1. What are the core features?          (3-5 bullet points)
  2. Who are the users?                   (mobile, web, API)
  3. What scale?                          (1K, 1M, 1B users)
  4. Read/write ratio?                    (100:1, 10:1, 1:1)
  5. Quality attributes?                  (latency, availability, consistency)
  6. What is out of scope?                (explicitly exclude)

CAP Theorem

  CP = Consistent + Partition-tolerant   (rejects requests during partition)
  AP = Available + Partition-tolerant    (serves stale data during partition)
  CA = Not practical in distributed systems (partitions are inevitable)

Choice	Use When	Examples
CP	Banking, inventory, leader election	HBase, Zookeeper
AP	Social feeds, shopping carts, analytics	Cassandra, DynamoDB

Availability Targets

Nines	Availability	Downtime/Year
2	99%	3.65 days
3	99.9%	8.76 hours
4	99.99%	52.6 minutes
5	99.999%	5.26 minutes

Estimation Pipeline

  Users → DAU → QPS → Storage → Bandwidth → Cache

Quick Math Constants

Constant	Value
Seconds/day	~86,400 (~10^5)
Seconds/month	~2.5M
Seconds/year	~30M
1M req/day	~12 QPS
1B req/day	~12K QPS

Powers of 2

Power	Value	Name
2^10	~1K	Kilo
2^20	~1M	Mega
2^30	~1G	Giga
2^40	~1T	Tera

Estimation Template

  DAU     = MAU * 0.2 (typical social app)
  QPS     = DAU * actions_per_user / 86,400
  Peak    = QPS * 3
  Storage = daily_writes * bytes_per_write * 365 (yearly)
  Cache   = 20% of daily unique data (80-20 rule)

Latency Numbers

Operation	Latency
L1 cache	~1 ns
RAM	~100 ns
SSD random read	~100 us
HDD random read	~10 ms
Same-datacenter RTT	~0.5 ms
Cross-continent RTT	~150 ms
Redis GET	~1 ms
PostgreSQL query	~1-5 ms

Decomposition Heuristics

Heuristic	Meaning
Single responsibility	One service, one business capability
Data ownership	Each service owns its database
Rate of change	Fast-changing modules separated from stable ones
Scaling needs	Different scaling profiles = different services
Bounded context (DDD)	Group by business domain

Communication Patterns

Pattern	When	Example
Sync (REST/gRPC)	Need immediate answer	Auth check, data fetch
Async (Queue/Event)	Caller doesn't wait	Email, video transcode, analytics
Hybrid	Critical path sync + side effects async	Post tweet (sync) + fan-out (async)

Database Selection

Need	Choose
Transactions, joins, ACID	SQL (PostgreSQL, MySQL)
High write throughput, time-series	Cassandra
Flexible schema, documents	MongoDB, DynamoDB
Fast lookups by key	Redis, DynamoDB
Full-text search	Elasticsearch
Relationships, graphs	Neo4j

Caching Strategies

Strategy	How	When
Cache-aside	App checks cache; fills on miss	Most common; general purpose
Write-through	Write to cache + DB together	Strong consistency needed
Write-behind	Write to cache; async flush to DB	High write throughput
TTL-based	Entries expire after N seconds	Acceptable staleness

Fan-Out Models (Timeline)

Model	Writes	Reads	Best For
Push (fan-out on write)	Expensive (write to all followers)	Cheap (pre-computed)	Most users
Pull (fan-out on read)	Cheap	Expensive (merge at read)	Celebrity accounts
Hybrid	Push for normal; pull for celebrities	Mix	Twitter-like systems

Interview 5-Phase Framework

  Phase 1: Requirements        5 min
  Phase 2: Estimation          5 min
  Phase 3: High-Level Design  15 min
  Phase 4: Deep Dive          15 min
  Phase 5: Wrap Up             5 min

Common Mistakes (Avoid These)

Mistake	Fix
Skip requirements	Always ask 5+ questions first
Over-engineer	Match solution to scale
No trade-offs	Justify every decision
Monologue	Check in with interviewer
Deep dive too early	Draw full picture first
Ignore failures	Mention redundancy, retries
Skip estimation	Show your math

Curveball Response Framework

  1. ACKNOWLEDGE    "That's a great point."
  2. STATE IMPACT   "If X happens, the impact would be..."
  3. PROPOSE FIX    "To handle this, I would..."
  4. TRADE-OFF      "The trade-off is..."

SLA / SLO / SLI

Term	What	Example
SLI	Metric you measure	P99 latency = 350ms
SLO	Target for that metric	P99 latency < 500ms
SLA	Contract with consequences	Credit if P99 > 500ms for a month

Whiteboard Layout

  ┌───────────────────┬──────────────────────┐
  │ REQUIREMENTS      │ ESTIMATION           │
  │ (top-left)        │ (top-right)          │
  ├───────────────────┴──────────────────────┤
  │                                          │
  │         ARCHITECTURE DIAGRAM             │
  │              (center)                    │
  │                                          │
  ├──────────────────┬───────────────────────┤
  │ API ENDPOINTS    │ TRADE-OFFS            │
  │ (bottom-left)    │ (bottom-right)        │
  └──────────────────┴───────────────────────┘

← Back to 9.7 — System Design Foundations (README)