Episode 9 — System Design / 9.7 — System Design Foundations
9.7.e — Interview Approach
In one sentence: A system design interview is not about memorizing solutions — it is a structured conversation where you demonstrate how you think about building large-scale systems, and the approach matters more than the answer.
Navigation: ← 9.7.d — Capacity Estimation · 9.7 Overview →
Table of Contents
- 1. The 45-Minute Framework
- 2. Phase 1: Requirements (5 Minutes)
- 3. Phase 2: Estimation (5 Minutes)
- 4. Phase 3: High-Level Design (15 Minutes)
- 5. Phase 4: Deep Dive (15 Minutes)
- 6. Phase 5: Wrap Up (5 Minutes)
- 7. How to Draw Diagrams
- 8. Handling "What About X?" Questions
- 9. Common Mistakes
- 10. Signal Checklist — What Interviewers Score
- 11. Example Walkthrough: "Design a URL Shortener"
- 12. Key Takeaways
- 13. Explain-It Challenge
1. The 45-Minute Framework
┌──────────────────────────────────────────────────────────────────┐
│ │
│ 0 min 45 min │
│ │ │ │
│ │◄── Requirements ──►│◄─ Estimation ─►│ │ │
│ │ (5 min) │ (5 min) │ │ │
│ │ │ │ │ │
│ │ │ │◄── High-Level ──►│ │
│ │ │ │ Design (15 min)│ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ │ ◄── Deep ─────►│ │
│ │ │ │ Dive (15 min)│ │
│ │ │ │ │ │
│ │ │ │ ◄Wrap►│ │
│ │ │ │ (5min)│ │
│ │
└──────────────────────────────────────────────────────────────────┘
Phase 1: REQUIREMENTS 5 min Clarify scope, features, constraints
Phase 2: ESTIMATION 5 min QPS, storage, bandwidth
Phase 3: HIGH-LEVEL DESIGN 15 min Architecture diagram, data flow
Phase 4: DEEP DIVE 15 min Interviewer-guided component deep dive
Phase 5: WRAP UP 5 min Trade-offs, bottlenecks, improvements
2. Phase 1: Requirements (5 Minutes)
What to Do
Ask clarifying questions. Do NOT start drawing yet.
Checklist
┌─────────────────────────────────────────────────────────┐
│ REQUIREMENTS SCRIPT │
│ │
│ "Before I start designing, let me clarify the scope." │
│ │
│ 1. "What are the core features we need to support?" │
│ → Write 3-5 bullet points │
│ │
│ 2. "Who are the users and how do they access the system?" │
│ → Mobile, web, API? Read-heavy or write-heavy? │
│ │
│ 3. "What scale should I design for?" │
│ → 1M users or 1B users? This changes everything. │
│ │
│ 4. "What quality attributes matter most?" │
│ → Latency? Availability? Consistency? │
│ │
│ 5. "What is explicitly out of scope?" │
│ → This focuses your 45 minutes. │
│ │
│ Write requirements on the board. Keep them visible. │
└─────────────────────────────────────────────────────────┘
Example Output (on the board)
REQUIREMENTS — Design Twitter
─────────────────────────────
Functional (P0):
• Post tweet (text only, 280 chars)
• View home timeline
• Follow / unfollow users
Non-functional:
• 500M MAU, 200M DAU
• Read-heavy (100:1)
• Timeline latency < 200ms
• 99.9% availability
• Eventual consistency acceptable
Out of scope: DMs, search, trending, ads
3. Phase 2: Estimation (5 Minutes)
What to Do
Quick back-of-envelope math. Show your work; talk through it out loud.
Template
ESTIMATION
──────────
DAU: 200M
Write QPS: 200M * 2 tweets / 86,400 ≈ 4,600 (peak: ~14K)
Read QPS: 4,600 * 100 = 460,000 (peak: ~1.4M)
Storage: 400M tweets * 500 bytes = 200 GB/day ≈ 73 TB/year
Bandwidth: 460K * 25 KB (timeline) ≈ 11.5 GB/s (CDN!)
Cache: 20% of daily data ≈ 500 GB Redis
Tips
- Round aggressively. 86,400 seconds/day can be rounded to 100K for quick math.
- State your assumptions. "I am assuming 20% of MAU are daily active."
- Highlight insights. "460K read QPS means we absolutely need aggressive caching."
- Do not spend more than 5 minutes here. Precision is not the goal; order of magnitude is.
4. Phase 3: High-Level Design (15 Minutes)
This is the main event. Draw the architecture diagram and walk through the data flow.
Step-by-Step
Step 1: Start with the client
Draw a box on the left: "Client (Mobile / Web)"
Step 2: Add the entry point
Load Balancer → API Gateway
Step 3: Identify core services
Based on your requirements:
• Tweet Service (create, delete, get tweets)
• Timeline Service (generate and serve timelines)
• User Service (profiles, follow graph)
Step 4: Add data stores
• Database for each service
• Cache layer (Redis)
• Object storage for media (if applicable)
Step 5: Add async components
• Message queue for fan-out
• Workers for async processing
Step 6: Walk through the request path
Number each arrow: 1 → 2 → 3 → ...
Explain out loud: "When a user posts a tweet..."
Example Diagram
┌────────┐ ┌─────┐ ┌──────────┐ ┌──────────────┐
│ Client │────►│ LB │────►│ API │────►│ Tweet │
│ │ │ │ │ Gateway │ │ Service │
└────────┘ └─────┘ └────┬─────┘ └──────┬───────┘
│ │
│ ┌────▼────┐
│ │ Tweet DB│
│ │(Postgres)│
│ └─────────┘
│
│ ┌──────────────┐
└──────────►│ Timeline │
│ Service │
└──────┬───────┘
│
┌─────────────┼──────────┐
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌────────┐
│ Redis │ │ Fan-out │ │User │
│ Cache │ │ Queue │ │Service │
│(timeline)│ │ (Kafka) │ └────┬───┘
└──────────┘ └────┬────┘ │
▼ ┌────▼───┐
┌────────┐ │User DB │
│Fan-out │ └────────┘
│Workers │
└────────┘
Narrate as You Draw
Talk continuously while drawing. Example narrative:
"When a user posts a tweet, the request goes through the load balancer to the API Gateway, which routes it to the Tweet Service. The Tweet Service writes the tweet to the Tweet DB and publishes an event to Kafka. Fan-out workers consume the event, look up the poster's followers from the User Service, and push the tweet into each follower's timeline cache in Redis. When a user reads their timeline, the Timeline Service reads directly from Redis, which gives us sub-10ms latency."
5. Phase 4: Deep Dive (15 Minutes)
The interviewer will typically pick 1-2 components and ask you to go deeper. Be prepared for any of these.
Common Deep-Dive Topics
| Topic | What to Discuss |
|---|---|
| Database schema | Table design, primary keys, indexes, sharding key |
| Caching strategy | Cache-aside vs write-through, TTL, invalidation, consistency |
| Sharding | Shard key selection, consistent hashing, rebalancing |
| Fan-out strategy | Push vs pull vs hybrid, handling celebrity accounts |
| API design | Endpoint design, pagination, rate limiting |
| Failure handling | What happens when Service X goes down? Circuit breakers, retries |
| Consistency | How do you handle conflicts? Last-write-wins? CRDTs? |
| Rate limiting | Token bucket, sliding window, distributed rate limiting |
| Monitoring | Metrics, alerts, dashboards, tracing |
Example Deep Dive: Fan-Out Strategy
PUSH MODEL (Fan-Out on Write)
─────────────────────────────
When Alice tweets:
1. Look up Alice's followers (e.g., 10K followers)
2. Write Alice's tweet to each follower's timeline cache
3. When a follower reads their timeline, it's pre-computed
Pros: Fast reads (O(1) — just read from cache)
Cons: Slow writes for celebrities (1M followers = 1M writes)
PULL MODEL (Fan-Out on Read)
────────────────────────────
When Bob reads his timeline:
1. Look up Bob's followed accounts
2. Fetch recent tweets from each account
3. Merge and sort
Pros: No write amplification
Cons: Slow reads (many DB queries per timeline request)
HYBRID MODEL (What Twitter actually does)
──────────────────────────────────────────
• Regular users (< 10K followers): PUSH model
• Celebrities (> 10K followers): PULL model
• Timeline = pre-cached (push) + merge celebrity tweets (pull)
How to Handle the Deep Dive
- Acknowledge the question. "Great question, let me think about that."
- State the trade-offs. "There are two approaches here..."
- Pick one and justify. "I would go with X because... given our requirements."
- Acknowledge limitations. "The downside is... but we can mitigate by..."
6. Phase 5: Wrap Up (5 Minutes)
What to Cover
┌─────────────────────────────────────────────────────────┐
│ WRAP UP CHECKLIST │
│ │
│ 1. BOTTLENECKS │
│ "The main bottleneck in this design is..." │
│ (e.g., fan-out for celebrities, DB write throughput) │
│ │
│ 2. SINGLE POINTS OF FAILURE │
│ "If I had more time, I would add redundancy for..." │
│ (e.g., multi-region, DB failover) │
│ │
│ 3. FUTURE IMPROVEMENTS │
│ "To handle 10x growth, we would need..." │
│ (e.g., sharding, more cache nodes, CDN expansion) │
│ │
│ 4. MONITORING │
│ "I would monitor: latency P99, error rate, cache hit │
│ ratio, queue depth, DB connection pool usage" │
│ │
└─────────────────────────────────────────────────────────┘
7. How to Draw Diagrams
Physical Whiteboard Tips
| Tip | Why |
|---|---|
| Start in the center-left | Leave room for components on the right and below |
| Draw big | Small diagrams are hard to read and hard to modify |
| Use consistent shapes | Rectangles = services, cylinders = databases, diamonds = decisions |
| Label EVERY arrow | "HTTP," "gRPC," "Kafka event," "SQL query" |
| Number the flow | 1 → 2 → 3 makes the data path clear |
| Leave space between boxes | You WILL need to add components mid-discussion |
| Use different colors | If available: blue for services, green for data stores, red for bottlenecks |
Virtual Whiteboard Tips (Excalidraw, Miro, etc.)
| Tip | Why |
|---|---|
| Pre-create common shapes | Have service boxes and DB cylinders ready to copy-paste |
| Use a grid | Keep things aligned |
| Practice beforehand | Know the tool's keyboard shortcuts |
| Share your screen early | Let the interviewer see your drawing in real time |
Diagram Evolution During Interview
START (Phase 3, minute 10): AFTER DEEP DIVE (Phase 4, minute 35):
Client → LB → API → Service → DB Client → CDN (static)
↓
Client → LB → API Gateway
│
┌────────┼────────┐
▼ ▼ ▼
Tweet Timeline User
Service Service Service
│ │ │
Tweet DB Redis User DB
│ (cache)
Kafka ──→ Fan-out Workers
It is normal and expected for your diagram to evolve. Start simple, add complexity as needed.
8. Handling "What About X?" Questions
Interviewers will throw curveballs to test how you think on your feet.
Common "What About X?" Questions and How to Handle Them
| Question | How to Respond |
|---|---|
| "What if this service goes down?" | "I would add redundancy: multiple instances behind a load balancer, health checks, and circuit breakers to prevent cascading failures." |
| "What if the database can't handle the load?" | "At this QPS, I would add read replicas for reads, or shard the database by [shard key]. For writes, a queue can absorb spikes." |
| "What about data consistency?" | "Given our requirements, I chose eventual consistency. For this specific case, I would use [approach]. If strong consistency is needed, I would use [alternative]." |
| "What if a celebrity with 100M followers tweets?" | "Fan-out on write would be too slow. I would use a hybrid model: push for regular users, pull for celebrity tweets at read time." |
| "How do you handle duplicate requests?" | "I would make the API idempotent using a client-generated idempotency key. The server checks if this key was already processed before executing." |
| "What about security?" | "Authentication via JWT at the API Gateway, authorization per-service, HTTPS everywhere, rate limiting to prevent abuse, input validation to prevent injection." |
The Response Framework
1. ACKNOWLEDGE "That's a great point."
2. STATE IMPACT "If X happens, the impact would be..."
3. PROPOSE FIX "To handle this, I would..."
4. TRADE-OFF "The trade-off is... but given our requirements, this is acceptable."
9. Common Mistakes
Mistake 1: Jumping Into Design Without Requirements
BAD: "Let me draw the architecture..." (minute 0)
GOOD: "Before I start, let me clarify the requirements..." (minute 0)
Why it hurts: You might design for the wrong scale or the wrong features.
Mistake 2: Over-Engineering
BAD: "We'll use Kubernetes with a service mesh, CQRS with event sourcing,
a distributed consensus protocol, and a custom CDC pipeline..."
(for a system with 1K users)
GOOD: "Given the scale (1K users), a monolith with PostgreSQL is sufficient.
If we need to scale later, we can split into services."
Why it hurts: Shows you cannot match the solution to the problem.
Mistake 3: Not Discussing Trade-Offs
BAD: "I'll use NoSQL."
GOOD: "I'm choosing NoSQL (Cassandra) because we need high write throughput
and can tolerate eventual consistency. The trade-off is that we lose
ACID transactions and complex queries."
Why it hurts: Interviewers want to see your reasoning, not just your choice.
Mistake 4: Monologuing
BAD: Talks for 15 minutes without checking in with the interviewer.
GOOD: "Does this direction make sense? Would you like me to go deeper
on any component before I continue?"
Why it hurts: The interview is a conversation, not a presentation.
Mistake 5: Getting Stuck on Details Too Early
BAD: Spends 10 minutes on database schema before drawing the overall architecture.
GOOD: Draws the full high-level picture first, THEN dives into specific components.
Why it hurts: You run out of time before showing the complete system.
Mistake 6: Not Considering Failure Scenarios
BAD: Assumes everything works perfectly all the time.
GOOD: "What if the cache goes down? We fall back to the database. What if
the database goes down? We have a replica that promotes to primary."
Mistake 7: Ignoring the Numbers
BAD: "We'll add a cache." (with no justification)
GOOD: "At 460K read QPS, the database cannot handle the load. With a cache
hit ratio of 90%, we reduce DB reads to 46K QPS, which 3 replicas
can handle."
10. Signal Checklist — What Interviewers Score
| Signal | Strong | Weak |
|---|---|---|
| Requirements gathering | Asks 5+ clarifying questions; writes them down | Jumps straight to drawing |
| Estimation | Shows clear math; derives useful insights | Skips estimation or guesses |
| Architecture | Clean diagram with labeled arrows and data flow | Messy diagram with unnamed boxes |
| Trade-offs | Explains why they chose A over B with reasoning | Makes choices without justification |
| Scalability | Identifies bottlenecks and proposes solutions | Designs for a single server |
| Communication | Checks in with interviewer; narrates while drawing | Monologues or stays silent |
| Failure handling | Discusses redundancy, retries, circuit breakers | Ignores failures completely |
| Depth | Can go deep on any component when asked | Surface-level understanding |
11. Example Walkthrough: "Design a URL Shortener"
Phase 1: Requirements (5 min)
Functional:
• Shorten a long URL → short URL
• Redirect short URL → original long URL
• Custom aliases (optional)
• Link analytics (click count, geography)
Non-functional:
• 100M new URLs per month
• Read:Write = 100:1
• Redirect latency < 50ms
• 99.9% availability
• URLs never expire (unless deleted)
Out of scope: User accounts, paid plans, spam detection
Phase 2: Estimation (5 min)
Write QPS: 100M / (30 * 86,400) ≈ 40/sec (peak: 120/sec)
Read QPS: 40 * 100 = 4,000/sec (peak: 12,000/sec)
Storage: 100M * 250 bytes/month = 25 GB/month, 1.5 TB / 5 years
Cache: 20% of daily unique URLs ≈ 5 GB (fits in one Redis)
Short URL: base62, 7 chars → 3.5T combos (sufficient for decades)
Phase 3: High-Level Design (15 min)
┌────────┐ ┌─────┐ ┌───────────────┐
│ Client │────►│ LB │────►│ URL Service │
└────────┘ └─────┘ └───────┬───────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌──────────┐
│ Redis │ │ DB │ │Analytics │
│ Cache │ │(Postgres│ │ Queue │
│short→long│ │ or │ │(Kafka) │
└──────────┘ │Cassandr)│ └────┬────┘
└────────┘ ▼
┌──────────┐
│Analytics │
│ Worker │
└────┬─────┘
▼
┌──────────┐
│ClickHouse│
│(analytics)│
└──────────┘
FLOW — Create Short URL:
1. Client → POST /shorten { url: "https://very-long-url.com/..." }
2. URL Service generates short code (base62 counter or hash)
3. Store mapping in DB: abc123 → https://very-long-url.com/...
4. Return: https://short.ly/abc123
FLOW — Redirect:
1. Client → GET /abc123
2. URL Service checks Redis cache
3. HIT → 301 redirect to original URL
4. MISS → query DB, populate cache, 301 redirect
5. Publish click event to Kafka (async)
Phase 4: Deep Dive (15 min)
Interviewer: "How do you generate the short URL? What about collisions?"
APPROACH 1: Counter-based (preferred)
─────────────────────────────────────
• Use a distributed counter (e.g., auto-increment ID or Snowflake ID)
• Convert to base62: 1000000 → "4c92"
• Guaranteed unique — no collisions
• But: IDs are predictable (sequential)
APPROACH 2: Hash-based
──────────────────────
• Hash the long URL: MD5("https://...") → first 7 chars of base62
• Possible collisions: check DB, if collision → append random char
• Not predictable, but collision handling adds complexity
I would go with Approach 1 (counter) because:
• No collision handling needed
• Simpler implementation
• The predictability concern can be addressed by encoding with a
random offset or shuffling the base62 alphabet
Interviewer: "How do you handle if the DB goes down?"
1. Redis cache serves reads (most traffic) — no DB needed for cached URLs
2. DB has a standby replica for failover (automated, < 30 sec)
3. Writes buffer in a queue during failover
4. After failover, replay queued writes
Phase 5: Wrap Up (5 min)
Bottlenecks:
• DB is a potential bottleneck at very high scale, but at 40 writes/sec
and 4K reads/sec (with 90% cache hit → 400 reads/sec to DB), a single
PostgreSQL instance is sufficient for years.
Future improvements:
• DB sharding by short URL hash (if scale demands)
• Multi-region deployment for lower global latency
• Bloom filter to check URL existence without hitting DB
• Rate limiting per IP to prevent abuse
Monitoring:
• Redirect latency P99
• Cache hit ratio
• DB connection pool utilization
• Queue depth (analytics pipeline)
12. Key Takeaways
- Follow the 5-phase framework religiously: Requirements → Estimation → High-Level Design → Deep Dive → Wrap Up.
- Never skip requirements. The first 5 minutes set the direction for the entire interview.
- Draw big, label everything, number the flow. The diagram is your primary communication tool.
- Discuss trade-offs for every decision. "I chose X because... the alternative was Y, but..."
- The interview is a conversation. Check in with the interviewer. Ask "Does this direction make sense?"
- Start simple, add complexity. Draw the basic flow first, then evolve the diagram with caching, queues, and redundancy.
- Show your math. Even rough estimation demonstrates quantitative thinking.
- Handle curveballs calmly. Acknowledge → State Impact → Propose Fix → Trade-off.
13. Explain-It Challenge
Without looking back, practice these:
- Walk through the 5-phase framework and explain what you do in each phase and how long it takes.
- You are asked "Design a notification system." What are the first 5 questions you would ask?
- You are mid-design and the interviewer asks "What if the database goes down?" Walk through your response framework.
- List 5 common mistakes candidates make in system design interviews and how to avoid each.
- Draw a high-level diagram for a simple file-sharing system (like Dropbox). Narrate as you draw.
Navigation: ← 9.7.d — Capacity Estimation · 9.7 Overview →