Episode 9 — System Design / 9.7 — System Design Foundations

9.7.e — Interview Approach

In one sentence: A system design interview is not about memorizing solutions — it is a structured conversation where you demonstrate how you think about building large-scale systems, and the approach matters more than the answer.

Navigation: ← 9.7.d — Capacity Estimation · 9.7 Overview →


Table of Contents


1. The 45-Minute Framework

  ┌──────────────────────────────────────────────────────────────────┐
  │                                                                    │
  │   0 min                                                 45 min     │
  │   │                                                       │        │
  │   │◄── Requirements ──►│◄─ Estimation ─►│                 │        │
  │   │     (5 min)        │   (5 min)      │                 │        │
  │   │                    │                │                 │        │
  │   │                    │                │◄── High-Level ──►│       │
  │   │                    │                │   Design (15 min)│       │
  │   │                    │                │                  │       │
  │   │                    │                │                  │       │
  │   │                    │                │   ◄── Deep ─────►│      │
  │   │                    │                │     Dive (15 min)│      │
  │   │                    │                │                  │      │
  │   │                    │                │            ◄Wrap►│      │
  │   │                    │                │            (5min)│      │
  │                                                                    │
  └──────────────────────────────────────────────────────────────────┘

  Phase 1: REQUIREMENTS       5 min    Clarify scope, features, constraints
  Phase 2: ESTIMATION          5 min    QPS, storage, bandwidth
  Phase 3: HIGH-LEVEL DESIGN  15 min    Architecture diagram, data flow
  Phase 4: DEEP DIVE          15 min    Interviewer-guided component deep dive
  Phase 5: WRAP UP             5 min    Trade-offs, bottlenecks, improvements

2. Phase 1: Requirements (5 Minutes)

What to Do

Ask clarifying questions. Do NOT start drawing yet.

Checklist

  ┌─────────────────────────────────────────────────────────┐
  │  REQUIREMENTS SCRIPT                                      │
  │                                                            │
  │  "Before I start designing, let me clarify the scope."     │
  │                                                            │
  │  1. "What are the core features we need to support?"       │
  │     → Write 3-5 bullet points                              │
  │                                                            │
  │  2. "Who are the users and how do they access the system?" │
  │     → Mobile, web, API? Read-heavy or write-heavy?         │
  │                                                            │
  │  3. "What scale should I design for?"                      │
  │     → 1M users or 1B users? This changes everything.       │
  │                                                            │
  │  4. "What quality attributes matter most?"                 │
  │     → Latency? Availability? Consistency?                  │
  │                                                            │
  │  5. "What is explicitly out of scope?"                     │
  │     → This focuses your 45 minutes.                        │
  │                                                            │
  │  Write requirements on the board. Keep them visible.       │
  └─────────────────────────────────────────────────────────┘

Example Output (on the board)

  REQUIREMENTS — Design Twitter
  ─────────────────────────────
  Functional (P0):
  • Post tweet (text only, 280 chars)
  • View home timeline
  • Follow / unfollow users

  Non-functional:
  • 500M MAU, 200M DAU
  • Read-heavy (100:1)
  • Timeline latency < 200ms
  • 99.9% availability
  • Eventual consistency acceptable

  Out of scope: DMs, search, trending, ads

3. Phase 2: Estimation (5 Minutes)

What to Do

Quick back-of-envelope math. Show your work; talk through it out loud.

Template

  ESTIMATION
  ──────────
  DAU:         200M
  Write QPS:   200M * 2 tweets / 86,400 ≈ 4,600 (peak: ~14K)
  Read QPS:    4,600 * 100 = 460,000 (peak: ~1.4M)
  Storage:     400M tweets * 500 bytes = 200 GB/day ≈ 73 TB/year
  Bandwidth:   460K * 25 KB (timeline) ≈ 11.5 GB/s (CDN!)
  Cache:       20% of daily data ≈ 500 GB Redis

Tips

  • Round aggressively. 86,400 seconds/day can be rounded to 100K for quick math.
  • State your assumptions. "I am assuming 20% of MAU are daily active."
  • Highlight insights. "460K read QPS means we absolutely need aggressive caching."
  • Do not spend more than 5 minutes here. Precision is not the goal; order of magnitude is.

4. Phase 3: High-Level Design (15 Minutes)

This is the main event. Draw the architecture diagram and walk through the data flow.

Step-by-Step

  Step 1: Start with the client
          Draw a box on the left: "Client (Mobile / Web)"

  Step 2: Add the entry point
          Load Balancer → API Gateway

  Step 3: Identify core services
          Based on your requirements:
          • Tweet Service (create, delete, get tweets)
          • Timeline Service (generate and serve timelines)
          • User Service (profiles, follow graph)

  Step 4: Add data stores
          • Database for each service
          • Cache layer (Redis)
          • Object storage for media (if applicable)

  Step 5: Add async components
          • Message queue for fan-out
          • Workers for async processing

  Step 6: Walk through the request path
          Number each arrow: 1 → 2 → 3 → ...
          Explain out loud: "When a user posts a tweet..."

Example Diagram

  ┌────────┐     ┌─────┐     ┌──────────┐     ┌──────────────┐
  │ Client │────►│ LB  │────►│   API    │────►│    Tweet     │
  │        │     │     │     │ Gateway  │     │   Service    │
  └────────┘     └─────┘     └────┬─────┘     └──────┬───────┘
                                  │                   │
                                  │              ┌────▼────┐
                                  │              │ Tweet DB│
                                  │              │(Postgres)│
                                  │              └─────────┘
                                  │
                                  │           ┌──────────────┐
                                  └──────────►│  Timeline    │
                                              │   Service    │
                                              └──────┬───────┘
                                                     │
                                       ┌─────────────┼──────────┐
                                       ▼             ▼          ▼
                                 ┌──────────┐  ┌─────────┐ ┌────────┐
                                 │  Redis   │  │ Fan-out │ │User    │
                                 │  Cache   │  │ Queue   │ │Service │
                                 │(timeline)│  │ (Kafka) │ └────┬───┘
                                 └──────────┘  └────┬────┘      │
                                                    ▼      ┌────▼───┐
                                               ┌────────┐  │User DB │
                                               │Fan-out │  └────────┘
                                               │Workers │
                                               └────────┘

Narrate as You Draw

Talk continuously while drawing. Example narrative:

"When a user posts a tweet, the request goes through the load balancer to the API Gateway, which routes it to the Tweet Service. The Tweet Service writes the tweet to the Tweet DB and publishes an event to Kafka. Fan-out workers consume the event, look up the poster's followers from the User Service, and push the tweet into each follower's timeline cache in Redis. When a user reads their timeline, the Timeline Service reads directly from Redis, which gives us sub-10ms latency."


5. Phase 4: Deep Dive (15 Minutes)

The interviewer will typically pick 1-2 components and ask you to go deeper. Be prepared for any of these.

Common Deep-Dive Topics

TopicWhat to Discuss
Database schemaTable design, primary keys, indexes, sharding key
Caching strategyCache-aside vs write-through, TTL, invalidation, consistency
ShardingShard key selection, consistent hashing, rebalancing
Fan-out strategyPush vs pull vs hybrid, handling celebrity accounts
API designEndpoint design, pagination, rate limiting
Failure handlingWhat happens when Service X goes down? Circuit breakers, retries
ConsistencyHow do you handle conflicts? Last-write-wins? CRDTs?
Rate limitingToken bucket, sliding window, distributed rate limiting
MonitoringMetrics, alerts, dashboards, tracing

Example Deep Dive: Fan-Out Strategy

  PUSH MODEL (Fan-Out on Write)
  ─────────────────────────────
  When Alice tweets:
    1. Look up Alice's followers (e.g., 10K followers)
    2. Write Alice's tweet to each follower's timeline cache
    3. When a follower reads their timeline, it's pre-computed

  Pros: Fast reads (O(1) — just read from cache)
  Cons: Slow writes for celebrities (1M followers = 1M writes)

  PULL MODEL (Fan-Out on Read)
  ────────────────────────────
  When Bob reads his timeline:
    1. Look up Bob's followed accounts
    2. Fetch recent tweets from each account
    3. Merge and sort

  Pros: No write amplification
  Cons: Slow reads (many DB queries per timeline request)

  HYBRID MODEL (What Twitter actually does)
  ──────────────────────────────────────────
  • Regular users (< 10K followers): PUSH model
  • Celebrities (> 10K followers): PULL model
  • Timeline = pre-cached (push) + merge celebrity tweets (pull)

How to Handle the Deep Dive

  1. Acknowledge the question. "Great question, let me think about that."
  2. State the trade-offs. "There are two approaches here..."
  3. Pick one and justify. "I would go with X because... given our requirements."
  4. Acknowledge limitations. "The downside is... but we can mitigate by..."

6. Phase 5: Wrap Up (5 Minutes)

What to Cover

  ┌─────────────────────────────────────────────────────────┐
  │  WRAP UP CHECKLIST                                        │
  │                                                            │
  │  1. BOTTLENECKS                                            │
  │     "The main bottleneck in this design is..."             │
  │     (e.g., fan-out for celebrities, DB write throughput)   │
  │                                                            │
  │  2. SINGLE POINTS OF FAILURE                               │
  │     "If I had more time, I would add redundancy for..."    │
  │     (e.g., multi-region, DB failover)                      │
  │                                                            │
  │  3. FUTURE IMPROVEMENTS                                    │
  │     "To handle 10x growth, we would need..."               │
  │     (e.g., sharding, more cache nodes, CDN expansion)      │
  │                                                            │
  │  4. MONITORING                                             │
  │     "I would monitor: latency P99, error rate, cache hit   │
  │      ratio, queue depth, DB connection pool usage"          │
  │                                                            │
  └─────────────────────────────────────────────────────────┘

7. How to Draw Diagrams

Physical Whiteboard Tips

TipWhy
Start in the center-leftLeave room for components on the right and below
Draw bigSmall diagrams are hard to read and hard to modify
Use consistent shapesRectangles = services, cylinders = databases, diamonds = decisions
Label EVERY arrow"HTTP," "gRPC," "Kafka event," "SQL query"
Number the flow1 → 2 → 3 makes the data path clear
Leave space between boxesYou WILL need to add components mid-discussion
Use different colorsIf available: blue for services, green for data stores, red for bottlenecks

Virtual Whiteboard Tips (Excalidraw, Miro, etc.)

TipWhy
Pre-create common shapesHave service boxes and DB cylinders ready to copy-paste
Use a gridKeep things aligned
Practice beforehandKnow the tool's keyboard shortcuts
Share your screen earlyLet the interviewer see your drawing in real time

Diagram Evolution During Interview

  START (Phase 3, minute 10):        AFTER DEEP DIVE (Phase 4, minute 35):

  Client → LB → API → Service → DB   Client → CDN (static)
                                             ↓
                                      Client → LB → API Gateway
                                                      │
                                             ┌────────┼────────┐
                                             ▼        ▼        ▼
                                          Tweet    Timeline   User
                                          Service  Service    Service
                                             │        │        │
                                          Tweet DB  Redis    User DB
                                             │     (cache)
                                          Kafka ──→ Fan-out Workers

It is normal and expected for your diagram to evolve. Start simple, add complexity as needed.


8. Handling "What About X?" Questions

Interviewers will throw curveballs to test how you think on your feet.

Common "What About X?" Questions and How to Handle Them

QuestionHow to Respond
"What if this service goes down?""I would add redundancy: multiple instances behind a load balancer, health checks, and circuit breakers to prevent cascading failures."
"What if the database can't handle the load?""At this QPS, I would add read replicas for reads, or shard the database by [shard key]. For writes, a queue can absorb spikes."
"What about data consistency?""Given our requirements, I chose eventual consistency. For this specific case, I would use [approach]. If strong consistency is needed, I would use [alternative]."
"What if a celebrity with 100M followers tweets?""Fan-out on write would be too slow. I would use a hybrid model: push for regular users, pull for celebrity tweets at read time."
"How do you handle duplicate requests?""I would make the API idempotent using a client-generated idempotency key. The server checks if this key was already processed before executing."
"What about security?""Authentication via JWT at the API Gateway, authorization per-service, HTTPS everywhere, rate limiting to prevent abuse, input validation to prevent injection."

The Response Framework

  1. ACKNOWLEDGE    "That's a great point."
  2. STATE IMPACT   "If X happens, the impact would be..."
  3. PROPOSE FIX    "To handle this, I would..."
  4. TRADE-OFF      "The trade-off is... but given our requirements, this is acceptable."

9. Common Mistakes

Mistake 1: Jumping Into Design Without Requirements

  BAD:  "Let me draw the architecture..."  (minute 0)
  GOOD: "Before I start, let me clarify the requirements..."  (minute 0)

Why it hurts: You might design for the wrong scale or the wrong features.

Mistake 2: Over-Engineering

  BAD:  "We'll use Kubernetes with a service mesh, CQRS with event sourcing,
         a distributed consensus protocol, and a custom CDC pipeline..."
         (for a system with 1K users)

  GOOD: "Given the scale (1K users), a monolith with PostgreSQL is sufficient.
         If we need to scale later, we can split into services."

Why it hurts: Shows you cannot match the solution to the problem.

Mistake 3: Not Discussing Trade-Offs

  BAD:  "I'll use NoSQL."
  GOOD: "I'm choosing NoSQL (Cassandra) because we need high write throughput
         and can tolerate eventual consistency. The trade-off is that we lose
         ACID transactions and complex queries."

Why it hurts: Interviewers want to see your reasoning, not just your choice.

Mistake 4: Monologuing

  BAD:  Talks for 15 minutes without checking in with the interviewer.
  GOOD: "Does this direction make sense? Would you like me to go deeper
         on any component before I continue?"

Why it hurts: The interview is a conversation, not a presentation.

Mistake 5: Getting Stuck on Details Too Early

  BAD:  Spends 10 minutes on database schema before drawing the overall architecture.
  GOOD: Draws the full high-level picture first, THEN dives into specific components.

Why it hurts: You run out of time before showing the complete system.

Mistake 6: Not Considering Failure Scenarios

  BAD:  Assumes everything works perfectly all the time.
  GOOD: "What if the cache goes down? We fall back to the database. What if
         the database goes down? We have a replica that promotes to primary."

Mistake 7: Ignoring the Numbers

  BAD:  "We'll add a cache." (with no justification)
  GOOD: "At 460K read QPS, the database cannot handle the load. With a cache
         hit ratio of 90%, we reduce DB reads to 46K QPS, which 3 replicas
         can handle."

10. Signal Checklist — What Interviewers Score

SignalStrongWeak
Requirements gatheringAsks 5+ clarifying questions; writes them downJumps straight to drawing
EstimationShows clear math; derives useful insightsSkips estimation or guesses
ArchitectureClean diagram with labeled arrows and data flowMessy diagram with unnamed boxes
Trade-offsExplains why they chose A over B with reasoningMakes choices without justification
ScalabilityIdentifies bottlenecks and proposes solutionsDesigns for a single server
CommunicationChecks in with interviewer; narrates while drawingMonologues or stays silent
Failure handlingDiscusses redundancy, retries, circuit breakersIgnores failures completely
DepthCan go deep on any component when askedSurface-level understanding

11. Example Walkthrough: "Design a URL Shortener"

Phase 1: Requirements (5 min)

  Functional:
  • Shorten a long URL → short URL
  • Redirect short URL → original long URL
  • Custom aliases (optional)
  • Link analytics (click count, geography)

  Non-functional:
  • 100M new URLs per month
  • Read:Write = 100:1
  • Redirect latency < 50ms
  • 99.9% availability
  • URLs never expire (unless deleted)

  Out of scope: User accounts, paid plans, spam detection

Phase 2: Estimation (5 min)

  Write QPS:   100M / (30 * 86,400) ≈ 40/sec (peak: 120/sec)
  Read QPS:    40 * 100 = 4,000/sec (peak: 12,000/sec)
  Storage:     100M * 250 bytes/month = 25 GB/month, 1.5 TB / 5 years
  Cache:       20% of daily unique URLs ≈ 5 GB (fits in one Redis)
  Short URL:   base62, 7 chars → 3.5T combos (sufficient for decades)

Phase 3: High-Level Design (15 min)

  ┌────────┐     ┌─────┐     ┌───────────────┐
  │ Client │────►│ LB  │────►│  URL Service  │
  └────────┘     └─────┘     └───────┬───────┘
                                     │
                          ┌──────────┼──────────┐
                          ▼          ▼          ▼
                    ┌──────────┐ ┌────────┐ ┌──────────┐
                    │  Redis   │ │  DB    │ │Analytics │
                    │  Cache   │ │(Postgres│ │ Queue   │
                    │short→long│ │ or     │ │(Kafka)  │
                    └──────────┘ │Cassandr)│ └────┬────┘
                                 └────────┘      ▼
                                           ┌──────────┐
                                           │Analytics │
                                           │  Worker  │
                                           └────┬─────┘
                                                ▼
                                           ┌──────────┐
                                           │ClickHouse│
                                           │(analytics)│
                                           └──────────┘

  FLOW — Create Short URL:
  1. Client → POST /shorten { url: "https://very-long-url.com/..." }
  2. URL Service generates short code (base62 counter or hash)
  3. Store mapping in DB: abc123 → https://very-long-url.com/...
  4. Return: https://short.ly/abc123

  FLOW — Redirect:
  1. Client → GET /abc123
  2. URL Service checks Redis cache
  3. HIT → 301 redirect to original URL
  4. MISS → query DB, populate cache, 301 redirect
  5. Publish click event to Kafka (async)

Phase 4: Deep Dive (15 min)

Interviewer: "How do you generate the short URL? What about collisions?"

  APPROACH 1: Counter-based (preferred)
  ─────────────────────────────────────
  • Use a distributed counter (e.g., auto-increment ID or Snowflake ID)
  • Convert to base62: 1000000 → "4c92"
  • Guaranteed unique — no collisions
  • But: IDs are predictable (sequential)

  APPROACH 2: Hash-based
  ──────────────────────
  • Hash the long URL: MD5("https://...") → first 7 chars of base62
  • Possible collisions: check DB, if collision → append random char
  • Not predictable, but collision handling adds complexity

  I would go with Approach 1 (counter) because:
  • No collision handling needed
  • Simpler implementation
  • The predictability concern can be addressed by encoding with a
    random offset or shuffling the base62 alphabet

Interviewer: "How do you handle if the DB goes down?"

  1. Redis cache serves reads (most traffic) — no DB needed for cached URLs
  2. DB has a standby replica for failover (automated, < 30 sec)
  3. Writes buffer in a queue during failover
  4. After failover, replay queued writes

Phase 5: Wrap Up (5 min)

  Bottlenecks:
  • DB is a potential bottleneck at very high scale, but at 40 writes/sec
    and 4K reads/sec (with 90% cache hit → 400 reads/sec to DB), a single
    PostgreSQL instance is sufficient for years.

  Future improvements:
  • DB sharding by short URL hash (if scale demands)
  • Multi-region deployment for lower global latency
  • Bloom filter to check URL existence without hitting DB
  • Rate limiting per IP to prevent abuse

  Monitoring:
  • Redirect latency P99
  • Cache hit ratio
  • DB connection pool utilization
  • Queue depth (analytics pipeline)

12. Key Takeaways

  1. Follow the 5-phase framework religiously: Requirements → Estimation → High-Level Design → Deep Dive → Wrap Up.
  2. Never skip requirements. The first 5 minutes set the direction for the entire interview.
  3. Draw big, label everything, number the flow. The diagram is your primary communication tool.
  4. Discuss trade-offs for every decision. "I chose X because... the alternative was Y, but..."
  5. The interview is a conversation. Check in with the interviewer. Ask "Does this direction make sense?"
  6. Start simple, add complexity. Draw the basic flow first, then evolve the diagram with caching, queues, and redundancy.
  7. Show your math. Even rough estimation demonstrates quantitative thinking.
  8. Handle curveballs calmly. Acknowledge → State Impact → Propose Fix → Trade-off.

13. Explain-It Challenge

Without looking back, practice these:

  1. Walk through the 5-phase framework and explain what you do in each phase and how long it takes.
  2. You are asked "Design a notification system." What are the first 5 questions you would ask?
  3. You are mid-design and the interviewer asks "What if the database goes down?" Walk through your response framework.
  4. List 5 common mistakes candidates make in system design interviews and how to avoid each.
  5. Draw a high-level diagram for a simple file-sharing system (like Dropbox). Narrate as you draw.

Navigation: ← 9.7.d — Capacity Estimation · 9.7 Overview →