Episode 9 — System Design / 9.7 — System Design Foundations
9.7.b — Requirements Analysis
In one sentence: Before drawing a single box on the whiteboard, you must clarify what the system needs to do (functional requirements) and how well it needs to do it (non-functional requirements) — skipping this step is the number one mistake in system design interviews.
Navigation: ← 9.7.a — What Is HLD · 9.7.c — Breaking Into Components →
Table of Contents
- 1. Why Requirements Come First
- 2. Functional Requirements (FRs)
- 3. Non-Functional Requirements (NFRs)
- 4. Identifying Core Features
- 5. Scalability, Availability, and Consistency
- 6. Back-of-Envelope Estimation
- 7. SLAs (Service Level Agreements)
- 8. Interview Approach: Always Clarify First
- 9. Worked Example: "Design Twitter"
- 10. Key Takeaways
- 11. Explain-It Challenge
1. Why Requirements Come First
WITHOUT REQUIREMENTS: WITH REQUIREMENTS:
Interviewer: "Design Twitter" Interviewer: "Design Twitter"
Candidate: *starts drawing Candidate: "Let me clarify scope.
boxes immediately* Should I focus on tweeting and
timelines, or also DMs, search,
Result: and trending? What scale — 100K
• Builds the wrong thing or 500M users?"
• Misses key constraints
• Runs out of time Result:
• Cannot justify trade-offs • Focused design
• Clear priorities
• Better trade-off discussions
Rule: Spend the first 5 minutes of any system design interview gathering requirements. It is never wasted time.
2. Functional Requirements (FRs)
Functional requirements describe what the system does — the features and behaviors visible to the user.
How to Identify FRs
Ask yourself (or the interviewer): "What can a user DO with this system?"
| System | Example Functional Requirements |
|---|---|
| URL Shortener | Create short URL, redirect to original, custom aliases, link analytics |
| Post tweet, view home timeline, follow/unfollow, like, retweet, search | |
| YouTube | Upload video, watch video, search, like/comment, subscribe, recommendations |
| Send message (1:1), group chat, read receipts, media sharing, online status | |
| Uber | Request ride, match with driver, real-time tracking, payment, rating |
FR Prioritization
Not all features are equal. In an interview, you have 45 minutes. Prioritize ruthlessly.
┌──────────────────────────────────────────────────────┐
│ FEATURE PRIORITIZATION │
│ │
│ MUST HAVE (P0) NICE TO HAVE (P1) OUT OF │
│ ────────────── ────────────────── SCOPE │
│ ───── │
│ • Post tweet • Search tweets • Ads │
│ • View timeline • Trending topics • DMs │
│ • Follow users • Analytics • Spaces │
│ • Notifications │
│ │
│ Design these FIRST Mention if time Explicitly│
│ allows exclude │
└──────────────────────────────────────────────────────┘
3. Non-Functional Requirements (NFRs)
Non-functional requirements describe how well the system performs — the quality attributes.
| NFR | Question It Answers | Typical Target |
|---|---|---|
| Scalability | How many users / requests can it handle? | "Support 500M MAU" |
| Availability | What percentage of uptime? | 99.9% (8.7h downtime/year) |
| Latency | How fast does it respond? | "Timeline loads in < 200ms" |
| Consistency | Do all users see the same data at the same time? | "Eventual consistency is OK for timelines" |
| Durability | Can data survive failures? | "Zero data loss for user data" |
| Throughput | How many operations per second? | "Handle 10K writes/sec" |
| Security | Authentication, authorization, encryption | "All data encrypted at rest and in transit" |
| Fault tolerance | What happens when a server/region goes down? | "No single point of failure" |
The CAP Theorem Trade-off
In a distributed system, you can only guarantee two out of three:
Consistency
/\
/ \
/ \
/ Pick \
/ Two \
/ \
/____________\
Availability ──── Partition
Tolerance
CP: Consistent + Partition-tolerant (sacrifice availability during partitions)
Examples: HBase, MongoDB (default), Zookeeper
AP: Available + Partition-tolerant (sacrifice consistency during partitions)
Examples: Cassandra, DynamoDB, CouchDB
CA: Consistent + Available (only possible if no network partitions — not realistic
in distributed systems)
In practice: Network partitions WILL happen. So the real choice is CP vs AP.
4. Identifying Core Features
The "What / Who / How Much" Framework
Use these three questions to nail down scope quickly:
┌──────────────────────────────────────────────────────┐
│ │
│ 1. WHAT does the system do? │
│ → List 3-5 core features (functional requirements) │
│ │
│ 2. WHO uses it and HOW? │
│ → User types, read vs write patterns, access │
│ methods (mobile, web, API) │
│ │
│ 3. HOW MUCH? │
│ → Number of users, data volume, request rate │
│ → This drives ALL scaling decisions │
│ │
└──────────────────────────────────────────────────────┘
Read-Heavy vs Write-Heavy
Understanding the read/write ratio is critical because it drives architecture decisions.
| Pattern | Read:Write | Example | Architecture Impact |
|---|---|---|---|
| Read-heavy | 100:1 or more | Twitter timeline, Wikipedia | Caching, read replicas, CDN |
| Write-heavy | 1:1 or writes dominate | Logging system, IoT sensors | Write-optimized DB, append-only logs, queues |
| Balanced | ~10:1 | E-commerce | Mix of caching and write optimization |
5. Scalability, Availability, and Consistency
Scalability Requirements
| Scale | Users (MAU) | Typical QPS | Architecture Complexity |
|---|---|---|---|
| Small | < 10K | < 100 | Single server, single DB |
| Medium | 10K - 1M | 100 - 10K | Load balancer, read replicas, cache |
| Large | 1M - 100M | 10K - 1M | Microservices, sharding, CDN, queues |
| Massive | 100M+ | 1M+ | Global distribution, multi-region, custom solutions |
Availability Targets
| Availability | Downtime/Year | Downtime/Month | Typical Use |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | Internal tools |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | Most SaaS products |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | Payment systems, databases |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | Infrastructure (AWS, GCP core) |
Consistency Models
| Model | Description | Use When |
|---|---|---|
| Strong consistency | Every read returns the latest write | Banking, inventory counts |
| Eventual consistency | Reads may return stale data; eventually converge | Social feeds, analytics, recommendations |
| Causal consistency | Causally related operations are seen in order | Chat messages (A replies to B, order matters) |
6. Back-of-Envelope Estimation
Quick math to sanity-check your design. You don't need exact numbers — order of magnitude matters.
Useful Numbers to Memorize
| Quantity | Value |
|---|---|
| Seconds in a day | ~86,400 (round to ~100K) |
| Seconds in a month | ~2.5 million |
| Seconds in a year | ~31.5 million (round to ~30M) |
| 1 million seconds | ~12 days |
| 1 billion seconds | ~31.7 years |
Storage Units
| Unit | Bytes | Approximate |
|---|---|---|
| 1 KB | 1,000 | A short text message |
| 1 MB | 1,000,000 | A high-res photo |
| 1 GB | 1,000,000,000 | A short HD movie |
| 1 TB | 10^12 | A large database |
| 1 PB | 10^15 | A data warehouse |
Quick Estimation Template
Given: 500M monthly active users
Step 1: Daily Active Users (DAU)
DAU = MAU * daily_active_ratio
DAU = 500M * 0.20 = 100M
Step 2: Queries Per Second (QPS)
Daily queries = DAU * actions_per_user
Daily queries = 100M * 10 = 1B
QPS = 1B / 86,400 ≈ ~12,000 QPS
Peak QPS = QPS * 2 = ~24,000 QPS
Step 3: Storage
Data per action = 250 bytes (e.g., a tweet)
Daily storage = 1B * 250 bytes = 250 GB/day
Yearly storage = 250 GB * 365 = ~91 TB/year
Step 4: Bandwidth
Outgoing data = QPS * response_size
= 12,000 * 10KB = 120 MB/s
7. SLAs (Service Level Agreements)
An SLA is a contract between a service provider and a customer that defines measurable targets.
| SLA Component | Example |
|---|---|
| Availability | 99.9% uptime per month |
| Latency (P99) | 99% of requests complete in < 500ms |
| Throughput | System handles 10,000 requests/sec |
| Durability | 99.999999999% (11 nines) for stored data |
| Error rate | Less than 0.1% of requests return 5xx errors |
SLA vs SLO vs SLI
| Term | Meaning | Example |
|---|---|---|
| SLI (Service Level Indicator) | The metric you measure | "P99 latency is 350ms" |
| SLO (Service Level Objective) | The target for that metric | "P99 latency should be < 500ms" |
| SLA (Service Level Agreement) | The contract with consequences | "If P99 > 500ms for a month, customer gets a credit" |
SLI (what you measure) → SLO (what you target) → SLA (what you promise)
8. Interview Approach: Always Clarify First
The 5-Minute Requirements Checklist
When the interviewer says "Design X," run through this checklist:
┌──────────────────────────────────────────────────────────┐
│ REQUIREMENTS CHECKLIST (5 min) │
│ │
│ 1. "What are the core features?" │
│ → List 3-5, confirm with interviewer │
│ │
│ 2. "Who are the users?" │
│ → End users? Internal? Both? Mobile + web? │
│ │
│ 3. "What scale are we targeting?" │
│ → 1K users? 1M? 1B? This changes EVERYTHING │
│ │
│ 4. "What are the read/write patterns?" │
│ → Read-heavy? Write-heavy? Ratio? │
│ │
│ 5. "What quality attributes matter most?" │
│ → Latency vs consistency? Availability target? │
│ │
│ 6. "What's out of scope?" │
│ → Explicitly exclude features to focus your design │
│ │
└──────────────────────────────────────────────────────────┘
Good vs Bad Requirement Questions
| Bad (vague / obvious) | Good (specific / clarifying) |
|---|---|
| "Should it be fast?" | "Should we optimize for P99 latency under 200ms?" |
| "Will people use it?" | "Are we designing for 1M or 100M daily active users?" |
| "Should it store data?" | "Should we guarantee zero data loss, or is eventual durability acceptable?" |
| "Is security important?" | "Do we need end-to-end encryption, or is TLS in transit sufficient?" |
9. Worked Example: "Design Twitter"
Step 1: Clarify Requirements
Functional (P0):
- Users can post tweets (text, 280 chars max)
- Users can view their home timeline (tweets from people they follow)
- Users can follow/unfollow other users
Functional (P1 — if time):
- Like tweets, retweet, search, trending topics
Non-functional:
- 500M MAU, 200M DAU
- Timeline loads in < 200ms
- 99.9% availability
- Eventually consistent timelines are acceptable
- Read-heavy (100:1 read:write ratio)
Step 2: Quick Estimation
DAU: 200M
Tweets/day: 200M * 2 tweets = 400M tweets/day
Write QPS: 400M / 86,400 ≈ 4,600 writes/sec
Read QPS: 4,600 * 100 = 460,000 reads/sec (timeline reads)
Storage/tweet: 280 bytes (text) + 200 bytes (metadata) ≈ 500 bytes
Daily storage: 400M * 500 bytes = 200 GB/day
Yearly: 200 GB * 365 ≈ 73 TB/year (text only; media is separate)
Bandwidth out: 460,000 * 10KB (timeline page) ≈ 4.6 GB/s
Step 3: Now You Are Ready to Design
With these requirements and numbers in hand, you can make informed decisions:
- 460K reads/sec means you NEED aggressive caching
- 99.9% availability means you need redundancy at every layer
- Eventual consistency means you can use a push/fan-out model for timelines
- 73 TB/year means you need to plan for sharding
10. Key Takeaways
- Always start with requirements. Spending 5 minutes clarifying scope saves 15 minutes of rework.
- Functional requirements define WHAT the system does; non-functional requirements define HOW WELL.
- Prioritize features into P0 (must have), P1 (nice to have), and out-of-scope.
- The CAP theorem means your real choice in distributed systems is CP vs AP.
- Back-of-envelope estimation gives you the numbers to justify your architecture decisions (caching, sharding, replication).
- SLAs are contracts; SLOs are targets; SLIs are measurements.
- Read/write ratio is one of the most important numbers — it drives caching, replication, and database selection.
11. Explain-It Challenge
Without looking back, explain in your own words:
- What is the difference between a functional and a non-functional requirement? Give two examples of each for a URL shortener.
- Why is the read/write ratio so important for architecture decisions?
- Explain the CAP theorem in plain language. Why is "CA" not realistic?
- Walk through a back-of-envelope estimation for a system with 10M DAU where each user makes 5 requests per day.
- What is the difference between SLI, SLO, and SLA?
Navigation: ← 9.7.a — What Is HLD · 9.7.c — Breaking Into Components →