Episode 9 — System Design / 9.7 — System Design Foundations

9.7.b — Requirements Analysis

In one sentence: Before drawing a single box on the whiteboard, you must clarify what the system needs to do (functional requirements) and how well it needs to do it (non-functional requirements) — skipping this step is the number one mistake in system design interviews.

Navigation: ← 9.7.a — What Is HLD · 9.7.c — Breaking Into Components →

1. Why Requirements Come First
2. Functional Requirements (FRs)
3. Non-Functional Requirements (NFRs)
4. Identifying Core Features
5. Scalability, Availability, and Consistency
6. Back-of-Envelope Estimation
7. SLAs (Service Level Agreements)
8. Interview Approach: Always Clarify First
9. Worked Example: "Design Twitter"
10. Key Takeaways
11. Explain-It Challenge

1. Why Requirements Come First

  WITHOUT REQUIREMENTS:                WITH REQUIREMENTS:

  Interviewer: "Design Twitter"        Interviewer: "Design Twitter"

  Candidate: *starts drawing           Candidate: "Let me clarify scope.
   boxes immediately*                   Should I focus on tweeting and
                                        timelines, or also DMs, search,
  Result:                               and trending? What scale — 100K
  • Builds the wrong thing              or 500M users?"
  • Misses key constraints
  • Runs out of time                   Result:
  • Cannot justify trade-offs          • Focused design
                                        • Clear priorities
                                        • Better trade-off discussions

Rule: Spend the first 5 minutes of any system design interview gathering requirements. It is never wasted time.

2. Functional Requirements (FRs)

Functional requirements describe what the system does — the features and behaviors visible to the user.

How to Identify FRs

Ask yourself (or the interviewer): "What can a user DO with this system?"

System	Example Functional Requirements
URL Shortener	Create short URL, redirect to original, custom aliases, link analytics
Twitter	Post tweet, view home timeline, follow/unfollow, like, retweet, search
YouTube	Upload video, watch video, search, like/comment, subscribe, recommendations
WhatsApp	Send message (1:1), group chat, read receipts, media sharing, online status
Uber	Request ride, match with driver, real-time tracking, payment, rating

FR Prioritization

Not all features are equal. In an interview, you have 45 minutes. Prioritize ruthlessly.

  ┌──────────────────────────────────────────────────────┐
  │              FEATURE PRIORITIZATION                    │
  │                                                        │
  │  MUST HAVE (P0)         NICE TO HAVE (P1)   OUT OF    │
  │  ──────────────         ──────────────────   SCOPE     │
  │                                              ─────     │
  │  • Post tweet           • Search tweets      • Ads     │
  │  • View timeline        • Trending topics    • DMs     │
  │  • Follow users         • Analytics          • Spaces  │
  │                         • Notifications                │
  │                                                        │
  │  Design these FIRST     Mention if time      Explicitly│
  │                         allows               exclude   │
  └──────────────────────────────────────────────────────┘

3. Non-Functional Requirements (NFRs)

Non-functional requirements describe how well the system performs — the quality attributes.

NFR	Question It Answers	Typical Target
Scalability	How many users / requests can it handle?	"Support 500M MAU"
Availability	What percentage of uptime?	99.9% (8.7h downtime/year)
Latency	How fast does it respond?	"Timeline loads in < 200ms"
Consistency	Do all users see the same data at the same time?	"Eventual consistency is OK for timelines"
Durability	Can data survive failures?	"Zero data loss for user data"
Throughput	How many operations per second?	"Handle 10K writes/sec"
Security	Authentication, authorization, encryption	"All data encrypted at rest and in transit"
Fault tolerance	What happens when a server/region goes down?	"No single point of failure"

The CAP Theorem Trade-off

In a distributed system, you can only guarantee two out of three:

            Consistency
               /\
              /  \
             /    \
            / Pick \
           /  Two   \
          /          \
         /____________\
  Availability ──── Partition
                    Tolerance

  CP: Consistent + Partition-tolerant (sacrifice availability during partitions)
      Examples: HBase, MongoDB (default), Zookeeper

  AP: Available + Partition-tolerant (sacrifice consistency during partitions)
      Examples: Cassandra, DynamoDB, CouchDB

  CA: Consistent + Available (only possible if no network partitions — not realistic
      in distributed systems)

In practice: Network partitions WILL happen. So the real choice is CP vs AP.

4. Identifying Core Features

The "What / Who / How Much" Framework

Use these three questions to nail down scope quickly:

  ┌──────────────────────────────────────────────────────┐
  │                                                        │
  │  1. WHAT does the system do?                           │
  │     → List 3-5 core features (functional requirements) │
  │                                                        │
  │  2. WHO uses it and HOW?                               │
  │     → User types, read vs write patterns, access       │
  │       methods (mobile, web, API)                       │
  │                                                        │
  │  3. HOW MUCH?                                          │
  │     → Number of users, data volume, request rate       │
  │     → This drives ALL scaling decisions                │
  │                                                        │
  └──────────────────────────────────────────────────────┘

Read-Heavy vs Write-Heavy

Understanding the read/write ratio is critical because it drives architecture decisions.

Pattern	Read:Write	Example	Architecture Impact
Read-heavy	100:1 or more	Twitter timeline, Wikipedia	Caching, read replicas, CDN
Write-heavy	1:1 or writes dominate	Logging system, IoT sensors	Write-optimized DB, append-only logs, queues
Balanced	~10:1	E-commerce	Mix of caching and write optimization

5. Scalability, Availability, and Consistency

Scalability Requirements

Scale	Users (MAU)	Typical QPS	Architecture Complexity
Small	< 10K	< 100	Single server, single DB
Medium	10K - 1M	100 - 10K	Load balancer, read replicas, cache
Large	1M - 100M	10K - 1M	Microservices, sharding, CDN, queues
Massive	100M+	1M+	Global distribution, multi-region, custom solutions

Availability Targets

Availability	Downtime/Year	Downtime/Month	Typical Use
99% (two nines)	3.65 days	7.3 hours	Internal tools
99.9% (three nines)	8.76 hours	43.8 minutes	Most SaaS products
99.99% (four nines)	52.6 minutes	4.38 minutes	Payment systems, databases
99.999% (five nines)	5.26 minutes	26.3 seconds	Infrastructure (AWS, GCP core)

Consistency Models

Model	Description	Use When
Strong consistency	Every read returns the latest write	Banking, inventory counts
Eventual consistency	Reads may return stale data; eventually converge	Social feeds, analytics, recommendations
Causal consistency	Causally related operations are seen in order	Chat messages (A replies to B, order matters)

6. Back-of-Envelope Estimation

Quick math to sanity-check your design. You don't need exact numbers — order of magnitude matters.

Useful Numbers to Memorize

Quantity	Value
Seconds in a day	~86,400 (round to ~100K)
Seconds in a month	~2.5 million
Seconds in a year	~31.5 million (round to ~30M)
1 million seconds	~12 days
1 billion seconds	~31.7 years

Storage Units

Unit	Bytes	Approximate
1 KB	1,000	A short text message
1 MB	1,000,000	A high-res photo
1 GB	1,000,000,000	A short HD movie
1 TB	10^12	A large database
1 PB	10^15	A data warehouse

Quick Estimation Template

  Given: 500M monthly active users

  Step 1: Daily Active Users (DAU)
           DAU = MAU * daily_active_ratio
           DAU = 500M * 0.20 = 100M

  Step 2: Queries Per Second (QPS)
           Daily queries = DAU * actions_per_user
           Daily queries = 100M * 10 = 1B
           QPS = 1B / 86,400 ≈ ~12,000 QPS
           Peak QPS = QPS * 2 = ~24,000 QPS

  Step 3: Storage
           Data per action = 250 bytes (e.g., a tweet)
           Daily storage = 1B * 250 bytes = 250 GB/day
           Yearly storage = 250 GB * 365 = ~91 TB/year

  Step 4: Bandwidth
           Outgoing data = QPS * response_size
           = 12,000 * 10KB = 120 MB/s

7. SLAs (Service Level Agreements)

An SLA is a contract between a service provider and a customer that defines measurable targets.

SLA Component	Example
Availability	99.9% uptime per month
Latency (P99)	99% of requests complete in < 500ms
Throughput	System handles 10,000 requests/sec
Durability	99.999999999% (11 nines) for stored data
Error rate	Less than 0.1% of requests return 5xx errors

SLA vs SLO vs SLI

Term	Meaning	Example
SLI (Service Level Indicator)	The metric you measure	"P99 latency is 350ms"
SLO (Service Level Objective)	The target for that metric	"P99 latency should be < 500ms"
SLA (Service Level Agreement)	The contract with consequences	"If P99 > 500ms for a month, customer gets a credit"

  SLI (what you measure) → SLO (what you target) → SLA (what you promise)

8. Interview Approach: Always Clarify First

The 5-Minute Requirements Checklist

When the interviewer says "Design X," run through this checklist:

  ┌──────────────────────────────────────────────────────────┐
  │              REQUIREMENTS CHECKLIST (5 min)                │
  │                                                            │
  │  1. "What are the core features?"                          │
  │     → List 3-5, confirm with interviewer                   │
  │                                                            │
  │  2. "Who are the users?"                                   │
  │     → End users? Internal? Both? Mobile + web?             │
  │                                                            │
  │  3. "What scale are we targeting?"                         │
  │     → 1K users? 1M? 1B? This changes EVERYTHING           │
  │                                                            │
  │  4. "What are the read/write patterns?"                    │
  │     → Read-heavy? Write-heavy? Ratio?                      │
  │                                                            │
  │  5. "What quality attributes matter most?"                 │
  │     → Latency vs consistency? Availability target?         │
  │                                                            │
  │  6. "What's out of scope?"                                 │
  │     → Explicitly exclude features to focus your design     │
  │                                                            │
  └──────────────────────────────────────────────────────────┘

Good vs Bad Requirement Questions

Bad (vague / obvious)	Good (specific / clarifying)
"Should it be fast?"	"Should we optimize for P99 latency under 200ms?"
"Will people use it?"	"Are we designing for 1M or 100M daily active users?"
"Should it store data?"	"Should we guarantee zero data loss, or is eventual durability acceptable?"
"Is security important?"	"Do we need end-to-end encryption, or is TLS in transit sufficient?"

9. Worked Example: "Design Twitter"

Step 1: Clarify Requirements

Functional (P0):

Users can post tweets (text, 280 chars max)
Users can view their home timeline (tweets from people they follow)
Users can follow/unfollow other users

Functional (P1 — if time):

Like tweets, retweet, search, trending topics

Non-functional:

500M MAU, 200M DAU
Timeline loads in < 200ms
99.9% availability
Eventually consistent timelines are acceptable
Read-heavy (100:1 read:write ratio)

Step 2: Quick Estimation

  DAU:           200M
  Tweets/day:    200M * 2 tweets = 400M tweets/day
  Write QPS:     400M / 86,400 ≈ 4,600 writes/sec
  Read QPS:      4,600 * 100 = 460,000 reads/sec (timeline reads)

  Storage/tweet: 280 bytes (text) + 200 bytes (metadata) ≈ 500 bytes
  Daily storage: 400M * 500 bytes = 200 GB/day
  Yearly:        200 GB * 365 ≈ 73 TB/year (text only; media is separate)

  Bandwidth out: 460,000 * 10KB (timeline page) ≈ 4.6 GB/s

Step 3: Now You Are Ready to Design

With these requirements and numbers in hand, you can make informed decisions:

460K reads/sec means you NEED aggressive caching
99.9% availability means you need redundancy at every layer
Eventual consistency means you can use a push/fan-out model for timelines
73 TB/year means you need to plan for sharding

10. Key Takeaways

Always start with requirements. Spending 5 minutes clarifying scope saves 15 minutes of rework.
Functional requirements define WHAT the system does; non-functional requirements define HOW WELL.
Prioritize features into P0 (must have), P1 (nice to have), and out-of-scope.
The CAP theorem means your real choice in distributed systems is CP vs AP.
Back-of-envelope estimation gives you the numbers to justify your architecture decisions (caching, sharding, replication).
SLAs are contracts; SLOs are targets; SLIs are measurements.
Read/write ratio is one of the most important numbers — it drives caching, replication, and database selection.

11. Explain-It Challenge

Without looking back, explain in your own words:

What is the difference between a functional and a non-functional requirement? Give two examples of each for a URL shortener.
Why is the read/write ratio so important for architecture decisions?
Explain the CAP theorem in plain language. Why is "CA" not realistic?
Walk through a back-of-envelope estimation for a system with 10M DAU where each user makes 5 requests per day.
What is the difference between SLI, SLO, and SLA?