Episode 9 — System Design / 9.7 — System Design Foundations

9.7.b — Requirements Analysis

In one sentence: Before drawing a single box on the whiteboard, you must clarify what the system needs to do (functional requirements) and how well it needs to do it (non-functional requirements) — skipping this step is the number one mistake in system design interviews.

Navigation: ← 9.7.a — What Is HLD · 9.7.c — Breaking Into Components →


Table of Contents


1. Why Requirements Come First

  WITHOUT REQUIREMENTS:                WITH REQUIREMENTS:

  Interviewer: "Design Twitter"        Interviewer: "Design Twitter"

  Candidate: *starts drawing           Candidate: "Let me clarify scope.
   boxes immediately*                   Should I focus on tweeting and
                                        timelines, or also DMs, search,
  Result:                               and trending? What scale — 100K
  • Builds the wrong thing              or 500M users?"
  • Misses key constraints
  • Runs out of time                   Result:
  • Cannot justify trade-offs          • Focused design
                                        • Clear priorities
                                        • Better trade-off discussions

Rule: Spend the first 5 minutes of any system design interview gathering requirements. It is never wasted time.


2. Functional Requirements (FRs)

Functional requirements describe what the system does — the features and behaviors visible to the user.

How to Identify FRs

Ask yourself (or the interviewer): "What can a user DO with this system?"

SystemExample Functional Requirements
URL ShortenerCreate short URL, redirect to original, custom aliases, link analytics
TwitterPost tweet, view home timeline, follow/unfollow, like, retweet, search
YouTubeUpload video, watch video, search, like/comment, subscribe, recommendations
WhatsAppSend message (1:1), group chat, read receipts, media sharing, online status
UberRequest ride, match with driver, real-time tracking, payment, rating

FR Prioritization

Not all features are equal. In an interview, you have 45 minutes. Prioritize ruthlessly.

  ┌──────────────────────────────────────────────────────┐
  │              FEATURE PRIORITIZATION                    │
  │                                                        │
  │  MUST HAVE (P0)         NICE TO HAVE (P1)   OUT OF    │
  │  ──────────────         ──────────────────   SCOPE     │
  │                                              ─────     │
  │  • Post tweet           • Search tweets      • Ads     │
  │  • View timeline        • Trending topics    • DMs     │
  │  • Follow users         • Analytics          • Spaces  │
  │                         • Notifications                │
  │                                                        │
  │  Design these FIRST     Mention if time      Explicitly│
  │                         allows               exclude   │
  └──────────────────────────────────────────────────────┘

3. Non-Functional Requirements (NFRs)

Non-functional requirements describe how well the system performs — the quality attributes.

NFRQuestion It AnswersTypical Target
ScalabilityHow many users / requests can it handle?"Support 500M MAU"
AvailabilityWhat percentage of uptime?99.9% (8.7h downtime/year)
LatencyHow fast does it respond?"Timeline loads in < 200ms"
ConsistencyDo all users see the same data at the same time?"Eventual consistency is OK for timelines"
DurabilityCan data survive failures?"Zero data loss for user data"
ThroughputHow many operations per second?"Handle 10K writes/sec"
SecurityAuthentication, authorization, encryption"All data encrypted at rest and in transit"
Fault toleranceWhat happens when a server/region goes down?"No single point of failure"

The CAP Theorem Trade-off

In a distributed system, you can only guarantee two out of three:

            Consistency
               /\
              /  \
             /    \
            / Pick \
           /  Two   \
          /          \
         /____________\
  Availability ──── Partition
                    Tolerance

  CP: Consistent + Partition-tolerant (sacrifice availability during partitions)
      Examples: HBase, MongoDB (default), Zookeeper

  AP: Available + Partition-tolerant (sacrifice consistency during partitions)
      Examples: Cassandra, DynamoDB, CouchDB

  CA: Consistent + Available (only possible if no network partitions — not realistic
      in distributed systems)

In practice: Network partitions WILL happen. So the real choice is CP vs AP.


4. Identifying Core Features

The "What / Who / How Much" Framework

Use these three questions to nail down scope quickly:

  ┌──────────────────────────────────────────────────────┐
  │                                                        │
  │  1. WHAT does the system do?                           │
  │     → List 3-5 core features (functional requirements) │
  │                                                        │
  │  2. WHO uses it and HOW?                               │
  │     → User types, read vs write patterns, access       │
  │       methods (mobile, web, API)                       │
  │                                                        │
  │  3. HOW MUCH?                                          │
  │     → Number of users, data volume, request rate       │
  │     → This drives ALL scaling decisions                │
  │                                                        │
  └──────────────────────────────────────────────────────┘

Read-Heavy vs Write-Heavy

Understanding the read/write ratio is critical because it drives architecture decisions.

PatternRead:WriteExampleArchitecture Impact
Read-heavy100:1 or moreTwitter timeline, WikipediaCaching, read replicas, CDN
Write-heavy1:1 or writes dominateLogging system, IoT sensorsWrite-optimized DB, append-only logs, queues
Balanced~10:1E-commerceMix of caching and write optimization

5. Scalability, Availability, and Consistency

Scalability Requirements

ScaleUsers (MAU)Typical QPSArchitecture Complexity
Small< 10K< 100Single server, single DB
Medium10K - 1M100 - 10KLoad balancer, read replicas, cache
Large1M - 100M10K - 1MMicroservices, sharding, CDN, queues
Massive100M+1M+Global distribution, multi-region, custom solutions

Availability Targets

AvailabilityDowntime/YearDowntime/MonthTypical Use
99% (two nines)3.65 days7.3 hoursInternal tools
99.9% (three nines)8.76 hours43.8 minutesMost SaaS products
99.99% (four nines)52.6 minutes4.38 minutesPayment systems, databases
99.999% (five nines)5.26 minutes26.3 secondsInfrastructure (AWS, GCP core)

Consistency Models

ModelDescriptionUse When
Strong consistencyEvery read returns the latest writeBanking, inventory counts
Eventual consistencyReads may return stale data; eventually convergeSocial feeds, analytics, recommendations
Causal consistencyCausally related operations are seen in orderChat messages (A replies to B, order matters)

6. Back-of-Envelope Estimation

Quick math to sanity-check your design. You don't need exact numbers — order of magnitude matters.

Useful Numbers to Memorize

QuantityValue
Seconds in a day~86,400 (round to ~100K)
Seconds in a month~2.5 million
Seconds in a year~31.5 million (round to ~30M)
1 million seconds~12 days
1 billion seconds~31.7 years

Storage Units

UnitBytesApproximate
1 KB1,000A short text message
1 MB1,000,000A high-res photo
1 GB1,000,000,000A short HD movie
1 TB10^12A large database
1 PB10^15A data warehouse

Quick Estimation Template

  Given: 500M monthly active users

  Step 1: Daily Active Users (DAU)
           DAU = MAU * daily_active_ratio
           DAU = 500M * 0.20 = 100M

  Step 2: Queries Per Second (QPS)
           Daily queries = DAU * actions_per_user
           Daily queries = 100M * 10 = 1B
           QPS = 1B / 86,400 ≈ ~12,000 QPS
           Peak QPS = QPS * 2 = ~24,000 QPS

  Step 3: Storage
           Data per action = 250 bytes (e.g., a tweet)
           Daily storage = 1B * 250 bytes = 250 GB/day
           Yearly storage = 250 GB * 365 = ~91 TB/year

  Step 4: Bandwidth
           Outgoing data = QPS * response_size
           = 12,000 * 10KB = 120 MB/s

7. SLAs (Service Level Agreements)

An SLA is a contract between a service provider and a customer that defines measurable targets.

SLA ComponentExample
Availability99.9% uptime per month
Latency (P99)99% of requests complete in < 500ms
ThroughputSystem handles 10,000 requests/sec
Durability99.999999999% (11 nines) for stored data
Error rateLess than 0.1% of requests return 5xx errors

SLA vs SLO vs SLI

TermMeaningExample
SLI (Service Level Indicator)The metric you measure"P99 latency is 350ms"
SLO (Service Level Objective)The target for that metric"P99 latency should be < 500ms"
SLA (Service Level Agreement)The contract with consequences"If P99 > 500ms for a month, customer gets a credit"
  SLI (what you measure) → SLO (what you target) → SLA (what you promise)

8. Interview Approach: Always Clarify First

The 5-Minute Requirements Checklist

When the interviewer says "Design X," run through this checklist:

  ┌──────────────────────────────────────────────────────────┐
  │              REQUIREMENTS CHECKLIST (5 min)                │
  │                                                            │
  │  1. "What are the core features?"                          │
  │     → List 3-5, confirm with interviewer                   │
  │                                                            │
  │  2. "Who are the users?"                                   │
  │     → End users? Internal? Both? Mobile + web?             │
  │                                                            │
  │  3. "What scale are we targeting?"                         │
  │     → 1K users? 1M? 1B? This changes EVERYTHING           │
  │                                                            │
  │  4. "What are the read/write patterns?"                    │
  │     → Read-heavy? Write-heavy? Ratio?                      │
  │                                                            │
  │  5. "What quality attributes matter most?"                 │
  │     → Latency vs consistency? Availability target?         │
  │                                                            │
  │  6. "What's out of scope?"                                 │
  │     → Explicitly exclude features to focus your design     │
  │                                                            │
  └──────────────────────────────────────────────────────────┘

Good vs Bad Requirement Questions

Bad (vague / obvious)Good (specific / clarifying)
"Should it be fast?""Should we optimize for P99 latency under 200ms?"
"Will people use it?""Are we designing for 1M or 100M daily active users?"
"Should it store data?""Should we guarantee zero data loss, or is eventual durability acceptable?"
"Is security important?""Do we need end-to-end encryption, or is TLS in transit sufficient?"

9. Worked Example: "Design Twitter"

Step 1: Clarify Requirements

Functional (P0):

  • Users can post tweets (text, 280 chars max)
  • Users can view their home timeline (tweets from people they follow)
  • Users can follow/unfollow other users

Functional (P1 — if time):

  • Like tweets, retweet, search, trending topics

Non-functional:

  • 500M MAU, 200M DAU
  • Timeline loads in < 200ms
  • 99.9% availability
  • Eventually consistent timelines are acceptable
  • Read-heavy (100:1 read:write ratio)

Step 2: Quick Estimation

  DAU:           200M
  Tweets/day:    200M * 2 tweets = 400M tweets/day
  Write QPS:     400M / 86,400 ≈ 4,600 writes/sec
  Read QPS:      4,600 * 100 = 460,000 reads/sec (timeline reads)

  Storage/tweet: 280 bytes (text) + 200 bytes (metadata) ≈ 500 bytes
  Daily storage: 400M * 500 bytes = 200 GB/day
  Yearly:        200 GB * 365 ≈ 73 TB/year (text only; media is separate)

  Bandwidth out: 460,000 * 10KB (timeline page) ≈ 4.6 GB/s

Step 3: Now You Are Ready to Design

With these requirements and numbers in hand, you can make informed decisions:

  • 460K reads/sec means you NEED aggressive caching
  • 99.9% availability means you need redundancy at every layer
  • Eventual consistency means you can use a push/fan-out model for timelines
  • 73 TB/year means you need to plan for sharding

10. Key Takeaways

  1. Always start with requirements. Spending 5 minutes clarifying scope saves 15 minutes of rework.
  2. Functional requirements define WHAT the system does; non-functional requirements define HOW WELL.
  3. Prioritize features into P0 (must have), P1 (nice to have), and out-of-scope.
  4. The CAP theorem means your real choice in distributed systems is CP vs AP.
  5. Back-of-envelope estimation gives you the numbers to justify your architecture decisions (caching, sharding, replication).
  6. SLAs are contracts; SLOs are targets; SLIs are measurements.
  7. Read/write ratio is one of the most important numbers — it drives caching, replication, and database selection.

11. Explain-It Challenge

Without looking back, explain in your own words:

  1. What is the difference between a functional and a non-functional requirement? Give two examples of each for a URL shortener.
  2. Why is the read/write ratio so important for architecture decisions?
  3. Explain the CAP theorem in plain language. Why is "CA" not realistic?
  4. Walk through a back-of-envelope estimation for a system with 10M DAU where each user makes 5 requests per day.
  5. What is the difference between SLI, SLO, and SLA?

Navigation: ← 9.7.a — What Is HLD · 9.7.c — Breaking Into Components →