Episode 9 — System Design / 9.7 — System Design Foundations

9.7.d — Capacity Estimation

In one sentence: Capacity estimation is the art of converting "we have X million users" into concrete numbers for QPS, storage, bandwidth, and memory — the math that turns hand-waving into engineering decisions.

Navigation: ← 9.7.c — Breaking Into Components · 9.7.e — Interview Approach →


Table of Contents


1. Why Estimation Matters

  WITHOUT ESTIMATION:                    WITH ESTIMATION:

  "We'll use a database"                "We need to handle 50K QPS reads
                                         and 5K QPS writes. A single
  "We'll add caching"                    PostgreSQL instance handles ~10K
                                         QPS. So we need read replicas.
  "It should scale"
                                         Cache hit ratio of 80% reduces
                                         DB reads to 10K QPS — one primary
                                         + 2 replicas will work."

Estimation gives you numbers to justify decisions. In an interview, it shows the interviewer you can think quantitatively about systems, not just draw boxes.


2. The Estimation Pipeline

Every estimation follows the same flow:

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  Users   │────►│   QPS    │────►│ Storage  │────►│ Bandwidth│
  │  (MAU,   │     │ (reads,  │     │ (per day,│     │ (ingress,│
  │   DAU)   │     │  writes) │     │  year)   │     │  egress) │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘
       │                                                   │
       └──────────────────────────────────────────────────►│
                                                     ┌─────▼─────┐
                                                     │  Memory   │
                                                     │ (cache    │
                                                     │  sizing)  │
                                                     └───────────┘

3. Users to QPS

Step-by-Step Formula

  Step 1: Monthly Active Users (MAU)
          Given by the problem (e.g., 500M)

  Step 2: Daily Active Users (DAU)
          DAU = MAU * daily_active_fraction
          Typical: 20-50% for social apps
          Example: 500M * 0.20 = 100M DAU

  Step 3: Actions per User per Day
          Varies by app:
          • Social media: 5-20 reads, 1-3 writes
          • Messaging: 20-50 messages sent
          • E-commerce: 3-10 product views, 0.1 purchases

  Step 4: Daily Total Actions
          = DAU * actions_per_user
          = 100M * 10 reads = 1 Billion reads/day

  Step 5: QPS (Queries Per Second)
          = daily_actions / seconds_per_day
          = 1,000,000,000 / 86,400
          ≈ 11,574 QPS
          Round to: ~12K QPS

  Step 6: Peak QPS
          = QPS * peak_multiplier (typically 2x-5x)
          = 12K * 3 = ~36K peak QPS

QPS Reference Table

ScaleDAUQPS (average)Peak QPSServer Needs
Startup10K~1~51 server
Growing1M~100~5002-5 servers
Large100M~12K~36K50+ servers, caching, LB
Massive1B~120K~500KThousands of servers, global CDN

4. Storage Estimation

Formula

  Daily storage = daily_writes * data_per_write

  Yearly storage = daily_storage * 365

  5-year storage = yearly_storage * 5  (plan for growth)

Data Size Reference

Data TypeTypical Size
Tweet / short text250-500 bytes
User profile (metadata)1-2 KB
Database row (typical)100 bytes - 1 KB
Thumbnail image10-50 KB
Photo (compressed)200 KB - 2 MB
Short video (1 min, compressed)5-50 MB
Long video (1 hour, HD)1-5 GB
Log entry100-500 bytes

Example: Tweet Storage

  Given:
    400M tweets/day
    Each tweet: 280 chars * 2 bytes/char = 560 bytes text
              + 200 bytes metadata (user_id, timestamp, etc.)
              = ~750 bytes per tweet

  Daily:   400M * 750 bytes = 300 GB/day
  Monthly: 300 GB * 30 = 9 TB/month
  Yearly:  300 GB * 365 ≈ 110 TB/year
  5 years: 110 TB * 5 = 550 TB

  With replication factor of 3: 550 TB * 3 = 1.65 PB

5. Bandwidth Estimation

Ingress (Data Coming In)

  Ingress = write_QPS * average_write_size

  Example:
    Write QPS: 5,000
    Average write size: 750 bytes (tweet)
    Ingress = 5,000 * 750 = 3.75 MB/s

Egress (Data Going Out)

  Egress = read_QPS * average_response_size

  Example:
    Read QPS: 50,000
    Average response (timeline page): 50 tweets * 750 bytes = ~37.5 KB
    Egress = 50,000 * 37.5 KB = 1.875 GB/s

  With media (images):
    Each timeline page has ~10 images * 200KB = 2 MB
    Egress_media = 50,000 * 2 MB = 100 GB/s
    (This is why CDNs are critical!)

Bandwidth Quick Reference

UnitValueExample
1 Mbps125 KB/sA slow API response
100 Mbps12.5 MB/sA busy web server
1 Gbps125 MB/sA database server
10 Gbps1.25 GB/sCDN edge node
100 Gbps12.5 GB/sLarge-scale CDN total

6. Memory Estimation (Caching)

The 80-20 Rule (Pareto Principle)

In most systems, 20% of the data serves 80% of the requests. Cache that 20%.

Cache Sizing Formula

  Cache size = daily_reads * data_per_read * cache_fraction

  Where cache_fraction = fraction of data to keep in memory
  (typically 20% of daily unique data)

Example: Twitter Timeline Cache

  Given:
    Read QPS: 50,000
    Daily read requests: 50,000 * 86,400 = 4.3 Billion reads/day
    Average read response: 37.5 KB (50 tweets on a timeline page)

  But many reads hit the SAME timelines (popular users).
  Unique timelines accessed/day: ~100M (not 4.3B)

  Cache the hottest 20%:
    = 100M * 0.20 * 37.5 KB
    = 20M * 37.5 KB
    = 750 GB

  With overhead (metadata, hash table, fragmentation):
    ~750 GB * 1.3 ≈ ~1 TB of Redis memory

  Redis instances:
    Each Redis instance: ~64 GB usable memory
    Instances needed: 1,000 / 64 ≈ 16 Redis instances

Cache Hit Ratio Impact

Cache Hit RatioDB Reads (if 50K QPS total)Impact
0% (no cache)50,000 QPS to DBDB overloaded
50%25,000 QPS to DBStill heavy
80%10,000 QPS to DBManageable
95%2,500 QPS to DBComfortable
99%500 QPS to DBDB barely touched

7. Rules of Thumb

The "1 Million Users" Rule of Thumb

MetricRough Value
1M MAU → DAU~200K - 500K (20-50%)
DAU → QPSDAU * actions / 86,400
Peak QPS2x - 5x average QPS
Read:Write ratio (social)100:1
Read:Write ratio (messaging)1:1 (but bursty reads)
Read:Write ratio (e-commerce)10:1

Hardware Rules of Thumb

ResourceCapacity (per machine)
Web server1K - 10K QPS (depends on logic complexity)
PostgreSQL5K - 20K QPS (with connection pooling)
MySQL5K - 20K QPS
Redis100K - 500K QPS
Elasticsearch1K - 10K QPS (depends on query complexity)
Kafka broker100K - 1M messages/sec

Latency Rules of Thumb

OperationLatency
L1 cache reference~1 ns
L2 cache reference~4 ns
RAM reference~100 ns
SSD random read~100 us
HDD random read~10 ms
Same datacenter round trip~0.5 ms
Cross-continent round trip~100-150 ms
Redis GET~0.5-1 ms
PostgreSQL simple query~1-5 ms
Send 1 MB over 1 Gbps network~10 ms

8. Worked Example: Twitter

Given

  • 500M MAU, 200M DAU
  • Average user reads timeline 10 times/day, posts 2 tweets/day
  • Each tweet: ~500 bytes; timeline page: 50 tweets
  • 10% of tweets include a photo (~200 KB)

QPS Calculation

  READ QPS:
    Timeline reads: 200M * 10 / 86,400 = ~23,000 QPS
    Peak: 23,000 * 3 = ~70,000 QPS

  WRITE QPS:
    New tweets: 200M * 2 / 86,400 = ~4,600 QPS
    Peak: 4,600 * 3 = ~14,000 QPS

Storage

  TEXT:
    Daily tweets: 400M * 500 bytes = 200 GB/day
    Yearly: 200 GB * 365 = 73 TB/year

  MEDIA:
    Tweets with photos: 400M * 10% = 40M photos/day
    40M * 200 KB = 8 TB/day
    Yearly: 8 TB * 365 = 2.9 PB/year

    (This is why Twitter uses a CDN and object storage like S3)

Bandwidth

  EGRESS (text):
    23,000 QPS * 25 KB (50 tweets * 500 bytes) = 575 MB/s

  EGRESS (media):
    If 30% of timeline reads load images:
    7,000 QPS * 10 images * 200 KB = 14 GB/s
    (Served mostly by CDN, not origin servers)

Cache

  Cache the hottest 20% of timelines:
    Daily unique timelines: ~100M
    Hot 20%: 20M timelines * 25 KB = 500 GB
    With overhead: ~650 GB ≈ ~10 Redis instances (64 GB each)

9. Worked Example: YouTube

Given

  • 2B MAU, 500M DAU
  • Average user watches 5 videos/day
  • Average video: 10 min, 50 MB (compressed, 720p)
  • 500K new videos uploaded per day

QPS Calculation

  VIDEO VIEW QPS:
    500M * 5 / 86,400 = ~29,000 QPS (video plays started)
    Peak: 29,000 * 3 = ~87,000 QPS

  UPLOAD QPS:
    500K / 86,400 = ~6 uploads/sec (uploads are slow, heavy operations)

  SEARCH QPS:
    If 50% of users search once/day:
    250M / 86,400 = ~2,900 QPS

Storage

  VIDEO STORAGE:
    New videos/day: 500K
    Average raw upload: 500 MB (before processing)
    Daily raw: 500K * 500 MB = 250 TB/day

    Each video is transcoded to multiple resolutions:
    360p + 480p + 720p + 1080p ≈ 3x original compressed
    Daily processed: 500K * 50 MB * 4 resolutions = 100 TB/day

    Yearly: 100 TB * 365 = 36.5 PB/year

  METADATA:
    Video metadata: ~10 KB per video
    500K * 10 KB = 5 GB/day (negligible compared to video)

Bandwidth

  EGRESS:
    Concurrent viewers (peak): ~5M users streaming simultaneously
    Average bitrate: 5 Mbps (720p)
    Total egress: 5M * 5 Mbps = 25 Tbps

    (This is why YouTube uses a massive global CDN)

Cache

  Video metadata cache:
    Total videos: ~1B (historical)
    Hot 1%: 10M videos * 10 KB = 100 GB (fits in a few Redis nodes)

  Video content cache (CDN):
    Hot 10% of videos get 90% of views
    CDN caches these at edge locations worldwide

10. Worked Example: URL Shortener

Given

  • 100M new URLs shortened per month
  • Read:Write ratio = 100:1
  • Short URL: 7 characters
  • Keep URLs for 5 years

QPS Calculation

  WRITE QPS:
    100M / (30 * 86,400) = ~39 writes/sec
    Peak: 39 * 3 = ~120 writes/sec
    (Very low — a single server can handle this easily)

  READ QPS (redirects):
    39 * 100 = ~3,900 reads/sec
    Peak: 3,900 * 3 = ~12,000 reads/sec

Storage

  Per URL record:
    short_url: 7 bytes
    long_url: ~200 bytes (average)
    created_at: 8 bytes
    user_id: 8 bytes
    Total: ~250 bytes

  Monthly: 100M * 250 bytes = 25 GB/month
  Yearly: 25 GB * 12 = 300 GB/year
  5 years: 300 GB * 5 = 1.5 TB

  Total records in 5 years: 100M * 12 * 5 = 6 Billion URLs

Unique Short URL Space

  7-character short URL using [a-z, A-Z, 0-9] = 62 characters
  62^7 = 3.5 Trillion possible combinations
  We need 6 Billion over 5 years
  Collision probability: very low (6B / 3.5T ≈ 0.17%)

Cache

  Apply 80-20 rule:
    20% of URLs get 80% of traffic
    Daily unique URLs accessed: ~100M (estimate)
    Hot 20%: 20M * 250 bytes = 5 GB

  This easily fits in a single Redis instance!
  Expected cache hit ratio: 90%+
  DB reads with cache: 12,000 * 10% = 1,200 QPS (one DB handles this)

11. Common Powers of 2

Memorize these for quick mental math in interviews.

PowerValueApproximateCommon Use
2^101,024~1 Thousand (1 KB)Kilo
2^201,048,576~1 Million (1 MB)Mega
2^301,073,741,824~1 Billion (1 GB)Giga
2^40~1 Trillion (1 TB)Tera
2^50~1 Quadrillion (1 PB)Peta

Quick Conversion Tricks

  1 day     ≈ 86,400 sec    ≈ 10^5 sec (use for quick division)
  1 month   ≈ 2.5M sec      ≈ 2.5 * 10^6
  1 year    ≈ 30M sec        ≈ 3 * 10^7

  1 M requests/day   ≈ 12 QPS
  10M requests/day   ≈ 120 QPS
  100M requests/day  ≈ 1,200 QPS
  1B requests/day    ≈ 12,000 QPS

12. Key Takeaways

  1. Estimation is a pipeline: Users → DAU → QPS → Storage → Bandwidth → Cache sizing.
  2. You don't need exact numbers. Order of magnitude is what matters. Is it 1 TB or 1 PB? Is it 1K QPS or 1M QPS?
  3. Peak QPS is 2-5x average QPS. Always design for peak, not average.
  4. The 80-20 rule is your friend for cache sizing: 20% of data serves 80% of traffic.
  5. Memorize key constants: seconds/day (~86K), 62^7 (~3.5T), read:write ratios for common systems.
  6. Media dominates storage and bandwidth. A system that stores images/videos needs orders of magnitude more storage than one that stores only text.
  7. Show your work. In an interview, the process matters more than the exact number.

13. Explain-It Challenge

Without looking back, work through these:

  1. A photo-sharing app has 50M DAU. Each user uploads 1 photo/day (2 MB each) and views 20 photos/day. Calculate: write QPS, read QPS, daily storage, daily egress bandwidth.
  2. A chat application has 100M DAU. Each user sends 40 messages/day (average 100 bytes each). How much storage is needed per year? How much cache for the hottest 20%?
  3. Why is peak QPS more important than average QPS when sizing infrastructure?
  4. A URL shortener receives 200M new URLs per month. What is the minimum character length needed for the short code (using base62) to last 10 years without collisions?
  5. Explain the 80-20 rule and how it applies to cache sizing with a concrete example.

Navigation: ← 9.7.c — Breaking Into Components · 9.7.e — Interview Approach →