Episode 9 — System Design / 9.7 — System Design Foundations

9.7.d — Capacity Estimation

In one sentence: Capacity estimation is the art of converting "we have X million users" into concrete numbers for QPS, storage, bandwidth, and memory — the math that turns hand-waving into engineering decisions.

Navigation: ← 9.7.c — Breaking Into Components · 9.7.e — Interview Approach →

1. Why Estimation Matters
2. The Estimation Pipeline
3. Users to QPS
4. Storage Estimation
5. Bandwidth Estimation
6. Memory Estimation (Caching)
7. Rules of Thumb
8. Worked Example: Twitter
9. Worked Example: YouTube
10. Worked Example: URL Shortener
11. Common Powers of 2
12. Key Takeaways
13. Explain-It Challenge

1. Why Estimation Matters

  WITHOUT ESTIMATION:                    WITH ESTIMATION:

  "We'll use a database"                "We need to handle 50K QPS reads
                                         and 5K QPS writes. A single
  "We'll add caching"                    PostgreSQL instance handles ~10K
                                         QPS. So we need read replicas.
  "It should scale"
                                         Cache hit ratio of 80% reduces
                                         DB reads to 10K QPS — one primary
                                         + 2 replicas will work."

Estimation gives you numbers to justify decisions. In an interview, it shows the interviewer you can think quantitatively about systems, not just draw boxes.

2. The Estimation Pipeline

Every estimation follows the same flow:

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  Users   │────►│   QPS    │────►│ Storage  │────►│ Bandwidth│
  │  (MAU,   │     │ (reads,  │     │ (per day,│     │ (ingress,│
  │   DAU)   │     │  writes) │     │  year)   │     │  egress) │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘
       │                                                   │
       └──────────────────────────────────────────────────►│
                                                     ┌─────▼─────┐
                                                     │  Memory   │
                                                     │ (cache    │
                                                     │  sizing)  │
                                                     └───────────┘

3. Users to QPS

Step-by-Step Formula

  Step 1: Monthly Active Users (MAU)
          Given by the problem (e.g., 500M)

  Step 2: Daily Active Users (DAU)
          DAU = MAU * daily_active_fraction
          Typical: 20-50% for social apps
          Example: 500M * 0.20 = 100M DAU

  Step 3: Actions per User per Day
          Varies by app:
          • Social media: 5-20 reads, 1-3 writes
          • Messaging: 20-50 messages sent
          • E-commerce: 3-10 product views, 0.1 purchases

  Step 4: Daily Total Actions
          = DAU * actions_per_user
          = 100M * 10 reads = 1 Billion reads/day

  Step 5: QPS (Queries Per Second)
          = daily_actions / seconds_per_day
          = 1,000,000,000 / 86,400
          ≈ 11,574 QPS
          Round to: ~12K QPS

  Step 6: Peak QPS
          = QPS * peak_multiplier (typically 2x-5x)
          = 12K * 3 = ~36K peak QPS

QPS Reference Table

Scale	DAU	QPS (average)	Peak QPS	Server Needs
Startup	10K	~1	~5	1 server
Growing	1M	~100	~500	2-5 servers
Large	100M	~12K	~36K	50+ servers, caching, LB
Massive	1B	~120K	~500K	Thousands of servers, global CDN

4. Storage Estimation

Formula

  Daily storage = daily_writes * data_per_write

  Yearly storage = daily_storage * 365

  5-year storage = yearly_storage * 5  (plan for growth)

Data Size Reference

Data Type	Typical Size
Tweet / short text	250-500 bytes
User profile (metadata)	1-2 KB
Database row (typical)	100 bytes - 1 KB
Thumbnail image	10-50 KB
Photo (compressed)	200 KB - 2 MB
Short video (1 min, compressed)	5-50 MB
Long video (1 hour, HD)	1-5 GB
Log entry	100-500 bytes

Example: Tweet Storage

  Given:
    400M tweets/day
    Each tweet: 280 chars * 2 bytes/char = 560 bytes text
              + 200 bytes metadata (user_id, timestamp, etc.)
              = ~750 bytes per tweet

  Daily:   400M * 750 bytes = 300 GB/day
  Monthly: 300 GB * 30 = 9 TB/month
  Yearly:  300 GB * 365 ≈ 110 TB/year
  5 years: 110 TB * 5 = 550 TB

  With replication factor of 3: 550 TB * 3 = 1.65 PB

5. Bandwidth Estimation

Ingress (Data Coming In)

  Ingress = write_QPS * average_write_size

  Example:
    Write QPS: 5,000
    Average write size: 750 bytes (tweet)
    Ingress = 5,000 * 750 = 3.75 MB/s

Egress (Data Going Out)

  Egress = read_QPS * average_response_size

  Example:
    Read QPS: 50,000
    Average response (timeline page): 50 tweets * 750 bytes = ~37.5 KB
    Egress = 50,000 * 37.5 KB = 1.875 GB/s

  With media (images):
    Each timeline page has ~10 images * 200KB = 2 MB
    Egress_media = 50,000 * 2 MB = 100 GB/s
    (This is why CDNs are critical!)

Bandwidth Quick Reference

Unit	Value	Example
1 Mbps	125 KB/s	A slow API response
100 Mbps	12.5 MB/s	A busy web server
1 Gbps	125 MB/s	A database server
10 Gbps	1.25 GB/s	CDN edge node
100 Gbps	12.5 GB/s	Large-scale CDN total

6. Memory Estimation (Caching)

The 80-20 Rule (Pareto Principle)

In most systems, 20% of the data serves 80% of the requests. Cache that 20%.

Cache Sizing Formula

  Cache size = daily_reads * data_per_read * cache_fraction

  Where cache_fraction = fraction of data to keep in memory
  (typically 20% of daily unique data)

Example: Twitter Timeline Cache

  Given:
    Read QPS: 50,000
    Daily read requests: 50,000 * 86,400 = 4.3 Billion reads/day
    Average read response: 37.5 KB (50 tweets on a timeline page)

  But many reads hit the SAME timelines (popular users).
  Unique timelines accessed/day: ~100M (not 4.3B)

  Cache the hottest 20%:
    = 100M * 0.20 * 37.5 KB
    = 20M * 37.5 KB
    = 750 GB

  With overhead (metadata, hash table, fragmentation):
    ~750 GB * 1.3 ≈ ~1 TB of Redis memory

  Redis instances:
    Each Redis instance: ~64 GB usable memory
    Instances needed: 1,000 / 64 ≈ 16 Redis instances

Cache Hit Ratio Impact

Cache Hit Ratio	DB Reads (if 50K QPS total)	Impact
0% (no cache)	50,000 QPS to DB	DB overloaded
50%	25,000 QPS to DB	Still heavy
80%	10,000 QPS to DB	Manageable
95%	2,500 QPS to DB	Comfortable
99%	500 QPS to DB	DB barely touched

7. Rules of Thumb

The "1 Million Users" Rule of Thumb

Metric	Rough Value
1M MAU → DAU	~200K - 500K (20-50%)
DAU → QPS	DAU * actions / 86,400
Peak QPS	2x - 5x average QPS
Read:Write ratio (social)	100:1
Read:Write ratio (messaging)	1:1 (but bursty reads)
Read:Write ratio (e-commerce)	10:1

Hardware Rules of Thumb

Resource	Capacity (per machine)
Web server	1K - 10K QPS (depends on logic complexity)
PostgreSQL	5K - 20K QPS (with connection pooling)
MySQL	5K - 20K QPS
Redis	100K - 500K QPS
Elasticsearch	1K - 10K QPS (depends on query complexity)
Kafka broker	100K - 1M messages/sec

Latency Rules of Thumb

Operation	Latency
L1 cache reference	~1 ns
L2 cache reference	~4 ns
RAM reference	~100 ns
SSD random read	~100 us
HDD random read	~10 ms
Same datacenter round trip	~0.5 ms
Cross-continent round trip	~100-150 ms
Redis GET	~0.5-1 ms
PostgreSQL simple query	~1-5 ms
Send 1 MB over 1 Gbps network	~10 ms

8. Worked Example: Twitter

Given

500M MAU, 200M DAU
Average user reads timeline 10 times/day, posts 2 tweets/day
Each tweet: ~500 bytes; timeline page: 50 tweets
10% of tweets include a photo (~200 KB)

QPS Calculation

  READ QPS:
    Timeline reads: 200M * 10 / 86,400 = ~23,000 QPS
    Peak: 23,000 * 3 = ~70,000 QPS

  WRITE QPS:
    New tweets: 200M * 2 / 86,400 = ~4,600 QPS
    Peak: 4,600 * 3 = ~14,000 QPS

Storage

  TEXT:
    Daily tweets: 400M * 500 bytes = 200 GB/day
    Yearly: 200 GB * 365 = 73 TB/year

  MEDIA:
    Tweets with photos: 400M * 10% = 40M photos/day
    40M * 200 KB = 8 TB/day
    Yearly: 8 TB * 365 = 2.9 PB/year

    (This is why Twitter uses a CDN and object storage like S3)

Bandwidth

  EGRESS (text):
    23,000 QPS * 25 KB (50 tweets * 500 bytes) = 575 MB/s

  EGRESS (media):
    If 30% of timeline reads load images:
    7,000 QPS * 10 images * 200 KB = 14 GB/s
    (Served mostly by CDN, not origin servers)

Cache

  Cache the hottest 20% of timelines:
    Daily unique timelines: ~100M
    Hot 20%: 20M timelines * 25 KB = 500 GB
    With overhead: ~650 GB ≈ ~10 Redis instances (64 GB each)

9. Worked Example: YouTube

Given

2B MAU, 500M DAU
Average user watches 5 videos/day
Average video: 10 min, 50 MB (compressed, 720p)
500K new videos uploaded per day

QPS Calculation

  VIDEO VIEW QPS:
    500M * 5 / 86,400 = ~29,000 QPS (video plays started)
    Peak: 29,000 * 3 = ~87,000 QPS

  UPLOAD QPS:
    500K / 86,400 = ~6 uploads/sec (uploads are slow, heavy operations)

  SEARCH QPS:
    If 50% of users search once/day:
    250M / 86,400 = ~2,900 QPS

Storage

  VIDEO STORAGE:
    New videos/day: 500K
    Average raw upload: 500 MB (before processing)
    Daily raw: 500K * 500 MB = 250 TB/day

    Each video is transcoded to multiple resolutions:
    360p + 480p + 720p + 1080p ≈ 3x original compressed
    Daily processed: 500K * 50 MB * 4 resolutions = 100 TB/day

    Yearly: 100 TB * 365 = 36.5 PB/year

  METADATA:
    Video metadata: ~10 KB per video
    500K * 10 KB = 5 GB/day (negligible compared to video)

Bandwidth

  EGRESS:
    Concurrent viewers (peak): ~5M users streaming simultaneously
    Average bitrate: 5 Mbps (720p)
    Total egress: 5M * 5 Mbps = 25 Tbps

    (This is why YouTube uses a massive global CDN)

Cache

  Video metadata cache:
    Total videos: ~1B (historical)
    Hot 1%: 10M videos * 10 KB = 100 GB (fits in a few Redis nodes)

  Video content cache (CDN):
    Hot 10% of videos get 90% of views
    CDN caches these at edge locations worldwide

10. Worked Example: URL Shortener

Given

100M new URLs shortened per month
Read:Write ratio = 100:1
Short URL: 7 characters
Keep URLs for 5 years

QPS Calculation

  WRITE QPS:
    100M / (30 * 86,400) = ~39 writes/sec
    Peak: 39 * 3 = ~120 writes/sec
    (Very low — a single server can handle this easily)

  READ QPS (redirects):
    39 * 100 = ~3,900 reads/sec
    Peak: 3,900 * 3 = ~12,000 reads/sec

Storage

  Per URL record:
    short_url: 7 bytes
    long_url: ~200 bytes (average)
    created_at: 8 bytes
    user_id: 8 bytes
    Total: ~250 bytes

  Monthly: 100M * 250 bytes = 25 GB/month
  Yearly: 25 GB * 12 = 300 GB/year
  5 years: 300 GB * 5 = 1.5 TB

  Total records in 5 years: 100M * 12 * 5 = 6 Billion URLs

Unique Short URL Space

  7-character short URL using [a-z, A-Z, 0-9] = 62 characters
  62^7 = 3.5 Trillion possible combinations
  We need 6 Billion over 5 years
  Collision probability: very low (6B / 3.5T ≈ 0.17%)

Cache

  Apply 80-20 rule:
    20% of URLs get 80% of traffic
    Daily unique URLs accessed: ~100M (estimate)
    Hot 20%: 20M * 250 bytes = 5 GB

  This easily fits in a single Redis instance!
  Expected cache hit ratio: 90%+
  DB reads with cache: 12,000 * 10% = 1,200 QPS (one DB handles this)

11. Common Powers of 2

Memorize these for quick mental math in interviews.

Power	Value	Approximate	Common Use
2^10	1,024	~1 Thousand (1 KB)	Kilo
2^20	1,048,576	~1 Million (1 MB)	Mega
2^30	1,073,741,824	~1 Billion (1 GB)	Giga
2^40		~1 Trillion (1 TB)	Tera
2^50		~1 Quadrillion (1 PB)	Peta

Quick Conversion Tricks

  1 day     ≈ 86,400 sec    ≈ 10^5 sec (use for quick division)
  1 month   ≈ 2.5M sec      ≈ 2.5 * 10^6
  1 year    ≈ 30M sec        ≈ 3 * 10^7

  1 M requests/day   ≈ 12 QPS
  10M requests/day   ≈ 120 QPS
  100M requests/day  ≈ 1,200 QPS
  1B requests/day    ≈ 12,000 QPS

12. Key Takeaways

Estimation is a pipeline: Users → DAU → QPS → Storage → Bandwidth → Cache sizing.
You don't need exact numbers. Order of magnitude is what matters. Is it 1 TB or 1 PB? Is it 1K QPS or 1M QPS?
Peak QPS is 2-5x average QPS. Always design for peak, not average.
The 80-20 rule is your friend for cache sizing: 20% of data serves 80% of traffic.
Memorize key constants: seconds/day (~86K), 62^7 (~3.5T), read:write ratios for common systems.
Media dominates storage and bandwidth. A system that stores images/videos needs orders of magnitude more storage than one that stores only text.
Show your work. In an interview, the process matters more than the exact number.

13. Explain-It Challenge

Without looking back, work through these:

A photo-sharing app has 50M DAU. Each user uploads 1 photo/day (2 MB each) and views 20 photos/day. Calculate: write QPS, read QPS, daily storage, daily egress bandwidth.
A chat application has 100M DAU. Each user sends 40 messages/day (average 100 bytes each). How much storage is needed per year? How much cache for the hottest 20%?
Why is peak QPS more important than average QPS when sizing infrastructure?
A URL shortener receives 200M new URLs per month. What is the minimum character length needed for the short code (using base62) to last 10 years without collisions?
Explain the 80-20 rule and how it applies to cache sizing with a concrete example.

Navigation: ← 9.7.c — Breaking Into Components · 9.7.e — Interview Approach →

9.7.d — Capacity Estimation

Table of Contents

1. Why Estimation Matters

2. The Estimation Pipeline

3. Users to QPS

Step-by-Step Formula

QPS Reference Table

4. Storage Estimation

Formula

Data Size Reference

Example: Tweet Storage

5. Bandwidth Estimation

Ingress (Data Coming In)

Egress (Data Going Out)

Bandwidth Quick Reference

6. Memory Estimation (Caching)

The 80-20 Rule (Pareto Principle)

Cache Sizing Formula

Example: Twitter Timeline Cache

Cache Hit Ratio Impact

7. Rules of Thumb

The "1 Million Users" Rule of Thumb

Hardware Rules of Thumb

Latency Rules of Thumb

8. Worked Example: Twitter

Given

QPS Calculation

Storage

Bandwidth

Cache

9. Worked Example: YouTube

Given

QPS Calculation

Storage

Bandwidth

Cache

10. Worked Example: URL Shortener

Given

QPS Calculation

Storage

Unique Short URL Space

Cache

11. Common Powers of 2

Quick Conversion Tricks

12. Key Takeaways

13. Explain-It Challenge