Episode 9 — System Design / 9.7 — System Design Foundations
9.7.d — Capacity Estimation
In one sentence: Capacity estimation is the art of converting "we have X million users" into concrete numbers for QPS, storage, bandwidth, and memory — the math that turns hand-waving into engineering decisions.
Navigation: ← 9.7.c — Breaking Into Components · 9.7.e — Interview Approach →
Table of Contents
- 1. Why Estimation Matters
- 2. The Estimation Pipeline
- 3. Users to QPS
- 4. Storage Estimation
- 5. Bandwidth Estimation
- 6. Memory Estimation (Caching)
- 7. Rules of Thumb
- 8. Worked Example: Twitter
- 9. Worked Example: YouTube
- 10. Worked Example: URL Shortener
- 11. Common Powers of 2
- 12. Key Takeaways
- 13. Explain-It Challenge
1. Why Estimation Matters
WITHOUT ESTIMATION: WITH ESTIMATION:
"We'll use a database" "We need to handle 50K QPS reads
and 5K QPS writes. A single
"We'll add caching" PostgreSQL instance handles ~10K
QPS. So we need read replicas.
"It should scale"
Cache hit ratio of 80% reduces
DB reads to 10K QPS — one primary
+ 2 replicas will work."
Estimation gives you numbers to justify decisions. In an interview, it shows the interviewer you can think quantitatively about systems, not just draw boxes.
2. The Estimation Pipeline
Every estimation follows the same flow:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Users │────►│ QPS │────►│ Storage │────►│ Bandwidth│
│ (MAU, │ │ (reads, │ │ (per day,│ │ (ingress,│
│ DAU) │ │ writes) │ │ year) │ │ egress) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
└──────────────────────────────────────────────────►│
┌─────▼─────┐
│ Memory │
│ (cache │
│ sizing) │
└───────────┘
3. Users to QPS
Step-by-Step Formula
Step 1: Monthly Active Users (MAU)
Given by the problem (e.g., 500M)
Step 2: Daily Active Users (DAU)
DAU = MAU * daily_active_fraction
Typical: 20-50% for social apps
Example: 500M * 0.20 = 100M DAU
Step 3: Actions per User per Day
Varies by app:
• Social media: 5-20 reads, 1-3 writes
• Messaging: 20-50 messages sent
• E-commerce: 3-10 product views, 0.1 purchases
Step 4: Daily Total Actions
= DAU * actions_per_user
= 100M * 10 reads = 1 Billion reads/day
Step 5: QPS (Queries Per Second)
= daily_actions / seconds_per_day
= 1,000,000,000 / 86,400
≈ 11,574 QPS
Round to: ~12K QPS
Step 6: Peak QPS
= QPS * peak_multiplier (typically 2x-5x)
= 12K * 3 = ~36K peak QPS
QPS Reference Table
| Scale | DAU | QPS (average) | Peak QPS | Server Needs |
|---|---|---|---|---|
| Startup | 10K | ~1 | ~5 | 1 server |
| Growing | 1M | ~100 | ~500 | 2-5 servers |
| Large | 100M | ~12K | ~36K | 50+ servers, caching, LB |
| Massive | 1B | ~120K | ~500K | Thousands of servers, global CDN |
4. Storage Estimation
Formula
Daily storage = daily_writes * data_per_write
Yearly storage = daily_storage * 365
5-year storage = yearly_storage * 5 (plan for growth)
Data Size Reference
| Data Type | Typical Size |
|---|---|
| Tweet / short text | 250-500 bytes |
| User profile (metadata) | 1-2 KB |
| Database row (typical) | 100 bytes - 1 KB |
| Thumbnail image | 10-50 KB |
| Photo (compressed) | 200 KB - 2 MB |
| Short video (1 min, compressed) | 5-50 MB |
| Long video (1 hour, HD) | 1-5 GB |
| Log entry | 100-500 bytes |
Example: Tweet Storage
Given:
400M tweets/day
Each tweet: 280 chars * 2 bytes/char = 560 bytes text
+ 200 bytes metadata (user_id, timestamp, etc.)
= ~750 bytes per tweet
Daily: 400M * 750 bytes = 300 GB/day
Monthly: 300 GB * 30 = 9 TB/month
Yearly: 300 GB * 365 ≈ 110 TB/year
5 years: 110 TB * 5 = 550 TB
With replication factor of 3: 550 TB * 3 = 1.65 PB
5. Bandwidth Estimation
Ingress (Data Coming In)
Ingress = write_QPS * average_write_size
Example:
Write QPS: 5,000
Average write size: 750 bytes (tweet)
Ingress = 5,000 * 750 = 3.75 MB/s
Egress (Data Going Out)
Egress = read_QPS * average_response_size
Example:
Read QPS: 50,000
Average response (timeline page): 50 tweets * 750 bytes = ~37.5 KB
Egress = 50,000 * 37.5 KB = 1.875 GB/s
With media (images):
Each timeline page has ~10 images * 200KB = 2 MB
Egress_media = 50,000 * 2 MB = 100 GB/s
(This is why CDNs are critical!)
Bandwidth Quick Reference
| Unit | Value | Example |
|---|---|---|
| 1 Mbps | 125 KB/s | A slow API response |
| 100 Mbps | 12.5 MB/s | A busy web server |
| 1 Gbps | 125 MB/s | A database server |
| 10 Gbps | 1.25 GB/s | CDN edge node |
| 100 Gbps | 12.5 GB/s | Large-scale CDN total |
6. Memory Estimation (Caching)
The 80-20 Rule (Pareto Principle)
In most systems, 20% of the data serves 80% of the requests. Cache that 20%.
Cache Sizing Formula
Cache size = daily_reads * data_per_read * cache_fraction
Where cache_fraction = fraction of data to keep in memory
(typically 20% of daily unique data)
Example: Twitter Timeline Cache
Given:
Read QPS: 50,000
Daily read requests: 50,000 * 86,400 = 4.3 Billion reads/day
Average read response: 37.5 KB (50 tweets on a timeline page)
But many reads hit the SAME timelines (popular users).
Unique timelines accessed/day: ~100M (not 4.3B)
Cache the hottest 20%:
= 100M * 0.20 * 37.5 KB
= 20M * 37.5 KB
= 750 GB
With overhead (metadata, hash table, fragmentation):
~750 GB * 1.3 ≈ ~1 TB of Redis memory
Redis instances:
Each Redis instance: ~64 GB usable memory
Instances needed: 1,000 / 64 ≈ 16 Redis instances
Cache Hit Ratio Impact
| Cache Hit Ratio | DB Reads (if 50K QPS total) | Impact |
|---|---|---|
| 0% (no cache) | 50,000 QPS to DB | DB overloaded |
| 50% | 25,000 QPS to DB | Still heavy |
| 80% | 10,000 QPS to DB | Manageable |
| 95% | 2,500 QPS to DB | Comfortable |
| 99% | 500 QPS to DB | DB barely touched |
7. Rules of Thumb
The "1 Million Users" Rule of Thumb
| Metric | Rough Value |
|---|---|
| 1M MAU → DAU | ~200K - 500K (20-50%) |
| DAU → QPS | DAU * actions / 86,400 |
| Peak QPS | 2x - 5x average QPS |
| Read:Write ratio (social) | 100:1 |
| Read:Write ratio (messaging) | 1:1 (but bursty reads) |
| Read:Write ratio (e-commerce) | 10:1 |
Hardware Rules of Thumb
| Resource | Capacity (per machine) |
|---|---|
| Web server | 1K - 10K QPS (depends on logic complexity) |
| PostgreSQL | 5K - 20K QPS (with connection pooling) |
| MySQL | 5K - 20K QPS |
| Redis | 100K - 500K QPS |
| Elasticsearch | 1K - 10K QPS (depends on query complexity) |
| Kafka broker | 100K - 1M messages/sec |
Latency Rules of Thumb
| Operation | Latency |
|---|---|
| L1 cache reference | ~1 ns |
| L2 cache reference | ~4 ns |
| RAM reference | ~100 ns |
| SSD random read | ~100 us |
| HDD random read | ~10 ms |
| Same datacenter round trip | ~0.5 ms |
| Cross-continent round trip | ~100-150 ms |
| Redis GET | ~0.5-1 ms |
| PostgreSQL simple query | ~1-5 ms |
| Send 1 MB over 1 Gbps network | ~10 ms |
8. Worked Example: Twitter
Given
- 500M MAU, 200M DAU
- Average user reads timeline 10 times/day, posts 2 tweets/day
- Each tweet: ~500 bytes; timeline page: 50 tweets
- 10% of tweets include a photo (~200 KB)
QPS Calculation
READ QPS:
Timeline reads: 200M * 10 / 86,400 = ~23,000 QPS
Peak: 23,000 * 3 = ~70,000 QPS
WRITE QPS:
New tweets: 200M * 2 / 86,400 = ~4,600 QPS
Peak: 4,600 * 3 = ~14,000 QPS
Storage
TEXT:
Daily tweets: 400M * 500 bytes = 200 GB/day
Yearly: 200 GB * 365 = 73 TB/year
MEDIA:
Tweets with photos: 400M * 10% = 40M photos/day
40M * 200 KB = 8 TB/day
Yearly: 8 TB * 365 = 2.9 PB/year
(This is why Twitter uses a CDN and object storage like S3)
Bandwidth
EGRESS (text):
23,000 QPS * 25 KB (50 tweets * 500 bytes) = 575 MB/s
EGRESS (media):
If 30% of timeline reads load images:
7,000 QPS * 10 images * 200 KB = 14 GB/s
(Served mostly by CDN, not origin servers)
Cache
Cache the hottest 20% of timelines:
Daily unique timelines: ~100M
Hot 20%: 20M timelines * 25 KB = 500 GB
With overhead: ~650 GB ≈ ~10 Redis instances (64 GB each)
9. Worked Example: YouTube
Given
- 2B MAU, 500M DAU
- Average user watches 5 videos/day
- Average video: 10 min, 50 MB (compressed, 720p)
- 500K new videos uploaded per day
QPS Calculation
VIDEO VIEW QPS:
500M * 5 / 86,400 = ~29,000 QPS (video plays started)
Peak: 29,000 * 3 = ~87,000 QPS
UPLOAD QPS:
500K / 86,400 = ~6 uploads/sec (uploads are slow, heavy operations)
SEARCH QPS:
If 50% of users search once/day:
250M / 86,400 = ~2,900 QPS
Storage
VIDEO STORAGE:
New videos/day: 500K
Average raw upload: 500 MB (before processing)
Daily raw: 500K * 500 MB = 250 TB/day
Each video is transcoded to multiple resolutions:
360p + 480p + 720p + 1080p ≈ 3x original compressed
Daily processed: 500K * 50 MB * 4 resolutions = 100 TB/day
Yearly: 100 TB * 365 = 36.5 PB/year
METADATA:
Video metadata: ~10 KB per video
500K * 10 KB = 5 GB/day (negligible compared to video)
Bandwidth
EGRESS:
Concurrent viewers (peak): ~5M users streaming simultaneously
Average bitrate: 5 Mbps (720p)
Total egress: 5M * 5 Mbps = 25 Tbps
(This is why YouTube uses a massive global CDN)
Cache
Video metadata cache:
Total videos: ~1B (historical)
Hot 1%: 10M videos * 10 KB = 100 GB (fits in a few Redis nodes)
Video content cache (CDN):
Hot 10% of videos get 90% of views
CDN caches these at edge locations worldwide
10. Worked Example: URL Shortener
Given
- 100M new URLs shortened per month
- Read:Write ratio = 100:1
- Short URL: 7 characters
- Keep URLs for 5 years
QPS Calculation
WRITE QPS:
100M / (30 * 86,400) = ~39 writes/sec
Peak: 39 * 3 = ~120 writes/sec
(Very low — a single server can handle this easily)
READ QPS (redirects):
39 * 100 = ~3,900 reads/sec
Peak: 3,900 * 3 = ~12,000 reads/sec
Storage
Per URL record:
short_url: 7 bytes
long_url: ~200 bytes (average)
created_at: 8 bytes
user_id: 8 bytes
Total: ~250 bytes
Monthly: 100M * 250 bytes = 25 GB/month
Yearly: 25 GB * 12 = 300 GB/year
5 years: 300 GB * 5 = 1.5 TB
Total records in 5 years: 100M * 12 * 5 = 6 Billion URLs
Unique Short URL Space
7-character short URL using [a-z, A-Z, 0-9] = 62 characters
62^7 = 3.5 Trillion possible combinations
We need 6 Billion over 5 years
Collision probability: very low (6B / 3.5T ≈ 0.17%)
Cache
Apply 80-20 rule:
20% of URLs get 80% of traffic
Daily unique URLs accessed: ~100M (estimate)
Hot 20%: 20M * 250 bytes = 5 GB
This easily fits in a single Redis instance!
Expected cache hit ratio: 90%+
DB reads with cache: 12,000 * 10% = 1,200 QPS (one DB handles this)
11. Common Powers of 2
Memorize these for quick mental math in interviews.
| Power | Value | Approximate | Common Use |
|---|---|---|---|
| 2^10 | 1,024 | ~1 Thousand (1 KB) | Kilo |
| 2^20 | 1,048,576 | ~1 Million (1 MB) | Mega |
| 2^30 | 1,073,741,824 | ~1 Billion (1 GB) | Giga |
| 2^40 | ~1 Trillion (1 TB) | Tera | |
| 2^50 | ~1 Quadrillion (1 PB) | Peta |
Quick Conversion Tricks
1 day ≈ 86,400 sec ≈ 10^5 sec (use for quick division)
1 month ≈ 2.5M sec ≈ 2.5 * 10^6
1 year ≈ 30M sec ≈ 3 * 10^7
1 M requests/day ≈ 12 QPS
10M requests/day ≈ 120 QPS
100M requests/day ≈ 1,200 QPS
1B requests/day ≈ 12,000 QPS
12. Key Takeaways
- Estimation is a pipeline: Users → DAU → QPS → Storage → Bandwidth → Cache sizing.
- You don't need exact numbers. Order of magnitude is what matters. Is it 1 TB or 1 PB? Is it 1K QPS or 1M QPS?
- Peak QPS is 2-5x average QPS. Always design for peak, not average.
- The 80-20 rule is your friend for cache sizing: 20% of data serves 80% of traffic.
- Memorize key constants: seconds/day (~86K), 62^7 (~3.5T), read:write ratios for common systems.
- Media dominates storage and bandwidth. A system that stores images/videos needs orders of magnitude more storage than one that stores only text.
- Show your work. In an interview, the process matters more than the exact number.
13. Explain-It Challenge
Without looking back, work through these:
- A photo-sharing app has 50M DAU. Each user uploads 1 photo/day (2 MB each) and views 20 photos/day. Calculate: write QPS, read QPS, daily storage, daily egress bandwidth.
- A chat application has 100M DAU. Each user sends 40 messages/day (average 100 bytes each). How much storage is needed per year? How much cache for the hottest 20%?
- Why is peak QPS more important than average QPS when sizing infrastructure?
- A URL shortener receives 200M new URLs per month. What is the minimum character length needed for the short code (using base62) to last 10 years without collisions?
- Explain the 80-20 rule and how it applies to cache sizing with a concrete example.
Navigation: ← 9.7.c — Breaking Into Components · 9.7.e — Interview Approach →