Episode 9 — System Design / 9.8 — Communication and Data Layer
9.8.d — Database Scaling
Big Picture: A single database server can handle a few thousand requests per second. At some point, your data grows beyond what one machine can store, or your traffic exceeds what one machine can serve. Database scaling is how you break past those limits without losing reliability.
Table of Contents
- Why Databases Become Bottlenecks
- Vertical vs Horizontal Scaling
- Read Replicas
- Master-Slave Replication
- Sharding Strategies
- Shard Key Selection
- Denormalization for Performance
- Connection Pooling
- Indexing Strategies
- Key Takeaways
- Explain-It Challenge
Why Databases Become Bottlenecks
Stage 1: Single Server Stage 2: Growing Pains Stage 3: Crisis
==================== ==================== ================
App --> [DB] App --> [DB overloaded] App --> [DB DOWN]
10K rows 10M rows 100M rows
100 QPS 5,000 QPS 50,000 QPS
"Everything is fine" "Queries getting slow" "Site is down"
Common symptoms:
- Query response times increasing (P99 > 1 second)
- CPU/memory on DB server at 90%+
- Connection pool exhausted
- Disk I/O maxed out
- Replication lag growing
Vertical vs Horizontal Scaling
VERTICAL SCALING HORIZONTAL SCALING
(Scale Up) (Scale Out)
================== ==================
+--------+ +----------+ +----+ +----+ +----+
| 4 CPU | | 64 CPU | | DB | | DB | | DB |
| 16 GB | ---> | 512 GB | | 1 | | 2 | | 3 |
| 1 TB | | 10 TB | +----+ +----+ +----+
+--------+ +----------+ | | |
(buy a bigger machine) (add more machines)
| Factor | Vertical | Horizontal |
|---|---|---|
| How | Bigger CPU, RAM, SSD | More machines |
| Limit | Hardware ceiling (~128 cores, 4TB RAM) | Virtually unlimited |
| Downtime | Usually requires restart | Add nodes with zero downtime |
| Cost | Exponentially expensive at top | Linear cost growth |
| Complexity | Simple (no code changes) | Complex (sharding, routing) |
| Data model | No changes | May need to partition data |
| Best for | SQL databases (PostgreSQL, MySQL) | NoSQL (Cassandra, DynamoDB) |
Rule of thumb: Scale vertically first (it is simpler). Switch to horizontal when you hit hardware limits or need fault tolerance across zones.
Read Replicas
Most applications are read-heavy (90%+ reads, <10% writes). Read replicas distribute read traffic across multiple copies of the database.
BEFORE: Single DB handles everything
=====================================
App Server ---- READ + WRITE ----> [Primary DB]
App Server ---- READ + WRITE ----> [Primary DB]
App Server ---- READ + WRITE ----> [Primary DB]
(bottleneck)
AFTER: Primary handles writes, replicas handle reads
=====================================================
WRITE
App Server ---- WRITE ----> [Primary DB]
|
Replication (async)
/ | \
v v v
[Replica 1] [Replica 2] [Replica 3]
^ ^ ^
App Server ---- READ --|---------|---------|
App Server ---- READ --|---------|---------|
App Server ---- READ --|---------|---------|
How It Works
| Step | What Happens |
|---|---|
| 1. Write arrives | Goes to primary (master) only |
| 2. WAL shipping | Primary writes to Write-Ahead Log (WAL) |
| 3. Replication | WAL entries sent to replicas (async or sync) |
| 4. Replica applies | Replica replays WAL entries to update its data |
| 5. Read routed | Application reads from replicas (round-robin or nearest) |
Replication Types
| Type | How It Works | Tradeoff |
|---|---|---|
| Asynchronous | Primary does not wait for replicas to confirm | Fast writes, but replicas may be slightly behind (lag) |
| Synchronous | Primary waits for at least one replica to confirm | No lag, but slower writes |
| Semi-synchronous | Primary waits for one replica, others are async | Balance of speed and safety |
Replication Lag
Timeline:
=========
t=0 Primary: INSERT user "Alice"
t=0 Primary: committed, returns success to client
t=50ms Replica 1: receives and applies the INSERT
t=80ms Replica 2: receives and applies the INSERT
PROBLEM: At t=20ms, a read to Replica 1 returns "Alice not found"
This is "replication lag" — typically 10-100ms, but can spike to seconds
under heavy write load.
Mitigations for replication lag:
- Read-your-writes consistency — After a write, route that user's reads to primary for a few seconds
- Monotonic reads — Pin a user to a specific replica for a session
- Synchronous replication — Eliminates lag but reduces write throughput
Master-Slave Replication
Single-Master (Most Common)
+------------------+
| PRIMARY |
| (Master) | <-- All WRITES go here
| |
+--------+---------+
|
Replication Stream
/ | \
v v v
+--------+ +--------+ +--------+
|Replica | |Replica | |Replica |
| (R1) | | (R2) | | (R3) | <-- All READS distributed here
+--------+ +--------+ +--------+
Multi-Master (Less Common)
+----------+ +----------+
| Master | <-----> | Master |
| (US) | sync | (EU) |
+----+-----+ +----+-----+
| |
+--+--+ +---+---+
|R1|R2| |R3 |R4 |
+--+--+ +---+---+
| Topology | Writes | Conflicts | Complexity | Use Case |
|---|---|---|---|---|
| Single-master | One node only | None | Low | Most applications |
| Multi-master | Any node | Possible (must resolve) | High | Multi-region low-latency writes |
| Masterless (Cassandra) | Any node | Possible (last-write-wins) | Medium | High availability |
Failover
NORMAL OPERATION AFTER PRIMARY FAILURE
================ =====================
[Primary] <-- writes [Primary X] (failed)
| |
[R1] [R2] [R3] [R1 -> NEW Primary] [R2] [R3]
^
promoted!
Steps:
1. Health check detects primary is down
2. Election: choose replica with least lag
3. Promote replica to primary
4. Redirect writes to new primary
5. Reconfigure other replicas to follow new primary
Risk: Some committed writes on old primary may be lost
if they were not replicated before failure.
Sharding Strategies
Sharding (horizontal partitioning) splits data across multiple database instances.
BEFORE SHARDING AFTER SHARDING
================ ================
+------------------+ +--------+ +--------+ +--------+
| ALL DATA | | Users | | Users | | Users |
| (100M users) | | A-H | | I-P | | Q-Z |
| Single DB | | Shard 1| | Shard 2| | Shard 3|
+------------------+ +--------+ +--------+ +--------+
1. Range-Based Sharding
Shard Key: user_id (integer)
Shard 1: user_id 1 - 1,000,000
Shard 2: user_id 1,000,001 - 2,000,000
Shard 3: user_id 2,000,001 - 3,000,000
+--Shard 1--+ +--Shard 2--+ +--Shard 3--+
| 1 - 1M | | 1M - 2M | | 2M - 3M |
+-----+-----+ +-----+-----+ +-----+-----+
| | |
Routing: IF user_id <= 1M -> Shard 1
IF user_id <= 2M -> Shard 2
ELSE -> Shard 3
Pros: Simple routing logic, easy range queries. Cons: Hotspots (new users all go to latest shard), uneven distribution.
2. Hash-Based Sharding
Shard Key: user_id
Routing: shard_number = hash(user_id) % num_shards
hash("alice") % 3 = 0 --> Shard 0
hash("bob") % 3 = 2 --> Shard 2
hash("charlie") % 3 = 1 --> Shard 1
+--Shard 0--+ +--Shard 1--+ +--Shard 2--+
| alice | | charlie | | bob |
| dave | | frank | | eve |
+------------+ +------------+ +------------+
Distribution: roughly even (depends on hash function quality)
Pros: Even data distribution, no hotspots. Cons: Range queries require hitting all shards, resharding is expensive (consistent hashing helps).
3. Geographic Sharding
+-----------------+ +-----------------+ +-----------------+
| US Shard | | EU Shard | | APAC Shard |
| (us-east-1) | | (eu-west-1) | | (ap-southeast) |
| | | | | |
| US users | | EU users | | APAC users |
| US orders | | EU orders | | APAC orders |
+-----------------+ +-----------------+ +-----------------+
^ ^ ^
| | |
US traffic EU traffic APAC traffic
Pros: Data locality (low latency), GDPR compliance (EU data stays in EU). Cons: Cross-region queries are expensive, what about users who travel?
4. Consistent Hashing (Advanced)
Solves the resharding problem. When you add/remove a shard, only ~1/N of data moves.
Hash Ring (0 to 2^32)
======================
Shard A (position 1000)
/
/
--------+--------+--------+--------+------>
0 1000 3000 6000 9000 2^32
| | |
Shard A Shard B Shard C
Key "alice" hashes to 2500 --> next shard clockwise --> Shard B
Key "bob" hashes to 7800 --> next shard clockwise --> Shard A (wraps)
ADD Shard D at position 4500:
- Only keys between 3000-4500 move from Shard C to Shard D
- Everything else stays put!
Sharding Strategy Comparison
| Strategy | Distribution | Range Queries | Resharding Cost | Complexity |
|---|---|---|---|---|
| Range | Uneven (hotspots) | Efficient (single shard) | Medium | Low |
| Hash | Even | Expensive (all shards) | High (consistent hashing helps) | Medium |
| Geographic | By region | Within region: good | Medium | Medium |
| Directory-based | Configurable | Depends on mapping | Low (just update directory) | High |
Shard Key Selection
The shard key determines which shard a piece of data lives on. A bad shard key ruins your entire architecture.
Good Shard Key Criteria
| Criteria | Why |
|---|---|
| High cardinality | Many distinct values = even distribution |
| Even distribution | No hotspots (one shard getting all traffic) |
| Matches query patterns | Most queries should target a single shard |
| Immutable | Changing the shard key means moving data between shards |
Shard Key Examples
| System | Good Shard Key | Bad Shard Key | Why Bad |
|---|---|---|---|
| E-commerce orders | user_id | order_date | All today's orders hit one shard |
| Chat messages | conversation_id | sender_id | Querying a conversation needs all shards |
| IoT sensor data | device_id | sensor_type | Few types = few shards used |
| Social media posts | user_id | country | US shard is 10x larger than others |
| Analytics events | event_id (hash) | event_type | "page_view" = 80% of data on one shard |
Composite Shard Keys
Sometimes a single field is not enough. Combine fields:
DynamoDB Single-Table Design:
Partition Key Sort Key Data
================== ================ ====================
USER#alice PROFILE {name, email, bio}
USER#alice ORDER#2025-01-15 {total: 99.99}
USER#alice ORDER#2025-01-20 {total: 45.50}
PRODUCT#laptop-1 DETAILS {name, price, desc}
PRODUCT#laptop-1 REVIEW#user-bob {rating: 5, text: ...}
The partition key distributes data across shards.
The sort key enables efficient range queries within a partition.
Denormalization for Performance
Normalized data (3NF) minimizes redundancy but requires joins. At scale, joins across shards are expensive. Denormalization trades storage for speed.
When to Denormalize
NORMALIZED (3NF) DENORMALIZED
================== ==================
POSTS table: POSTS table:
| id | user_id | text | | id | user_id | user_name | user_avatar | text |
| | | | | | | | | |
+ USERS table: No JOIN needed to display post with
| id | name | avatar | user info. Faster reads.
Requires JOIN to display But: if user changes name, must
post with user info. update ALL their posts.
Denormalization Patterns
| Pattern | How | When |
|---|---|---|
| Embed (NoSQL) | Store related data in same document | Data accessed together, rarely changes independently |
| Duplicate | Copy fields into another table | Hot read path needs data from multiple tables |
| Pre-compute | Store aggregated values | Counts, sums, averages accessed frequently |
| Materialized view | Database maintains a pre-joined view | Complex queries needed frequently |
Example: E-Commerce Product Page
NORMALIZED (requires 4 queries):
1. SELECT * FROM products WHERE id = 101
2. SELECT * FROM categories WHERE id = products.category_id
3. SELECT AVG(rating) FROM reviews WHERE product_id = 101
4. SELECT * FROM reviews WHERE product_id = 101 ORDER BY date DESC LIMIT 5
DENORMALIZED (1 query or 1 document read):
{
"id": 101,
"name": "Wireless Headphones",
"price": 79.99,
"category_name": "Electronics", // duplicated from categories
"avg_rating": 4.5, // pre-computed
"review_count": 234, // pre-computed
"recent_reviews": [ // embedded
{ "user": "Alice", "rating": 5, "text": "Great sound!" },
{ "user": "Bob", "rating": 4, "text": "Good value" }
]
}
Tradeoffs
| Factor | Normalized | Denormalized |
|---|---|---|
| Read speed | Slower (joins) | Faster (single read) |
| Write speed | Faster (update once) | Slower (update many copies) |
| Storage | Less | More |
| Consistency | Easy (single source of truth) | Hard (multiple copies to sync) |
| Flexibility | High (ad-hoc queries) | Low (designed for specific queries) |
Connection Pooling
Opening a database connection is expensive (TCP handshake, authentication, SSL). Connection pooling reuses connections.
WITHOUT POOLING WITH POOLING
================ ================
Request 1 --> Open Conn --> Query --> Close Conn Request 1 --+
Request 2 --> Open Conn --> Query --> Close Conn |
Request 3 --> Open Conn --> Query --> Close Conn Request 2 --+--> [Pool: 10 conns] --> DB
... |
(each open = ~5-20ms overhead) Request 3 --+
(connections reused, near-zero overhead)
Pool Configuration
+------------------------------------------+
| Connection Pool |
| |
| min_connections: 5 (always ready) |
| max_connections: 20 (upper limit) |
| idle_timeout: 300s (close unused) |
| connection_timeout: 5s (wait for conn) |
| max_lifetime: 1800s (recycle conns) |
| |
| Active: [C1] [C2] [C3] [C4] [C5] |
| Idle: [C6] [C7] |
| Available: 13 more |
+------------------------------------------+
Pool Sizing Formula
connections = ((core_count * 2) + effective_spindle_count)
Example:
- 4-core CPU, SSD (no spindles): connections = (4 * 2) + 1 = 9
- 8-core CPU, SSD: connections = (8 * 2) + 1 = 17
Rule: A smaller pool (10-20) with a queue usually
outperforms a larger pool (100+) because of:
- Less context switching
- Less memory per connection (~10MB in PostgreSQL)
- Less lock contention
Popular Connection Poolers
| Tool | Database | Feature |
|---|---|---|
| PgBouncer | PostgreSQL | Lightweight, transaction/session pooling |
| pgpool-II | PostgreSQL | Pooling + load balancing + replication |
| ProxySQL | MySQL | Query routing, caching, failover |
| HikariCP | Java (any DB) | Fastest JVM connection pool |
Indexing Strategies
An index is a data structure that speeds up reads at the cost of slower writes and extra storage.
How Indexes Work
WITHOUT INDEX (Full Table Scan) WITH INDEX (B-Tree Lookup)
================================ ============================
SELECT * FROM users SELECT * FROM users
WHERE email = 'alice@mail.com' WHERE email = 'alice@mail.com'
Scan: Row 1 -> No B-Tree Index on email
Row 2 -> No =======================
Row 3 -> No [M]
... / | \
Row 999,999 -> No [D] [P] [T]
Row 1,000,000 -> YES! / \ | \
[A] [G] [N] [W]
Time: O(N) = scan all rows |
1,000,000 row reads [alice@]
Time: O(log N) = ~20 reads
for 1,000,000 rows
Index Types
| Index Type | How It Works | Best For |
|---|---|---|
| B-Tree (default) | Balanced tree, sorted | Equality and range queries (=, <, >, BETWEEN) |
| Hash | Hash table lookup | Exact equality only (=), very fast |
| GIN (Generalized Inverted) | Inverted index | Full-text search, JSONB, arrays |
| GiST (Generalized Search Tree) | Tree for spatial data | Geospatial queries, range types |
| BRIN (Block Range) | Min/max per block range | Large, naturally ordered tables (time-series) |
Index Best Practices
-- 1. Index columns used in WHERE clauses
CREATE INDEX idx_users_email ON users(email);
-- 2. Composite index for multi-column queries
-- (Column order matters! Leftmost prefix rule)
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- This index supports:
-- WHERE user_id = 42 (YES - uses index)
-- WHERE user_id = 42 AND status = 'pending' (YES - uses full index)
-- WHERE status = 'pending' (NO - cannot skip first column)
-- 3. Covering index (includes all queried columns)
CREATE INDEX idx_orders_covering ON orders(user_id, status) INCLUDE (total, created_at);
-- Query can be answered entirely from the index, no table lookup needed
-- 4. Partial index (index a subset of rows)
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Smaller index, only indexes pending orders
-- 5. Unique index (also enforces constraint)
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);
When NOT to Index
| Situation | Why |
|---|---|
| Small tables (<1000 rows) | Full scan is faster than index lookup |
| Columns rarely queried in WHERE | Index cost without benefit |
| Columns with low cardinality | gender (M/F) - index does not help much |
| Write-heavy tables with many indexes | Each INSERT/UPDATE must update all indexes |
| Wide columns (long text) | Index becomes too large |
Index Monitoring
-- PostgreSQL: Find unused indexes
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
-- PostgreSQL: Find slow queries needing indexes
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- PostgreSQL: Check index size
SELECT pg_size_pretty(pg_indexes_size('users'));
The Write Cost of Indexes
INSERT into table with 0 indexes: 1 write
INSERT into table with 1 index: 2 writes (table + index)
INSERT into table with 5 indexes: 6 writes (table + 5 indexes)
INSERT into table with 10 indexes: 11 writes
Rule: Every index slows writes.
Target: 5-10 indexes per table is reasonable.
Alarm: 20+ indexes means something is wrong.
Key Takeaways
- Scale vertically first — A bigger machine with proper indexing handles more than most people expect. Do not shard prematurely.
- Read replicas are the first horizontal scaling step — Since most apps are read-heavy, replicas give you 3-5x capacity with minimal complexity.
- Replication lag is a real problem — Design your app to handle it (read-your-writes, sticky sessions, or synchronous replication for critical reads).
- Shard key selection is the most important sharding decision — A bad key creates hotspots. Choose based on query patterns, high cardinality, and even distribution.
- Hash sharding distributes evenly; range sharding enables range queries — Pick based on your access patterns.
- Denormalize on the read path, normalize on the write path — Duplicate data to speed up reads, but maintain a source of truth.
- Connection pools should be small — 10-20 connections per application server usually outperforms 100+.
- Every index speeds reads and slows writes — Be strategic. Index what you query, monitor what you do not use, and remove dead indexes.
Explain-It Challenge
Scenario: You run an e-commerce platform. Currently:
- 50 million products in a single PostgreSQL database
- 10,000 queries per second (80% reads, 20% writes)
- Average query time has gone from 10ms to 500ms over 6 months
- The database server CPU is at 85%
Walk through your scaling plan step by step:
- What quick wins would you try first (no architecture changes)?
- When would you add read replicas?
- At what point would you consider sharding?
- What shard key would you choose and why?
- How would you handle cross-shard queries (e.g., "search all products")?
Next -> 9.8.e — Data Consistency