Episode 9 — System Design / 9.8 — Communication and Data Layer

9.8.d — Database Scaling

Big Picture: A single database server can handle a few thousand requests per second. At some point, your data grows beyond what one machine can store, or your traffic exceeds what one machine can serve. Database scaling is how you break past those limits without losing reliability.

Why Databases Become Bottlenecks
Vertical vs Horizontal Scaling
Read Replicas
Master-Slave Replication
Sharding Strategies
Shard Key Selection
Denormalization for Performance
Connection Pooling
Indexing Strategies
Key Takeaways
Explain-It Challenge

Why Databases Become Bottlenecks

  Stage 1: Single Server         Stage 2: Growing Pains         Stage 3: Crisis
  ====================           ====================           ================

  App --> [DB]                   App --> [DB overloaded]         App --> [DB DOWN]
          10K rows                       10M rows                       100M rows
          100 QPS                        5,000 QPS                      50,000 QPS
          "Everything is fine"           "Queries getting slow"         "Site is down"

Common symptoms:

Query response times increasing (P99 > 1 second)
CPU/memory on DB server at 90%+
Connection pool exhausted
Disk I/O maxed out
Replication lag growing

Vertical vs Horizontal Scaling

  VERTICAL SCALING                       HORIZONTAL SCALING
  (Scale Up)                             (Scale Out)
  ==================                     ==================

  +--------+      +----------+           +----+  +----+  +----+
  |  4 CPU |      |  64 CPU  |           | DB |  | DB |  | DB |
  |  16 GB | ---> | 512 GB   |           | 1  |  | 2  |  | 3  |
  |  1 TB  |      |  10 TB   |           +----+  +----+  +----+
  +--------+      +----------+             |        |        |
  (buy a bigger machine)                  (add more machines)

Factor	Vertical	Horizontal
How	Bigger CPU, RAM, SSD	More machines
Limit	Hardware ceiling (~128 cores, 4TB RAM)	Virtually unlimited
Downtime	Usually requires restart	Add nodes with zero downtime
Cost	Exponentially expensive at top	Linear cost growth
Complexity	Simple (no code changes)	Complex (sharding, routing)
Data model	No changes	May need to partition data
Best for	SQL databases (PostgreSQL, MySQL)	NoSQL (Cassandra, DynamoDB)

Rule of thumb: Scale vertically first (it is simpler). Switch to horizontal when you hit hardware limits or need fault tolerance across zones.

Read Replicas

Most applications are read-heavy (90%+ reads, <10% writes). Read replicas distribute read traffic across multiple copies of the database.

  BEFORE: Single DB handles everything
  =====================================
  
  App Server ---- READ + WRITE ----> [Primary DB]
  App Server ---- READ + WRITE ----> [Primary DB]
  App Server ---- READ + WRITE ----> [Primary DB]
                                     (bottleneck)

  AFTER: Primary handles writes, replicas handle reads
  =====================================================

                          WRITE
  App Server ---- WRITE ----> [Primary DB]
                                   |
                          Replication (async)
                          /        |        \
                         v         v         v
                    [Replica 1] [Replica 2] [Replica 3]
                         ^         ^         ^
  App Server ---- READ --|---------|---------|
  App Server ---- READ --|---------|---------|
  App Server ---- READ --|---------|---------|

How It Works

Step	What Happens
1. Write arrives	Goes to primary (master) only
2. WAL shipping	Primary writes to Write-Ahead Log (WAL)
3. Replication	WAL entries sent to replicas (async or sync)
4. Replica applies	Replica replays WAL entries to update its data
5. Read routed	Application reads from replicas (round-robin or nearest)

Replication Types

Type	How It Works	Tradeoff
Asynchronous	Primary does not wait for replicas to confirm	Fast writes, but replicas may be slightly behind (lag)
Synchronous	Primary waits for at least one replica to confirm	No lag, but slower writes
Semi-synchronous	Primary waits for one replica, others are async	Balance of speed and safety

Replication Lag

  Timeline:
  =========
  
  t=0    Primary: INSERT user "Alice"
  t=0    Primary: committed, returns success to client
  
  t=50ms Replica 1: receives and applies the INSERT
  t=80ms Replica 2: receives and applies the INSERT
  
  PROBLEM: At t=20ms, a read to Replica 1 returns "Alice not found"
  
  This is "replication lag" — typically 10-100ms, but can spike to seconds
  under heavy write load.

Mitigations for replication lag:

Read-your-writes consistency — After a write, route that user's reads to primary for a few seconds
Monotonic reads — Pin a user to a specific replica for a session
Synchronous replication — Eliminates lag but reduces write throughput

Master-Slave Replication

Single-Master (Most Common)

  +------------------+
  |   PRIMARY        |
  |   (Master)       |  <-- All WRITES go here
  |                  |
  +--------+---------+
           |
     Replication Stream
     /         |         \
    v          v          v
  +--------+ +--------+ +--------+
  |Replica | |Replica | |Replica |
  |  (R1)  | |  (R2)  | |  (R3)  |  <-- All READS distributed here
  +--------+ +--------+ +--------+

Multi-Master (Less Common)

  +----------+         +----------+
  |  Master  | <-----> |  Master  |
  |  (US)    |  sync   |  (EU)    |
  +----+-----+         +----+-----+
       |                     |
    +--+--+              +---+---+
    |R1|R2|              |R3 |R4 |
    +--+--+              +---+---+

Topology	Writes	Conflicts	Complexity	Use Case
Single-master	One node only	None	Low	Most applications
Multi-master	Any node	Possible (must resolve)	High	Multi-region low-latency writes
Masterless (Cassandra)	Any node	Possible (last-write-wins)	Medium	High availability

Failover

  NORMAL OPERATION                    AFTER PRIMARY FAILURE
  ================                    =====================

  [Primary] <-- writes               [Primary X] (failed)
      |                                    |
  [R1] [R2] [R3]                     [R1 -> NEW Primary] [R2] [R3]
                                           ^
                                       promoted!
                                       
  Steps:
  1. Health check detects primary is down
  2. Election: choose replica with least lag
  3. Promote replica to primary
  4. Redirect writes to new primary
  5. Reconfigure other replicas to follow new primary
  
  Risk: Some committed writes on old primary may be lost
  if they were not replicated before failure.

Sharding Strategies

Sharding (horizontal partitioning) splits data across multiple database instances.

  BEFORE SHARDING                  AFTER SHARDING
  ================                 ================

  +------------------+            +--------+  +--------+  +--------+
  | ALL DATA         |            | Users  |  | Users  |  | Users  |
  | (100M users)     |            | A-H    |  | I-P    |  | Q-Z    |
  | Single DB        |            | Shard 1|  | Shard 2|  | Shard 3|
  +------------------+            +--------+  +--------+  +--------+

1. Range-Based Sharding

  Shard Key: user_id (integer)
  
  Shard 1: user_id 1       - 1,000,000
  Shard 2: user_id 1,000,001 - 2,000,000
  Shard 3: user_id 2,000,001 - 3,000,000
  
  +--Shard 1--+  +--Shard 2--+  +--Shard 3--+
  | 1 - 1M    |  | 1M - 2M   |  | 2M - 3M   |
  +-----+-----+  +-----+-----+  +-----+-----+
        |               |               |
  Routing: IF user_id <= 1M -> Shard 1
           IF user_id <= 2M -> Shard 2
           ELSE -> Shard 3

Pros: Simple routing logic, easy range queries. Cons: Hotspots (new users all go to latest shard), uneven distribution.

2. Hash-Based Sharding

  Shard Key: user_id
  Routing: shard_number = hash(user_id) % num_shards
  
  hash("alice") % 3 = 0  --> Shard 0
  hash("bob")   % 3 = 2  --> Shard 2
  hash("charlie") % 3 = 1 --> Shard 1
  
  +--Shard 0--+  +--Shard 1--+  +--Shard 2--+
  | alice      |  | charlie    |  | bob        |
  | dave       |  | frank      |  | eve        |
  +------------+  +------------+  +------------+
  
  Distribution: roughly even (depends on hash function quality)

Pros: Even data distribution, no hotspots. Cons: Range queries require hitting all shards, resharding is expensive (consistent hashing helps).

3. Geographic Sharding

  +-----------------+     +-----------------+     +-----------------+
  |  US Shard       |     |  EU Shard       |     |  APAC Shard     |
  |  (us-east-1)    |     |  (eu-west-1)    |     |  (ap-southeast) |
  |                 |     |                 |     |                 |
  |  US users       |     |  EU users       |     |  APAC users     |
  |  US orders      |     |  EU orders      |     |  APAC orders    |
  +-----------------+     +-----------------+     +-----------------+
         ^                       ^                       ^
         |                       |                       |
    US traffic              EU traffic              APAC traffic

Pros: Data locality (low latency), GDPR compliance (EU data stays in EU). Cons: Cross-region queries are expensive, what about users who travel?

4. Consistent Hashing (Advanced)

Solves the resharding problem. When you add/remove a shard, only ~1/N of data moves.

  Hash Ring (0 to 2^32)
  ======================
  
          Shard A (position 1000)
            /
           /
  --------+--------+--------+--------+------>
  0      1000     3000     6000     9000   2^32
           |        |        |
         Shard A  Shard B  Shard C
  
  Key "alice" hashes to 2500 --> next shard clockwise --> Shard B
  Key "bob" hashes to 7800   --> next shard clockwise --> Shard A (wraps)
  
  ADD Shard D at position 4500:
  - Only keys between 3000-4500 move from Shard C to Shard D
  - Everything else stays put!

Sharding Strategy Comparison

Strategy	Distribution	Range Queries	Resharding Cost	Complexity
Range	Uneven (hotspots)	Efficient (single shard)	Medium	Low
Hash	Even	Expensive (all shards)	High (consistent hashing helps)	Medium
Geographic	By region	Within region: good	Medium	Medium
Directory-based	Configurable	Depends on mapping	Low (just update directory)	High

Shard Key Selection

The shard key determines which shard a piece of data lives on. A bad shard key ruins your entire architecture.

Good Shard Key Criteria

Criteria	Why
High cardinality	Many distinct values = even distribution
Even distribution	No hotspots (one shard getting all traffic)
Matches query patterns	Most queries should target a single shard
Immutable	Changing the shard key means moving data between shards

Shard Key Examples

System	Good Shard Key	Bad Shard Key	Why Bad
E-commerce orders	`user_id`	`order_date`	All today's orders hit one shard
Chat messages	`conversation_id`	`sender_id`	Querying a conversation needs all shards
IoT sensor data	`device_id`	`sensor_type`	Few types = few shards used
Social media posts	`user_id`	`country`	US shard is 10x larger than others
Analytics events	`event_id` (hash)	`event_type`	"page_view" = 80% of data on one shard

Composite Shard Keys

Sometimes a single field is not enough. Combine fields:

  DynamoDB Single-Table Design:
  
  Partition Key         Sort Key           Data
  ==================    ================   ====================
  USER#alice            PROFILE            {name, email, bio}
  USER#alice            ORDER#2025-01-15   {total: 99.99}
  USER#alice            ORDER#2025-01-20   {total: 45.50}
  PRODUCT#laptop-1      DETAILS            {name, price, desc}
  PRODUCT#laptop-1      REVIEW#user-bob    {rating: 5, text: ...}
  
  The partition key distributes data across shards.
  The sort key enables efficient range queries within a partition.

Denormalization for Performance

Normalized data (3NF) minimizes redundancy but requires joins. At scale, joins across shards are expensive. Denormalization trades storage for speed.

When to Denormalize

  NORMALIZED (3NF)                    DENORMALIZED
  ==================                  ==================

  POSTS table:                        POSTS table:
  | id | user_id | text |            | id | user_id | user_name | user_avatar | text |
  |    |         |      |            |    |         |           |             |      |
  
  + USERS table:                      No JOIN needed to display post with
  | id | name | avatar |             user info. Faster reads.
  
  Requires JOIN to display            But: if user changes name, must
  post with user info.                update ALL their posts.

Denormalization Patterns

Pattern	How	When
Embed (NoSQL)	Store related data in same document	Data accessed together, rarely changes independently
Duplicate	Copy fields into another table	Hot read path needs data from multiple tables
Pre-compute	Store aggregated values	Counts, sums, averages accessed frequently
Materialized view	Database maintains a pre-joined view	Complex queries needed frequently

Example: E-Commerce Product Page

  NORMALIZED (requires 4 queries):
  1. SELECT * FROM products WHERE id = 101
  2. SELECT * FROM categories WHERE id = products.category_id
  3. SELECT AVG(rating) FROM reviews WHERE product_id = 101
  4. SELECT * FROM reviews WHERE product_id = 101 ORDER BY date DESC LIMIT 5

  DENORMALIZED (1 query or 1 document read):
  {
    "id": 101,
    "name": "Wireless Headphones",
    "price": 79.99,
    "category_name": "Electronics",          // duplicated from categories
    "avg_rating": 4.5,                       // pre-computed
    "review_count": 234,                     // pre-computed
    "recent_reviews": [                      // embedded
      { "user": "Alice", "rating": 5, "text": "Great sound!" },
      { "user": "Bob", "rating": 4, "text": "Good value" }
    ]
  }

Tradeoffs

Factor	Normalized	Denormalized
Read speed	Slower (joins)	Faster (single read)
Write speed	Faster (update once)	Slower (update many copies)
Storage	Less	More
Consistency	Easy (single source of truth)	Hard (multiple copies to sync)
Flexibility	High (ad-hoc queries)	Low (designed for specific queries)

Connection Pooling

Opening a database connection is expensive (TCP handshake, authentication, SSL). Connection pooling reuses connections.

  WITHOUT POOLING                      WITH POOLING
  ================                     ================

  Request 1 --> Open Conn --> Query --> Close Conn    Request 1 --+
  Request 2 --> Open Conn --> Query --> Close Conn                |
  Request 3 --> Open Conn --> Query --> Close Conn    Request 2 --+--> [Pool: 10 conns] --> DB
  ...                                                             |
  (each open = ~5-20ms overhead)                      Request 3 --+
                                                      
                                                      (connections reused, near-zero overhead)

Pool Configuration

  +------------------------------------------+
  |           Connection Pool                |
  |                                          |
  |  min_connections: 5    (always ready)    |
  |  max_connections: 20   (upper limit)     |
  |  idle_timeout: 300s    (close unused)    |
  |  connection_timeout: 5s (wait for conn)  |
  |  max_lifetime: 1800s  (recycle conns)    |
  |                                          |
  |  Active: [C1] [C2] [C3] [C4] [C5]      |
  |  Idle:   [C6] [C7]                      |
  |  Available: 13 more                      |
  +------------------------------------------+

Pool Sizing Formula

  connections = ((core_count * 2) + effective_spindle_count)

  Example:
  - 4-core CPU, SSD (no spindles): connections = (4 * 2) + 1 = 9
  - 8-core CPU, SSD: connections = (8 * 2) + 1 = 17

  Rule: A smaller pool (10-20) with a queue usually
  outperforms a larger pool (100+) because of:
  - Less context switching
  - Less memory per connection (~10MB in PostgreSQL)
  - Less lock contention

Popular Connection Poolers

Tool	Database	Feature
PgBouncer	PostgreSQL	Lightweight, transaction/session pooling
pgpool-II	PostgreSQL	Pooling + load balancing + replication
ProxySQL	MySQL	Query routing, caching, failover
HikariCP	Java (any DB)	Fastest JVM connection pool

Indexing Strategies

An index is a data structure that speeds up reads at the cost of slower writes and extra storage.

How Indexes Work

  WITHOUT INDEX (Full Table Scan)         WITH INDEX (B-Tree Lookup)
  ================================        ============================

  SELECT * FROM users                     SELECT * FROM users
  WHERE email = 'alice@mail.com'          WHERE email = 'alice@mail.com'

  Scan: Row 1 -> No                            B-Tree Index on email
        Row 2 -> No                            =======================
        Row 3 -> No                                    [M]
        ...                                           / | \
        Row 999,999 -> No                          [D] [P] [T]
        Row 1,000,000 -> YES!                     / \   |   \
                                               [A] [G] [N] [W]
  Time: O(N) = scan all rows                     |
  1,000,000 row reads                          [alice@]
                                               
                                              Time: O(log N) = ~20 reads
                                              for 1,000,000 rows

Index Types

Index Type	How It Works	Best For
B-Tree (default)	Balanced tree, sorted	Equality and range queries (`=`, `<`, `>`, `BETWEEN`)
Hash	Hash table lookup	Exact equality only (`=`), very fast
GIN (Generalized Inverted)	Inverted index	Full-text search, JSONB, arrays
GiST (Generalized Search Tree)	Tree for spatial data	Geospatial queries, range types
BRIN (Block Range)	Min/max per block range	Large, naturally ordered tables (time-series)

Index Best Practices

-- 1. Index columns used in WHERE clauses
CREATE INDEX idx_users_email ON users(email);

-- 2. Composite index for multi-column queries
--    (Column order matters! Leftmost prefix rule)
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- This index supports:
--   WHERE user_id = 42                    (YES - uses index)
--   WHERE user_id = 42 AND status = 'pending'  (YES - uses full index)
--   WHERE status = 'pending'              (NO - cannot skip first column)

-- 3. Covering index (includes all queried columns)
CREATE INDEX idx_orders_covering ON orders(user_id, status) INCLUDE (total, created_at);
-- Query can be answered entirely from the index, no table lookup needed

-- 4. Partial index (index a subset of rows)
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Smaller index, only indexes pending orders

-- 5. Unique index (also enforces constraint)
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

When NOT to Index

Situation	Why
Small tables (<1000 rows)	Full scan is faster than index lookup
Columns rarely queried in WHERE	Index cost without benefit
Columns with low cardinality	`gender` (M/F) - index does not help much
Write-heavy tables with many indexes	Each INSERT/UPDATE must update all indexes
Wide columns (long text)	Index becomes too large

Index Monitoring

-- PostgreSQL: Find unused indexes
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

-- PostgreSQL: Find slow queries needing indexes
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- PostgreSQL: Check index size
SELECT pg_size_pretty(pg_indexes_size('users'));

The Write Cost of Indexes

  INSERT into table with 0 indexes:  1 write
  INSERT into table with 1 index:    2 writes (table + index)
  INSERT into table with 5 indexes:  6 writes (table + 5 indexes)
  INSERT into table with 10 indexes: 11 writes
  
  Rule: Every index slows writes.
  Target: 5-10 indexes per table is reasonable.
  Alarm: 20+ indexes means something is wrong.

Key Takeaways

Scale vertically first — A bigger machine with proper indexing handles more than most people expect. Do not shard prematurely.
Read replicas are the first horizontal scaling step — Since most apps are read-heavy, replicas give you 3-5x capacity with minimal complexity.
Replication lag is a real problem — Design your app to handle it (read-your-writes, sticky sessions, or synchronous replication for critical reads).
Shard key selection is the most important sharding decision — A bad key creates hotspots. Choose based on query patterns, high cardinality, and even distribution.
Hash sharding distributes evenly; range sharding enables range queries — Pick based on your access patterns.
Denormalize on the read path, normalize on the write path — Duplicate data to speed up reads, but maintain a source of truth.
Connection pools should be small — 10-20 connections per application server usually outperforms 100+.
Every index speeds reads and slows writes — Be strategic. Index what you query, monitor what you do not use, and remove dead indexes.

Explain-It Challenge

Scenario: You run an e-commerce platform. Currently:

50 million products in a single PostgreSQL database

10,000 queries per second (80% reads, 20% writes)

Average query time has gone from 10ms to 500ms over 6 months

The database server CPU is at 85%

Walk through your scaling plan step by step:

What quick wins would you try first (no architecture changes)?

When would you add read replicas?

At what point would you consider sharding?

What shard key would you choose and why?

How would you handle cross-shard queries (e.g., "search all products")?

Next -> 9.8.e — Data Consistency