Episode 9 — System Design / 9.8 — Communication and Data Layer

9.8.d — Database Scaling

Big Picture: A single database server can handle a few thousand requests per second. At some point, your data grows beyond what one machine can store, or your traffic exceeds what one machine can serve. Database scaling is how you break past those limits without losing reliability.


Table of Contents

  1. Why Databases Become Bottlenecks
  2. Vertical vs Horizontal Scaling
  3. Read Replicas
  4. Master-Slave Replication
  5. Sharding Strategies
  6. Shard Key Selection
  7. Denormalization for Performance
  8. Connection Pooling
  9. Indexing Strategies
  10. Key Takeaways
  11. Explain-It Challenge

Why Databases Become Bottlenecks

  Stage 1: Single Server         Stage 2: Growing Pains         Stage 3: Crisis
  ====================           ====================           ================

  App --> [DB]                   App --> [DB overloaded]         App --> [DB DOWN]
          10K rows                       10M rows                       100M rows
          100 QPS                        5,000 QPS                      50,000 QPS
          "Everything is fine"           "Queries getting slow"         "Site is down"

Common symptoms:

  • Query response times increasing (P99 > 1 second)
  • CPU/memory on DB server at 90%+
  • Connection pool exhausted
  • Disk I/O maxed out
  • Replication lag growing

Vertical vs Horizontal Scaling

  VERTICAL SCALING                       HORIZONTAL SCALING
  (Scale Up)                             (Scale Out)
  ==================                     ==================

  +--------+      +----------+           +----+  +----+  +----+
  |  4 CPU |      |  64 CPU  |           | DB |  | DB |  | DB |
  |  16 GB | ---> | 512 GB   |           | 1  |  | 2  |  | 3  |
  |  1 TB  |      |  10 TB   |           +----+  +----+  +----+
  +--------+      +----------+             |        |        |
  (buy a bigger machine)                  (add more machines)
FactorVerticalHorizontal
HowBigger CPU, RAM, SSDMore machines
LimitHardware ceiling (~128 cores, 4TB RAM)Virtually unlimited
DowntimeUsually requires restartAdd nodes with zero downtime
CostExponentially expensive at topLinear cost growth
ComplexitySimple (no code changes)Complex (sharding, routing)
Data modelNo changesMay need to partition data
Best forSQL databases (PostgreSQL, MySQL)NoSQL (Cassandra, DynamoDB)

Rule of thumb: Scale vertically first (it is simpler). Switch to horizontal when you hit hardware limits or need fault tolerance across zones.


Read Replicas

Most applications are read-heavy (90%+ reads, <10% writes). Read replicas distribute read traffic across multiple copies of the database.

  BEFORE: Single DB handles everything
  =====================================
  
  App Server ---- READ + WRITE ----> [Primary DB]
  App Server ---- READ + WRITE ----> [Primary DB]
  App Server ---- READ + WRITE ----> [Primary DB]
                                     (bottleneck)

  AFTER: Primary handles writes, replicas handle reads
  =====================================================

                          WRITE
  App Server ---- WRITE ----> [Primary DB]
                                   |
                          Replication (async)
                          /        |        \
                         v         v         v
                    [Replica 1] [Replica 2] [Replica 3]
                         ^         ^         ^
  App Server ---- READ --|---------|---------|
  App Server ---- READ --|---------|---------|
  App Server ---- READ --|---------|---------|

How It Works

StepWhat Happens
1. Write arrivesGoes to primary (master) only
2. WAL shippingPrimary writes to Write-Ahead Log (WAL)
3. ReplicationWAL entries sent to replicas (async or sync)
4. Replica appliesReplica replays WAL entries to update its data
5. Read routedApplication reads from replicas (round-robin or nearest)

Replication Types

TypeHow It WorksTradeoff
AsynchronousPrimary does not wait for replicas to confirmFast writes, but replicas may be slightly behind (lag)
SynchronousPrimary waits for at least one replica to confirmNo lag, but slower writes
Semi-synchronousPrimary waits for one replica, others are asyncBalance of speed and safety

Replication Lag

  Timeline:
  =========
  
  t=0    Primary: INSERT user "Alice"
  t=0    Primary: committed, returns success to client
  
  t=50ms Replica 1: receives and applies the INSERT
  t=80ms Replica 2: receives and applies the INSERT
  
  PROBLEM: At t=20ms, a read to Replica 1 returns "Alice not found"
  
  This is "replication lag" — typically 10-100ms, but can spike to seconds
  under heavy write load.

Mitigations for replication lag:

  1. Read-your-writes consistency — After a write, route that user's reads to primary for a few seconds
  2. Monotonic reads — Pin a user to a specific replica for a session
  3. Synchronous replication — Eliminates lag but reduces write throughput

Master-Slave Replication

Single-Master (Most Common)

  +------------------+
  |   PRIMARY        |
  |   (Master)       |  <-- All WRITES go here
  |                  |
  +--------+---------+
           |
     Replication Stream
     /         |         \
    v          v          v
  +--------+ +--------+ +--------+
  |Replica | |Replica | |Replica |
  |  (R1)  | |  (R2)  | |  (R3)  |  <-- All READS distributed here
  +--------+ +--------+ +--------+

Multi-Master (Less Common)

  +----------+         +----------+
  |  Master  | <-----> |  Master  |
  |  (US)    |  sync   |  (EU)    |
  +----+-----+         +----+-----+
       |                     |
    +--+--+              +---+---+
    |R1|R2|              |R3 |R4 |
    +--+--+              +---+---+
TopologyWritesConflictsComplexityUse Case
Single-masterOne node onlyNoneLowMost applications
Multi-masterAny nodePossible (must resolve)HighMulti-region low-latency writes
Masterless (Cassandra)Any nodePossible (last-write-wins)MediumHigh availability

Failover

  NORMAL OPERATION                    AFTER PRIMARY FAILURE
  ================                    =====================

  [Primary] <-- writes               [Primary X] (failed)
      |                                    |
  [R1] [R2] [R3]                     [R1 -> NEW Primary] [R2] [R3]
                                           ^
                                       promoted!
                                       
  Steps:
  1. Health check detects primary is down
  2. Election: choose replica with least lag
  3. Promote replica to primary
  4. Redirect writes to new primary
  5. Reconfigure other replicas to follow new primary
  
  Risk: Some committed writes on old primary may be lost
  if they were not replicated before failure.

Sharding Strategies

Sharding (horizontal partitioning) splits data across multiple database instances.

  BEFORE SHARDING                  AFTER SHARDING
  ================                 ================

  +------------------+            +--------+  +--------+  +--------+
  | ALL DATA         |            | Users  |  | Users  |  | Users  |
  | (100M users)     |            | A-H    |  | I-P    |  | Q-Z    |
  | Single DB        |            | Shard 1|  | Shard 2|  | Shard 3|
  +------------------+            +--------+  +--------+  +--------+

1. Range-Based Sharding

  Shard Key: user_id (integer)
  
  Shard 1: user_id 1       - 1,000,000
  Shard 2: user_id 1,000,001 - 2,000,000
  Shard 3: user_id 2,000,001 - 3,000,000
  
  +--Shard 1--+  +--Shard 2--+  +--Shard 3--+
  | 1 - 1M    |  | 1M - 2M   |  | 2M - 3M   |
  +-----+-----+  +-----+-----+  +-----+-----+
        |               |               |
  Routing: IF user_id <= 1M -> Shard 1
           IF user_id <= 2M -> Shard 2
           ELSE -> Shard 3

Pros: Simple routing logic, easy range queries. Cons: Hotspots (new users all go to latest shard), uneven distribution.

2. Hash-Based Sharding

  Shard Key: user_id
  Routing: shard_number = hash(user_id) % num_shards
  
  hash("alice") % 3 = 0  --> Shard 0
  hash("bob")   % 3 = 2  --> Shard 2
  hash("charlie") % 3 = 1 --> Shard 1
  
  +--Shard 0--+  +--Shard 1--+  +--Shard 2--+
  | alice      |  | charlie    |  | bob        |
  | dave       |  | frank      |  | eve        |
  +------------+  +------------+  +------------+
  
  Distribution: roughly even (depends on hash function quality)

Pros: Even data distribution, no hotspots. Cons: Range queries require hitting all shards, resharding is expensive (consistent hashing helps).

3. Geographic Sharding

  +-----------------+     +-----------------+     +-----------------+
  |  US Shard       |     |  EU Shard       |     |  APAC Shard     |
  |  (us-east-1)    |     |  (eu-west-1)    |     |  (ap-southeast) |
  |                 |     |                 |     |                 |
  |  US users       |     |  EU users       |     |  APAC users     |
  |  US orders      |     |  EU orders      |     |  APAC orders    |
  +-----------------+     +-----------------+     +-----------------+
         ^                       ^                       ^
         |                       |                       |
    US traffic              EU traffic              APAC traffic

Pros: Data locality (low latency), GDPR compliance (EU data stays in EU). Cons: Cross-region queries are expensive, what about users who travel?

4. Consistent Hashing (Advanced)

Solves the resharding problem. When you add/remove a shard, only ~1/N of data moves.

  Hash Ring (0 to 2^32)
  ======================
  
          Shard A (position 1000)
            /
           /
  --------+--------+--------+--------+------>
  0      1000     3000     6000     9000   2^32
           |        |        |
         Shard A  Shard B  Shard C
  
  Key "alice" hashes to 2500 --> next shard clockwise --> Shard B
  Key "bob" hashes to 7800   --> next shard clockwise --> Shard A (wraps)
  
  ADD Shard D at position 4500:
  - Only keys between 3000-4500 move from Shard C to Shard D
  - Everything else stays put!

Sharding Strategy Comparison

StrategyDistributionRange QueriesResharding CostComplexity
RangeUneven (hotspots)Efficient (single shard)MediumLow
HashEvenExpensive (all shards)High (consistent hashing helps)Medium
GeographicBy regionWithin region: goodMediumMedium
Directory-basedConfigurableDepends on mappingLow (just update directory)High

Shard Key Selection

The shard key determines which shard a piece of data lives on. A bad shard key ruins your entire architecture.

Good Shard Key Criteria

CriteriaWhy
High cardinalityMany distinct values = even distribution
Even distributionNo hotspots (one shard getting all traffic)
Matches query patternsMost queries should target a single shard
ImmutableChanging the shard key means moving data between shards

Shard Key Examples

SystemGood Shard KeyBad Shard KeyWhy Bad
E-commerce ordersuser_idorder_dateAll today's orders hit one shard
Chat messagesconversation_idsender_idQuerying a conversation needs all shards
IoT sensor datadevice_idsensor_typeFew types = few shards used
Social media postsuser_idcountryUS shard is 10x larger than others
Analytics eventsevent_id (hash)event_type"page_view" = 80% of data on one shard

Composite Shard Keys

Sometimes a single field is not enough. Combine fields:

  DynamoDB Single-Table Design:
  
  Partition Key         Sort Key           Data
  ==================    ================   ====================
  USER#alice            PROFILE            {name, email, bio}
  USER#alice            ORDER#2025-01-15   {total: 99.99}
  USER#alice            ORDER#2025-01-20   {total: 45.50}
  PRODUCT#laptop-1      DETAILS            {name, price, desc}
  PRODUCT#laptop-1      REVIEW#user-bob    {rating: 5, text: ...}
  
  The partition key distributes data across shards.
  The sort key enables efficient range queries within a partition.

Denormalization for Performance

Normalized data (3NF) minimizes redundancy but requires joins. At scale, joins across shards are expensive. Denormalization trades storage for speed.

When to Denormalize

  NORMALIZED (3NF)                    DENORMALIZED
  ==================                  ==================

  POSTS table:                        POSTS table:
  | id | user_id | text |            | id | user_id | user_name | user_avatar | text |
  |    |         |      |            |    |         |           |             |      |
  
  + USERS table:                      No JOIN needed to display post with
  | id | name | avatar |             user info. Faster reads.
  
  Requires JOIN to display            But: if user changes name, must
  post with user info.                update ALL their posts.

Denormalization Patterns

PatternHowWhen
Embed (NoSQL)Store related data in same documentData accessed together, rarely changes independently
DuplicateCopy fields into another tableHot read path needs data from multiple tables
Pre-computeStore aggregated valuesCounts, sums, averages accessed frequently
Materialized viewDatabase maintains a pre-joined viewComplex queries needed frequently

Example: E-Commerce Product Page

  NORMALIZED (requires 4 queries):
  1. SELECT * FROM products WHERE id = 101
  2. SELECT * FROM categories WHERE id = products.category_id
  3. SELECT AVG(rating) FROM reviews WHERE product_id = 101
  4. SELECT * FROM reviews WHERE product_id = 101 ORDER BY date DESC LIMIT 5

  DENORMALIZED (1 query or 1 document read):
  {
    "id": 101,
    "name": "Wireless Headphones",
    "price": 79.99,
    "category_name": "Electronics",          // duplicated from categories
    "avg_rating": 4.5,                       // pre-computed
    "review_count": 234,                     // pre-computed
    "recent_reviews": [                      // embedded
      { "user": "Alice", "rating": 5, "text": "Great sound!" },
      { "user": "Bob", "rating": 4, "text": "Good value" }
    ]
  }

Tradeoffs

FactorNormalizedDenormalized
Read speedSlower (joins)Faster (single read)
Write speedFaster (update once)Slower (update many copies)
StorageLessMore
ConsistencyEasy (single source of truth)Hard (multiple copies to sync)
FlexibilityHigh (ad-hoc queries)Low (designed for specific queries)

Connection Pooling

Opening a database connection is expensive (TCP handshake, authentication, SSL). Connection pooling reuses connections.

  WITHOUT POOLING                      WITH POOLING
  ================                     ================

  Request 1 --> Open Conn --> Query --> Close Conn    Request 1 --+
  Request 2 --> Open Conn --> Query --> Close Conn                |
  Request 3 --> Open Conn --> Query --> Close Conn    Request 2 --+--> [Pool: 10 conns] --> DB
  ...                                                             |
  (each open = ~5-20ms overhead)                      Request 3 --+
                                                      
                                                      (connections reused, near-zero overhead)

Pool Configuration

  +------------------------------------------+
  |           Connection Pool                |
  |                                          |
  |  min_connections: 5    (always ready)    |
  |  max_connections: 20   (upper limit)     |
  |  idle_timeout: 300s    (close unused)    |
  |  connection_timeout: 5s (wait for conn)  |
  |  max_lifetime: 1800s  (recycle conns)    |
  |                                          |
  |  Active: [C1] [C2] [C3] [C4] [C5]      |
  |  Idle:   [C6] [C7]                      |
  |  Available: 13 more                      |
  +------------------------------------------+

Pool Sizing Formula

  connections = ((core_count * 2) + effective_spindle_count)

  Example:
  - 4-core CPU, SSD (no spindles): connections = (4 * 2) + 1 = 9
  - 8-core CPU, SSD: connections = (8 * 2) + 1 = 17

  Rule: A smaller pool (10-20) with a queue usually
  outperforms a larger pool (100+) because of:
  - Less context switching
  - Less memory per connection (~10MB in PostgreSQL)
  - Less lock contention

Popular Connection Poolers

ToolDatabaseFeature
PgBouncerPostgreSQLLightweight, transaction/session pooling
pgpool-IIPostgreSQLPooling + load balancing + replication
ProxySQLMySQLQuery routing, caching, failover
HikariCPJava (any DB)Fastest JVM connection pool

Indexing Strategies

An index is a data structure that speeds up reads at the cost of slower writes and extra storage.

How Indexes Work

  WITHOUT INDEX (Full Table Scan)         WITH INDEX (B-Tree Lookup)
  ================================        ============================

  SELECT * FROM users                     SELECT * FROM users
  WHERE email = 'alice@mail.com'          WHERE email = 'alice@mail.com'

  Scan: Row 1 -> No                            B-Tree Index on email
        Row 2 -> No                            =======================
        Row 3 -> No                                    [M]
        ...                                           / | \
        Row 999,999 -> No                          [D] [P] [T]
        Row 1,000,000 -> YES!                     / \   |   \
                                               [A] [G] [N] [W]
  Time: O(N) = scan all rows                     |
  1,000,000 row reads                          [alice@]
                                               
                                              Time: O(log N) = ~20 reads
                                              for 1,000,000 rows

Index Types

Index TypeHow It WorksBest For
B-Tree (default)Balanced tree, sortedEquality and range queries (=, <, >, BETWEEN)
HashHash table lookupExact equality only (=), very fast
GIN (Generalized Inverted)Inverted indexFull-text search, JSONB, arrays
GiST (Generalized Search Tree)Tree for spatial dataGeospatial queries, range types
BRIN (Block Range)Min/max per block rangeLarge, naturally ordered tables (time-series)

Index Best Practices

-- 1. Index columns used in WHERE clauses
CREATE INDEX idx_users_email ON users(email);

-- 2. Composite index for multi-column queries
--    (Column order matters! Leftmost prefix rule)
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- This index supports:
--   WHERE user_id = 42                    (YES - uses index)
--   WHERE user_id = 42 AND status = 'pending'  (YES - uses full index)
--   WHERE status = 'pending'              (NO - cannot skip first column)

-- 3. Covering index (includes all queried columns)
CREATE INDEX idx_orders_covering ON orders(user_id, status) INCLUDE (total, created_at);
-- Query can be answered entirely from the index, no table lookup needed

-- 4. Partial index (index a subset of rows)
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Smaller index, only indexes pending orders

-- 5. Unique index (also enforces constraint)
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

When NOT to Index

SituationWhy
Small tables (<1000 rows)Full scan is faster than index lookup
Columns rarely queried in WHEREIndex cost without benefit
Columns with low cardinalitygender (M/F) - index does not help much
Write-heavy tables with many indexesEach INSERT/UPDATE must update all indexes
Wide columns (long text)Index becomes too large

Index Monitoring

-- PostgreSQL: Find unused indexes
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

-- PostgreSQL: Find slow queries needing indexes
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- PostgreSQL: Check index size
SELECT pg_size_pretty(pg_indexes_size('users'));

The Write Cost of Indexes

  INSERT into table with 0 indexes:  1 write
  INSERT into table with 1 index:    2 writes (table + index)
  INSERT into table with 5 indexes:  6 writes (table + 5 indexes)
  INSERT into table with 10 indexes: 11 writes
  
  Rule: Every index slows writes.
  Target: 5-10 indexes per table is reasonable.
  Alarm: 20+ indexes means something is wrong.

Key Takeaways

  1. Scale vertically first — A bigger machine with proper indexing handles more than most people expect. Do not shard prematurely.
  2. Read replicas are the first horizontal scaling step — Since most apps are read-heavy, replicas give you 3-5x capacity with minimal complexity.
  3. Replication lag is a real problem — Design your app to handle it (read-your-writes, sticky sessions, or synchronous replication for critical reads).
  4. Shard key selection is the most important sharding decision — A bad key creates hotspots. Choose based on query patterns, high cardinality, and even distribution.
  5. Hash sharding distributes evenly; range sharding enables range queries — Pick based on your access patterns.
  6. Denormalize on the read path, normalize on the write path — Duplicate data to speed up reads, but maintain a source of truth.
  7. Connection pools should be small — 10-20 connections per application server usually outperforms 100+.
  8. Every index speeds reads and slows writes — Be strategic. Index what you query, monitor what you do not use, and remove dead indexes.

Explain-It Challenge

Scenario: You run an e-commerce platform. Currently:

  • 50 million products in a single PostgreSQL database
  • 10,000 queries per second (80% reads, 20% writes)
  • Average query time has gone from 10ms to 500ms over 6 months
  • The database server CPU is at 85%

Walk through your scaling plan step by step:

  1. What quick wins would you try first (no architecture changes)?
  2. When would you add read replicas?
  3. At what point would you consider sharding?
  4. What shard key would you choose and why?
  5. How would you handle cross-shard queries (e.g., "search all products")?

Next -> 9.8.e — Data Consistency