Episode 6 — Scaling Reliability Microservices Web3 / 6.5 — Scaling Concepts

6.5.a -- Vertical vs Horizontal Scaling

In one sentence: Vertical scaling means buying a bigger machine; horizontal scaling means buying more machines -- and the choice between them shapes your architecture, budget, and failure tolerance at every growth stage.

Navigation: <- 6.5 Overview | 6.5.b -- Load Balancers ->


1. What Is Scaling?

Scaling is the process of increasing a system's capacity to handle more load -- more requests, more data, more users. When your Express API starts returning 503 errors because CPU is pegged at 100%, you need to scale.

There are exactly two directions you can scale:

SCALING

   Scale UP (Vertical)          Scale OUT (Horizontal)
   ┌────────────────┐           ┌──────┐ ┌──────┐ ┌──────┐
   │                │           │ Box  │ │ Box  │ │ Box  │
   │   BIGGER BOX   │           │  1   │ │  2   │ │  3   │
   │                │           └──────┘ └──────┘ └──────┘
   │  More CPU      │               │        │        │
   │  More RAM      │           ┌───┴────────┴────────┴───┐
   │  More Disk     │           │      Load Balancer       │
   │                │           └──────────────────────────┘
   └────────────────┘

2. Vertical Scaling (Scale Up)

Vertical scaling means upgrading the hardware of a single machine: more CPU cores, more RAM, faster disks, better network cards.

How it works in practice

Phase 1: t3.small   (2 vCPU,  2 GB RAM)   — $15/month
Phase 2: t3.xlarge  (4 vCPU, 16 GB RAM)   — $120/month
Phase 3: m5.4xlarge (16 vCPU, 64 GB RAM)  — $560/month
Phase 4: x1.16xlarge(64 vCPU, 976 GB RAM) — $6,700/month
Phase 5: ???        — There is no Phase 5. You hit the ceiling.

Advantages

  1. Simple -- no code changes needed. Your Express app that runs on a small box runs identically on a bigger box.
  2. No distributed systems complexity -- one server means no network partitions, no eventual consistency, no data synchronisation.
  3. Strong consistency -- one database on one machine means ACID transactions just work.
  4. Easy debugging -- one server means one set of logs, one process to monitor.

Disadvantages

  1. Hard ceiling -- the largest EC2 instance (u-24tb1.metal: 448 vCPU, 24 TB RAM) costs $218/hour. After that, there is nothing to buy.
  2. Downtime during upgrades -- changing instance types typically requires stopping the machine (AWS does support some live resizing, but it is not instant).
  3. Single point of failure -- one machine dies, everything dies. No redundancy.
  4. Non-linear cost -- doubling capacity more than doubles cost at the high end.

The cost curve problem

COST vs CAPACITY (Vertical Scaling)

Cost ($)
  │
  │                                          x  ← Diminishing returns
  │                                     x
  │                                x
  │                          x
  │                    x
  │              x
  │         x
  │     x
  │  x
  │x
  └──────────────────────────────────────────── Capacity
        Cost grows EXPONENTIALLY
        while capacity grows LINEARLY

A machine with 2x the CPU/RAM typically costs 2.5x-3x more. At the high end, this ratio gets worse. This is why vertical scaling is a short-term strategy for most growing systems.


3. Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines (instances, containers, pods) running the same application, distributing traffic across them with a load balancer.

How it works in practice

Phase 1: 1x t3.medium  (2 vCPU, 4 GB) — $30/month
Phase 2: 2x t3.medium                   — $60/month
Phase 3: 5x t3.medium                   — $150/month
Phase 4: 20x t3.medium                  — $600/month
Phase 5: 100x t3.medium                 — $3,000/month
Phase 6: 500x t3.medium                 — $15,000/month
...no ceiling except your AWS bill...

Advantages

  1. Near-linear cost scaling -- 2x capacity costs ~2x money (plus a small load balancer overhead).
  2. No hard ceiling -- you can keep adding machines as long as your architecture supports it.
  3. Fault tolerance -- if one server dies, the other 19 continue serving traffic. Users may not even notice.
  4. Rolling deployments -- update servers one at a time with zero downtime.
  5. Geographic distribution -- place instances in multiple AWS regions for lower latency worldwide.

Disadvantages

  1. Requires stateless application design -- if your server stores session data in memory, horizontal scaling breaks (covered in 6.5.c).
  2. Operational complexity -- you now manage a load balancer, health checks, auto-scaling rules, deployment coordination.
  3. Data synchronisation -- if instances write to the same database, you need connection pooling, read replicas, or sharding.
  4. Network overhead -- inter-service communication adds latency compared to in-process calls.

The cost curve advantage

COST vs CAPACITY (Horizontal Scaling)

Cost ($)
  │
  │                                              x  ← Linear
  │                                         x
  │                                    x
  │                               x
  │                          x
  │                     x
  │                x
  │           x
  │      x
  │ x
  └──────────────────────────────────────────── Capacity
        Cost grows LINEARLY
        with capacity

4. Comparison Table

FactorVertical (Scale Up)Horizontal (Scale Out)
What changesMachine sizeNumber of machines
Cost curveExponential (diminishing returns)Linear (predictable)
Hard ceilingYes (largest machine available)No practical ceiling
Code changes neededNoneMust be stateless
Downtime to scaleUsually yes (reboot/resize)No (add instances behind LB)
Failure toleranceNone (single point of failure)High (N-1 survive)
Operational complexityLowHigher (LB, health checks, auto-scaling)
Data consistencyStrong (single DB)Requires distributed patterns
Best forSmall apps, databases, quick winsWeb servers, APIs, microservices
Typical ceiling~$10K/month per machineLimited only by budget and architecture

5. When to Use Each

Use vertical scaling when:

  • You are starting out -- do not over-engineer. A single $50/month server can handle thousands of requests per second.
  • Your database is the bottleneck -- databases are hard to horizontally scale. A bigger RDS instance is often the right first move.
  • You need strong consistency -- financial transactions, inventory counts, anything where eventual consistency is unacceptable.
  • You are prototyping -- move fast, scale up if needed, refactor for horizontal scaling later.

Use horizontal scaling when:

  • Traffic is growing predictably -- you know you need 10x capacity in 6 months.
  • You need high availability -- zero tolerance for downtime.
  • Traffic is spiky -- auto-scaling can add instances during peaks and remove them at 3 AM.
  • Your application is stateless (or you can make it stateless) -- APIs, web servers, workers.
  • You are running microservices -- each service scales independently based on its own load.

The real-world answer: both

Most production systems use both strategies together:

Step 1: Start on a single t3.medium ($30/month)
Step 2: Scale up to t3.xlarge ($120/month) when traffic grows
Step 3: Hit the "vertical gets expensive" point
Step 4: Refactor for statelessness
Step 5: Scale out to 3x t3.medium ($90/month) — cheaper AND more resilient
Step 6: Add auto-scaling: 2-20 instances based on CPU
Step 7: Scale the database vertically (bigger RDS) while the app scales horizontally

6. Real-World Examples

Netflix

  • Application tier: Thousands of horizontally-scaled container instances behind load balancers in multiple AWS regions.
  • Database tier: Vertically-scaled Cassandra nodes (big machines) with horizontal replication across regions.
  • Lesson: The application scales out; the database scales up AND out.

Shopify

  • During flash sales: Auto-scales horizontally from a baseline of instances to thousands of instances within minutes.
  • Database: MySQL with read replicas (horizontal for reads) and vertical scaling for the primary (big machine for writes).

Early-stage startup

  • Day 1: Single $20/month DigitalOcean droplet. Express API + MongoDB on the same machine.
  • Month 6: Separate the database to managed MongoDB Atlas (vertical scaling for the DB).
  • Year 1: 3 API instances behind an Nginx load balancer. Redis for sessions.
  • Year 2: Kubernetes with auto-scaling. 5-50 pods depending on traffic.

7. Database Scaling

Databases are the hardest part of scaling because they hold state. You cannot just "add more database servers" the way you add more API servers.

Vertical scaling (the default for databases)

// Most startups start here — and it works for a long time
// AWS RDS instance sizes for PostgreSQL:

// db.t3.medium:  2 vCPU,  4 GB  — $60/month   — handles ~500 connections
// db.r5.xlarge:  4 vCPU, 32 GB  — $350/month  — handles ~2,000 connections
// db.r5.4xlarge: 16 vCPU, 128 GB — $1,400/month — handles ~5,000 connections
// db.r5.24xlarge: 96 vCPU, 768 GB — $8,400/month — handles ~10,000 connections

Read replicas (horizontal scaling for reads)

                 ┌──────────────┐
    Writes ────> │   Primary    │
                 │   (Master)   │
                 └──────┬───────┘
                        │ Replication
              ┌─────────┼─────────┐
              ▼         ▼         ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ Replica 1│ │ Replica 2│ │ Replica 3│
        └──────────┘ └──────────┘ └──────────┘
              ▲         ▲         ▲
              └─────────┼─────────┘
                  Reads distributed
                  across replicas
// Using read replicas in Node.js with Sequelize
const { Sequelize } = require('sequelize');

const sequelize = new Sequelize('database', 'user', 'password', {
  replication: {
    read: [
      { host: 'replica-1.db.example.com' },
      { host: 'replica-2.db.example.com' },
      { host: 'replica-3.db.example.com' },
    ],
    write: { host: 'primary.db.example.com' },
  },
  pool: {
    max: 20,
    idle: 30000,
  },
});

// Sequelize automatically routes:
// - SELECT queries to read replicas (round-robin)
// - INSERT/UPDATE/DELETE to the primary
const users = await User.findAll(); // → goes to a replica
await User.create({ name: 'Alice' }); // → goes to primary

Important caveat: Read replicas have replication lag -- a write to the primary may take 10-100ms to appear on a replica. Design your application to tolerate this.

Sharding (horizontal scaling for writes)

Sharding splits data across multiple database instances based on a shard key:

User ID 1-1000     → Shard A (Database 1)
User ID 1001-2000  → Shard B (Database 2)
User ID 2001-3000  → Shard C (Database 3)

OR hash-based:
User ID % 3 == 0   → Shard A
User ID % 3 == 1   → Shard B
User ID % 3 == 2   → Shard C

Sharding is complex -- cross-shard queries are expensive, re-sharding is painful, and you lose the ability to do simple JOINs across shards. Do not shard until you absolutely have to. Most applications never reach this point.


8. Scaling Node.js Specifically

Node.js runs on a single thread by default. A single Node.js process cannot utilise multiple CPU cores. This is a vertical scaling problem that has horizontal solutions.

The cluster module (built-in horizontal scaling)

// cluster-server.js
const cluster = require('cluster');
const http = require('http');
const os = require('os');

const numCPUs = os.cpus().length;

if (cluster.isPrimary) {
  console.log(`Primary process ${process.pid} starting ${numCPUs} workers`);

  // Fork one worker per CPU core
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  // Replace crashed workers
  cluster.on('exit', (worker, code, signal) => {
    console.log(`Worker ${worker.process.pid} died (${signal || code}). Restarting...`);
    cluster.fork();
  });
} else {
  // Each worker runs its own Express server
  const express = require('express');
  const app = express();

  app.get('/api/health', (req, res) => {
    res.json({
      pid: process.pid,
      uptime: process.uptime(),
      memory: process.memoryUsage(),
    });
  });

  app.get('/api/heavy', (req, res) => {
    // CPU-intensive work is distributed across workers
    let sum = 0;
    for (let i = 0; i < 1e7; i++) sum += Math.sqrt(i);
    res.json({ result: sum, handledBy: process.pid });
  });

  app.listen(3000, () => {
    console.log(`Worker ${process.pid} listening on port 3000`);
  });
}
Output on a 4-core machine:
Primary process 1234 starting 4 workers
Worker 1235 listening on port 3000
Worker 1236 listening on port 3000
Worker 1237 listening on port 3000
Worker 1238 listening on port 3000

Each request is handled by a different worker (round-robin by default on Linux).

PM2 (production process manager)

# Start with cluster mode — PM2 manages the workers for you
pm2 start app.js -i max          # Fork one worker per CPU core
pm2 start app.js -i 4            # Fork exactly 4 workers

# Zero-downtime reload
pm2 reload app.js                # Restarts workers one by one

# Monitor
pm2 monit                        # Real-time CPU/memory per worker
pm2 list                         # Status of all processes

# Ecosystem file (ecosystem.config.js)
module.exports = {
  apps: [{
    name: 'api',
    script: './app.js',
    instances: 'max',            // One per CPU
    exec_mode: 'cluster',        // Use cluster mode
    max_memory_restart: '500M',  // Restart if worker exceeds 500MB
    env: {
      NODE_ENV: 'production',
      PORT: 3000,
    },
  }],
};

Container orchestration (Kubernetes / ECS)

For true horizontal scaling across machines, you containerise your Node.js app and let an orchestrator manage replicas:

# kubernetes deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3                        # 3 pods (horizontal scaling)
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: myapp:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: "250m"            # Vertical: request 0.25 CPU per pod
              memory: "256Mi"        # Vertical: request 256MB per pod
            limits:
              cpu: "500m"
              memory: "512Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70     # Scale out when avg CPU > 70%

9. Cost Analysis at Different Scales

Monthly RequestsVertical StrategyVertical CostHorizontal StrategyHorizontal Cost
100K1x t3.small$151x t3.small$15
1M1x t3.xlarge$1202x t3.medium$60
10M1x m5.4xlarge$5605x t3.medium$150
50M1x r5.12xlarge$3,60015x t3.medium + ALB$480
100M1x u-6tb1.metal$10,000+30x t3.medium + ALB$930
500MNot possible--150x t3.medium + ALB + auto-scaling$4,650

Note: These are approximate numbers to illustrate the trend. Actual costs depend on region, reserved instances, spot pricing, and workload characteristics. The load balancer (ALB) adds ~$20/month + data processing costs.


10. Auto-Scaling: The Best of Both Worlds

Auto-scaling automatically adjusts the number of instances based on metrics:

Traffic pattern over 24 hours:

Instances
  │
  │          ┌──────┐
  │        ┌─┘      └─┐         ┌──────┐
  │     ┌──┘          └──┐   ┌──┘      └──┐
  │  ┌──┘                └───┘            └──┐
  │──┘                                       └──
  │  2    5    10   10    5     8    10    3   2
  └─────────────────────────────────────────────── Time
  12am  6am  9am  12pm  3pm  6pm  9pm  11pm  12am

  Minimum: 2 instances (always running — handles baseline)
  Maximum: 20 instances (caps spending)
  Scale-out trigger: CPU > 70% for 2 minutes
  Scale-in trigger:  CPU < 30% for 5 minutes (slower — avoid flapping)
// AWS SDK — creating an auto-scaling policy (simplified)
const AWS = require('aws-sdk');
const autoscaling = new AWS.AutoScaling();

const params = {
  AutoScalingGroupName: 'api-asg',
  PolicyName: 'scale-out-on-cpu',
  PolicyType: 'TargetTrackingScaling',
  TargetTrackingConfiguration: {
    PredefinedMetricSpecification: {
      PredefinedMetricType: 'ASGAverageCPUUtilization',
    },
    TargetValue: 70.0,    // Maintain ~70% CPU across all instances
    ScaleInCooldown: 300,  // Wait 5 min before removing instances
    ScaleOutCooldown: 60,  // Wait 1 min before adding more
  },
};

await autoscaling.putScalingPolicy(params).promise();

11. Key Takeaways

  1. Vertical scaling is simple but limited -- no code changes, but exponential cost and a hard ceiling.
  2. Horizontal scaling is powerful but requires stateless design -- near-linear cost, no ceiling, but more operational complexity.
  3. Start vertical, go horizontal -- do not over-engineer day one. Scale up until it gets expensive, then refactor for scale-out.
  4. Databases are the hard part -- read replicas for read-heavy workloads, sharding only when absolutely necessary.
  5. Node.js needs cluster mode -- a single process wastes multi-core machines. Use the cluster module, PM2, or container orchestration.
  6. Auto-scaling is the real answer -- combine horizontal scaling with metrics-driven automation to handle variable load cost-effectively.

Explain-It Challenge

  1. Your CEO asks: "Why can't we just buy a bigger server?" Explain the limits of vertical scaling using cost and availability arguments.
  2. A junior developer asks why adding more Express instances does not automatically double throughput. What is likely missing?
  3. Your database is the bottleneck, not the API servers. Should you scale the database horizontally or vertically? Walk through the decision.

Navigation: <- 6.5 Overview | 6.5.b -- Load Balancers ->