Episode 6 — Scaling Reliability Microservices Web3 / 6.5 — Scaling Concepts
6.5.a -- Vertical vs Horizontal Scaling
In one sentence: Vertical scaling means buying a bigger machine; horizontal scaling means buying more machines -- and the choice between them shapes your architecture, budget, and failure tolerance at every growth stage.
Navigation: <- 6.5 Overview | 6.5.b -- Load Balancers ->
1. What Is Scaling?
Scaling is the process of increasing a system's capacity to handle more load -- more requests, more data, more users. When your Express API starts returning 503 errors because CPU is pegged at 100%, you need to scale.
There are exactly two directions you can scale:
SCALING
Scale UP (Vertical) Scale OUT (Horizontal)
┌────────────────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ │ │ Box │ │ Box │ │ Box │
│ BIGGER BOX │ │ 1 │ │ 2 │ │ 3 │
│ │ └──────┘ └──────┘ └──────┘
│ More CPU │ │ │ │
│ More RAM │ ┌───┴────────┴────────┴───┐
│ More Disk │ │ Load Balancer │
│ │ └──────────────────────────┘
└────────────────┘
2. Vertical Scaling (Scale Up)
Vertical scaling means upgrading the hardware of a single machine: more CPU cores, more RAM, faster disks, better network cards.
How it works in practice
Phase 1: t3.small (2 vCPU, 2 GB RAM) — $15/month
Phase 2: t3.xlarge (4 vCPU, 16 GB RAM) — $120/month
Phase 3: m5.4xlarge (16 vCPU, 64 GB RAM) — $560/month
Phase 4: x1.16xlarge(64 vCPU, 976 GB RAM) — $6,700/month
Phase 5: ??? — There is no Phase 5. You hit the ceiling.
Advantages
- Simple -- no code changes needed. Your Express app that runs on a small box runs identically on a bigger box.
- No distributed systems complexity -- one server means no network partitions, no eventual consistency, no data synchronisation.
- Strong consistency -- one database on one machine means ACID transactions just work.
- Easy debugging -- one server means one set of logs, one process to monitor.
Disadvantages
- Hard ceiling -- the largest EC2 instance (u-24tb1.metal: 448 vCPU, 24 TB RAM) costs $218/hour. After that, there is nothing to buy.
- Downtime during upgrades -- changing instance types typically requires stopping the machine (AWS does support some live resizing, but it is not instant).
- Single point of failure -- one machine dies, everything dies. No redundancy.
- Non-linear cost -- doubling capacity more than doubles cost at the high end.
The cost curve problem
COST vs CAPACITY (Vertical Scaling)
Cost ($)
│
│ x ← Diminishing returns
│ x
│ x
│ x
│ x
│ x
│ x
│ x
│ x
│x
└──────────────────────────────────────────── Capacity
Cost grows EXPONENTIALLY
while capacity grows LINEARLY
A machine with 2x the CPU/RAM typically costs 2.5x-3x more. At the high end, this ratio gets worse. This is why vertical scaling is a short-term strategy for most growing systems.
3. Horizontal Scaling (Scale Out)
Horizontal scaling means adding more machines (instances, containers, pods) running the same application, distributing traffic across them with a load balancer.
How it works in practice
Phase 1: 1x t3.medium (2 vCPU, 4 GB) — $30/month
Phase 2: 2x t3.medium — $60/month
Phase 3: 5x t3.medium — $150/month
Phase 4: 20x t3.medium — $600/month
Phase 5: 100x t3.medium — $3,000/month
Phase 6: 500x t3.medium — $15,000/month
...no ceiling except your AWS bill...
Advantages
- Near-linear cost scaling -- 2x capacity costs ~2x money (plus a small load balancer overhead).
- No hard ceiling -- you can keep adding machines as long as your architecture supports it.
- Fault tolerance -- if one server dies, the other 19 continue serving traffic. Users may not even notice.
- Rolling deployments -- update servers one at a time with zero downtime.
- Geographic distribution -- place instances in multiple AWS regions for lower latency worldwide.
Disadvantages
- Requires stateless application design -- if your server stores session data in memory, horizontal scaling breaks (covered in 6.5.c).
- Operational complexity -- you now manage a load balancer, health checks, auto-scaling rules, deployment coordination.
- Data synchronisation -- if instances write to the same database, you need connection pooling, read replicas, or sharding.
- Network overhead -- inter-service communication adds latency compared to in-process calls.
The cost curve advantage
COST vs CAPACITY (Horizontal Scaling)
Cost ($)
│
│ x ← Linear
│ x
│ x
│ x
│ x
│ x
│ x
│ x
│ x
│ x
└──────────────────────────────────────────── Capacity
Cost grows LINEARLY
with capacity
4. Comparison Table
| Factor | Vertical (Scale Up) | Horizontal (Scale Out) |
|---|---|---|
| What changes | Machine size | Number of machines |
| Cost curve | Exponential (diminishing returns) | Linear (predictable) |
| Hard ceiling | Yes (largest machine available) | No practical ceiling |
| Code changes needed | None | Must be stateless |
| Downtime to scale | Usually yes (reboot/resize) | No (add instances behind LB) |
| Failure tolerance | None (single point of failure) | High (N-1 survive) |
| Operational complexity | Low | Higher (LB, health checks, auto-scaling) |
| Data consistency | Strong (single DB) | Requires distributed patterns |
| Best for | Small apps, databases, quick wins | Web servers, APIs, microservices |
| Typical ceiling | ~$10K/month per machine | Limited only by budget and architecture |
5. When to Use Each
Use vertical scaling when:
- You are starting out -- do not over-engineer. A single $50/month server can handle thousands of requests per second.
- Your database is the bottleneck -- databases are hard to horizontally scale. A bigger RDS instance is often the right first move.
- You need strong consistency -- financial transactions, inventory counts, anything where eventual consistency is unacceptable.
- You are prototyping -- move fast, scale up if needed, refactor for horizontal scaling later.
Use horizontal scaling when:
- Traffic is growing predictably -- you know you need 10x capacity in 6 months.
- You need high availability -- zero tolerance for downtime.
- Traffic is spiky -- auto-scaling can add instances during peaks and remove them at 3 AM.
- Your application is stateless (or you can make it stateless) -- APIs, web servers, workers.
- You are running microservices -- each service scales independently based on its own load.
The real-world answer: both
Most production systems use both strategies together:
Step 1: Start on a single t3.medium ($30/month)
Step 2: Scale up to t3.xlarge ($120/month) when traffic grows
Step 3: Hit the "vertical gets expensive" point
Step 4: Refactor for statelessness
Step 5: Scale out to 3x t3.medium ($90/month) — cheaper AND more resilient
Step 6: Add auto-scaling: 2-20 instances based on CPU
Step 7: Scale the database vertically (bigger RDS) while the app scales horizontally
6. Real-World Examples
Netflix
- Application tier: Thousands of horizontally-scaled container instances behind load balancers in multiple AWS regions.
- Database tier: Vertically-scaled Cassandra nodes (big machines) with horizontal replication across regions.
- Lesson: The application scales out; the database scales up AND out.
Shopify
- During flash sales: Auto-scales horizontally from a baseline of instances to thousands of instances within minutes.
- Database: MySQL with read replicas (horizontal for reads) and vertical scaling for the primary (big machine for writes).
Early-stage startup
- Day 1: Single $20/month DigitalOcean droplet. Express API + MongoDB on the same machine.
- Month 6: Separate the database to managed MongoDB Atlas (vertical scaling for the DB).
- Year 1: 3 API instances behind an Nginx load balancer. Redis for sessions.
- Year 2: Kubernetes with auto-scaling. 5-50 pods depending on traffic.
7. Database Scaling
Databases are the hardest part of scaling because they hold state. You cannot just "add more database servers" the way you add more API servers.
Vertical scaling (the default for databases)
// Most startups start here — and it works for a long time
// AWS RDS instance sizes for PostgreSQL:
// db.t3.medium: 2 vCPU, 4 GB — $60/month — handles ~500 connections
// db.r5.xlarge: 4 vCPU, 32 GB — $350/month — handles ~2,000 connections
// db.r5.4xlarge: 16 vCPU, 128 GB — $1,400/month — handles ~5,000 connections
// db.r5.24xlarge: 96 vCPU, 768 GB — $8,400/month — handles ~10,000 connections
Read replicas (horizontal scaling for reads)
┌──────────────┐
Writes ────> │ Primary │
│ (Master) │
└──────┬───────┘
│ Replication
┌─────────┼─────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Replica 1│ │ Replica 2│ │ Replica 3│
└──────────┘ └──────────┘ └──────────┘
▲ ▲ ▲
└─────────┼─────────┘
Reads distributed
across replicas
// Using read replicas in Node.js with Sequelize
const { Sequelize } = require('sequelize');
const sequelize = new Sequelize('database', 'user', 'password', {
replication: {
read: [
{ host: 'replica-1.db.example.com' },
{ host: 'replica-2.db.example.com' },
{ host: 'replica-3.db.example.com' },
],
write: { host: 'primary.db.example.com' },
},
pool: {
max: 20,
idle: 30000,
},
});
// Sequelize automatically routes:
// - SELECT queries to read replicas (round-robin)
// - INSERT/UPDATE/DELETE to the primary
const users = await User.findAll(); // → goes to a replica
await User.create({ name: 'Alice' }); // → goes to primary
Important caveat: Read replicas have replication lag -- a write to the primary may take 10-100ms to appear on a replica. Design your application to tolerate this.
Sharding (horizontal scaling for writes)
Sharding splits data across multiple database instances based on a shard key:
User ID 1-1000 → Shard A (Database 1)
User ID 1001-2000 → Shard B (Database 2)
User ID 2001-3000 → Shard C (Database 3)
OR hash-based:
User ID % 3 == 0 → Shard A
User ID % 3 == 1 → Shard B
User ID % 3 == 2 → Shard C
Sharding is complex -- cross-shard queries are expensive, re-sharding is painful, and you lose the ability to do simple JOINs across shards. Do not shard until you absolutely have to. Most applications never reach this point.
8. Scaling Node.js Specifically
Node.js runs on a single thread by default. A single Node.js process cannot utilise multiple CPU cores. This is a vertical scaling problem that has horizontal solutions.
The cluster module (built-in horizontal scaling)
// cluster-server.js
const cluster = require('cluster');
const http = require('http');
const os = require('os');
const numCPUs = os.cpus().length;
if (cluster.isPrimary) {
console.log(`Primary process ${process.pid} starting ${numCPUs} workers`);
// Fork one worker per CPU core
for (let i = 0; i < numCPUs; i++) {
cluster.fork();
}
// Replace crashed workers
cluster.on('exit', (worker, code, signal) => {
console.log(`Worker ${worker.process.pid} died (${signal || code}). Restarting...`);
cluster.fork();
});
} else {
// Each worker runs its own Express server
const express = require('express');
const app = express();
app.get('/api/health', (req, res) => {
res.json({
pid: process.pid,
uptime: process.uptime(),
memory: process.memoryUsage(),
});
});
app.get('/api/heavy', (req, res) => {
// CPU-intensive work is distributed across workers
let sum = 0;
for (let i = 0; i < 1e7; i++) sum += Math.sqrt(i);
res.json({ result: sum, handledBy: process.pid });
});
app.listen(3000, () => {
console.log(`Worker ${process.pid} listening on port 3000`);
});
}
Output on a 4-core machine:
Primary process 1234 starting 4 workers
Worker 1235 listening on port 3000
Worker 1236 listening on port 3000
Worker 1237 listening on port 3000
Worker 1238 listening on port 3000
Each request is handled by a different worker (round-robin by default on Linux).
PM2 (production process manager)
# Start with cluster mode — PM2 manages the workers for you
pm2 start app.js -i max # Fork one worker per CPU core
pm2 start app.js -i 4 # Fork exactly 4 workers
# Zero-downtime reload
pm2 reload app.js # Restarts workers one by one
# Monitor
pm2 monit # Real-time CPU/memory per worker
pm2 list # Status of all processes
# Ecosystem file (ecosystem.config.js)
module.exports = {
apps: [{
name: 'api',
script: './app.js',
instances: 'max', // One per CPU
exec_mode: 'cluster', // Use cluster mode
max_memory_restart: '500M', // Restart if worker exceeds 500MB
env: {
NODE_ENV: 'production',
PORT: 3000,
},
}],
};
Container orchestration (Kubernetes / ECS)
For true horizontal scaling across machines, you containerise your Node.js app and let an orchestrator manage replicas:
# kubernetes deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3 # 3 pods (horizontal scaling)
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: myapp:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: "250m" # Vertical: request 0.25 CPU per pod
memory: "256Mi" # Vertical: request 256MB per pod
limits:
cpu: "500m"
memory: "512Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale out when avg CPU > 70%
9. Cost Analysis at Different Scales
| Monthly Requests | Vertical Strategy | Vertical Cost | Horizontal Strategy | Horizontal Cost |
|---|---|---|---|---|
| 100K | 1x t3.small | $15 | 1x t3.small | $15 |
| 1M | 1x t3.xlarge | $120 | 2x t3.medium | $60 |
| 10M | 1x m5.4xlarge | $560 | 5x t3.medium | $150 |
| 50M | 1x r5.12xlarge | $3,600 | 15x t3.medium + ALB | $480 |
| 100M | 1x u-6tb1.metal | $10,000+ | 30x t3.medium + ALB | $930 |
| 500M | Not possible | -- | 150x t3.medium + ALB + auto-scaling | $4,650 |
Note: These are approximate numbers to illustrate the trend. Actual costs depend on region, reserved instances, spot pricing, and workload characteristics. The load balancer (ALB) adds ~$20/month + data processing costs.
10. Auto-Scaling: The Best of Both Worlds
Auto-scaling automatically adjusts the number of instances based on metrics:
Traffic pattern over 24 hours:
Instances
│
│ ┌──────┐
│ ┌─┘ └─┐ ┌──────┐
│ ┌──┘ └──┐ ┌──┘ └──┐
│ ┌──┘ └───┘ └──┐
│──┘ └──
│ 2 5 10 10 5 8 10 3 2
└─────────────────────────────────────────────── Time
12am 6am 9am 12pm 3pm 6pm 9pm 11pm 12am
Minimum: 2 instances (always running — handles baseline)
Maximum: 20 instances (caps spending)
Scale-out trigger: CPU > 70% for 2 minutes
Scale-in trigger: CPU < 30% for 5 minutes (slower — avoid flapping)
// AWS SDK — creating an auto-scaling policy (simplified)
const AWS = require('aws-sdk');
const autoscaling = new AWS.AutoScaling();
const params = {
AutoScalingGroupName: 'api-asg',
PolicyName: 'scale-out-on-cpu',
PolicyType: 'TargetTrackingScaling',
TargetTrackingConfiguration: {
PredefinedMetricSpecification: {
PredefinedMetricType: 'ASGAverageCPUUtilization',
},
TargetValue: 70.0, // Maintain ~70% CPU across all instances
ScaleInCooldown: 300, // Wait 5 min before removing instances
ScaleOutCooldown: 60, // Wait 1 min before adding more
},
};
await autoscaling.putScalingPolicy(params).promise();
11. Key Takeaways
- Vertical scaling is simple but limited -- no code changes, but exponential cost and a hard ceiling.
- Horizontal scaling is powerful but requires stateless design -- near-linear cost, no ceiling, but more operational complexity.
- Start vertical, go horizontal -- do not over-engineer day one. Scale up until it gets expensive, then refactor for scale-out.
- Databases are the hard part -- read replicas for read-heavy workloads, sharding only when absolutely necessary.
- Node.js needs cluster mode -- a single process wastes multi-core machines. Use the cluster module, PM2, or container orchestration.
- Auto-scaling is the real answer -- combine horizontal scaling with metrics-driven automation to handle variable load cost-effectively.
Explain-It Challenge
- Your CEO asks: "Why can't we just buy a bigger server?" Explain the limits of vertical scaling using cost and availability arguments.
- A junior developer asks why adding more Express instances does not automatically double throughput. What is likely missing?
- Your database is the bottleneck, not the API servers. Should you scale the database horizontally or vertically? Walk through the decision.
Navigation: <- 6.5 Overview | 6.5.b -- Load Balancers ->