Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling
6.4.a — Horizontal Scaling with ECS
In one sentence: Horizontal scaling adds or removes ECS task instances behind a load balancer to match traffic demand, enabled by stateless service architecture and driven by auto-scaling policies that react to CPU, memory, or custom metrics.
Navigation: ← 6.4 Overview · 6.4.b — Health Checks and Availability →
1. Horizontal vs Vertical Scaling
Before diving into ECS specifics, understand the two fundamental scaling approaches:
VERTICAL SCALING (Scale Up) HORIZONTAL SCALING (Scale Out)
┌─────────────────────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ │ │Task 1 │ │Task 2 │ │Task 3 │
│ Single Big Task │ │ 1 CPU │ │ 1 CPU │ │ 1 CPU │
│ 4 CPU / 8 GB │ │ 2 GB │ │ 2 GB │ │ 2 GB │
│ │ └───────┘ └───────┘ └───────┘
└─────────────────────┘ │ │ │
└─────────┼─────────┘
Limits: Hardware ceiling │
Single point of failure Load Balancer
Limits: Nearly unlimited
Fault tolerant
Horizontal scaling wins for microservices because:
- No hardware ceiling — add as many tasks as you need
- Fault tolerance — one task crashing does not bring down the service
- Cost efficiency — scale down during off-peak hours to save money
- Zero-downtime deployments — rolling updates replace tasks one at a time
2. ECS Scaling Fundamentals: Desired Count vs Running Count
Every ECS service has two critical numbers:
| Property | Meaning |
|---|---|
| Desired count | How many tasks the service wants running |
| Running count | How many tasks are actually running right now |
| Pending count | Tasks that are starting up but not yet ready |
ECS Service: api-service
Desired count: 4
Running count: 3
Pending count: 1 ← One task is starting up
ECS will keep trying until Running count = Desired count.
If a task crashes, ECS automatically launches a replacement.
Setting desired count manually
# Set a fixed number of tasks
aws ecs update-service \
--cluster production-cluster \
--service api-service \
--desired-count 4
# Check the current state
aws ecs describe-services \
--cluster production-cluster \
--services api-service \
--query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}'
Output:
{
"desired": 4,
"running": 3,
"pending": 1
}
When does ECS adjust running count?
| Event | What Happens |
|---|---|
| Task crashes (OOM, unhandled exception) | ECS launches a replacement immediately |
| Task fails health check | ECS stops the task and launches a new one |
| You increase desired count | ECS launches additional tasks |
| You decrease desired count | ECS drains and stops excess tasks |
| Auto-scaling policy triggers | ECS adjusts desired count automatically |
3. Auto-Scaling Policies
Auto-scaling automatically adjusts the desired count based on metrics. ECS uses Application Auto Scaling which supports three policy types:
3a. Target Tracking Scaling (Recommended)
Target tracking is the simplest and most effective policy. You specify a target value for a metric, and AWS automatically adjusts scaling to maintain that target.
# Step 1: Register the scalable target
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 20
# Step 2: Create a target tracking policy for CPU
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-target-tracking \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}'
How it works:
Target: 70% CPU utilization
Current: 4 tasks, average 85% CPU
AWS calculates: 4 tasks * 85% / 70% target = 4.86 → rounds up to 5 tasks
Action: Increase desired count from 4 to 5
After scale-out: 5 tasks, average 68% CPU → within target → no action
3b. Step Scaling
Step scaling lets you define different scaling increments based on how far a metric deviates from a threshold:
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-step-scaling \
--policy-type StepScaling \
--step-scaling-policy-configuration '{
"AdjustmentType": "ChangeInCapacity",
"StepAdjustments": [
{
"MetricIntervalLowerBound": 0,
"MetricIntervalUpperBound": 15,
"ScalingAdjustment": 1
},
{
"MetricIntervalLowerBound": 15,
"MetricIntervalUpperBound": 30,
"ScalingAdjustment": 3
},
{
"MetricIntervalLowerBound": 30,
"ScalingAdjustment": 5
}
],
"Cooldown": 60
}'
Step scaling in plain English:
Alarm threshold: 70% CPU
CPU at 75% (0-15 above threshold) → add 1 task
CPU at 90% (15-30 above threshold) → add 3 tasks
CPU at 100% (30+ above threshold) → add 5 tasks
This gives you proportional response to traffic spikes.
3c. Scheduled Scaling
For predictable traffic patterns (e.g., marketing campaigns, business hours):
# Scale up at 8 AM UTC on weekdays
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--scheduled-action-name morning-scale-up \
--schedule "cron(0 8 ? * MON-FRI *)" \
--scalable-target-action MinCapacity=6,MaxCapacity=20
# Scale down at 10 PM UTC on weekdays
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--scheduled-action-name evening-scale-down \
--schedule "cron(0 22 ? * MON-FRI *)" \
--scalable-target-action MinCapacity=2,MaxCapacity=6
4. CPU-Based vs Memory-Based Scaling
CPU-based scaling (most common)
Best for compute-bound services — API servers, data processing, AI inference:
# Target tracking on CPU
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
}
}'
Memory-based scaling
Best for memory-intensive services — caching layers, image processing, large data transforms:
# Target tracking on memory
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name memory-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 75.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
}
}'
Custom metric scaling (ALB request count per target)
Best for request-driven services where you want to cap requests per instance:
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production-cluster/api-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name request-count-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 1000.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
"ResourceLabel": "app/my-alb/1234567890/targetgroup/my-tg/0987654321"
}
}'
5. Scaling Out vs Scaling In
Scaling out (adding tasks) and scaling in (removing tasks) behave very differently and require different strategies:
Scaling out (adding capacity)
Trigger: Metric exceeds threshold
Action: Increase desired count
Speed: 30-90 seconds for new task to be ready
Risk: LOW — adding capacity is safe
Best practices:
- Short cooldown (60 seconds) — respond quickly to spikes
- Be aggressive — it's better to over-provision temporarily
- Monitor pending count for capacity issues
Scaling in (removing capacity)
Trigger: Metric drops below threshold
Action: Decrease desired count, drain connections from chosen task
Speed: Task drain period + deregistration delay = 60-120 seconds
Risk: MODERATE — removing capacity can drop active connections
Best practices:
- Long cooldown (300+ seconds) — avoid flapping
- Gradual — scale in 1 task at a time
- Connection draining — allow in-flight requests to complete
- Min capacity — NEVER scale to 0 (unless serverless)
Cooldown periods explained
┌────────────┐ Scale-out ┌────────────┐
│ 3 tasks │ ─── cooldown ──── │ 4 tasks │
│ CPU: 85% │ (60 sec) │ CPU: 68% │
└────────────┘ └────────────┘
│
Wait 60 seconds
before next scale-out
decision
┌────────────┐ Scale-in ┌────────────┐
│ 4 tasks │ ─── cooldown ──── │ 3 tasks │
│ CPU: 30% │ (300 sec) │ CPU: 40% │
└────────────┘ └────────────┘
│
Wait 300 seconds
before next scale-in
decision
WHY different cooldowns?
Scale-out: Respond to emergencies fast
Scale-in: Avoid removing capacity you'll need again in 2 minutes
6. Stateless Service Architecture
Horizontal scaling only works if your services are stateless. A stateless service stores no client-specific data between requests.
Why statelessness enables scaling
STATEFUL (broken):
Request 1 → Task A stores session data in memory
Request 2 → Load balancer sends to Task B
Result: Task B has no session data → ERROR
STATELESS (works):
Request 1 → Task A reads session from Redis/JWT
Request 2 → Task B reads session from Redis/JWT
Result: Both tasks produce the same result → SUCCESS
What makes a service stateful (and how to fix it)
| Stateful Pattern | Stateless Fix |
|---|---|
| In-memory sessions | External session store (Redis, DynamoDB) |
| Local file uploads | S3 or shared file system (EFS) |
| In-memory caches | External cache (ElastiCache Redis) |
| WebSocket connections | Sticky sessions or pub/sub (Redis pub/sub) |
| Local database | Managed database (RDS, DocumentDB) |
| Application-level counters | Atomic counters in Redis/DynamoDB |
Express.js stateless session example
// BAD: In-memory session (stateful - breaks with multiple tasks)
const sessions = {};
app.post('/login', (req, res) => {
const sessionId = generateId();
sessions[sessionId] = { userId: req.body.userId, loginTime: Date.now() };
res.cookie('sessionId', sessionId);
res.json({ success: true });
});
app.get('/profile', (req, res) => {
const session = sessions[req.cookies.sessionId]; // FAILS if different task
if (!session) return res.status(401).json({ error: 'Not authenticated' });
res.json({ userId: session.userId });
});
// GOOD: JWT-based authentication (stateless - works with any number of tasks)
const jwt = require('jsonwebtoken');
app.post('/login', async (req, res) => {
const user = await db.users.findOne({ email: req.body.email });
if (!user || !await bcrypt.compare(req.body.password, user.passwordHash)) {
return res.status(401).json({ error: 'Invalid credentials' });
}
const token = jwt.sign(
{ userId: user._id, email: user.email },
process.env.JWT_SECRET,
{ expiresIn: '24h' }
);
res.json({ token });
});
app.get('/profile', authenticateJWT, async (req, res) => {
// req.user is decoded from the JWT — no server-side state needed
const user = await db.users.findById(req.user.userId);
res.json({ userId: user._id, email: user.email });
});
function authenticateJWT(req, res, next) {
const token = req.headers.authorization?.split(' ')[1];
if (!token) return res.status(401).json({ error: 'No token provided' });
try {
req.user = jwt.verify(token, process.env.JWT_SECRET);
next();
} catch (err) {
res.status(403).json({ error: 'Invalid token' });
}
}
// GOOD: External session store with Redis (stateless tasks, shared state)
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');
const redisClient = createClient({
url: process.env.REDIS_URL // ElastiCache endpoint
});
app.use(session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET,
resave: false,
saveUninitialized: false,
cookie: { secure: true, maxAge: 86400000 } // 24 hours
}));
7. Session Handling Without State
Three approaches for managing user sessions across horizontally scaled tasks:
Approach 1: JWT (JSON Web Tokens)
Client sends JWT in Authorization header
│
▼
Any task can verify the JWT independently
No shared state needed
No database lookup needed (claims are in the token)
Pros: Fastest, no external dependency
Cons: Can't revoke individual tokens (without a blocklist)
Best for: API authentication, microservices
Approach 2: External Session Store (Redis)
Client sends session cookie
│
▼
Any task looks up session in Redis
Redis is shared across all tasks
Pros: Can revoke sessions instantly, store complex session data
Cons: Redis is a single point of failure, adds latency
Best for: Web applications, complex session data
Approach 3: Sticky Sessions (ALB)
ALB routes ALL requests from one client to the SAME task
│
▼
Session data stays in memory on that task
Pros: Simple, no external store needed
Cons: Uneven load distribution, task failure loses sessions
Best for: Legacy applications during migration
8. Warm-Up Time Considerations
New tasks take time to become fully productive. This warm-up period affects scaling decisions.
Task Lifecycle:
0s - ECS launches new task
5s - Container image pulled (if not cached)
10s - Container starts
15s - Application starts, opens DB connections
20s - Health check passes
30s - ALB registers task as healthy
45s - Task receives first request
60s - JIT optimizations kick in, caches warm
Total warm-up: 30-90 seconds before a new task handles traffic at full efficiency
Why warm-up matters for auto-scaling
Problem: Spike arrives → auto-scaling triggers → new tasks take 60s
During those 60s, existing tasks are overloaded
Solutions:
1. Predictive scaling — scale up BEFORE the spike (scheduled scaling)
2. Over-provision slightly — keep min capacity above bare minimum
3. Request queuing — buffer requests during scaling events
4. Scale-out cooldown — short (60s) so you can add tasks quickly
Node.js warm-up optimization
// server.js — pre-warm connections before accepting traffic
const express = require('express');
const mongoose = require('mongoose');
const Redis = require('ioredis');
const app = express();
let isReady = false;
async function warmUp() {
console.log('[WARM-UP] Starting pre-warm sequence...');
// 1. Connect to database
await mongoose.connect(process.env.MONGODB_URI);
console.log('[WARM-UP] MongoDB connected');
// 2. Connect to Redis
const redis = new Redis(process.env.REDIS_URL);
await redis.ping();
console.log('[WARM-UP] Redis connected');
// 3. Pre-load frequently accessed data
const config = await db.collection('config').findOne({ key: 'app-settings' });
app.locals.config = config;
console.log('[WARM-UP] Config cached');
// 4. Warm up any ML models or heavy initializations
// await loadModel();
isReady = true;
console.log('[WARM-UP] Service ready to accept traffic');
}
// Health check respects warm-up state
app.get('/health', (req, res) => {
if (!isReady) {
return res.status(503).json({ status: 'warming-up' });
}
res.json({ status: 'healthy' });
});
// Start server only after warm-up
warmUp()
.then(() => {
app.listen(process.env.PORT || 3000, () => {
console.log(`Server listening on port ${process.env.PORT || 3000}`);
});
})
.catch((err) => {
console.error('[WARM-UP] Failed:', err);
process.exit(1);
});
9. Complete Auto-Scaling Configuration Example
Here is a production-ready auto-scaling setup combining multiple strategies:
#!/bin/bash
# auto-scaling-setup.sh — Complete ECS auto-scaling configuration
CLUSTER="production-cluster"
SERVICE="api-service"
RESOURCE_ID="service/${CLUSTER}/${SERVICE}"
echo "=== Step 1: Register scalable target ==="
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id "$RESOURCE_ID" \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 20
echo "=== Step 2: CPU target tracking (primary) ==="
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id "$RESOURCE_ID" \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-target-70 \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300,
"DisableScaleIn": false
}'
echo "=== Step 3: Memory target tracking (secondary) ==="
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id "$RESOURCE_ID" \
--scalable-dimension ecs:service:DesiredCount \
--policy-name memory-target-75 \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 75.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}'
echo "=== Step 4: Scheduled scaling for business hours ==="
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id "$RESOURCE_ID" \
--scalable-dimension ecs:service:DesiredCount \
--scheduled-action-name weekday-scale-up \
--schedule "cron(0 8 ? * MON-FRI *)" \
--scalable-target-action MinCapacity=4,MaxCapacity=20
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id "$RESOURCE_ID" \
--scalable-dimension ecs:service:DesiredCount \
--scheduled-action-name weeknight-scale-down \
--schedule "cron(0 22 ? * MON-FRI *)" \
--scalable-target-action MinCapacity=2,MaxCapacity=8
echo "=== Step 5: Verify configuration ==="
aws application-autoscaling describe-scaling-policies \
--service-namespace ecs \
--resource-id "$RESOURCE_ID" \
--query 'ScalingPolicies[].{Name:PolicyName,Type:PolicyType}'
CloudFormation / Terraform equivalent
# CloudFormation snippet
AutoScalingTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MaxCapacity: 20
MinCapacity: 2
ResourceId: !Sub service/${ECSCluster}/${ECSService.Name}
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
CPUScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: cpu-target-tracking
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref AutoScalingTarget
TargetTrackingScalingPolicyConfiguration:
TargetValue: 70.0
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
ScaleOutCooldown: 60
ScaleInCooldown: 300
10. Key Takeaways
- Horizontal scaling adds task instances behind a load balancer — prefer this over vertical scaling for microservices.
- Target tracking is the recommended auto-scaling policy — set a target (e.g., 70% CPU) and AWS handles the math.
- Stateless services are mandatory for horizontal scaling — externalize all state to databases, Redis, or JWTs.
- Scale-out fast, scale-in slow — use short cooldowns for adding capacity and long cooldowns for removing it.
- Warm-up time means new tasks are not immediately productive — account for this in your scaling strategy by maintaining a healthy minimum capacity.
Explain-It Challenge
- Your ECS service has 3 tasks at 90% CPU. Target tracking is set to 70%. How many tasks will AWS scale to? Walk through the math.
- A colleague wants to store user sessions in a global variable. Explain why this breaks when you have 4 ECS tasks behind a load balancer.
- Traffic spikes every Monday at 9 AM. The auto-scaling takes 60 seconds to add new tasks. How would you prevent the 60-second gap?
Navigation: ← 6.4 Overview · 6.4.b — Health Checks and Availability →