Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling

6.4.a — Horizontal Scaling with ECS

In one sentence: Horizontal scaling adds or removes ECS task instances behind a load balancer to match traffic demand, enabled by stateless service architecture and driven by auto-scaling policies that react to CPU, memory, or custom metrics.

Navigation: ← 6.4 Overview · 6.4.b — Health Checks and Availability →

1. Horizontal vs Vertical Scaling

Before diving into ECS specifics, understand the two fundamental scaling approaches:

VERTICAL SCALING (Scale Up)             HORIZONTAL SCALING (Scale Out)
┌─────────────────────┐                 ┌───────┐ ┌───────┐ ┌───────┐
│                     │                 │Task 1 │ │Task 2 │ │Task 3 │
│   Single Big Task   │                 │ 1 CPU │ │ 1 CPU │ │ 1 CPU │
│   4 CPU / 8 GB      │                 │ 2 GB  │ │ 2 GB  │ │ 2 GB  │
│                     │                 └───────┘ └───────┘ └───────┘
└─────────────────────┘                     │         │         │
                                            └─────────┼─────────┘
Limits: Hardware ceiling                              │
Single point of failure                          Load Balancer
                                            Limits: Nearly unlimited
                                            Fault tolerant

Horizontal scaling wins for microservices because:

No hardware ceiling — add as many tasks as you need
Fault tolerance — one task crashing does not bring down the service
Cost efficiency — scale down during off-peak hours to save money
Zero-downtime deployments — rolling updates replace tasks one at a time

2. ECS Scaling Fundamentals: Desired Count vs Running Count

Every ECS service has two critical numbers:

Property	Meaning
Desired count	How many tasks the service wants running
Running count	How many tasks are actually running right now
Pending count	Tasks that are starting up but not yet ready

ECS Service: api-service
  Desired count:  4
  Running count:  3
  Pending count:  1     ← One task is starting up

ECS will keep trying until Running count = Desired count.
If a task crashes, ECS automatically launches a replacement.

Setting desired count manually

# Set a fixed number of tasks
aws ecs update-service \
  --cluster production-cluster \
  --service api-service \
  --desired-count 4

# Check the current state
aws ecs describe-services \
  --cluster production-cluster \
  --services api-service \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}'

Output:

{
  "desired": 4,
  "running": 3,
  "pending": 1
}

When does ECS adjust running count?

Event	What Happens
Task crashes (OOM, unhandled exception)	ECS launches a replacement immediately
Task fails health check	ECS stops the task and launches a new one
You increase desired count	ECS launches additional tasks
You decrease desired count	ECS drains and stops excess tasks
Auto-scaling policy triggers	ECS adjusts desired count automatically

3. Auto-Scaling Policies

Auto-scaling automatically adjusts the desired count based on metrics. ECS uses Application Auto Scaling which supports three policy types:

3a. Target Tracking Scaling (Recommended)

Target tracking is the simplest and most effective policy. You specify a target value for a metric, and AWS automatically adjusts scaling to maintain that target.

# Step 1: Register the scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

# Step 2: Create a target tracking policy for CPU
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

How it works:

Target: 70% CPU utilization
Current: 4 tasks, average 85% CPU

AWS calculates: 4 tasks * 85% / 70% target = 4.86 → rounds up to 5 tasks
Action: Increase desired count from 4 to 5

After scale-out: 5 tasks, average 68% CPU → within target → no action

3b. Step Scaling

Step scaling lets you define different scaling increments based on how far a metric deviates from a threshold:

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-step-scaling \
  --policy-type StepScaling \
  --step-scaling-policy-configuration '{
    "AdjustmentType": "ChangeInCapacity",
    "StepAdjustments": [
      {
        "MetricIntervalLowerBound": 0,
        "MetricIntervalUpperBound": 15,
        "ScalingAdjustment": 1
      },
      {
        "MetricIntervalLowerBound": 15,
        "MetricIntervalUpperBound": 30,
        "ScalingAdjustment": 3
      },
      {
        "MetricIntervalLowerBound": 30,
        "ScalingAdjustment": 5
      }
    ],
    "Cooldown": 60
  }'

Step scaling in plain English:

Alarm threshold: 70% CPU

CPU at 75% (0-15 above threshold)  → add 1 task
CPU at 90% (15-30 above threshold) → add 3 tasks
CPU at 100% (30+ above threshold)  → add 5 tasks

This gives you proportional response to traffic spikes.

3c. Scheduled Scaling

For predictable traffic patterns (e.g., marketing campaigns, business hours):

# Scale up at 8 AM UTC on weekdays
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name morning-scale-up \
  --schedule "cron(0 8 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=6,MaxCapacity=20

# Scale down at 10 PM UTC on weekdays
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name evening-scale-down \
  --schedule "cron(0 22 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=2,MaxCapacity=6

4. CPU-Based vs Memory-Based Scaling

CPU-based scaling (most common)

Best for compute-bound services — API servers, data processing, AI inference:

# Target tracking on CPU
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    }
  }'

Memory-based scaling

Best for memory-intensive services — caching layers, image processing, large data transforms:

# Target tracking on memory
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name memory-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 75.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
    }
  }'

Custom metric scaling (ALB request count per target)

Best for request-driven services where you want to cap requests per instance:

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name request-count-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/my-alb/1234567890/targetgroup/my-tg/0987654321"
    }
  }'

5. Scaling Out vs Scaling In

Scaling out (adding tasks) and scaling in (removing tasks) behave very differently and require different strategies:

Scaling out (adding capacity)

Trigger: Metric exceeds threshold
Action:  Increase desired count
Speed:   30-90 seconds for new task to be ready
Risk:    LOW — adding capacity is safe

Best practices:
  - Short cooldown (60 seconds) — respond quickly to spikes
  - Be aggressive — it's better to over-provision temporarily
  - Monitor pending count for capacity issues

Scaling in (removing capacity)

Trigger: Metric drops below threshold
Action:  Decrease desired count, drain connections from chosen task
Speed:   Task drain period + deregistration delay = 60-120 seconds
Risk:    MODERATE — removing capacity can drop active connections

Best practices:
  - Long cooldown (300+ seconds) — avoid flapping
  - Gradual — scale in 1 task at a time
  - Connection draining — allow in-flight requests to complete
  - Min capacity — NEVER scale to 0 (unless serverless)

Cooldown periods explained

┌────────────┐     Scale-out     ┌────────────┐
│ 3 tasks    │ ─── cooldown ──── │ 4 tasks    │
│ CPU: 85%   │     (60 sec)      │ CPU: 68%   │
└────────────┘                   └────────────┘
                                       │
                                  Wait 60 seconds
                                  before next scale-out
                                  decision

┌────────────┐     Scale-in      ┌────────────┐
│ 4 tasks    │ ─── cooldown ──── │ 3 tasks    │
│ CPU: 30%   │     (300 sec)     │ CPU: 40%   │
└────────────┘                   └────────────┘
                                       │
                                  Wait 300 seconds
                                  before next scale-in
                                  decision

WHY different cooldowns?
  Scale-out: Respond to emergencies fast
  Scale-in:  Avoid removing capacity you'll need again in 2 minutes

6. Stateless Service Architecture

Horizontal scaling only works if your services are stateless. A stateless service stores no client-specific data between requests.

Why statelessness enables scaling

STATEFUL (broken):
  Request 1 → Task A stores session data in memory
  Request 2 → Load balancer sends to Task B
  Result:     Task B has no session data → ERROR

STATELESS (works):
  Request 1 → Task A reads session from Redis/JWT
  Request 2 → Task B reads session from Redis/JWT
  Result:     Both tasks produce the same result → SUCCESS

What makes a service stateful (and how to fix it)

Stateful Pattern	Stateless Fix
In-memory sessions	External session store (Redis, DynamoDB)
Local file uploads	S3 or shared file system (EFS)
In-memory caches	External cache (ElastiCache Redis)
WebSocket connections	Sticky sessions or pub/sub (Redis pub/sub)
Local database	Managed database (RDS, DocumentDB)
Application-level counters	Atomic counters in Redis/DynamoDB

Express.js stateless session example

// BAD: In-memory session (stateful - breaks with multiple tasks)
const sessions = {};

app.post('/login', (req, res) => {
  const sessionId = generateId();
  sessions[sessionId] = { userId: req.body.userId, loginTime: Date.now() };
  res.cookie('sessionId', sessionId);
  res.json({ success: true });
});

app.get('/profile', (req, res) => {
  const session = sessions[req.cookies.sessionId]; // FAILS if different task
  if (!session) return res.status(401).json({ error: 'Not authenticated' });
  res.json({ userId: session.userId });
});

// GOOD: JWT-based authentication (stateless - works with any number of tasks)
const jwt = require('jsonwebtoken');

app.post('/login', async (req, res) => {
  const user = await db.users.findOne({ email: req.body.email });
  if (!user || !await bcrypt.compare(req.body.password, user.passwordHash)) {
    return res.status(401).json({ error: 'Invalid credentials' });
  }
  
  const token = jwt.sign(
    { userId: user._id, email: user.email },
    process.env.JWT_SECRET,
    { expiresIn: '24h' }
  );
  
  res.json({ token });
});

app.get('/profile', authenticateJWT, async (req, res) => {
  // req.user is decoded from the JWT — no server-side state needed
  const user = await db.users.findById(req.user.userId);
  res.json({ userId: user._id, email: user.email });
});

function authenticateJWT(req, res, next) {
  const token = req.headers.authorization?.split(' ')[1];
  if (!token) return res.status(401).json({ error: 'No token provided' });
  
  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);
    next();
  } catch (err) {
    res.status(403).json({ error: 'Invalid token' });
  }
}

// GOOD: External session store with Redis (stateless tasks, shared state)
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');

const redisClient = createClient({
  url: process.env.REDIS_URL  // ElastiCache endpoint
});

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
  cookie: { secure: true, maxAge: 86400000 } // 24 hours
}));

7. Session Handling Without State

Three approaches for managing user sessions across horizontally scaled tasks:

Approach 1: JWT (JSON Web Tokens)

Client sends JWT in Authorization header
     │
     ▼
Any task can verify the JWT independently
No shared state needed
No database lookup needed (claims are in the token)

Pros: Fastest, no external dependency
Cons: Can't revoke individual tokens (without a blocklist)
Best for: API authentication, microservices

Approach 2: External Session Store (Redis)

Client sends session cookie
     │
     ▼
Any task looks up session in Redis
Redis is shared across all tasks

Pros: Can revoke sessions instantly, store complex session data
Cons: Redis is a single point of failure, adds latency
Best for: Web applications, complex session data

Approach 3: Sticky Sessions (ALB)

ALB routes ALL requests from one client to the SAME task
     │
     ▼
Session data stays in memory on that task

Pros: Simple, no external store needed
Cons: Uneven load distribution, task failure loses sessions
Best for: Legacy applications during migration

8. Warm-Up Time Considerations

New tasks take time to become fully productive. This warm-up period affects scaling decisions.

Task Lifecycle:
  0s   - ECS launches new task
  5s   - Container image pulled (if not cached)
  10s  - Container starts
  15s  - Application starts, opens DB connections
  20s  - Health check passes
  30s  - ALB registers task as healthy
  45s  - Task receives first request
  60s  - JIT optimizations kick in, caches warm

Total warm-up: 30-90 seconds before a new task handles traffic at full efficiency

Why warm-up matters for auto-scaling

Problem: Spike arrives → auto-scaling triggers → new tasks take 60s
         During those 60s, existing tasks are overloaded

Solutions:
  1. Predictive scaling — scale up BEFORE the spike (scheduled scaling)
  2. Over-provision slightly — keep min capacity above bare minimum
  3. Request queuing — buffer requests during scaling events
  4. Scale-out cooldown — short (60s) so you can add tasks quickly

Node.js warm-up optimization

// server.js — pre-warm connections before accepting traffic

const express = require('express');
const mongoose = require('mongoose');
const Redis = require('ioredis');

const app = express();
let isReady = false;

async function warmUp() {
  console.log('[WARM-UP] Starting pre-warm sequence...');
  
  // 1. Connect to database
  await mongoose.connect(process.env.MONGODB_URI);
  console.log('[WARM-UP] MongoDB connected');
  
  // 2. Connect to Redis
  const redis = new Redis(process.env.REDIS_URL);
  await redis.ping();
  console.log('[WARM-UP] Redis connected');
  
  // 3. Pre-load frequently accessed data
  const config = await db.collection('config').findOne({ key: 'app-settings' });
  app.locals.config = config;
  console.log('[WARM-UP] Config cached');
  
  // 4. Warm up any ML models or heavy initializations
  // await loadModel();
  
  isReady = true;
  console.log('[WARM-UP] Service ready to accept traffic');
}

// Health check respects warm-up state
app.get('/health', (req, res) => {
  if (!isReady) {
    return res.status(503).json({ status: 'warming-up' });
  }
  res.json({ status: 'healthy' });
});

// Start server only after warm-up
warmUp()
  .then(() => {
    app.listen(process.env.PORT || 3000, () => {
      console.log(`Server listening on port ${process.env.PORT || 3000}`);
    });
  })
  .catch((err) => {
    console.error('[WARM-UP] Failed:', err);
    process.exit(1);
  });

9. Complete Auto-Scaling Configuration Example

Here is a production-ready auto-scaling setup combining multiple strategies:

#!/bin/bash
# auto-scaling-setup.sh — Complete ECS auto-scaling configuration

CLUSTER="production-cluster"
SERVICE="api-service"
RESOURCE_ID="service/${CLUSTER}/${SERVICE}"

echo "=== Step 1: Register scalable target ==="
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

echo "=== Step 2: CPU target tracking (primary) ==="
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-70 \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300,
    "DisableScaleIn": false
  }'

echo "=== Step 3: Memory target tracking (secondary) ==="
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name memory-target-75 \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 75.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

echo "=== Step 4: Scheduled scaling for business hours ==="
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name weekday-scale-up \
  --schedule "cron(0 8 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=4,MaxCapacity=20

aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name weeknight-scale-down \
  --schedule "cron(0 22 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=2,MaxCapacity=8

echo "=== Step 5: Verify configuration ==="
aws application-autoscaling describe-scaling-policies \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --query 'ScalingPolicies[].{Name:PolicyName,Type:PolicyType}'

CloudFormation / Terraform equivalent

# CloudFormation snippet
AutoScalingTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    MaxCapacity: 20
    MinCapacity: 2
    ResourceId: !Sub service/${ECSCluster}/${ECSService.Name}
    ScalableDimension: ecs:service:DesiredCount
    ServiceNamespace: ecs

CPUScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyName: cpu-target-tracking
    PolicyType: TargetTrackingScaling
    ScalingTargetId: !Ref AutoScalingTarget
    TargetTrackingScalingPolicyConfiguration:
      TargetValue: 70.0
      PredefinedMetricSpecification:
        PredefinedMetricType: ECSServiceAverageCPUUtilization
      ScaleOutCooldown: 60
      ScaleInCooldown: 300

10. Key Takeaways

Horizontal scaling adds task instances behind a load balancer — prefer this over vertical scaling for microservices.
Target tracking is the recommended auto-scaling policy — set a target (e.g., 70% CPU) and AWS handles the math.
Stateless services are mandatory for horizontal scaling — externalize all state to databases, Redis, or JWTs.
Scale-out fast, scale-in slow — use short cooldowns for adding capacity and long cooldowns for removing it.
Warm-up time means new tasks are not immediately productive — account for this in your scaling strategy by maintaining a healthy minimum capacity.

Explain-It Challenge

Your ECS service has 3 tasks at 90% CPU. Target tracking is set to 70%. How many tasks will AWS scale to? Walk through the math.
A colleague wants to store user sessions in a global variable. Explain why this breaks when you have 4 ECS tasks behind a load balancer.
Traffic spikes every Monday at 9 AM. The auto-scaling takes 60 seconds to add new tasks. How would you prevent the 60-second gap?

Navigation: ← 6.4 Overview · 6.4.b — Health Checks and Availability →