Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling

6.4.a — Horizontal Scaling with ECS

In one sentence: Horizontal scaling adds or removes ECS task instances behind a load balancer to match traffic demand, enabled by stateless service architecture and driven by auto-scaling policies that react to CPU, memory, or custom metrics.

Navigation: ← 6.4 Overview · 6.4.b — Health Checks and Availability →


1. Horizontal vs Vertical Scaling

Before diving into ECS specifics, understand the two fundamental scaling approaches:

VERTICAL SCALING (Scale Up)             HORIZONTAL SCALING (Scale Out)
┌─────────────────────┐                 ┌───────┐ ┌───────┐ ┌───────┐
│                     │                 │Task 1 │ │Task 2 │ │Task 3 │
│   Single Big Task   │                 │ 1 CPU │ │ 1 CPU │ │ 1 CPU │
│   4 CPU / 8 GB      │                 │ 2 GB  │ │ 2 GB  │ │ 2 GB  │
│                     │                 └───────┘ └───────┘ └───────┘
└─────────────────────┘                     │         │         │
                                            └─────────┼─────────┘
Limits: Hardware ceiling                              │
Single point of failure                          Load Balancer
                                            Limits: Nearly unlimited
                                            Fault tolerant

Horizontal scaling wins for microservices because:

  1. No hardware ceiling — add as many tasks as you need
  2. Fault tolerance — one task crashing does not bring down the service
  3. Cost efficiency — scale down during off-peak hours to save money
  4. Zero-downtime deployments — rolling updates replace tasks one at a time

2. ECS Scaling Fundamentals: Desired Count vs Running Count

Every ECS service has two critical numbers:

PropertyMeaning
Desired countHow many tasks the service wants running
Running countHow many tasks are actually running right now
Pending countTasks that are starting up but not yet ready
ECS Service: api-service
  Desired count:  4
  Running count:  3
  Pending count:  1     ← One task is starting up

ECS will keep trying until Running count = Desired count.
If a task crashes, ECS automatically launches a replacement.

Setting desired count manually

# Set a fixed number of tasks
aws ecs update-service \
  --cluster production-cluster \
  --service api-service \
  --desired-count 4

# Check the current state
aws ecs describe-services \
  --cluster production-cluster \
  --services api-service \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}'

Output:

{
  "desired": 4,
  "running": 3,
  "pending": 1
}

When does ECS adjust running count?

EventWhat Happens
Task crashes (OOM, unhandled exception)ECS launches a replacement immediately
Task fails health checkECS stops the task and launches a new one
You increase desired countECS launches additional tasks
You decrease desired countECS drains and stops excess tasks
Auto-scaling policy triggersECS adjusts desired count automatically

3. Auto-Scaling Policies

Auto-scaling automatically adjusts the desired count based on metrics. ECS uses Application Auto Scaling which supports three policy types:

3a. Target Tracking Scaling (Recommended)

Target tracking is the simplest and most effective policy. You specify a target value for a metric, and AWS automatically adjusts scaling to maintain that target.

# Step 1: Register the scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

# Step 2: Create a target tracking policy for CPU
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

How it works:

Target: 70% CPU utilization
Current: 4 tasks, average 85% CPU

AWS calculates: 4 tasks * 85% / 70% target = 4.86 → rounds up to 5 tasks
Action: Increase desired count from 4 to 5

After scale-out: 5 tasks, average 68% CPU → within target → no action

3b. Step Scaling

Step scaling lets you define different scaling increments based on how far a metric deviates from a threshold:

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-step-scaling \
  --policy-type StepScaling \
  --step-scaling-policy-configuration '{
    "AdjustmentType": "ChangeInCapacity",
    "StepAdjustments": [
      {
        "MetricIntervalLowerBound": 0,
        "MetricIntervalUpperBound": 15,
        "ScalingAdjustment": 1
      },
      {
        "MetricIntervalLowerBound": 15,
        "MetricIntervalUpperBound": 30,
        "ScalingAdjustment": 3
      },
      {
        "MetricIntervalLowerBound": 30,
        "ScalingAdjustment": 5
      }
    ],
    "Cooldown": 60
  }'

Step scaling in plain English:

Alarm threshold: 70% CPU

CPU at 75% (0-15 above threshold)  → add 1 task
CPU at 90% (15-30 above threshold) → add 3 tasks
CPU at 100% (30+ above threshold)  → add 5 tasks

This gives you proportional response to traffic spikes.

3c. Scheduled Scaling

For predictable traffic patterns (e.g., marketing campaigns, business hours):

# Scale up at 8 AM UTC on weekdays
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name morning-scale-up \
  --schedule "cron(0 8 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=6,MaxCapacity=20

# Scale down at 10 PM UTC on weekdays
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name evening-scale-down \
  --schedule "cron(0 22 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=2,MaxCapacity=6

4. CPU-Based vs Memory-Based Scaling

CPU-based scaling (most common)

Best for compute-bound services — API servers, data processing, AI inference:

# Target tracking on CPU
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    }
  }'

Memory-based scaling

Best for memory-intensive services — caching layers, image processing, large data transforms:

# Target tracking on memory
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name memory-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 75.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
    }
  }'

Custom metric scaling (ALB request count per target)

Best for request-driven services where you want to cap requests per instance:

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production-cluster/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name request-count-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/my-alb/1234567890/targetgroup/my-tg/0987654321"
    }
  }'

5. Scaling Out vs Scaling In

Scaling out (adding tasks) and scaling in (removing tasks) behave very differently and require different strategies:

Scaling out (adding capacity)

Trigger: Metric exceeds threshold
Action:  Increase desired count
Speed:   30-90 seconds for new task to be ready
Risk:    LOW — adding capacity is safe

Best practices:
  - Short cooldown (60 seconds) — respond quickly to spikes
  - Be aggressive — it's better to over-provision temporarily
  - Monitor pending count for capacity issues

Scaling in (removing capacity)

Trigger: Metric drops below threshold
Action:  Decrease desired count, drain connections from chosen task
Speed:   Task drain period + deregistration delay = 60-120 seconds
Risk:    MODERATE — removing capacity can drop active connections

Best practices:
  - Long cooldown (300+ seconds) — avoid flapping
  - Gradual — scale in 1 task at a time
  - Connection draining — allow in-flight requests to complete
  - Min capacity — NEVER scale to 0 (unless serverless)

Cooldown periods explained

┌────────────┐     Scale-out     ┌────────────┐
│ 3 tasks    │ ─── cooldown ──── │ 4 tasks    │
│ CPU: 85%   │     (60 sec)      │ CPU: 68%   │
└────────────┘                   └────────────┘
                                       │
                                  Wait 60 seconds
                                  before next scale-out
                                  decision

┌────────────┐     Scale-in      ┌────────────┐
│ 4 tasks    │ ─── cooldown ──── │ 3 tasks    │
│ CPU: 30%   │     (300 sec)     │ CPU: 40%   │
└────────────┘                   └────────────┘
                                       │
                                  Wait 300 seconds
                                  before next scale-in
                                  decision

WHY different cooldowns?
  Scale-out: Respond to emergencies fast
  Scale-in:  Avoid removing capacity you'll need again in 2 minutes

6. Stateless Service Architecture

Horizontal scaling only works if your services are stateless. A stateless service stores no client-specific data between requests.

Why statelessness enables scaling

STATEFUL (broken):
  Request 1 → Task A stores session data in memory
  Request 2 → Load balancer sends to Task B
  Result:     Task B has no session data → ERROR

STATELESS (works):
  Request 1 → Task A reads session from Redis/JWT
  Request 2 → Task B reads session from Redis/JWT
  Result:     Both tasks produce the same result → SUCCESS

What makes a service stateful (and how to fix it)

Stateful PatternStateless Fix
In-memory sessionsExternal session store (Redis, DynamoDB)
Local file uploadsS3 or shared file system (EFS)
In-memory cachesExternal cache (ElastiCache Redis)
WebSocket connectionsSticky sessions or pub/sub (Redis pub/sub)
Local databaseManaged database (RDS, DocumentDB)
Application-level countersAtomic counters in Redis/DynamoDB

Express.js stateless session example

// BAD: In-memory session (stateful - breaks with multiple tasks)
const sessions = {};

app.post('/login', (req, res) => {
  const sessionId = generateId();
  sessions[sessionId] = { userId: req.body.userId, loginTime: Date.now() };
  res.cookie('sessionId', sessionId);
  res.json({ success: true });
});

app.get('/profile', (req, res) => {
  const session = sessions[req.cookies.sessionId]; // FAILS if different task
  if (!session) return res.status(401).json({ error: 'Not authenticated' });
  res.json({ userId: session.userId });
});
// GOOD: JWT-based authentication (stateless - works with any number of tasks)
const jwt = require('jsonwebtoken');

app.post('/login', async (req, res) => {
  const user = await db.users.findOne({ email: req.body.email });
  if (!user || !await bcrypt.compare(req.body.password, user.passwordHash)) {
    return res.status(401).json({ error: 'Invalid credentials' });
  }
  
  const token = jwt.sign(
    { userId: user._id, email: user.email },
    process.env.JWT_SECRET,
    { expiresIn: '24h' }
  );
  
  res.json({ token });
});

app.get('/profile', authenticateJWT, async (req, res) => {
  // req.user is decoded from the JWT — no server-side state needed
  const user = await db.users.findById(req.user.userId);
  res.json({ userId: user._id, email: user.email });
});

function authenticateJWT(req, res, next) {
  const token = req.headers.authorization?.split(' ')[1];
  if (!token) return res.status(401).json({ error: 'No token provided' });
  
  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);
    next();
  } catch (err) {
    res.status(403).json({ error: 'Invalid token' });
  }
}
// GOOD: External session store with Redis (stateless tasks, shared state)
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');

const redisClient = createClient({
  url: process.env.REDIS_URL  // ElastiCache endpoint
});

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
  cookie: { secure: true, maxAge: 86400000 } // 24 hours
}));

7. Session Handling Without State

Three approaches for managing user sessions across horizontally scaled tasks:

Approach 1: JWT (JSON Web Tokens)

Client sends JWT in Authorization header
     │
     ▼
Any task can verify the JWT independently
No shared state needed
No database lookup needed (claims are in the token)

Pros: Fastest, no external dependency
Cons: Can't revoke individual tokens (without a blocklist)
Best for: API authentication, microservices

Approach 2: External Session Store (Redis)

Client sends session cookie
     │
     ▼
Any task looks up session in Redis
Redis is shared across all tasks

Pros: Can revoke sessions instantly, store complex session data
Cons: Redis is a single point of failure, adds latency
Best for: Web applications, complex session data

Approach 3: Sticky Sessions (ALB)

ALB routes ALL requests from one client to the SAME task
     │
     ▼
Session data stays in memory on that task

Pros: Simple, no external store needed
Cons: Uneven load distribution, task failure loses sessions
Best for: Legacy applications during migration

8. Warm-Up Time Considerations

New tasks take time to become fully productive. This warm-up period affects scaling decisions.

Task Lifecycle:
  0s   - ECS launches new task
  5s   - Container image pulled (if not cached)
  10s  - Container starts
  15s  - Application starts, opens DB connections
  20s  - Health check passes
  30s  - ALB registers task as healthy
  45s  - Task receives first request
  60s  - JIT optimizations kick in, caches warm

Total warm-up: 30-90 seconds before a new task handles traffic at full efficiency

Why warm-up matters for auto-scaling

Problem: Spike arrives → auto-scaling triggers → new tasks take 60s
         During those 60s, existing tasks are overloaded

Solutions:
  1. Predictive scaling — scale up BEFORE the spike (scheduled scaling)
  2. Over-provision slightly — keep min capacity above bare minimum
  3. Request queuing — buffer requests during scaling events
  4. Scale-out cooldown — short (60s) so you can add tasks quickly

Node.js warm-up optimization

// server.js — pre-warm connections before accepting traffic

const express = require('express');
const mongoose = require('mongoose');
const Redis = require('ioredis');

const app = express();
let isReady = false;

async function warmUp() {
  console.log('[WARM-UP] Starting pre-warm sequence...');
  
  // 1. Connect to database
  await mongoose.connect(process.env.MONGODB_URI);
  console.log('[WARM-UP] MongoDB connected');
  
  // 2. Connect to Redis
  const redis = new Redis(process.env.REDIS_URL);
  await redis.ping();
  console.log('[WARM-UP] Redis connected');
  
  // 3. Pre-load frequently accessed data
  const config = await db.collection('config').findOne({ key: 'app-settings' });
  app.locals.config = config;
  console.log('[WARM-UP] Config cached');
  
  // 4. Warm up any ML models or heavy initializations
  // await loadModel();
  
  isReady = true;
  console.log('[WARM-UP] Service ready to accept traffic');
}

// Health check respects warm-up state
app.get('/health', (req, res) => {
  if (!isReady) {
    return res.status(503).json({ status: 'warming-up' });
  }
  res.json({ status: 'healthy' });
});

// Start server only after warm-up
warmUp()
  .then(() => {
    app.listen(process.env.PORT || 3000, () => {
      console.log(`Server listening on port ${process.env.PORT || 3000}`);
    });
  })
  .catch((err) => {
    console.error('[WARM-UP] Failed:', err);
    process.exit(1);
  });

9. Complete Auto-Scaling Configuration Example

Here is a production-ready auto-scaling setup combining multiple strategies:

#!/bin/bash
# auto-scaling-setup.sh — Complete ECS auto-scaling configuration

CLUSTER="production-cluster"
SERVICE="api-service"
RESOURCE_ID="service/${CLUSTER}/${SERVICE}"

echo "=== Step 1: Register scalable target ==="
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

echo "=== Step 2: CPU target tracking (primary) ==="
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-70 \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300,
    "DisableScaleIn": false
  }'

echo "=== Step 3: Memory target tracking (secondary) ==="
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name memory-target-75 \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 75.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

echo "=== Step 4: Scheduled scaling for business hours ==="
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name weekday-scale-up \
  --schedule "cron(0 8 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=4,MaxCapacity=20

aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name weeknight-scale-down \
  --schedule "cron(0 22 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=2,MaxCapacity=8

echo "=== Step 5: Verify configuration ==="
aws application-autoscaling describe-scaling-policies \
  --service-namespace ecs \
  --resource-id "$RESOURCE_ID" \
  --query 'ScalingPolicies[].{Name:PolicyName,Type:PolicyType}'

CloudFormation / Terraform equivalent

# CloudFormation snippet
AutoScalingTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    MaxCapacity: 20
    MinCapacity: 2
    ResourceId: !Sub service/${ECSCluster}/${ECSService.Name}
    ScalableDimension: ecs:service:DesiredCount
    ServiceNamespace: ecs

CPUScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyName: cpu-target-tracking
    PolicyType: TargetTrackingScaling
    ScalingTargetId: !Ref AutoScalingTarget
    TargetTrackingScalingPolicyConfiguration:
      TargetValue: 70.0
      PredefinedMetricSpecification:
        PredefinedMetricType: ECSServiceAverageCPUUtilization
      ScaleOutCooldown: 60
      ScaleInCooldown: 300

10. Key Takeaways

  1. Horizontal scaling adds task instances behind a load balancer — prefer this over vertical scaling for microservices.
  2. Target tracking is the recommended auto-scaling policy — set a target (e.g., 70% CPU) and AWS handles the math.
  3. Stateless services are mandatory for horizontal scaling — externalize all state to databases, Redis, or JWTs.
  4. Scale-out fast, scale-in slow — use short cooldowns for adding capacity and long cooldowns for removing it.
  5. Warm-up time means new tasks are not immediately productive — account for this in your scaling strategy by maintaining a healthy minimum capacity.

Explain-It Challenge

  1. Your ECS service has 3 tasks at 90% CPU. Target tracking is set to 70%. How many tasks will AWS scale to? Walk through the math.
  2. A colleague wants to store user sessions in a global variable. Explain why this breaks when you have 4 ECS tasks behind a load balancer.
  3. Traffic spikes every Monday at 9 AM. The auto-scaling takes 60 seconds to add new tasks. How would you prevent the 60-second gap?

Navigation: ← 6.4 Overview · 6.4.b — Health Checks and Availability →