Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling

6.4.b — Health Checks and Availability

In one sentence: Health checks are automated probes that continuously verify whether each ECS task can receive traffic, enabling the system to replace unhealthy tasks, prevent cascading failures, and maintain high availability across distributed services.

Navigation: ← 6.4.a Horizontal Scaling · 6.4.c — CloudWatch Monitoring →

1. Why Health Checks Matter in Distributed Systems

In a monolithic application, if the server is running, it is usually serving requests. In a distributed system with multiple tasks, the situation is far more complex:

Without health checks:
  Task 1: Running, healthy         ← Receives traffic (good)
  Task 2: Running, DB disconnected ← Receives traffic (returns 500 errors!)
  Task 3: Running, out of memory   ← Receives traffic (crashes on every request!)

  Result: 2 out of 3 requests fail. Users see intermittent errors.

With health checks:
  Task 1: Running, healthy         ← Receives traffic (good)
  Task 2: Running, DB disconnected ← MARKED UNHEALTHY → removed from load balancer
  Task 3: Running, out of memory   ← MARKED UNHEALTHY → replaced by ECS

  Result: 100% of routed requests go to healthy tasks.

Health checks solve three critical problems:

Problem	How Health Checks Help
Zombie processes	Detect tasks that are running but not functioning
Dependency failures	Detect when a task loses its database or cache connection
Graceful degradation	Remove broken tasks from rotation before users are affected

2. ECS Health Checks vs ALB Health Checks

There are two independent layers of health checking in an ECS + ALB setup, and understanding the difference is essential:

┌─────────────────────────────────────────────────────────────────┐
│                    TWO LAYERS OF HEALTH CHECKS                   │
│                                                                  │
│  Layer 1: ALB Health Check                                       │
│  ┌─────────┐      GET /health        ┌──────────┐               │
│  │   ALB   │ ──────────────────────▶ │  Task    │               │
│  │         │ ◀────── 200 OK ──────── │          │               │
│  └─────────┘                         └──────────┘               │
│  Purpose: Decides if task receives TRAFFIC                       │
│  Failure: ALB stops sending requests to this task                │
│                                                                  │
│  Layer 2: ECS Container Health Check                             │
│  ┌─────────┐    docker healthcheck   ┌──────────┐               │
│  │  ECS    │ ──────────────────────▶ │Container │               │
│  │  Agent  │ ◀────── exit code ───── │          │               │
│  └─────────┘                         └──────────┘               │
│  Purpose: Decides if container is ALIVE                          │
│  Failure: ECS stops and replaces the entire task                 │
└─────────────────────────────────────────────────────────────────┘

Comparison table

Aspect	ALB Health Check	ECS Container Health Check
Who checks	Application Load Balancer	ECS agent (Docker HEALTHCHECK)
Protocol	HTTP/HTTPS request	Shell command inside container
Endpoint	Your app's `/health` route	`CMD` defined in task definition
On failure	Removes task from target group (no traffic)	Replaces the entire task
Configurable	Interval, threshold, timeout, path	Interval, timeout, retries, start period
Scope	Network-level + application-level	Container process-level

ALB health check configuration

# Create a target group with health check settings
aws elbv2 create-target-group \
  --name api-targets \
  --protocol HTTP \
  --port 3000 \
  --vpc-id vpc-12345 \
  --health-check-protocol HTTP \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --matcher '{"HttpCode": "200"}'

What these settings mean:

Every 30 seconds:
  ALB sends GET /health to each task
  Task must respond within 5 seconds
  Response must be HTTP 200

  If 3 consecutive checks fail → task marked UNHEALTHY → no more traffic
  If 2 consecutive checks pass → task marked HEALTHY → receives traffic again

ECS container health check in task definition

{
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      },
      "portMappings": [
        { "containerPort": 3000, "protocol": "tcp" }
      ]
    }
  ]
}

3. Shallow vs Deep Health Checks

Not all health checks are equal. The depth of the check determines what failures you can detect:

Shallow health check (liveness probe)

Tests: "Is the process alive and accepting HTTP requests?"

// Shallow: Only confirms Express is running
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

Detects: Process crashed, port not listening, container frozen
Misses:  Database disconnected, Redis down, disk full, memory leak

Deep health check (readiness probe)

Tests: "Can this task actually serve real requests end-to-end?"

// Deep: Verifies all critical dependencies
app.get('/health', async (req, res) => {
  const checks = {
    server: 'ok',
    database: 'unknown',
    redis: 'unknown',
    memory: 'unknown'
  };

  try {
    // Check MongoDB connection
    const dbStart = Date.now();
    await mongoose.connection.db.admin().ping();
    const dbLatency = Date.now() - dbStart;
    checks.database = dbLatency < 1000 ? 'ok' : 'slow';
    if (dbLatency >= 1000) checks.database = 'degraded';
  } catch (err) {
    checks.database = 'error';
  }

  try {
    // Check Redis connection
    const redisStart = Date.now();
    await redisClient.ping();
    const redisLatency = Date.now() - redisStart;
    checks.redis = redisLatency < 500 ? 'ok' : 'degraded';
  } catch (err) {
    checks.redis = 'error';
  }

  // Check memory usage
  const memUsage = process.memoryUsage();
  const heapUsedMB = memUsage.heapUsed / 1024 / 1024;
  const heapTotalMB = memUsage.heapTotal / 1024 / 1024;
  const memPercent = (heapUsedMB / heapTotalMB) * 100;
  checks.memory = memPercent < 90 ? 'ok' : 'warning';

  // Determine overall status
  const hasError = Object.values(checks).includes('error');
  const hasDegraded = Object.values(checks).includes('degraded');

  const status = hasError ? 503 : hasDegraded ? 207 : 200;

  res.status(status).json({
    status: hasError ? 'unhealthy' : hasDegraded ? 'degraded' : 'healthy',
    checks,
    uptime: process.uptime(),
    timestamp: new Date().toISOString()
  });
});

When to use each

Check Type	Use For	Risk
Shallow	ALB health check (fast, low overhead)	May route to tasks with broken dependencies
Deep	Readiness determination, monitoring dashboards	Expensive check may itself cause issues
Hybrid	ALB uses shallow, separate `/health/deep` for monitoring	Best of both worlds

4. Liveness vs Readiness Probes

While ECS does not natively separate liveness and readiness probes (unlike Kubernetes), the concepts still apply and can be implemented:

LIVENESS PROBE: "Is this process alive?"
  - If NO: Kill and restart the task
  - Checks: Process responding, not deadlocked
  - ECS equivalent: Container health check (CMD-SHELL)

READINESS PROBE: "Can this task serve traffic?"
  - If NO: Stop sending traffic, but DON'T kill the task
  - Checks: Dependencies connected, warm-up complete
  - ECS equivalent: ALB health check

Implementing both in Express.js

let isReady = false;  // Set to true after warm-up

// LIVENESS: Is the process alive? (for ECS container health check)
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive', pid: process.pid });
});

// READINESS: Can this task serve real requests? (for ALB health check)
app.get('/health/ready', async (req, res) => {
  if (!isReady) {
    return res.status(503).json({
      status: 'not-ready',
      reason: 'Service is still warming up'
    });
  }

  try {
    await mongoose.connection.db.admin().ping();
    await redisClient.ping();
    res.status(200).json({ status: 'ready' });
  } catch (err) {
    res.status(503).json({
      status: 'not-ready',
      reason: err.message
    });
  }
});

// Warm-up sequence
async function warmUp() {
  await mongoose.connect(process.env.MONGODB_URI);
  await redisClient.connect();
  // Pre-load caches, etc.
  isReady = true;
  console.log('Service is ready to accept traffic');
}

Task definition using liveness for container check, readiness for ALB:

{
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:3000/health/live || exit 1"],
    "interval": 30,
    "timeout": 5,
    "retries": 3,
    "startPeriod": 60
  }
}

ALB target group configured with /health/ready as the health check path.

5. Grace Periods and Start Periods

New tasks need time to start up. Without grace periods, health checks would immediately fail and create an infinite restart loop:

WITHOUT grace period:
  0s  - Task starts
  5s  - Health check runs → app not ready → FAIL
  10s - Health check runs → app not ready → FAIL
  15s - Health check runs → app not ready → FAIL (3 retries)
  16s - ECS marks unhealthy → kills task → starts new one
  INFINITE LOOP — task never has time to start

WITH grace period (startPeriod: 60):
  0s  - Task starts
  5s  - Health check runs → IGNORED (within start period)
  30s - Health check runs → IGNORED (within start period)
  55s - App finishes warm-up
  60s - Grace period ends
  65s - Health check runs → 200 OK → HEALTHY

Configuring grace periods

ECS container health check start period:

{
  "healthCheck": {
    "startPeriod": 60,
    "interval": 30,
    "timeout": 5,
    "retries": 3
  }
}

ECS service health check grace period (for ALB checks):

aws ecs create-service \
  --cluster production-cluster \
  --service-name api-service \
  --task-definition api:5 \
  --desired-count 3 \
  --health-check-grace-period-seconds 120 \
  --load-balancers targetGroupArn=arn:aws:...,containerName=api,containerPort=3000

Setting the right grace period:

Grace period = Application startup time + Safety margin

Example:
  Container pull:    5s
  App initialization: 10s
  DB connection:     5s
  Cache warm-up:     10s
  Total startup:     30s
  Safety margin:     30s (2x)
  Grace period:      60s

If your app takes 45s to start, set grace period to 90-120s.
Too short: Tasks killed before ready (restart loops)
Too long:  Broken tasks stay in rotation longer than necessary

6. Unhealthy Task Replacement

When a task fails health checks, ECS follows a specific replacement sequence:

1. ALB detects unhealthy task (3 consecutive failures)
   └── ALB stops routing traffic to that task

2. ECS detects container health check failure
   └── ECS marks task as UNHEALTHY

3. ECS stops the unhealthy task
   └── Container receives SIGTERM
   └── 30-second drain period (configurable)
   └── Container receives SIGKILL if still running

4. ECS launches replacement task
   └── New container pulled and started
   └── Health check grace period begins
   └── Once healthy, ALB routes traffic to it

Timeline:
  0s   - Failure detected
  90s  - Task deregistered from ALB (deregistration delay)
  120s - Old task stopped
  150s - New task started
  210s - New task passes health check
  210s - Service fully recovered

Handling graceful shutdown in Node.js

const server = app.listen(3000);

// Graceful shutdown on SIGTERM (sent by ECS before stopping)
process.on('SIGTERM', async () => {
  console.log('SIGTERM received. Starting graceful shutdown...');

  // 1. Stop accepting new connections
  server.close(() => {
    console.log('HTTP server closed');
  });

  // 2. Mark as unhealthy so ALB stops sending traffic
  isReady = false;

  // 3. Wait for in-flight requests to complete (max 30 seconds)
  await new Promise(resolve => setTimeout(resolve, 10000));

  // 4. Close database connections
  await mongoose.connection.close();
  await redisClient.quit();

  console.log('Graceful shutdown complete');
  process.exit(0);
});

// Also handle SIGINT for local development
process.on('SIGINT', () => {
  console.log('SIGINT received. Shutting down...');
  process.exit(0);
});

7. Comprehensive Health Check Endpoint

Here is a production-grade health check implementation that covers all scenarios:

const express = require('express');
const mongoose = require('mongoose');
const Redis = require('ioredis');

const app = express();
const redis = new Redis(process.env.REDIS_URL);

let isReady = false;
const startTime = Date.now();

// Dependency health checker
async function checkDependency(name, checkFn, timeoutMs = 3000) {
  const start = Date.now();
  try {
    await Promise.race([
      checkFn(),
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('Timeout')), timeoutMs)
      )
    ]);
    return {
      name,
      status: 'ok',
      latencyMs: Date.now() - start
    };
  } catch (err) {
    return {
      name,
      status: 'error',
      latencyMs: Date.now() - start,
      error: err.message
    };
  }
}

// SHALLOW: Liveness (for ECS container health check / Docker HEALTHCHECK)
app.get('/health/live', (req, res) => {
  res.status(200).json({
    status: 'alive',
    uptime: Math.floor((Date.now() - startTime) / 1000)
  });
});

// DEEP: Readiness (for ALB health check)
app.get('/health/ready', async (req, res) => {
  if (!isReady) {
    return res.status(503).json({ status: 'warming-up' });
  }
  
  const checks = await Promise.all([
    checkDependency('mongodb', () => mongoose.connection.db.admin().ping()),
    checkDependency('redis', () => redis.ping()),
  ]);

  const allOk = checks.every(c => c.status === 'ok');
  
  res.status(allOk ? 200 : 503).json({
    status: allOk ? 'healthy' : 'unhealthy',
    checks
  });
});

// DETAILED: Full diagnostics (for dashboards and debugging, NOT for ALB)
app.get('/health/detail', async (req, res) => {
  const checks = await Promise.all([
    checkDependency('mongodb', () => mongoose.connection.db.admin().ping()),
    checkDependency('redis', () => redis.ping()),
  ]);

  const mem = process.memoryUsage();

  res.status(200).json({
    status: isReady ? 'ready' : 'warming-up',
    version: process.env.APP_VERSION || 'unknown',
    uptime: Math.floor((Date.now() - startTime) / 1000),
    dependencies: checks,
    memory: {
      heapUsedMB: Math.round(mem.heapUsed / 1024 / 1024),
      heapTotalMB: Math.round(mem.heapTotal / 1024 / 1024),
      rssMB: Math.round(mem.rss / 1024 / 1024),
      externalMB: Math.round(mem.external / 1024 / 1024)
    },
    system: {
      nodeVersion: process.version,
      platform: process.platform,
      pid: process.pid,
      cpuUsage: process.cpuUsage()
    }
  });
});

8. Health Check Best Practices

Practice	Why
Keep ALB health checks fast (< 2 seconds)	Slow checks cause false positives and delay traffic routing
Separate liveness from readiness	Prevents restart loops; lets ALB handle traffic routing
Use appropriate intervals	30 seconds is standard; too frequent wastes resources, too infrequent delays detection
Set grace periods correctly	Must exceed your application's startup time with margin
Don't check non-critical dependencies	If a logging service is down, the task can still serve requests
Return structured JSON	Debugging is much easier with detailed health responses
Rate-limit deep checks	Prevent health checks from overloading dependent services
Monitor health check latency	Increasing latency is an early warning of problems
Test health check failures	Deliberately break dependencies and verify the system recovers

9. Cascading Failure Prevention

A cascading failure occurs when one failing component triggers failures in other components. Health checks are a key defense:

CASCADING FAILURE SCENARIO:

  Database becomes slow (high latency)
       │
       ▼
  Task health checks start timing out
       │
       ▼
  ALB marks all tasks unhealthy
       │
       ▼
  ALL traffic returns 502 Bad Gateway
       │
       ▼
  Upstream services that depend on this service also fail
       │
       ▼
  ENTIRE SYSTEM DOWN

Prevention strategies

// Strategy 1: Circuit breaker pattern
// If database is slow, fail fast instead of timing out
const CircuitBreaker = require('opossum');

const dbBreaker = new CircuitBreaker(
  async (query) => mongoose.connection.db.collection('users').findOne(query),
  {
    timeout: 3000,       // If DB call takes > 3s, trip the breaker
    errorThresholdPercentage: 50,  // Trip if 50% of calls fail
    resetTimeout: 30000  // Try again after 30 seconds
  }
);

dbBreaker.on('open', () => console.log('Circuit OPEN — DB calls will fail fast'));
dbBreaker.on('halfOpen', () => console.log('Circuit HALF-OPEN — testing DB'));
dbBreaker.on('close', () => console.log('Circuit CLOSED — DB healthy'));

app.get('/users/:id', async (req, res) => {
  try {
    const user = await dbBreaker.fire({ _id: req.params.id });
    res.json(user);
  } catch (err) {
    if (err.message === 'Breaker is open') {
      res.status(503).json({ error: 'Service temporarily unavailable' });
    } else {
      res.status(500).json({ error: 'Internal error' });
    }
  }
});

// Strategy 2: Health check with dependency tolerance
// Mark healthy even if non-critical dependencies are down
app.get('/health/ready', async (req, res) => {
  const critical = await Promise.all([
    checkDependency('mongodb', () => mongoose.connection.db.admin().ping()),
  ]);
  
  const nonCritical = await Promise.all([
    checkDependency('redis', () => redis.ping()),
    checkDependency('email-service', () => fetch('http://email-svc/health')),
  ]);

  // Only fail on CRITICAL dependency failures
  const criticalOk = critical.every(c => c.status === 'ok');
  
  res.status(criticalOk ? 200 : 503).json({
    status: criticalOk ? 'healthy' : 'unhealthy',
    critical,
    nonCritical  // Report but don't fail
  });
});

// Strategy 3: Bulkhead pattern — isolate dependency failures
// Use separate connection pools with timeouts
const mongoPool = mongoose.createConnection(process.env.MONGODB_URI, {
  maxPoolSize: 10,        // Limit connections
  serverSelectionTimeoutMS: 5000,  // Fail fast
  socketTimeoutMS: 10000  // Don't hang forever
});

const redisClient = new Redis(process.env.REDIS_URL, {
  maxRetriesPerRequest: 3,
  connectTimeout: 5000,
  commandTimeout: 3000
});

10. Key Takeaways

Two layers of health checks — ALB checks control traffic routing; ECS container checks control task lifecycle. Both are needed.
Shallow checks for speed, deep checks for accuracy — use shallow for ALB, deep for monitoring dashboards.
Grace periods prevent restart loops — always set them longer than your application's startup time.
Health checks should be fast — a slow health check can itself become a reliability problem.
Separate critical from non-critical dependencies — failing a health check because a logging service is down is worse than the original problem.

Explain-It Challenge

Your ECS service shows 3 running tasks but the ALB says only 1 is healthy. What could be happening and how would you debug it?
A new deployment causes all tasks to cycle indefinitely — starting, failing health checks, getting replaced. What is the most likely cause and how do you fix it?
Explain why making your health check endpoint call a slow external API is a bad idea, even though it seems like a "deeper" check.

Navigation: ← 6.4.a Horizontal Scaling · 6.4.c — CloudWatch Monitoring →