Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling

6.4.c — CloudWatch Monitoring

In one sentence: Amazon CloudWatch is the central observability platform for AWS services, collecting metrics (numbers over time), logs (event records), and alarms (automated alerts) that give you real-time visibility into your ECS services, application behavior, and system health.

Navigation: ← 6.4.b Health Checks · 6.4 Overview →

1. CloudWatch Fundamentals

CloudWatch has four core building blocks. Every observability strategy starts here:

┌─────────────────────────────────────────────────────────────────┐
│                   CLOUDWATCH BUILDING BLOCKS                     │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   METRICS    │  │    LOGS      │  │   ALARMS     │           │
│  │              │  │              │  │              │           │
│  │  Numbers     │  │  Text events │  │  Threshold   │           │
│  │  over time   │  │  structured  │  │  watchers    │           │
│  │              │  │  or free-form│  │              │           │
│  │  CPU: 72%    │  │  {"level":   │  │  IF CPU>80%  │           │
│  │  Memory: 45% │  │   "error",   │  │  THEN notify │           │
│  │  Requests: 1k│  │   "msg":...} │  │  + scale     │           │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘           │
│         │                 │                 │                    │
│         └─────────────────┼─────────────────┘                    │
│                           ▼                                      │
│                  ┌──────────────┐                                 │
│                  │  DASHBOARDS  │                                 │
│                  │              │                                 │
│                  │  Visual      │                                 │
│                  │  real-time   │                                 │
│                  │  overview    │                                 │
│                  └──────────────┘                                 │
└─────────────────────────────────────────────────────────────────┘

Component	What It Does	Example
Metrics	Time-series numerical data points	CPU utilization at 72% at 14:30:00
Logs	Text records from applications and services	`{"level":"error","message":"DB connection failed"}`
Alarms	Watch a metric and trigger actions when thresholds are breached	Send SNS notification when CPU > 80% for 5 minutes
Dashboards	Visual panels combining metrics and logs	Real-time view of all ECS services

2. ECS Metrics in CloudWatch

ECS automatically publishes metrics to CloudWatch. These are the metrics you will use daily:

Service-level metrics (namespace: `AWS/ECS`)

Metric	Description	Typical Alert Threshold
`CPUUtilization`	Percentage of CPU reserved that is in use	> 80%
`MemoryUtilization`	Percentage of memory reserved that is in use	> 85%

Viewing ECS metrics via CLI

# Get CPU utilization for an ECS service over the last hour
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average Maximum

# Get memory utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name MemoryUtilization \
  --dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average Maximum

ALB-related metrics (namespace: `AWS/ApplicationELB`)

Metric	Description	What It Tells You
`RequestCount`	Total requests received	Traffic volume
`TargetResponseTime`	Time for target to respond	Application latency
`HTTPCode_Target_2XX_Count`	Successful responses	Happy path volume
`HTTPCode_Target_5XX_Count`	Server error responses	Error rate
`HealthyHostCount`	Number of healthy targets	Availability
`UnHealthyHostCount`	Number of unhealthy targets	Problem detection

# Get 5XX error count over the last hour
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890 \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Sum

3. Custom Metrics from Node.js Applications

Built-in ECS metrics tell you about infrastructure. Custom metrics tell you about your application:

Publishing custom metrics with the AWS SDK

const { CloudWatchClient, PutMetricDataCommand } = require('@aws-sdk/client-cloudwatch');

const cloudwatch = new CloudWatchClient({ region: 'us-east-1' });

// Helper to publish a custom metric
async function publishMetric(metricName, value, unit = 'Count', dimensions = []) {
  const command = new PutMetricDataCommand({
    Namespace: 'MyApp/API',
    MetricData: [
      {
        MetricName: metricName,
        Value: value,
        Unit: unit,
        Timestamp: new Date(),
        Dimensions: [
          { Name: 'Environment', Value: process.env.NODE_ENV || 'development' },
          { Name: 'ServiceName', Value: 'api-service' },
          ...dimensions
        ]
      }
    ]
  });

  try {
    await cloudwatch.send(command);
  } catch (err) {
    console.error('Failed to publish metric:', err.message);
    // Don't throw — metric publishing should never break the request
  }
}

// Track API response times
app.use(async (req, res, next) => {
  const start = Date.now();

  res.on('finish', async () => {
    const duration = Date.now() - start;
    const route = req.route?.path || req.path;

    await publishMetric('ResponseTime', duration, 'Milliseconds', [
      { Name: 'Route', Value: route },
      { Name: 'Method', Value: req.method },
      { Name: 'StatusCode', Value: String(res.statusCode) }
    ]);

    // Track error rates
    if (res.statusCode >= 500) {
      await publishMetric('ServerErrors', 1, 'Count', [
        { Name: 'Route', Value: route }
      ]);
    }
  });

  next();
});

Tracking AI usage metrics

// Track OpenAI / LLM API usage
async function callLLM(prompt, model = 'gpt-4o') {
  const start = Date.now();

  try {
    const response = await openai.chat.completions.create({
      model,
      messages: [{ role: 'user', content: prompt }],
      temperature: 0.7
    });

    const duration = Date.now() - start;
    const inputTokens = response.usage.prompt_tokens;
    const outputTokens = response.usage.completion_tokens;

    // Publish AI-specific metrics
    await Promise.all([
      publishMetric('LLM_Latency', duration, 'Milliseconds', [
        { Name: 'Model', Value: model }
      ]),
      publishMetric('LLM_InputTokens', inputTokens, 'Count', [
        { Name: 'Model', Value: model }
      ]),
      publishMetric('LLM_OutputTokens', outputTokens, 'Count', [
        { Name: 'Model', Value: model }
      ]),
      publishMetric('LLM_TotalCost', calculateCost(model, inputTokens, outputTokens), 'None', [
        { Name: 'Model', Value: model }
      ])
    ]);

    return response.choices[0].message.content;
  } catch (err) {
    await publishMetric('LLM_Errors', 1, 'Count', [
      { Name: 'Model', Value: model },
      { Name: 'ErrorType', Value: err.code || 'unknown' }
    ]);
    throw err;
  }
}

function calculateCost(model, inputTokens, outputTokens) {
  const pricing = {
    'gpt-4o': { input: 2.50 / 1_000_000, output: 10.00 / 1_000_000 },
    'gpt-4o-mini': { input: 0.15 / 1_000_000, output: 0.60 / 1_000_000 }
  };
  const p = pricing[model] || pricing['gpt-4o'];
  return (inputTokens * p.input) + (outputTokens * p.output);
}

Batching metrics for efficiency

// Batch metrics to reduce API calls (publish every 60 seconds)
class MetricBatcher {
  constructor(namespace, flushIntervalMs = 60000) {
    this.namespace = namespace;
    this.buffer = [];
    this.cloudwatch = new CloudWatchClient({ region: 'us-east-1' });

    setInterval(() => this.flush(), flushIntervalMs);
  }

  add(metricName, value, unit = 'Count', dimensions = []) {
    this.buffer.push({
      MetricName: metricName,
      Value: value,
      Unit: unit,
      Timestamp: new Date(),
      Dimensions: dimensions
    });
  }

  async flush() {
    if (this.buffer.length === 0) return;

    // CloudWatch accepts max 1000 metric data points per call
    const batches = [];
    for (let i = 0; i < this.buffer.length; i += 1000) {
      batches.push(this.buffer.slice(i, i + 1000));
    }

    this.buffer = [];

    for (const batch of batches) {
      try {
        await this.cloudwatch.send(new PutMetricDataCommand({
          Namespace: this.namespace,
          MetricData: batch
        }));
      } catch (err) {
        console.error('Failed to flush metrics:', err.message);
      }
    }
  }
}

const metrics = new MetricBatcher('MyApp/API');

// Usage in middleware
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    metrics.add('ResponseTime', Date.now() - start, 'Milliseconds');
    metrics.add('RequestCount', 1);
    if (res.statusCode >= 500) metrics.add('5xxErrors', 1);
  });
  next();
});

4. Log Groups and Log Streams

CloudWatch Logs organize application output into a hierarchy:

CloudWatch Logs
  └── Log Group: /ecs/api-service           ← One per service/application
       ├── Log Stream: ecs/api/task-id-1    ← One per task/container
       ├── Log Stream: ecs/api/task-id-2
       └── Log Stream: ecs/api/task-id-3

ECS task definition log configuration

{
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-create-group": "true"
        }
      }
    }
  ]
}

Querying logs via CLI

# View recent logs for a log group
aws logs tail /ecs/api-service --follow

# Search for errors in the last hour
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --start-time $(date -d '1 hour ago' +%s000) \
  --filter-pattern "ERROR"

# Search for specific request ID
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.requestId = "abc-123" }'

# Set log retention (default is forever — gets expensive)
aws logs put-retention-policy \
  --log-group-name /ecs/api-service \
  --retention-in-days 30

5. Structured Logging to CloudWatch

Structured logging means emitting logs as JSON objects instead of plain text. This makes them searchable, filterable, and parseable:

// BAD: Unstructured logging
console.log('User login failed for email: john@example.com');
console.log('Error: Invalid password');
// How do you search for all failed logins? All errors for a specific user?

// GOOD: Structured logging
const logger = {
  info: (message, data = {}) => {
    console.log(JSON.stringify({
      level: 'info',
      message,
      timestamp: new Date().toISOString(),
      service: process.env.SERVICE_NAME || 'api-service',
      taskId: process.env.ECS_TASK_ID || 'local',
      ...data
    }));
  },

  error: (message, error, data = {}) => {
    console.log(JSON.stringify({
      level: 'error',
      message,
      timestamp: new Date().toISOString(),
      service: process.env.SERVICE_NAME || 'api-service',
      taskId: process.env.ECS_TASK_ID || 'local',
      error: {
        name: error?.name,
        message: error?.message,
        stack: error?.stack
      },
      ...data
    }));
  },

  warn: (message, data = {}) => {
    console.log(JSON.stringify({
      level: 'warn',
      message,
      timestamp: new Date().toISOString(),
      service: process.env.SERVICE_NAME || 'api-service',
      taskId: process.env.ECS_TASK_ID || 'local',
      ...data
    }));
  }
};

// Usage
logger.info('User logged in', { userId: 'user-123', email: 'john@example.com' });
logger.error('Login failed', new Error('Invalid password'), { email: 'john@example.com' });

Request-scoped logging with correlation IDs

const { v4: uuidv4 } = require('uuid');

// Middleware: Attach request ID to every request
app.use((req, res, next) => {
  req.requestId = req.headers['x-request-id'] || uuidv4();
  res.setHeader('x-request-id', req.requestId);
  
  // Create a request-scoped logger
  req.log = {
    info: (message, data = {}) => logger.info(message, { requestId: req.requestId, ...data }),
    error: (message, err, data = {}) => logger.error(message, err, { requestId: req.requestId, ...data }),
    warn: (message, data = {}) => logger.warn(message, { requestId: req.requestId, ...data }),
  };

  // Log every incoming request
  req.log.info('Request received', {
    method: req.method,
    path: req.path,
    userAgent: req.headers['user-agent'],
    ip: req.ip
  });

  next();
});

// Usage in route handlers
app.get('/users/:id', async (req, res) => {
  req.log.info('Fetching user', { userId: req.params.id });

  try {
    const user = await db.users.findById(req.params.id);
    if (!user) {
      req.log.warn('User not found', { userId: req.params.id });
      return res.status(404).json({ error: 'User not found' });
    }
    req.log.info('User fetched successfully', { userId: req.params.id });
    res.json(user);
  } catch (err) {
    req.log.error('Failed to fetch user', err, { userId: req.params.id });
    res.status(500).json({ error: 'Internal server error' });
  }
});

Searching structured logs in CloudWatch

# Find all errors
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.level = "error" }'

# Find errors for a specific user
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.level = "error" && $.userId = "user-123" }'

# Find slow requests (> 1000ms)
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.responseTime > 1000 }'

# Trace a single request across all log streams
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.requestId = "abc-123-def" }'

6. Setting Up CloudWatch Alarms

Alarms watch a metric and trigger actions when thresholds are breached:

CPU alarm

# Alert when average CPU exceeds 80% for 5 minutes
aws cloudwatch put-metric-alarm \
  --alarm-name "api-service-high-cpu" \
  --alarm-description "CPU utilization above 80% for 5 minutes" \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
  --statistic Average \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
  --treat-missing-data missing

Error rate alarm

# Alert when 5XX errors exceed 5 in 1 minute
aws cloudwatch put-metric-alarm \
  --alarm-name "api-service-5xx-spike" \
  --alarm-description "More than 5 server errors in 1 minute" \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890 \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 1 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
  --treat-missing-data notBreaching

Memory alarm

# Alert when memory exceeds 85%
aws cloudwatch put-metric-alarm \
  --alarm-name "api-service-high-memory" \
  --alarm-description "Memory utilization above 85%" \
  --namespace AWS/ECS \
  --metric-name MemoryUtilization \
  --dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 85 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts

Healthy host count alarm

# Alert when fewer than 2 healthy tasks
aws cloudwatch put-metric-alarm \
  --alarm-name "api-service-low-healthy-hosts" \
  --alarm-description "Fewer than 2 healthy tasks" \
  --namespace AWS/ApplicationELB \
  --metric-name HealthyHostCount \
  --dimensions Name=TargetGroup,Value=targetgroup/api-targets/0987654321 Name=LoadBalancer,Value=app/my-alb/1234567890 \
  --statistic Minimum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 2 \
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts

Custom metric alarm (LLM error rate)

# Alert when LLM errors exceed 10 in 5 minutes
aws cloudwatch put-metric-alarm \
  --alarm-name "api-service-llm-errors" \
  --alarm-description "LLM API errors spiking" \
  --namespace MyApp/API \
  --metric-name LLM_Errors \
  --dimensions Name=Environment,Value=production Name=ServiceName,Value=api-service \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts

7. Dashboard Creation

Dashboards provide a single-pane-of-glass view of your system:

# Create a comprehensive monitoring dashboard
aws cloudwatch put-dashboard \
  --dashboard-name "api-service-overview" \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "x": 0, "y": 0, "width": 12, "height": 6,
        "properties": {
          "title": "CPU & Memory Utilization",
          "metrics": [
            ["AWS/ECS", "CPUUtilization", "ClusterName", "production-cluster", "ServiceName", "api-service", { "stat": "Average", "label": "CPU %" }],
            ["AWS/ECS", "MemoryUtilization", "ClusterName", "production-cluster", "ServiceName", "api-service", { "stat": "Average", "label": "Memory %" }]
          ],
          "period": 60,
          "yAxis": { "left": { "min": 0, "max": 100 } },
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "x": 12, "y": 0, "width": 12, "height": 6,
        "properties": {
          "title": "Request Count & Errors",
          "metrics": [
            ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Sum", "label": "Requests" }],
            ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Sum", "label": "5XX Errors", "color": "#d62728" }]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "x": 0, "y": 6, "width": 12, "height": 6,
        "properties": {
          "title": "Healthy Hosts",
          "metrics": [
            ["AWS/ApplicationELB", "HealthyHostCount", "TargetGroup", "targetgroup/api-targets/0987654321", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Average", "label": "Healthy" }],
            ["AWS/ApplicationELB", "UnHealthyHostCount", "TargetGroup", "targetgroup/api-targets/0987654321", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Average", "label": "Unhealthy", "color": "#d62728" }]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "x": 12, "y": 6, "width": 12, "height": 6,
        "properties": {
          "title": "Response Time (p50, p90, p99)",
          "metrics": [
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234567890", { "stat": "p50", "label": "p50" }],
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234567890", { "stat": "p90", "label": "p90" }],
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234567890", { "stat": "p99", "label": "p99", "color": "#d62728" }]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      },
      {
        "type": "log",
        "x": 0, "y": 12, "width": 24, "height": 6,
        "properties": {
          "title": "Recent Errors",
          "query": "SOURCE \u0027/ecs/api-service\u0027 | fields @timestamp, @message | filter @message like /error/i | sort @timestamp desc | limit 20",
          "region": "us-east-1",
          "view": "table"
        }
      }
    ]
  }'

8. Tracking AI Usage and System Events

When your microservices use AI/LLM APIs, you need specialized observability:

// AI usage event logging
function logAIEvent(event) {
  const logEntry = {
    level: 'info',
    category: 'ai-usage',
    timestamp: new Date().toISOString(),
    service: 'ai-service',
    ...event
  };
  console.log(JSON.stringify(logEntry));
}

// Example: Track every LLM call
app.post('/api/generate', async (req, res) => {
  const startTime = Date.now();
  const requestId = req.requestId;

  logAIEvent({
    event: 'llm_request_started',
    requestId,
    model: 'gpt-4o',
    promptLength: req.body.prompt.length,
    userId: req.user?.id
  });

  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: req.body.prompt }]
    });

    const duration = Date.now() - startTime;

    logAIEvent({
      event: 'llm_request_completed',
      requestId,
      model: 'gpt-4o',
      durationMs: duration,
      inputTokens: response.usage.prompt_tokens,
      outputTokens: response.usage.completion_tokens,
      totalTokens: response.usage.total_tokens,
      estimatedCost: calculateCost('gpt-4o', response.usage.prompt_tokens, response.usage.completion_tokens),
      finishReason: response.choices[0].finish_reason
    });

    res.json({ result: response.choices[0].message.content });
  } catch (err) {
    logAIEvent({
      event: 'llm_request_failed',
      requestId,
      model: 'gpt-4o',
      durationMs: Date.now() - startTime,
      errorType: err.code || err.constructor.name,
      errorMessage: err.message,
      retryable: err.status === 429 || err.status >= 500
    });

    res.status(500).json({ error: 'AI generation failed' });
  }
});

CloudWatch Logs Insights queries for AI tracking

-- Total AI spend in the last 24 hours
SOURCE '/ecs/ai-service'
| filter category = "ai-usage" AND event = "llm_request_completed"
| stats sum(estimatedCost) as totalCost, count() as totalCalls
| sort totalCost desc

-- Average latency by model
SOURCE '/ecs/ai-service'
| filter category = "ai-usage" AND event = "llm_request_completed"
| stats avg(durationMs) as avgLatency, p99(durationMs) as p99Latency, count() as calls by model

-- Error rate over time
SOURCE '/ecs/ai-service'
| filter category = "ai-usage"
| stats count() as total,
        sum(event = "llm_request_failed") as errors
        by bin(5m)
| sort @timestamp desc

-- Top users by AI spend
SOURCE '/ecs/ai-service'
| filter category = "ai-usage" AND event = "llm_request_completed"
| stats sum(estimatedCost) as spend, count() as calls by userId
| sort spend desc
| limit 20

9. Simulating Service Failures and Observing Resilience

Testing your monitoring is as important as setting it up. Here are techniques to simulate failures:

Simulate high CPU

// Endpoint to artificially spike CPU (ONLY in staging/test)
if (process.env.NODE_ENV !== 'production') {
  app.post('/debug/stress-cpu', (req, res) => {
    const durationMs = req.body.durationMs || 30000;
    const end = Date.now() + durationMs;

    // Spin CPU
    const interval = setInterval(() => {
      if (Date.now() >= end) {
        clearInterval(interval);
        return;
      }
      // CPU-intensive work
      let x = 0;
      for (let i = 0; i < 1e7; i++) x += Math.random();
    }, 10);

    res.json({ message: `CPU stress for ${durationMs}ms` });
  });
}

Simulate dependency failure

// Temporarily block database connections (ONLY in staging/test)
if (process.env.NODE_ENV !== 'production') {
  let simulateDbFailure = false;

  app.post('/debug/simulate-db-failure', (req, res) => {
    simulateDbFailure = req.body.enabled;
    res.json({ simulateDbFailure });
  });

  // Middleware that intercepts DB calls
  app.use((req, res, next) => {
    if (simulateDbFailure) {
      req.dbOverride = true;
    }
    next();
  });
}

Simulate memory leak

// Gradually consume memory (ONLY in staging/test)
if (process.env.NODE_ENV !== 'production') {
  const leakedArrays = [];

  app.post('/debug/leak-memory', (req, res) => {
    const sizeMB = req.body.sizeMB || 50;
    const arr = new Array(sizeMB * 1024 * 1024 / 8).fill(Math.random());
    leakedArrays.push(arr);
    const mem = process.memoryUsage();
    res.json({
      leaked: `${sizeMB}MB`,
      totalLeaked: `${leakedArrays.length * sizeMB}MB`,
      heapUsed: `${Math.round(mem.heapUsed / 1024 / 1024)}MB`
    });
  });

  app.post('/debug/clear-leaks', (req, res) => {
    leakedArrays.length = 0;
    global.gc?.();
    res.json({ message: 'Leaks cleared' });
  });
}

What to observe during chaos tests

1. Do alarms fire within expected time?
   - CPU alarm: Should fire within 5 minutes of CPU > 80%
   - Error alarm: Should fire within 1 minute of error spike

2. Does auto-scaling respond?
   - CPU spike: New tasks should launch within 2 minutes
   - CPU drop: Tasks should scale in after cooldown period

3. Do health checks detect the failure?
   - DB failure: ALB should deregister unhealthy tasks within 90 seconds
   - Memory spike: ECS should replace OOM-killed tasks

4. Do logs capture the right information?
   - Error logs with stack traces and context
   - Request IDs for tracing
   - Timestamps for timeline reconstruction

5. Does the dashboard show the incident clearly?
   - CPU/memory spike visible
   - Error rate increase visible
   - Healthy host count drop visible

10. AWS X-Ray for Distributed Tracing (Introduction)

When a single user request passes through multiple microservices, logs alone are not enough. You need distributed tracing to follow the request across service boundaries:

User Request → API Gateway → Auth Service → User Service → DB
                                          → AI Service → OpenAI
                                          → Notification Service → SES

Without tracing: "Something is slow" — but WHICH service?
With X-Ray:     Visual map showing each hop and its latency

Basic X-Ray setup in Express.js

const AWSXRay = require('aws-xray-sdk-core');
const xrayExpress = require('aws-xray-sdk-express');

// Capture all AWS SDK calls
AWSXRay.captureAWS(require('aws-sdk'));

// Capture HTTP calls
AWSXRay.captureHTTPsGlobal(require('http'));
AWSXRay.captureHTTPsGlobal(require('https'));

// Open segment for each incoming request
app.use(xrayExpress.openSegment('api-service'));

// Your routes here...
app.get('/users/:id', async (req, res) => {
  // X-Ray automatically traces this DB call
  const user = await db.users.findById(req.params.id);

  // Add custom annotation
  const segment = AWSXRay.getSegment();
  const subsegment = segment.addNewSubsegment('process-user');
  subsegment.addAnnotation('userId', req.params.id);
  subsegment.close();

  res.json(user);
});

// Close segment after response
app.use(xrayExpress.closeSegment());

X-Ray in ECS task definition

{
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
      "environment": [
        { "name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000" }
      ]
    },
    {
      "name": "xray-daemon",
      "image": "amazon/aws-xray-daemon",
      "portMappings": [
        { "containerPort": 2000, "protocol": "udp" }
      ],
      "cpu": 32,
      "memoryReservation": 256
    }
  ]
}

11. Key Takeaways

CloudWatch is the foundation — metrics, logs, alarms, and dashboards give you complete visibility into ECS services.
Custom metrics tell you what infrastructure metrics cannot — track application-specific values like AI cost, response quality, and business events.
Structured logging (JSON) makes logs searchable and parseable — always include timestamps, request IDs, and log levels.
Alarms must have actions — an alarm that fires but notifies nobody is useless. Connect alarms to SNS, auto-scaling, or Lambda.
Test your monitoring — simulate failures in staging to verify that alarms fire, dashboards show the incident, and auto-scaling responds.
X-Ray for tracing — when requests cross service boundaries, distributed tracing is the only way to diagnose latency and errors.

Explain-It Challenge

Your dashboard shows CPU at 30% but users report slow responses. What CloudWatch metrics would you check next, and why might CPU be misleading?
Design a structured log format for an e-commerce checkout service that would let you trace any order from payment to fulfillment across 4 microservices.
An alarm fires at 3 AM for high error rates. Walk through the exact steps you would take using CloudWatch to diagnose the issue.