Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling
6.4.c — CloudWatch Monitoring
In one sentence: Amazon CloudWatch is the central observability platform for AWS services, collecting metrics (numbers over time), logs (event records), and alarms (automated alerts) that give you real-time visibility into your ECS services, application behavior, and system health.
Navigation: ← 6.4.b Health Checks · 6.4 Overview →
1. CloudWatch Fundamentals
CloudWatch has four core building blocks. Every observability strategy starts here:
┌─────────────────────────────────────────────────────────────────┐
│ CLOUDWATCH BUILDING BLOCKS │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ METRICS │ │ LOGS │ │ ALARMS │ │
│ │ │ │ │ │ │ │
│ │ Numbers │ │ Text events │ │ Threshold │ │
│ │ over time │ │ structured │ │ watchers │ │
│ │ │ │ or free-form│ │ │ │
│ │ CPU: 72% │ │ {"level": │ │ IF CPU>80% │ │
│ │ Memory: 45% │ │ "error", │ │ THEN notify │ │
│ │ Requests: 1k│ │ "msg":...} │ │ + scale │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ DASHBOARDS │ │
│ │ │ │
│ │ Visual │ │
│ │ real-time │ │
│ │ overview │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Component | What It Does | Example |
|---|---|---|
| Metrics | Time-series numerical data points | CPU utilization at 72% at 14:30:00 |
| Logs | Text records from applications and services | {"level":"error","message":"DB connection failed"} |
| Alarms | Watch a metric and trigger actions when thresholds are breached | Send SNS notification when CPU > 80% for 5 minutes |
| Dashboards | Visual panels combining metrics and logs | Real-time view of all ECS services |
2. ECS Metrics in CloudWatch
ECS automatically publishes metrics to CloudWatch. These are the metrics you will use daily:
Service-level metrics (namespace: AWS/ECS)
| Metric | Description | Typical Alert Threshold |
|---|---|---|
CPUUtilization | Percentage of CPU reserved that is in use | > 80% |
MemoryUtilization | Percentage of memory reserved that is in use | > 85% |
Viewing ECS metrics via CLI
# Get CPU utilization for an ECS service over the last hour
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average Maximum
# Get memory utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name MemoryUtilization \
--dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average Maximum
ALB-related metrics (namespace: AWS/ApplicationELB)
| Metric | Description | What It Tells You |
|---|---|---|
RequestCount | Total requests received | Traffic volume |
TargetResponseTime | Time for target to respond | Application latency |
HTTPCode_Target_2XX_Count | Successful responses | Happy path volume |
HTTPCode_Target_5XX_Count | Server error responses | Error rate |
HealthyHostCount | Number of healthy targets | Availability |
UnHealthyHostCount | Number of unhealthy targets | Problem detection |
# Get 5XX error count over the last hour
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value=app/my-alb/1234567890 \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Sum
3. Custom Metrics from Node.js Applications
Built-in ECS metrics tell you about infrastructure. Custom metrics tell you about your application:
Publishing custom metrics with the AWS SDK
const { CloudWatchClient, PutMetricDataCommand } = require('@aws-sdk/client-cloudwatch');
const cloudwatch = new CloudWatchClient({ region: 'us-east-1' });
// Helper to publish a custom metric
async function publishMetric(metricName, value, unit = 'Count', dimensions = []) {
const command = new PutMetricDataCommand({
Namespace: 'MyApp/API',
MetricData: [
{
MetricName: metricName,
Value: value,
Unit: unit,
Timestamp: new Date(),
Dimensions: [
{ Name: 'Environment', Value: process.env.NODE_ENV || 'development' },
{ Name: 'ServiceName', Value: 'api-service' },
...dimensions
]
}
]
});
try {
await cloudwatch.send(command);
} catch (err) {
console.error('Failed to publish metric:', err.message);
// Don't throw — metric publishing should never break the request
}
}
// Track API response times
app.use(async (req, res, next) => {
const start = Date.now();
res.on('finish', async () => {
const duration = Date.now() - start;
const route = req.route?.path || req.path;
await publishMetric('ResponseTime', duration, 'Milliseconds', [
{ Name: 'Route', Value: route },
{ Name: 'Method', Value: req.method },
{ Name: 'StatusCode', Value: String(res.statusCode) }
]);
// Track error rates
if (res.statusCode >= 500) {
await publishMetric('ServerErrors', 1, 'Count', [
{ Name: 'Route', Value: route }
]);
}
});
next();
});
Tracking AI usage metrics
// Track OpenAI / LLM API usage
async function callLLM(prompt, model = 'gpt-4o') {
const start = Date.now();
try {
const response = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
temperature: 0.7
});
const duration = Date.now() - start;
const inputTokens = response.usage.prompt_tokens;
const outputTokens = response.usage.completion_tokens;
// Publish AI-specific metrics
await Promise.all([
publishMetric('LLM_Latency', duration, 'Milliseconds', [
{ Name: 'Model', Value: model }
]),
publishMetric('LLM_InputTokens', inputTokens, 'Count', [
{ Name: 'Model', Value: model }
]),
publishMetric('LLM_OutputTokens', outputTokens, 'Count', [
{ Name: 'Model', Value: model }
]),
publishMetric('LLM_TotalCost', calculateCost(model, inputTokens, outputTokens), 'None', [
{ Name: 'Model', Value: model }
])
]);
return response.choices[0].message.content;
} catch (err) {
await publishMetric('LLM_Errors', 1, 'Count', [
{ Name: 'Model', Value: model },
{ Name: 'ErrorType', Value: err.code || 'unknown' }
]);
throw err;
}
}
function calculateCost(model, inputTokens, outputTokens) {
const pricing = {
'gpt-4o': { input: 2.50 / 1_000_000, output: 10.00 / 1_000_000 },
'gpt-4o-mini': { input: 0.15 / 1_000_000, output: 0.60 / 1_000_000 }
};
const p = pricing[model] || pricing['gpt-4o'];
return (inputTokens * p.input) + (outputTokens * p.output);
}
Batching metrics for efficiency
// Batch metrics to reduce API calls (publish every 60 seconds)
class MetricBatcher {
constructor(namespace, flushIntervalMs = 60000) {
this.namespace = namespace;
this.buffer = [];
this.cloudwatch = new CloudWatchClient({ region: 'us-east-1' });
setInterval(() => this.flush(), flushIntervalMs);
}
add(metricName, value, unit = 'Count', dimensions = []) {
this.buffer.push({
MetricName: metricName,
Value: value,
Unit: unit,
Timestamp: new Date(),
Dimensions: dimensions
});
}
async flush() {
if (this.buffer.length === 0) return;
// CloudWatch accepts max 1000 metric data points per call
const batches = [];
for (let i = 0; i < this.buffer.length; i += 1000) {
batches.push(this.buffer.slice(i, i + 1000));
}
this.buffer = [];
for (const batch of batches) {
try {
await this.cloudwatch.send(new PutMetricDataCommand({
Namespace: this.namespace,
MetricData: batch
}));
} catch (err) {
console.error('Failed to flush metrics:', err.message);
}
}
}
}
const metrics = new MetricBatcher('MyApp/API');
// Usage in middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
metrics.add('ResponseTime', Date.now() - start, 'Milliseconds');
metrics.add('RequestCount', 1);
if (res.statusCode >= 500) metrics.add('5xxErrors', 1);
});
next();
});
4. Log Groups and Log Streams
CloudWatch Logs organize application output into a hierarchy:
CloudWatch Logs
└── Log Group: /ecs/api-service ← One per service/application
├── Log Stream: ecs/api/task-id-1 ← One per task/container
├── Log Stream: ecs/api/task-id-2
└── Log Stream: ecs/api/task-id-3
ECS task definition log configuration
{
"containerDefinitions": [
{
"name": "api",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/api-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs",
"awslogs-create-group": "true"
}
}
}
]
}
Querying logs via CLI
# View recent logs for a log group
aws logs tail /ecs/api-service --follow
# Search for errors in the last hour
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--start-time $(date -d '1 hour ago' +%s000) \
--filter-pattern "ERROR"
# Search for specific request ID
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--filter-pattern '{ $.requestId = "abc-123" }'
# Set log retention (default is forever — gets expensive)
aws logs put-retention-policy \
--log-group-name /ecs/api-service \
--retention-in-days 30
5. Structured Logging to CloudWatch
Structured logging means emitting logs as JSON objects instead of plain text. This makes them searchable, filterable, and parseable:
// BAD: Unstructured logging
console.log('User login failed for email: john@example.com');
console.log('Error: Invalid password');
// How do you search for all failed logins? All errors for a specific user?
// GOOD: Structured logging
const logger = {
info: (message, data = {}) => {
console.log(JSON.stringify({
level: 'info',
message,
timestamp: new Date().toISOString(),
service: process.env.SERVICE_NAME || 'api-service',
taskId: process.env.ECS_TASK_ID || 'local',
...data
}));
},
error: (message, error, data = {}) => {
console.log(JSON.stringify({
level: 'error',
message,
timestamp: new Date().toISOString(),
service: process.env.SERVICE_NAME || 'api-service',
taskId: process.env.ECS_TASK_ID || 'local',
error: {
name: error?.name,
message: error?.message,
stack: error?.stack
},
...data
}));
},
warn: (message, data = {}) => {
console.log(JSON.stringify({
level: 'warn',
message,
timestamp: new Date().toISOString(),
service: process.env.SERVICE_NAME || 'api-service',
taskId: process.env.ECS_TASK_ID || 'local',
...data
}));
}
};
// Usage
logger.info('User logged in', { userId: 'user-123', email: 'john@example.com' });
logger.error('Login failed', new Error('Invalid password'), { email: 'john@example.com' });
Request-scoped logging with correlation IDs
const { v4: uuidv4 } = require('uuid');
// Middleware: Attach request ID to every request
app.use((req, res, next) => {
req.requestId = req.headers['x-request-id'] || uuidv4();
res.setHeader('x-request-id', req.requestId);
// Create a request-scoped logger
req.log = {
info: (message, data = {}) => logger.info(message, { requestId: req.requestId, ...data }),
error: (message, err, data = {}) => logger.error(message, err, { requestId: req.requestId, ...data }),
warn: (message, data = {}) => logger.warn(message, { requestId: req.requestId, ...data }),
};
// Log every incoming request
req.log.info('Request received', {
method: req.method,
path: req.path,
userAgent: req.headers['user-agent'],
ip: req.ip
});
next();
});
// Usage in route handlers
app.get('/users/:id', async (req, res) => {
req.log.info('Fetching user', { userId: req.params.id });
try {
const user = await db.users.findById(req.params.id);
if (!user) {
req.log.warn('User not found', { userId: req.params.id });
return res.status(404).json({ error: 'User not found' });
}
req.log.info('User fetched successfully', { userId: req.params.id });
res.json(user);
} catch (err) {
req.log.error('Failed to fetch user', err, { userId: req.params.id });
res.status(500).json({ error: 'Internal server error' });
}
});
Searching structured logs in CloudWatch
# Find all errors
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--filter-pattern '{ $.level = "error" }'
# Find errors for a specific user
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--filter-pattern '{ $.level = "error" && $.userId = "user-123" }'
# Find slow requests (> 1000ms)
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--filter-pattern '{ $.responseTime > 1000 }'
# Trace a single request across all log streams
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--filter-pattern '{ $.requestId = "abc-123-def" }'
6. Setting Up CloudWatch Alarms
Alarms watch a metric and trigger actions when thresholds are breached:
CPU alarm
# Alert when average CPU exceeds 80% for 5 minutes
aws cloudwatch put-metric-alarm \
--alarm-name "api-service-high-cpu" \
--alarm-description "CPU utilization above 80% for 5 minutes" \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
--statistic Average \
--period 60 \
--evaluation-periods 5 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
--treat-missing-data missing
Error rate alarm
# Alert when 5XX errors exceed 5 in 1 minute
aws cloudwatch put-metric-alarm \
--alarm-name "api-service-5xx-spike" \
--alarm-description "More than 5 server errors in 1 minute" \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value=app/my-alb/1234567890 \
--statistic Sum \
--period 60 \
--evaluation-periods 1 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
--treat-missing-data notBreaching
Memory alarm
# Alert when memory exceeds 85%
aws cloudwatch put-metric-alarm \
--alarm-name "api-service-high-memory" \
--alarm-description "Memory utilization above 85%" \
--namespace AWS/ECS \
--metric-name MemoryUtilization \
--dimensions Name=ClusterName,Value=production-cluster Name=ServiceName,Value=api-service \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 85 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts
Healthy host count alarm
# Alert when fewer than 2 healthy tasks
aws cloudwatch put-metric-alarm \
--alarm-name "api-service-low-healthy-hosts" \
--alarm-description "Fewer than 2 healthy tasks" \
--namespace AWS/ApplicationELB \
--metric-name HealthyHostCount \
--dimensions Name=TargetGroup,Value=targetgroup/api-targets/0987654321 Name=LoadBalancer,Value=app/my-alb/1234567890 \
--statistic Minimum \
--period 60 \
--evaluation-periods 2 \
--threshold 2 \
--comparison-operator LessThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts
Custom metric alarm (LLM error rate)
# Alert when LLM errors exceed 10 in 5 minutes
aws cloudwatch put-metric-alarm \
--alarm-name "api-service-llm-errors" \
--alarm-description "LLM API errors spiking" \
--namespace MyApp/API \
--metric-name LLM_Errors \
--dimensions Name=Environment,Value=production Name=ServiceName,Value=api-service \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts
7. Dashboard Creation
Dashboards provide a single-pane-of-glass view of your system:
# Create a comprehensive monitoring dashboard
aws cloudwatch put-dashboard \
--dashboard-name "api-service-overview" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "CPU & Memory Utilization",
"metrics": [
["AWS/ECS", "CPUUtilization", "ClusterName", "production-cluster", "ServiceName", "api-service", { "stat": "Average", "label": "CPU %" }],
["AWS/ECS", "MemoryUtilization", "ClusterName", "production-cluster", "ServiceName", "api-service", { "stat": "Average", "label": "Memory %" }]
],
"period": 60,
"yAxis": { "left": { "min": 0, "max": 100 } },
"view": "timeSeries"
}
},
{
"type": "metric",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Request Count & Errors",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Sum", "label": "Requests" }],
["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Sum", "label": "5XX Errors", "color": "#d62728" }]
],
"period": 60,
"view": "timeSeries"
}
},
{
"type": "metric",
"x": 0, "y": 6, "width": 12, "height": 6,
"properties": {
"title": "Healthy Hosts",
"metrics": [
["AWS/ApplicationELB", "HealthyHostCount", "TargetGroup", "targetgroup/api-targets/0987654321", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Average", "label": "Healthy" }],
["AWS/ApplicationELB", "UnHealthyHostCount", "TargetGroup", "targetgroup/api-targets/0987654321", "LoadBalancer", "app/my-alb/1234567890", { "stat": "Average", "label": "Unhealthy", "color": "#d62728" }]
],
"period": 60,
"view": "timeSeries"
}
},
{
"type": "metric",
"x": 12, "y": 6, "width": 12, "height": 6,
"properties": {
"title": "Response Time (p50, p90, p99)",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234567890", { "stat": "p50", "label": "p50" }],
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234567890", { "stat": "p90", "label": "p90" }],
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/1234567890", { "stat": "p99", "label": "p99", "color": "#d62728" }]
],
"period": 60,
"view": "timeSeries"
}
},
{
"type": "log",
"x": 0, "y": 12, "width": 24, "height": 6,
"properties": {
"title": "Recent Errors",
"query": "SOURCE \u0027/ecs/api-service\u0027 | fields @timestamp, @message | filter @message like /error/i | sort @timestamp desc | limit 20",
"region": "us-east-1",
"view": "table"
}
}
]
}'
8. Tracking AI Usage and System Events
When your microservices use AI/LLM APIs, you need specialized observability:
// AI usage event logging
function logAIEvent(event) {
const logEntry = {
level: 'info',
category: 'ai-usage',
timestamp: new Date().toISOString(),
service: 'ai-service',
...event
};
console.log(JSON.stringify(logEntry));
}
// Example: Track every LLM call
app.post('/api/generate', async (req, res) => {
const startTime = Date.now();
const requestId = req.requestId;
logAIEvent({
event: 'llm_request_started',
requestId,
model: 'gpt-4o',
promptLength: req.body.prompt.length,
userId: req.user?.id
});
try {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: req.body.prompt }]
});
const duration = Date.now() - startTime;
logAIEvent({
event: 'llm_request_completed',
requestId,
model: 'gpt-4o',
durationMs: duration,
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
totalTokens: response.usage.total_tokens,
estimatedCost: calculateCost('gpt-4o', response.usage.prompt_tokens, response.usage.completion_tokens),
finishReason: response.choices[0].finish_reason
});
res.json({ result: response.choices[0].message.content });
} catch (err) {
logAIEvent({
event: 'llm_request_failed',
requestId,
model: 'gpt-4o',
durationMs: Date.now() - startTime,
errorType: err.code || err.constructor.name,
errorMessage: err.message,
retryable: err.status === 429 || err.status >= 500
});
res.status(500).json({ error: 'AI generation failed' });
}
});
CloudWatch Logs Insights queries for AI tracking
-- Total AI spend in the last 24 hours
SOURCE '/ecs/ai-service'
| filter category = "ai-usage" AND event = "llm_request_completed"
| stats sum(estimatedCost) as totalCost, count() as totalCalls
| sort totalCost desc
-- Average latency by model
SOURCE '/ecs/ai-service'
| filter category = "ai-usage" AND event = "llm_request_completed"
| stats avg(durationMs) as avgLatency, p99(durationMs) as p99Latency, count() as calls by model
-- Error rate over time
SOURCE '/ecs/ai-service'
| filter category = "ai-usage"
| stats count() as total,
sum(event = "llm_request_failed") as errors
by bin(5m)
| sort @timestamp desc
-- Top users by AI spend
SOURCE '/ecs/ai-service'
| filter category = "ai-usage" AND event = "llm_request_completed"
| stats sum(estimatedCost) as spend, count() as calls by userId
| sort spend desc
| limit 20
9. Simulating Service Failures and Observing Resilience
Testing your monitoring is as important as setting it up. Here are techniques to simulate failures:
Simulate high CPU
// Endpoint to artificially spike CPU (ONLY in staging/test)
if (process.env.NODE_ENV !== 'production') {
app.post('/debug/stress-cpu', (req, res) => {
const durationMs = req.body.durationMs || 30000;
const end = Date.now() + durationMs;
// Spin CPU
const interval = setInterval(() => {
if (Date.now() >= end) {
clearInterval(interval);
return;
}
// CPU-intensive work
let x = 0;
for (let i = 0; i < 1e7; i++) x += Math.random();
}, 10);
res.json({ message: `CPU stress for ${durationMs}ms` });
});
}
Simulate dependency failure
// Temporarily block database connections (ONLY in staging/test)
if (process.env.NODE_ENV !== 'production') {
let simulateDbFailure = false;
app.post('/debug/simulate-db-failure', (req, res) => {
simulateDbFailure = req.body.enabled;
res.json({ simulateDbFailure });
});
// Middleware that intercepts DB calls
app.use((req, res, next) => {
if (simulateDbFailure) {
req.dbOverride = true;
}
next();
});
}
Simulate memory leak
// Gradually consume memory (ONLY in staging/test)
if (process.env.NODE_ENV !== 'production') {
const leakedArrays = [];
app.post('/debug/leak-memory', (req, res) => {
const sizeMB = req.body.sizeMB || 50;
const arr = new Array(sizeMB * 1024 * 1024 / 8).fill(Math.random());
leakedArrays.push(arr);
const mem = process.memoryUsage();
res.json({
leaked: `${sizeMB}MB`,
totalLeaked: `${leakedArrays.length * sizeMB}MB`,
heapUsed: `${Math.round(mem.heapUsed / 1024 / 1024)}MB`
});
});
app.post('/debug/clear-leaks', (req, res) => {
leakedArrays.length = 0;
global.gc?.();
res.json({ message: 'Leaks cleared' });
});
}
What to observe during chaos tests
1. Do alarms fire within expected time?
- CPU alarm: Should fire within 5 minutes of CPU > 80%
- Error alarm: Should fire within 1 minute of error spike
2. Does auto-scaling respond?
- CPU spike: New tasks should launch within 2 minutes
- CPU drop: Tasks should scale in after cooldown period
3. Do health checks detect the failure?
- DB failure: ALB should deregister unhealthy tasks within 90 seconds
- Memory spike: ECS should replace OOM-killed tasks
4. Do logs capture the right information?
- Error logs with stack traces and context
- Request IDs for tracing
- Timestamps for timeline reconstruction
5. Does the dashboard show the incident clearly?
- CPU/memory spike visible
- Error rate increase visible
- Healthy host count drop visible
10. AWS X-Ray for Distributed Tracing (Introduction)
When a single user request passes through multiple microservices, logs alone are not enough. You need distributed tracing to follow the request across service boundaries:
User Request → API Gateway → Auth Service → User Service → DB
→ AI Service → OpenAI
→ Notification Service → SES
Without tracing: "Something is slow" — but WHICH service?
With X-Ray: Visual map showing each hop and its latency
Basic X-Ray setup in Express.js
const AWSXRay = require('aws-xray-sdk-core');
const xrayExpress = require('aws-xray-sdk-express');
// Capture all AWS SDK calls
AWSXRay.captureAWS(require('aws-sdk'));
// Capture HTTP calls
AWSXRay.captureHTTPsGlobal(require('http'));
AWSXRay.captureHTTPsGlobal(require('https'));
// Open segment for each incoming request
app.use(xrayExpress.openSegment('api-service'));
// Your routes here...
app.get('/users/:id', async (req, res) => {
// X-Ray automatically traces this DB call
const user = await db.users.findById(req.params.id);
// Add custom annotation
const segment = AWSXRay.getSegment();
const subsegment = segment.addNewSubsegment('process-user');
subsegment.addAnnotation('userId', req.params.id);
subsegment.close();
res.json(user);
});
// Close segment after response
app.use(xrayExpress.closeSegment());
X-Ray in ECS task definition
{
"containerDefinitions": [
{
"name": "api",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
"environment": [
{ "name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000" }
]
},
{
"name": "xray-daemon",
"image": "amazon/aws-xray-daemon",
"portMappings": [
{ "containerPort": 2000, "protocol": "udp" }
],
"cpu": 32,
"memoryReservation": 256
}
]
}
11. Key Takeaways
- CloudWatch is the foundation — metrics, logs, alarms, and dashboards give you complete visibility into ECS services.
- Custom metrics tell you what infrastructure metrics cannot — track application-specific values like AI cost, response quality, and business events.
- Structured logging (JSON) makes logs searchable and parseable — always include timestamps, request IDs, and log levels.
- Alarms must have actions — an alarm that fires but notifies nobody is useless. Connect alarms to SNS, auto-scaling, or Lambda.
- Test your monitoring — simulate failures in staging to verify that alarms fire, dashboards show the incident, and auto-scaling responds.
- X-Ray for tracing — when requests cross service boundaries, distributed tracing is the only way to diagnose latency and errors.
Explain-It Challenge
- Your dashboard shows CPU at 30% but users report slow responses. What CloudWatch metrics would you check next, and why might CPU be misleading?
- Design a structured log format for an e-commerce checkout service that would let you trace any order from payment to fulfillment across 4 microservices.
- An alarm fires at 3 AM for high error rates. Walk through the exact steps you would take using CloudWatch to diagnose the issue.
Navigation: ← 6.4.b Health Checks · 6.4 Overview →