Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling

6.4 — Distributed Observability & Scaling: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps — reopen README.md6.4.a6.4.c.
  3. Practice6.4-Exercise-Questions.md.
  4. Polish answers6.4-Interview-Questions.md.

Core vocabulary

TermOne-liner
Horizontal scalingAdding more task instances to handle load (scale out/in)
Vertical scalingMaking a single instance bigger (scale up/down)
Desired countNumber of tasks ECS wants running
Running countNumber of tasks actually running right now
Target trackingAuto-scaling policy that maintains a target metric value (e.g., 70% CPU)
Step scalingAuto-scaling policy with graduated responses to metric breaches
CooldownWait period between scaling actions to prevent flapping
Stateless serviceService that stores no client data between requests
Health checkAutomated probe verifying a task can serve requests
Liveness probe"Is the process alive?" (ECS container health check)
Readiness probe"Can this task serve traffic?" (ALB health check)
Grace periodTime before health checks count after task start
Circuit breakerPattern that fails fast when a dependency is down
CloudWatch metricTime-series numerical data point
CloudWatch alarmThreshold watcher that triggers actions
Structured loggingJSON-formatted log entries (searchable, parseable)
Correlation IDUUID propagated across services to trace one request
X-RayAWS distributed tracing for cross-service request flows

Scaling strategies

HORIZONTAL (preferred for microservices):
  Add/remove task instances behind load balancer
  Requires: Stateless architecture
  Limit: Nearly unlimited

VERTICAL (last resort):
  Increase CPU/memory of single instance
  Requires: Nothing special
  Limit: Hardware ceiling, single point of failure

Auto-scaling policy cheat sheet

Target Tracking (recommended):
  - Set target value (e.g., 70% CPU)
  - AWS handles the math
  - Best for: Steady, gradual load changes

Step Scaling:
  - Define graduated steps (add 1/3/5 tasks at different thresholds)
  - You control the proportional response
  - Best for: Sharp spikes, variable severity

Scheduled Scaling:
  - Pre-defined capacity at specific times
  - cron expressions for recurring patterns
  - Best for: Predictable traffic (business hours, campaigns)

Scale-out vs scale-in

Scale OUT:                     Scale IN:
  Cooldown: 60s (short)          Cooldown: 300s (long)
  Risk: LOW                      Risk: MODERATE
  Speed: Fast response           Speed: Gradual
  Strategy: Be aggressive        Strategy: Be conservative

Target tracking math

New desired count = current tasks * current metric / target metric

Example:
  5 tasks * 90% CPU / 70% target = 6.43 → round up → 7 tasks

Health check types

SHALLOW (Liveness):
  GET /health → 200 OK
  Checks: Process alive, port listening
  Speed: < 10ms
  Use for: ALB health check, container health check

DEEP (Readiness):
  GET /health/ready → checks DB, Redis, memory
  Checks: All critical dependencies
  Speed: < 2000ms (with timeouts)
  Use for: Monitoring dashboards, deployment gates

DETAILED (Diagnostics):
  GET /health/detail → full system report
  Checks: Everything + non-critical services
  Use for: Debugging, never for ALB

ALB vs ECS health check

ALB Health Check:
  Who: Load Balancer
  How: HTTP request to /health
  On failure: Remove from target group (no traffic)
  Config: path, interval, timeout, thresholds

ECS Container Health Check:
  Who: ECS Agent (Docker)
  How: Shell command inside container
  On failure: Stop and replace the task
  Config: command, interval, timeout, retries, startPeriod

Grace period guide

Grace period = App startup time * 2

App starts in 30s → grace period = 60s
App starts in 60s → grace period = 120s

Too short: Restart loops (tasks killed before ready)
Too long:  Broken tasks stay in rotation

CloudWatch components

METRICS:
  Built-in:  CPUUtilization, MemoryUtilization (namespace: AWS/ECS)
  ALB:       RequestCount, 5XX_Count, TargetResponseTime (namespace: AWS/ApplicationELB)
  Custom:    ResponseTime, LLM_Latency, BusinessEvents (namespace: MyApp/API)

LOGS:
  Structure: Log Group → Log Streams
  Driver:    awslogs in task definition
  Format:    JSON (structured) > plain text
  Retention: Set explicitly (default = forever = expensive)

ALARMS:
  States:    OK → ALARM → INSUFFICIENT_DATA
  Actions:   SNS notification, auto-scaling, Lambda
  Config:    metric, threshold, period, evaluation periods, comparison operator

DASHBOARDS:
  Widgets:   Line charts (metrics), tables (logs), numbers (single stat)
  Scope:     One overview + one per service

Auto-scaling configuration checklist

# 1. Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/CLUSTER/SERVICE \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 20

# 2. Create CPU target tracking policy
aws application-autoscaling put-scaling-policy \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

# 3. Verify
aws application-autoscaling describe-scaling-policies \
  --service-namespace ecs

Structured logging template

// Minimum viable structured log
console.log(JSON.stringify({
  level: 'info|warn|error',
  message: 'What happened',
  timestamp: new Date().toISOString(),
  service: 'service-name',
  requestId: 'uuid',
  taskId: process.env.ECS_TASK_ID
}));

Monitoring checklist (day-one setup)

AlarmMetricThresholdAction
High CPUCPUUtilization> 80% for 5 minSNS + auto-scale
High MemoryMemoryUtilization> 85% for 5 minSNS
Error spikeHTTPCode_Target_5XX_Count> 5/minSNS (page on-call)
Low healthy hostsHealthyHostCount< 2SNS (page on-call)
High latencyTargetResponseTime p99> 3 secSNS
LLM errorsCustom LLM_Errors> 10/5 minSNS

Essential CloudWatch CLI commands

# Tail logs in real-time
aws logs tail /ecs/api-service --follow

# Search for errors
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.level = "error" }'

# Trace a request across all streams
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.requestId = "abc-123" }'

# Get CPU stats
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=api \
  --period 300 --statistics Average Maximum \
  --start-time ... --end-time ...

# Set log retention
aws logs put-retention-policy \
  --log-group-name /ecs/api-service \
  --retention-in-days 30

Common gotchas

GotchaWhy
In-memory sessions break scalingLoad balancer sends requests to different tasks
Health check grace period too shortTasks killed before they finish starting
Deep health check on ALBSlow dependency = ALL tasks marked unhealthy = full outage
No correlation IDsImpossible to trace requests across microservices
Default log retention (forever)CloudWatch Logs costs grow unbounded
CPU-based scaling for I/O workloadsAI/API calls use no CPU while waiting — misleading metric
Scale-in cooldown too shortFlapping: add task, remove task, add task repeatedly
No min capacityAuto-scaling can scale to 0 during brief low-traffic periods
Unstructured logsCannot search or filter effectively in CloudWatch
Alarms without actionsAlert fires but nobody is notified

Quick decision tree

Q: Which scaling policy?
  Steady gradual load → Target Tracking
  Sharp unpredictable spikes → Step Scaling
  Predictable patterns → Scheduled + Target Tracking

Q: Which health check depth?
  For ALB traffic routing → Shallow (fast, no dependencies)
  For monitoring dashboard → Deep (check all dependencies)
  For ECS container check → Liveness (is process alive?)

Q: Which metric to scale on?
  Compute-bound service → CPU utilization
  Memory-intensive service → Memory utilization
  Request-driven service → ALB request count per target
  AI/I/O-bound service → Custom concurrent request count

Q: Log retention period?
  Development → 7 days
  Staging → 14 days
  Production → 30-90 days
  Compliance requirements → As mandated (may be years)

End of 6.4 quick revision.