Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling

6.4 — Distributed Observability & Scaling: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps — reopen README.md → 6.4.a…6.4.c.
Practice — 6.4-Exercise-Questions.md.
Polish answers — 6.4-Interview-Questions.md.

Core vocabulary

Term	One-liner
Horizontal scaling	Adding more task instances to handle load (scale out/in)
Vertical scaling	Making a single instance bigger (scale up/down)
Desired count	Number of tasks ECS wants running
Running count	Number of tasks actually running right now
Target tracking	Auto-scaling policy that maintains a target metric value (e.g., 70% CPU)
Step scaling	Auto-scaling policy with graduated responses to metric breaches
Cooldown	Wait period between scaling actions to prevent flapping
Stateless service	Service that stores no client data between requests
Health check	Automated probe verifying a task can serve requests
Liveness probe	"Is the process alive?" (ECS container health check)
Readiness probe	"Can this task serve traffic?" (ALB health check)
Grace period	Time before health checks count after task start
Circuit breaker	Pattern that fails fast when a dependency is down
CloudWatch metric	Time-series numerical data point
CloudWatch alarm	Threshold watcher that triggers actions
Structured logging	JSON-formatted log entries (searchable, parseable)
Correlation ID	UUID propagated across services to trace one request
X-Ray	AWS distributed tracing for cross-service request flows

Scaling strategies

HORIZONTAL (preferred for microservices):
  Add/remove task instances behind load balancer
  Requires: Stateless architecture
  Limit: Nearly unlimited

VERTICAL (last resort):
  Increase CPU/memory of single instance
  Requires: Nothing special
  Limit: Hardware ceiling, single point of failure

Auto-scaling policy cheat sheet

Target Tracking (recommended):
  - Set target value (e.g., 70% CPU)
  - AWS handles the math
  - Best for: Steady, gradual load changes

Step Scaling:
  - Define graduated steps (add 1/3/5 tasks at different thresholds)
  - You control the proportional response
  - Best for: Sharp spikes, variable severity

Scheduled Scaling:
  - Pre-defined capacity at specific times
  - cron expressions for recurring patterns
  - Best for: Predictable traffic (business hours, campaigns)

Scale-out vs scale-in

Scale OUT:                     Scale IN:
  Cooldown: 60s (short)          Cooldown: 300s (long)
  Risk: LOW                      Risk: MODERATE
  Speed: Fast response           Speed: Gradual
  Strategy: Be aggressive        Strategy: Be conservative

Target tracking math

New desired count = current tasks * current metric / target metric

Example:
  5 tasks * 90% CPU / 70% target = 6.43 → round up → 7 tasks

Health check types

SHALLOW (Liveness):
  GET /health → 200 OK
  Checks: Process alive, port listening
  Speed: < 10ms
  Use for: ALB health check, container health check

DEEP (Readiness):
  GET /health/ready → checks DB, Redis, memory
  Checks: All critical dependencies
  Speed: < 2000ms (with timeouts)
  Use for: Monitoring dashboards, deployment gates

DETAILED (Diagnostics):
  GET /health/detail → full system report
  Checks: Everything + non-critical services
  Use for: Debugging, never for ALB

ALB vs ECS health check

ALB Health Check:
  Who: Load Balancer
  How: HTTP request to /health
  On failure: Remove from target group (no traffic)
  Config: path, interval, timeout, thresholds

ECS Container Health Check:
  Who: ECS Agent (Docker)
  How: Shell command inside container
  On failure: Stop and replace the task
  Config: command, interval, timeout, retries, startPeriod

Grace period guide

Grace period = App startup time * 2

App starts in 30s → grace period = 60s
App starts in 60s → grace period = 120s

Too short: Restart loops (tasks killed before ready)
Too long:  Broken tasks stay in rotation

CloudWatch components

METRICS:
  Built-in:  CPUUtilization, MemoryUtilization (namespace: AWS/ECS)
  ALB:       RequestCount, 5XX_Count, TargetResponseTime (namespace: AWS/ApplicationELB)
  Custom:    ResponseTime, LLM_Latency, BusinessEvents (namespace: MyApp/API)

LOGS:
  Structure: Log Group → Log Streams
  Driver:    awslogs in task definition
  Format:    JSON (structured) > plain text
  Retention: Set explicitly (default = forever = expensive)

ALARMS:
  States:    OK → ALARM → INSUFFICIENT_DATA
  Actions:   SNS notification, auto-scaling, Lambda
  Config:    metric, threshold, period, evaluation periods, comparison operator

DASHBOARDS:
  Widgets:   Line charts (metrics), tables (logs), numbers (single stat)
  Scope:     One overview + one per service

Auto-scaling configuration checklist

# 1. Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/CLUSTER/SERVICE \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 --max-capacity 20

# 2. Create CPU target tracking policy
aws application-autoscaling put-scaling-policy \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

# 3. Verify
aws application-autoscaling describe-scaling-policies \
  --service-namespace ecs

Structured logging template

// Minimum viable structured log
console.log(JSON.stringify({
  level: 'info|warn|error',
  message: 'What happened',
  timestamp: new Date().toISOString(),
  service: 'service-name',
  requestId: 'uuid',
  taskId: process.env.ECS_TASK_ID
}));

Monitoring checklist (day-one setup)

Alarm	Metric	Threshold	Action
High CPU	`CPUUtilization`	> 80% for 5 min	SNS + auto-scale
High Memory	`MemoryUtilization`	> 85% for 5 min	SNS
Error spike	`HTTPCode_Target_5XX_Count`	> 5/min	SNS (page on-call)
Low healthy hosts	`HealthyHostCount`	< 2	SNS (page on-call)
High latency	`TargetResponseTime` p99	> 3 sec	SNS
LLM errors	Custom `LLM_Errors`	> 10/5 min	SNS

Essential CloudWatch CLI commands

# Tail logs in real-time
aws logs tail /ecs/api-service --follow

# Search for errors
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.level = "error" }'

# Trace a request across all streams
aws logs filter-log-events \
  --log-group-name /ecs/api-service \
  --filter-pattern '{ $.requestId = "abc-123" }'

# Get CPU stats
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=api \
  --period 300 --statistics Average Maximum \
  --start-time ... --end-time ...

# Set log retention
aws logs put-retention-policy \
  --log-group-name /ecs/api-service \
  --retention-in-days 30

Common gotchas

Gotcha	Why
In-memory sessions break scaling	Load balancer sends requests to different tasks
Health check grace period too short	Tasks killed before they finish starting
Deep health check on ALB	Slow dependency = ALL tasks marked unhealthy = full outage
No correlation IDs	Impossible to trace requests across microservices
Default log retention (forever)	CloudWatch Logs costs grow unbounded
CPU-based scaling for I/O workloads	AI/API calls use no CPU while waiting — misleading metric
Scale-in cooldown too short	Flapping: add task, remove task, add task repeatedly
No min capacity	Auto-scaling can scale to 0 during brief low-traffic periods
Unstructured logs	Cannot search or filter effectively in CloudWatch
Alarms without actions	Alert fires but nobody is notified

Quick decision tree

Q: Which scaling policy?
  Steady gradual load → Target Tracking
  Sharp unpredictable spikes → Step Scaling
  Predictable patterns → Scheduled + Target Tracking

Q: Which health check depth?
  For ALB traffic routing → Shallow (fast, no dependencies)
  For monitoring dashboard → Deep (check all dependencies)
  For ECS container check → Liveness (is process alive?)

Q: Which metric to scale on?
  Compute-bound service → CPU utilization
  Memory-intensive service → Memory utilization
  Request-driven service → ALB request count per target
  AI/I/O-bound service → Custom concurrent request count

Q: Log retention period?
  Development → 7 days
  Staging → 14 days
  Production → 30-90 days
  Compliance requirements → As mandated (may be years)

End of 6.4 quick revision.