Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling
6.4 — Distributed Observability & Scaling: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md→6.4.a…6.4.c. - Practice —
6.4-Exercise-Questions.md. - Polish answers —
6.4-Interview-Questions.md.
Core vocabulary
| Term | One-liner |
|---|---|
| Horizontal scaling | Adding more task instances to handle load (scale out/in) |
| Vertical scaling | Making a single instance bigger (scale up/down) |
| Desired count | Number of tasks ECS wants running |
| Running count | Number of tasks actually running right now |
| Target tracking | Auto-scaling policy that maintains a target metric value (e.g., 70% CPU) |
| Step scaling | Auto-scaling policy with graduated responses to metric breaches |
| Cooldown | Wait period between scaling actions to prevent flapping |
| Stateless service | Service that stores no client data between requests |
| Health check | Automated probe verifying a task can serve requests |
| Liveness probe | "Is the process alive?" (ECS container health check) |
| Readiness probe | "Can this task serve traffic?" (ALB health check) |
| Grace period | Time before health checks count after task start |
| Circuit breaker | Pattern that fails fast when a dependency is down |
| CloudWatch metric | Time-series numerical data point |
| CloudWatch alarm | Threshold watcher that triggers actions |
| Structured logging | JSON-formatted log entries (searchable, parseable) |
| Correlation ID | UUID propagated across services to trace one request |
| X-Ray | AWS distributed tracing for cross-service request flows |
Scaling strategies
HORIZONTAL (preferred for microservices):
Add/remove task instances behind load balancer
Requires: Stateless architecture
Limit: Nearly unlimited
VERTICAL (last resort):
Increase CPU/memory of single instance
Requires: Nothing special
Limit: Hardware ceiling, single point of failure
Auto-scaling policy cheat sheet
Target Tracking (recommended):
- Set target value (e.g., 70% CPU)
- AWS handles the math
- Best for: Steady, gradual load changes
Step Scaling:
- Define graduated steps (add 1/3/5 tasks at different thresholds)
- You control the proportional response
- Best for: Sharp spikes, variable severity
Scheduled Scaling:
- Pre-defined capacity at specific times
- cron expressions for recurring patterns
- Best for: Predictable traffic (business hours, campaigns)
Scale-out vs scale-in
Scale OUT: Scale IN:
Cooldown: 60s (short) Cooldown: 300s (long)
Risk: LOW Risk: MODERATE
Speed: Fast response Speed: Gradual
Strategy: Be aggressive Strategy: Be conservative
Target tracking math
New desired count = current tasks * current metric / target metric
Example:
5 tasks * 90% CPU / 70% target = 6.43 → round up → 7 tasks
Health check types
SHALLOW (Liveness):
GET /health → 200 OK
Checks: Process alive, port listening
Speed: < 10ms
Use for: ALB health check, container health check
DEEP (Readiness):
GET /health/ready → checks DB, Redis, memory
Checks: All critical dependencies
Speed: < 2000ms (with timeouts)
Use for: Monitoring dashboards, deployment gates
DETAILED (Diagnostics):
GET /health/detail → full system report
Checks: Everything + non-critical services
Use for: Debugging, never for ALB
ALB vs ECS health check
ALB Health Check:
Who: Load Balancer
How: HTTP request to /health
On failure: Remove from target group (no traffic)
Config: path, interval, timeout, thresholds
ECS Container Health Check:
Who: ECS Agent (Docker)
How: Shell command inside container
On failure: Stop and replace the task
Config: command, interval, timeout, retries, startPeriod
Grace period guide
Grace period = App startup time * 2
App starts in 30s → grace period = 60s
App starts in 60s → grace period = 120s
Too short: Restart loops (tasks killed before ready)
Too long: Broken tasks stay in rotation
CloudWatch components
METRICS:
Built-in: CPUUtilization, MemoryUtilization (namespace: AWS/ECS)
ALB: RequestCount, 5XX_Count, TargetResponseTime (namespace: AWS/ApplicationELB)
Custom: ResponseTime, LLM_Latency, BusinessEvents (namespace: MyApp/API)
LOGS:
Structure: Log Group → Log Streams
Driver: awslogs in task definition
Format: JSON (structured) > plain text
Retention: Set explicitly (default = forever = expensive)
ALARMS:
States: OK → ALARM → INSUFFICIENT_DATA
Actions: SNS notification, auto-scaling, Lambda
Config: metric, threshold, period, evaluation periods, comparison operator
DASHBOARDS:
Widgets: Line charts (metrics), tables (logs), numbers (single stat)
Scope: One overview + one per service
Auto-scaling configuration checklist
# 1. Register scalable target
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/CLUSTER/SERVICE \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 --max-capacity 20
# 2. Create CPU target tracking policy
aws application-autoscaling put-scaling-policy \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}'
# 3. Verify
aws application-autoscaling describe-scaling-policies \
--service-namespace ecs
Structured logging template
// Minimum viable structured log
console.log(JSON.stringify({
level: 'info|warn|error',
message: 'What happened',
timestamp: new Date().toISOString(),
service: 'service-name',
requestId: 'uuid',
taskId: process.env.ECS_TASK_ID
}));
Monitoring checklist (day-one setup)
| Alarm | Metric | Threshold | Action |
|---|---|---|---|
| High CPU | CPUUtilization | > 80% for 5 min | SNS + auto-scale |
| High Memory | MemoryUtilization | > 85% for 5 min | SNS |
| Error spike | HTTPCode_Target_5XX_Count | > 5/min | SNS (page on-call) |
| Low healthy hosts | HealthyHostCount | < 2 | SNS (page on-call) |
| High latency | TargetResponseTime p99 | > 3 sec | SNS |
| LLM errors | Custom LLM_Errors | > 10/5 min | SNS |
Essential CloudWatch CLI commands
# Tail logs in real-time
aws logs tail /ecs/api-service --follow
# Search for errors
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--filter-pattern '{ $.level = "error" }'
# Trace a request across all streams
aws logs filter-log-events \
--log-group-name /ecs/api-service \
--filter-pattern '{ $.requestId = "abc-123" }'
# Get CPU stats
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS --metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=api \
--period 300 --statistics Average Maximum \
--start-time ... --end-time ...
# Set log retention
aws logs put-retention-policy \
--log-group-name /ecs/api-service \
--retention-in-days 30
Common gotchas
| Gotcha | Why |
|---|---|
| In-memory sessions break scaling | Load balancer sends requests to different tasks |
| Health check grace period too short | Tasks killed before they finish starting |
| Deep health check on ALB | Slow dependency = ALL tasks marked unhealthy = full outage |
| No correlation IDs | Impossible to trace requests across microservices |
| Default log retention (forever) | CloudWatch Logs costs grow unbounded |
| CPU-based scaling for I/O workloads | AI/API calls use no CPU while waiting — misleading metric |
| Scale-in cooldown too short | Flapping: add task, remove task, add task repeatedly |
| No min capacity | Auto-scaling can scale to 0 during brief low-traffic periods |
| Unstructured logs | Cannot search or filter effectively in CloudWatch |
| Alarms without actions | Alert fires but nobody is notified |
Quick decision tree
Q: Which scaling policy?
Steady gradual load → Target Tracking
Sharp unpredictable spikes → Step Scaling
Predictable patterns → Scheduled + Target Tracking
Q: Which health check depth?
For ALB traffic routing → Shallow (fast, no dependencies)
For monitoring dashboard → Deep (check all dependencies)
For ECS container check → Liveness (is process alive?)
Q: Which metric to scale on?
Compute-bound service → CPU utilization
Memory-intensive service → Memory utilization
Request-driven service → ALB request count per target
AI/I/O-bound service → Custom concurrent request count
Q: Log retention period?
Development → 7 days
Staging → 14 days
Production → 30-90 days
Compliance requirements → As mandated (may be years)
End of 6.4 quick revision.