Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling
6.4 — Exercise Questions: Distributed Observability & Scaling
Practice questions for all three subtopics in Section 6.4. Mix of conceptual, configuration, troubleshooting, and design questions.
How to use this material (instructions)
- Read lessons in order —
README.md, then6.4.a→6.4.c. - Answer closed-book first — then compare to the matching lesson.
- Try the AWS CLI commands — experiment in a sandbox AWS account.
- Interview prep —
6.4-Interview-Questions.md. - Quick review —
6.4-Quick-Revision.md.
6.4.a — Horizontal Scaling with ECS (Q1–Q12)
Q1. Explain the difference between horizontal scaling and vertical scaling. Why is horizontal scaling preferred for microservices?
Q2. An ECS service has desiredCount: 4, runningCount: 3, pendingCount: 0. What is wrong and what will ECS do about it?
Q3. What is the difference between target tracking and step scaling auto-scaling policies? When would you choose each?
Q4. Write the AWS CLI command to register an ECS service as a scalable target with a minimum of 2 tasks and maximum of 15 tasks.
Q5. Your auto-scaling target tracking is set to 70% CPU. The service currently has 5 tasks averaging 90% CPU. How many tasks will AWS scale to? Show the math.
Q6. Explain why scaling in (removing tasks) should use a longer cooldown than scaling out (adding tasks). What problem does a short scale-in cooldown cause?
Q7. A developer stores user sessions in a JavaScript Map object in their Express.js server. There are 4 ECS tasks behind an ALB. What happens when the same user's requests go to different tasks? How do you fix this?
Q8. Name three approaches for managing user sessions across horizontally scaled tasks. For each, state one advantage and one disadvantage.
Q9. Your ECS tasks take 90 seconds to warm up (DB connections, cache loading, JIT optimization). How does this affect your auto-scaling strategy? Name two mitigation strategies.
Q10. Write a step scaling policy configuration (as JSON) with three steps: add 1 task when CPU is 0-10% above threshold, add 3 when 10-25% above, add 5 when 25%+ above.
Q11. What is scheduled scaling and when is it more appropriate than reactive auto-scaling? Give a real-world example.
Q12. Design exercise: Your API service handles 100 requests/second at baseline but spikes to 2000 requests/second every weekday at 9 AM. Each task handles ~200 requests/second. Design a complete auto-scaling strategy including min/max capacity, scaling policies, and scheduled actions.
6.4.b — Health Checks and Availability (Q13–Q24)
Q13. Explain the difference between an ALB health check and an ECS container health check. What happens when each detects an unhealthy task?
Q14. What is a shallow health check and what is a deep health check? Give a code example of each.
Q15. Your health check endpoint calls mongoose.connection.db.admin().ping(). The database is under heavy load and this call takes 8 seconds. The ALB health check timeout is 5 seconds. What happens?
Q16. Explain the concept of liveness vs readiness probes. How would you implement both in an Express.js application on ECS, even though ECS does not natively separate them?
Q17. What is a health check grace period and why is it necessary? What happens if you set it too short? Too long?
Q18. After a deployment, all your ECS tasks enter a restart loop: they start, fail health checks, get replaced, start again. Name three possible causes and how to debug each.
Q19. Write an Express.js health check endpoint that checks MongoDB, Redis, and reports memory usage. It should return 200 if all critical dependencies are healthy and 503 if any critical dependency is down.
Q20. Explain the circuit breaker pattern and how it prevents cascading failures. How does it relate to health checks?
Q21. Your ECS service has 5 tasks. The database goes down. All 5 tasks fail their deep health checks. ALB marks all as unhealthy. What happens to user traffic? How would you prevent this scenario?
Q22. Write the healthCheck section of an ECS task definition JSON that uses curl to check localhost:3000/health/live, with 30-second intervals, 5-second timeout, 3 retries, and a 90-second start period.
Q23. Explain connection draining (deregistration delay). Why is it important when scaling in or replacing unhealthy tasks?
Q24. Troubleshooting: Your ECS service shows 3 running tasks, the ALB shows 3 registered targets, but only 1 is marked healthy. The other 2 show "Health checks failed with code 502". What are the possible causes?
6.4.c — CloudWatch Monitoring (Q25–Q38)
Q25. Name the four core building blocks of CloudWatch and explain what each does in one sentence.
Q26. What is the difference between a CloudWatch metric, a log, and an alarm? Give an example of each for an ECS service.
Q27. Write an AWS CLI command to get the average CPU utilization of an ECS service over the last hour, with 5-minute data points.
Q28. Explain why structured logging (JSON) is better than console.log('Error: something failed') for production monitoring. What fields should every structured log entry include?
Q29. Write a Node.js middleware function that publishes a custom CloudWatch metric for API response time on every request.
Q30. Your application uses OpenAI's API. Design a custom metrics strategy to track AI usage. What metrics would you publish, and what alarms would you set?
Q31. Write a CloudWatch alarm (AWS CLI) that fires when the 5XX error count exceeds 10 in a 5-minute window and sends a notification to an SNS topic.
Q32. What is a correlation ID (request ID)? How does it help debug issues in a distributed system with multiple microservices?
Q33. Write a CloudWatch Logs Insights query that finds the 10 slowest requests in the last hour, showing the request ID, path, and duration.
Q34. Your dashboard shows CPU at 25% but p99 response time is 5 seconds. What could explain low CPU but high latency? Name three possible causes.
Q35. Explain what AWS X-Ray does and when you would use it instead of (or in addition to) CloudWatch Logs.
Q36. Design a CloudWatch dashboard for a microservices application with 3 services (api, auth, ai-service). What widgets would you include and why?
Q37. You are setting up monitoring for a new production service. List the minimum set of alarms you would create on day one, with the metric, threshold, and action for each.
Q38. Scenario: An alarm fires at 3 AM — "5XX error rate above threshold." Walk through the exact steps you would take using CloudWatch to diagnose the root cause, from opening the console to identifying the fix.
Answer Hints
| Q | Hint |
|---|---|
| Q2 | Running + Pending < Desired means a task failed to launch. Check ECS events for placement errors or resource constraints. |
| Q5 | 5 tasks * 90% / 70% target = 6.43 → rounds up to 7 tasks |
| Q6 | Short scale-in cooldown causes "flapping" — tasks are added then removed then added repeatedly |
| Q7 | Session data is only in one task's memory. Load balancer sends request to a different task → session lost → user logged out. Fix: JWT or Redis sessions. |
| Q9 | During warm-up, new tasks cannot handle traffic. Mitigate: pre-warming, higher min capacity, predictive scaling. |
| Q15 | ALB times out after 5 seconds → marks task unhealthy → deregisters it. All tasks marked unhealthy if DB is slow for all. |
| Q18 | Common causes: (1) Health check path wrong, (2) app crashing during startup, (3) grace period too short. Debug: check ECS events, task logs, test health endpoint locally. |
| Q21 | ALB returns 502/503 to all users. Prevention: shallow health check for ALB (just check if process is alive), deep check separate. Or: circuit breaker on DB calls. |
| Q24 | Possible causes: (1) App returning errors on /health, (2) security group blocking ALB → task traffic, (3) health check path misconfigured, (4) app listening on wrong port. |
| Q34 | Low CPU + high latency = I/O bound. Causes: (1) waiting on database queries, (2) waiting on external API calls (OpenAI), (3) waiting on network I/O. CPU is idle while waiting. |
| Q37 | Minimum alarms: CPU > 80%, Memory > 85%, 5XX > threshold, HealthyHostCount < minimum, response time p99 > threshold. |
← Back to 6.4 — Distributed Observability & Scaling (README)