Episode 6 — Scaling Reliability Microservices Web3 / 6.4 — Distributed Observability and Scaling

Interview Questions: Distributed Observability & Scaling

Model answers for horizontal scaling, health checks, CloudWatch monitoring, and distributed observability in ECS-based microservices.

How to use this material (instructions)

Read lessons in order — README.md, then 6.4.a → 6.4.c.
Practice out loud — definition → example → pitfall.
Pair with exercises — 6.4-Exercise-Questions.md.
Quick review — 6.4-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What is horizontal scaling and why is it important?

Why interviewers ask: Tests fundamental understanding of scalability — the foundation of distributed systems.

Model answer:

Horizontal scaling means adding more instances (ECS tasks, containers, servers) to handle increased load, as opposed to vertical scaling which makes a single instance bigger. In AWS ECS, horizontal scaling adds more task copies behind a load balancer.

It matters because: (1) No ceiling — you can add as many instances as needed, while vertical scaling hits hardware limits. (2) Fault tolerance — if one task crashes, others keep serving traffic. (3) Cost efficiency — scale down during off-peak hours. (4) Zero-downtime deployments — rolling updates replace tasks one at a time while others continue serving.

The key requirement is stateless architecture — each task must be interchangeable. If a task stores session data in memory, the next request routed to a different task will fail. Solutions include JWTs for authentication, Redis for session storage, and S3 for file uploads.

Q2. What are health checks and why do distributed systems need them?

Why interviewers ask: Health checks are the difference between a system that self-heals and one that silently degrades.

Model answer:

Health checks are automated probes that verify whether a service instance can handle requests. In a distributed system, a process can be "running" but not "functioning" — the database connection may be lost, memory may be exhausted, or a dependency may be down.

There are two layers in ECS: (1) ALB health checks — the load balancer sends HTTP requests to each task's health endpoint. Unhealthy tasks are removed from the target group so they stop receiving traffic. (2) ECS container health checks — Docker-level checks that determine if the container process is alive. Failure triggers task replacement.

Health checks should be fast (under 2 seconds), reliable (no false positives), and appropriate in depth — a shallow check confirms the process is alive, while a deep check verifies critical dependencies. Best practice is using a shallow check for ALB routing and a separate deep check endpoint for monitoring dashboards.

Q3. What is CloudWatch and what are its core components?

Why interviewers ask: CloudWatch is the default observability platform on AWS — knowing it demonstrates operational readiness.

Model answer:

Amazon CloudWatch is AWS's monitoring and observability service with four core components:

Metrics are time-series numerical data points. ECS automatically publishes CPU and memory utilization. You can also publish custom metrics — for example, API response time, LLM token usage, or business events like order counts.

Logs are text records from applications and services. ECS tasks configured with the awslogs driver send stdout/stderr to CloudWatch Log Groups. Best practice is structured JSON logging with fields like level, timestamp, requestId, and service name.

Alarms watch a metric and trigger actions when thresholds are breached. For example, "if average CPU exceeds 80% for 5 minutes, send an SNS notification and trigger auto-scaling." Alarms have three states: OK, ALARM, and INSUFFICIENT_DATA.

Dashboards are visual panels combining metrics, logs, and alarms into a single view. A production dashboard typically shows CPU, memory, request count, error rate, response time percentiles, and healthy host count.

Q4. What is the difference between scaling out and scaling in?

Why interviewers ask: Many candidates only think about adding capacity — understanding scale-in shows production awareness.

Model answer:

Scaling out means adding more tasks to handle increased demand. It is relatively safe — you are adding capacity. Best practice is a short cooldown (60 seconds) so the system responds quickly to spikes.

Scaling in means removing tasks when demand decreases. It is riskier because: (1) active connections may be dropped, (2) in-flight requests may fail, and (3) premature scale-in can cause "flapping" — repeatedly adding and removing tasks.

Best practices for scale-in: Use a long cooldown (300+ seconds) to avoid flapping. Enable connection draining so in-flight requests complete before a task is removed. Set a minimum capacity (never scale to 0 for critical services). Scale in gradually — one task at a time.

The asymmetry is intentional: scaling out fast prevents outages, while scaling in slow prevents instability. A practical example: if your scale-in cooldown is 60 seconds and traffic fluctuates, you might add 3 tasks, remove 2, add 2, remove 1 — wasting resources on constant churn instead of maintaining a stable baseline.

Intermediate (Q5–Q8)

Q5. Explain target tracking vs step scaling policies. When do you use each?

Why interviewers ask: Tests practical knowledge of auto-scaling configuration — common in DevOps and backend roles.

Model answer:

Target tracking is the simplest and most common policy. You specify a target value (e.g., 70% CPU utilization) and AWS automatically adjusts the desired count to maintain that target. It creates and manages the underlying CloudWatch alarms automatically.

Step scaling gives you more control. You define specific steps: "if CPU is 0-10% above threshold, add 1 task; if 10-25% above, add 3; if 25%+ above, add 5." This allows proportional response to different severity levels.

When to use each:

Target tracking — Default choice. Works well for steady, gradual load changes. Simpler to configure and maintain. AWS handles the math.
Step scaling — When you need proportional responses to different severity levels. Better for workloads with sharp spikes where you want aggressive scale-out at high thresholds. More complex to tune.

In practice, most teams start with target tracking on CPU at 70% and only switch to step scaling if they have specific requirements. You can combine both with scheduled scaling for predictable patterns — for example, scaling up before a known 9 AM traffic spike.

Q6. How would you implement health checks that prevent cascading failures?

Why interviewers ask: Cascading failures are a real production risk — this tests defensive engineering thinking.

Model answer:

A cascading failure occurs when one component's failure triggers failures in dependent components. A classic example: the database becomes slow, all tasks' deep health checks time out, ALB marks all tasks unhealthy, the entire service returns 502s, and upstream services that depend on it also fail.

Prevention strategies:

1. Separate critical from non-critical dependencies. Only fail the health check on truly critical dependencies (database). If Redis or a logging service is down, the task can often still serve requests with degraded functionality.

2. Circuit breaker pattern. If the database is slow, fail fast instead of waiting for timeouts. A circuit breaker tracks failure rates and "opens" (rejects calls immediately) when failures exceed a threshold. After a cooldown, it "half-opens" to test if the dependency recovered.

3. Shallow health check for ALB. Use a minimal check that confirms the process is alive. Use a separate /health/deep endpoint for monitoring that checks all dependencies but does not control traffic routing.

4. Timeouts on health check dependencies. If the health check itself calls db.ping(), use a short timeout (2 seconds). A health check that hangs for 30 seconds defeats the purpose.

5. Bulkhead pattern. Use separate connection pools with individual timeouts for each dependency. A slow database should not exhaust the connection pool needed for Redis.

Q7. Design a structured logging strategy for a microservices application.

Why interviewers ask: Logging is the most underestimated aspect of production readiness — this separates production-experienced engineers from classroom learners.

Model answer:

Every log entry should be a JSON object with a consistent schema. Minimum required fields:

{
  "level": "info|warn|error",
  "message": "Human-readable description",
  "timestamp": "2026-04-11T14:30:00.000Z",
  "service": "api-service",
  "requestId": "uuid-for-tracing",
  "taskId": "ecs-task-id"
}

Correlation IDs (request IDs) are the most critical field. Generate a UUID at the API gateway or first service, then propagate it through all downstream service calls via HTTP headers (x-request-id). This lets you trace a single user request across all services by querying: filter requestId = "abc-123".

Log levels should be meaningful: error for things that need human attention, warn for recoverable issues, info for normal operations, debug for development only.

Context enrichment — include relevant data: user ID, route, method, response status, duration. For error logs, include the full error stack trace. For AI calls, include model, token counts, and cost.

Operational practices: Set log retention policies (30-90 days) to control costs. Use CloudWatch Logs Insights for ad-hoc queries. Create metric filters to turn log patterns into alarms (e.g., count of "database connection failed" messages).

Q8. A new deployment causes all ECS tasks to restart in a loop. How do you diagnose and fix it?

Why interviewers ask: Real-world troubleshooting scenario — tests systematic debugging under pressure.

Model answer:

Step 1: Check ECS service events.

aws ecs describe-services --cluster prod --services api --query 'services[0].events[:10]'

Look for: "service api has reached a steady state" (good) vs "task failed health checks" or "essential container exited" (bad).

Step 2: Check stopped task reasons.

aws ecs describe-tasks --cluster prod --tasks <task-arn> --query 'tasks[0].stoppedReason'

Common reasons: "Essential container in task exited" (app crash), "Task failed ELB health checks" (health check misconfigured), "OutOfMemoryError" (container exceeds memory limit).

Step 3: Check CloudWatch Logs. Look at the logs for the crashed tasks to find application-level errors — missing environment variables, failed database connections, or uncaught exceptions during startup.

Step 4: Check health check configuration. If the health check grace period is shorter than the application's startup time, the task will be killed before it is ready. Fix: increase healthCheckGracePeriodSeconds.

Step 5: Check resource limits. If the task definition allocates 512MB memory but the app needs 600MB during startup, every task will be OOM-killed. Fix: increase memory allocation in the task definition.

Immediate mitigation: Roll back to the previous task definition version while investigating:

aws ecs update-service --cluster prod --service api --task-definition api:previous-version

Advanced (Q9–Q11)

Q9. Design a complete observability strategy for a production microservices system with 5 services.

Why interviewers ask: Tests system design thinking — can you build a coherent monitoring strategy, not just individual components?

Model answer:

Layer 1: Infrastructure metrics (automatic). ECS publishes CPU, memory, and task count per service. ALB publishes request count, error rates, response times, and healthy host counts. Set alarms on all of these with appropriate thresholds.

Layer 2: Application metrics (custom). Each service publishes custom metrics to CloudWatch: request duration percentiles, error rates by route, queue depths, cache hit rates. The AI service additionally tracks: LLM latency, token usage, cost per request, and model error rates.

Layer 3: Structured logs. All services emit JSON logs with correlation IDs. A request entering the system gets a UUID that propagates to all downstream services. CloudWatch Logs Insights queries can trace any request across all 5 services.

Layer 4: Distributed tracing (X-Ray). X-Ray provides a visual service map showing dependencies and latency between services. Every inter-service call carries the trace header. This answers "which service is the bottleneck?" without manual log correlation.

Layer 5: Dashboards. One overview dashboard showing health of all 5 services. One detail dashboard per service showing its specific metrics. One business dashboard showing user-facing KPIs (request volume, success rate, p99 latency).

Layer 6: Alerting strategy. Three severity tiers: P1 (page on-call) — service down, all hosts unhealthy, error rate > 50%. P2 (Slack notification) — CPU > 80%, error rate > 5%, scaling event. P3 (daily summary) — elevated latency, non-critical dependency degraded.

Anti-patterns to avoid: Alerting on every metric (alert fatigue), no correlation IDs (impossible to debug), unstructured logs (impossible to search), no log retention policy (unlimited costs).

Q10. How would you implement zero-downtime deployments with health checks and auto-scaling?

Why interviewers ask: Combines scaling, health checks, and deployment strategy — tests holistic operational thinking.

Model answer:

Rolling deployment (default ECS strategy):

ECS replaces tasks one at a time. Configure minimumHealthyPercent: 100 and maximumPercent: 200 — meaning ECS launches a new task before stopping an old one, ensuring capacity never drops below current level.

The sequence: (1) ECS launches new task with updated code. (2) New task starts and enters health check grace period. (3) After grace period, ALB health check verifies the task is healthy. (4) ALB registers the new task and begins routing traffic. (5) ECS drains an old task (connection draining period). (6) Old task is stopped. (7) Repeat for all remaining tasks.

Critical configuration:

Health check grace period: Must exceed startup time (e.g., 120s)
Deregistration delay: Allow in-flight requests to complete (e.g., 60s)
Min healthy percent: 100% (never drop below current capacity)
Max percent: 200% (allow double capacity during deployment)

Auto-scaling interaction: During deployment, auto-scaling should be considered. If a deployment coincides with a traffic spike, the new tasks being launched for the deployment count toward capacity. Set deploymentConfiguration.alarms to automatically roll back if the new version causes alarm breaches.

Readiness gate: The health check endpoint should verify that the new code version's dependencies are satisfied — for example, if a new version requires a database migration, the readiness check should verify the migration has been applied before accepting traffic.

Canary alternative: For higher-risk deployments, use CodeDeploy with ECS to route a small percentage of traffic to new tasks first, monitor error rates, and automatically roll back if metrics degrade.

Q11. Your system handles AI workloads with variable latency (100ms to 30 seconds per request). How would you design scaling and monitoring differently than a typical CRUD API?

Why interviewers ask: Tests ability to adapt standard patterns to non-standard workloads — AI services have unique scaling characteristics.

Model answer:

AI workloads break several assumptions of traditional auto-scaling:

Scaling challenge: CPU utilization is misleading. A task waiting for an OpenAI API response uses almost zero CPU but cannot handle more requests. Traditional CPU-based scaling would see 10% CPU and scale in, while the tasks are actually fully occupied.

Better scaling metrics: (1) Concurrent request count — publish a custom metric tracking active requests per task. Scale when it exceeds a threshold (e.g., 5 concurrent AI requests per task). (2) ALB request count per target — scale based on how many requests each task is handling. (3) Queue depth — if using async processing, scale based on how many requests are queued.

// Custom concurrent request tracking
let activeRequests = 0;

app.use((req, res, next) => {
  activeRequests++;
  res.on('finish', () => {
    activeRequests--;
    publishMetric('ActiveRequests', activeRequests, 'Count');
  });
  next();
});

Health check challenge: A task processing a 30-second AI request should not fail health checks. Solution: separate the health check from request processing. The health endpoint returns 200 as long as the process is alive and connected to dependencies — it does not wait for in-flight AI requests.

Timeout configuration: ALB idle timeout must exceed the maximum expected AI response time (set to 60 seconds or higher). Connection draining timeout must also be extended.

Monitoring specifics: Track LLM-specific metrics — model latency distribution, token usage, cost per request, rate limit errors (429s), and finish reason distribution. Set alarms on LLM error rates separately from application error rates. A spike in LLM 429 errors means you are hitting rate limits and need to implement request queuing or get a higher tier.

Cost awareness: AI requests have highly variable cost. A single request might cost $0.001 (short prompt, GPT-4o-mini) or $0.50 (long context, GPT-4). Track per-request cost as a metric and set billing alarms.

Quick-fire

#	Question	One-line answer
1	Min capacity for a production service?	At least 2 — one task is a single point of failure
2	Target tracking vs step scaling default?	Target tracking — simpler, AWS manages the math
3	What makes a service horizontally scalable?	Statelessness — no in-memory sessions, no local files
4	ALB health check failure consequence?	Task removed from target group (no traffic)
5	ECS container health check failure consequence?	Task stopped and replaced
6	Health check grace period too short?	Restart loop — tasks killed before ready
7	Four CloudWatch components?	Metrics, Logs, Alarms, Dashboards
8	Why structured logging?	Searchable, filterable, parseable by CloudWatch Logs Insights
9	What is a correlation ID?	UUID propagated across services to trace a single request
10	X-Ray vs CloudWatch Logs?	X-Ray shows visual service map and latency; Logs show detailed event records

← Back to 6.4 — Distributed Observability & Scaling (README)