Episode 6 — Scaling Reliability Microservices Web3 / 6.3 — AWS Cloud Native Deployment
6.3 — Exercise Questions: AWS Cloud-Native Deployment
Practice questions for all four subtopics in Section 6.3. Mix of conceptual, CLI commands, architecture design, and troubleshooting scenarios.
How to use this material (instructions)
- Read lessons in order —
README.md, then6.3.a→6.3.d. - Answer closed-book first — then compare to the matching lesson.
- Try the CLI commands — use an AWS sandbox account to practice hands-on.
- Interview prep —
6.3-Interview-Questions.md. - Quick review —
6.3-Quick-Revision.md.
6.3.a — ECR and Container Images (Q1–Q10)
Q1. What is Amazon ECR? Explain in one sentence how it differs from Docker Hub.
Q2. Write the full image URI format for an ECR image. Identify each component (account ID, region, repository, tag).
Q3. You run docker push to ECR and get no basic auth credentials. What command do you need to run first, and what does it do?
Q4. Write the complete sequence of CLI commands to: (a) create an ECR repository, (b) build a Docker image, (c) tag it for ECR, (d) authenticate, and (e) push it.
Q5. Explain multi-stage Docker builds. Why do they matter for ECR storage costs and ECS startup times? Estimate the size difference between a single-stage and multi-stage Node.js image.
Q6. Your Dockerfile starts with FROM node:20. A teammate suggests changing to FROM node:20-alpine. What are the pros and cons?
Q7. Why should production Docker images run as a non-root user? Write the Dockerfile commands to create and switch to a non-root user.
Q8. What is an ECR lifecycle policy? Write a policy that keeps only the last 5 tagged images and deletes untagged images after 24 hours.
Q9. What does scanOnPush=true do when creating an ECR repository? What are the severity levels for vulnerability findings?
Q10. You have 50 images in an ECR repository, each averaging 300 MB. How much does this cost per month at $0.10/GB? Now write a lifecycle policy to reduce this cost.
6.3.b — ECS and Fargate (Q11–Q22)
Q11. Draw the ECS hierarchy from top to bottom: Cluster → Service → Task → Container. Explain the purpose of each level.
Q12. What is a task definition and how does it relate to a running task? Why are task definitions versioned?
Q13. Explain the difference between the executionRoleArn and taskRoleArn fields in a task definition. Give an example of what each is used for.
Q14. Write a complete task definition JSON for a Node.js service that: runs on Fargate, uses 512 CPU / 1024 MB memory, exposes port 3000, sends logs to CloudWatch, and pulls a secret from Secrets Manager.
Q15. List all valid Fargate CPU/memory combinations for 256 CPU units. What happens if you specify an invalid combination?
Q16. Compare Fargate and EC2 launch types across these dimensions: server management, pricing, scaling, startup time, GPU support. When would you choose each?
Q17. You create an ECS service with desired-count: 3 but only 1 task is running. List at least 4 possible causes and the CLI command to diagnose each.
Q18. Explain the rolling deployment process. If minimumHealthyPercent=100 and maximumPercent=200 with 3 desired tasks, describe the step-by-step sequence.
Q19. What is a deployment circuit breaker? Write the CLI command to enable it with automatic rollback.
Q20. Configure auto-scaling for an ECS service that: scales between 2 and 20 tasks, targets 70% CPU utilization, waits 60 seconds before scaling out, and waits 300 seconds before scaling in. Write all CLI commands.
Q21. Your ECS service is running on Fargate. A developer says "I need to SSH into the container to debug." What do you tell them? Write the commands to enable and use ECS Exec.
Q22. Explain the difference between a task (one-off) and a service (long-running). Give an example use case for each.
6.3.c — Application Load Balancer (Q23–Q32)
Q23. What OSI layer does an ALB operate at? What routing capabilities does this give it that a Network Load Balancer (Layer 4) cannot do?
Q24. Define these ALB components and explain how they relate: listener, rule, target group, target.
Q25. You have three microservices: user-service, order-service, and payment-service. Design the ALB routing rules (path patterns, target groups) to serve all three from a single ALB.
Q26. Write the complete CLI commands to: (a) create an ALB, (b) create a target group for a Fargate service, (c) create an HTTPS listener, and (d) add a path-based routing rule.
Q27. Why must the target type be ip (not instance) when using Fargate? What would happen if you accidentally used instance?
Q28. Design a health check configuration for a Node.js API. Specify: path, interval, timeout, healthy threshold, unhealthy threshold. Justify each choice.
Q29. Your health check calls the database to verify connectivity. During a database outage, ALL tasks become unhealthy and the service goes down completely. Redesign the health check strategy to prevent this.
Q30. What is HTTPS termination at the ALB? Explain the traffic encryption flow from the client to the container. Why is this approach preferred over each container managing its own SSL?
Q31. What is connection draining (deregistration delay)? Why is it important during deployments? What value would you set for a REST API vs a WebSocket service?
Q32. After a deployment, your service returns intermittent 502 Bad Gateway errors for about 60 seconds, then recovers. Diagnose the likely cause related to the ALB health check timing.
6.3.d — VPC Networking and IAM (Q33–Q44)
Q33. What is a VPC? Why should you create a custom VPC for production instead of using the default VPC?
Q34. Explain the difference between public and private subnets. What determines whether a subnet is public or private? (Hint: it's about the route table.)
Q35. Draw a network diagram showing: VPC with 2 AZs, public subnets (with ALB and NAT Gateway), private subnets (with ECS tasks), and data subnets (with RDS). Show the traffic flow from a user to a database query.
Q36. What is a NAT Gateway? Why do ECS tasks in private subnets need one? Estimate the monthly cost for a NAT Gateway.
Q37. Write the security group rules (inbound and outbound) for: (a) the ALB security group, (b) the ECS tasks security group, (c) the RDS security group. Explain why you reference security group IDs instead of CIDR ranges.
Q38. Compare security groups and Network ACLs across: stateful vs stateless, allow vs deny rules, resource level vs subnet level. When would you use a NACL in addition to security groups?
Q39. What are VPC Endpoints? List the 4-5 essential VPC endpoints for an ECS Fargate deployment and explain why each is needed.
Q40. An ECS task fails with CannotPullContainerError. The image exists in ECR. List all possible networking and IAM causes.
Q41. Explain the least privilege principle for IAM. Rewrite this overly-broad policy to be properly scoped:
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": "*"
}
The service only needs to upload user profile photos to the user-photos bucket.
Q42. Write separate task role policies for: (a) an order-service that reads from SQS and writes to DynamoDB, and (b) an email-worker that reads from SQS and sends via SES.
Q43. Your company has a compliance requirement that no ECS task should be able to access the internet directly. Design the VPC architecture and explain how you would use VPC Endpoints to meet this requirement.
Q44. A developer accidentally attached the execution role (with ECR pull permissions) as the task role. What security risks does this create? How would you detect and prevent this?
Answer Hints
| Q | Hint |
|---|---|
| Q3 | aws ecr get-login-password --region <r> | docker login --username AWS --password-stdin <uri> — token lasts 12 hours |
| Q5 | Single-stage ~1.5 GB vs multi-stage ~270 MB; alpine base + production deps only |
| Q10 | 50 x 300 MB = 15 GB x $0.10 = $1.50/month; lifecycle policy to keep only 10 images saves ~60% |
| Q15 | 256 CPU: 512, 1024, 2048 MB only; invalid combo returns ClientException |
| Q17 | Check: stopped task reasons (describe-tasks), SG rules, subnet routes, IAM roles, resource limits |
| Q18 | Start 1 new, wait for health, stop 1 old — repeat 3 times; at most 6 tasks at once |
| Q27 | Fargate tasks use awsvpc mode, each gets its own IP; instance target type expects EC2 instance IDs |
| Q29 | Liveness check: /health (app process alive, returns 200); Readiness check: /ready (dependencies checked); ALB uses liveness only |
| Q32 | New tasks registering but not yet healthy; healthy threshold x health check interval = delay before traffic arrives |
| Q36 | ~$0.045/hour + $0.045/GB data = ~$32+/month; required for ECR image pulls, external API calls |
| Q40 | Missing execution role, missing ECR permissions, no NAT Gateway, no VPC endpoint for ECR, security group blocks outbound 443, route table misconfigured |
| Q41 | "Action": ["s3:PutObject"], "Resource": "arn:aws:s3:::user-photos/*" |