Episode 6 — Scaling Reliability Microservices Web3 / 6.3 — AWS Cloud Native Deployment
6.3 — AWS Cloud-Native Deployment: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps — reopen
README.md→6.3.a…6.3.d. - Practice —
6.3-Exercise-Questions.md. - Polish answers —
6.3-Interview-Questions.md.
Core Vocabulary
| Term | One-liner |
|---|---|
| ECR | Managed Docker registry — private, IAM-integrated, encrypted at rest |
| ECS | Container orchestration service — runs, scales, and manages containers |
| Fargate | Serverless ECS launch type — no servers to manage |
| Cluster | Logical grouping of ECS services and tasks |
| Task Definition | JSON blueprint for how to run a container (image, CPU, memory, ports, env) |
| Task | Running instance of a task definition |
| Service | Maintains desired count of healthy tasks, handles deployments |
| ALB | Application Load Balancer — Layer 7, path-based routing, HTTPS termination |
| Target Group | Collection of targets (ECS task IPs) that receive ALB traffic |
| Health Check | ALB probe to verify a task is responsive (GET /health) |
| VPC | Virtual Private Cloud — isolated network on AWS |
| Public Subnet | Subnet with route to Internet Gateway |
| Private Subnet | Subnet with no direct internet route (NAT for outbound) |
| NAT Gateway | Allows outbound-only internet from private subnets |
| Security Group | Stateful firewall at the resource level |
| NACL | Stateless firewall at the subnet level |
| Execution Role | IAM role for ECS agent (pull images, write logs, fetch secrets) |
| Task Role | IAM role for your app code (call S3, SQS, DynamoDB, etc.) |
| VPC Endpoint | Private connection to AWS services without NAT/internet |
ECR Workflow
1. aws ecr create-repository --repository-name my-service
2. docker build -t my-service:v1.0 .
3. docker tag my-service:v1.0 <acct>.dkr.ecr.<region>.amazonaws.com/my-service:v1.0
4. aws ecr get-login-password | docker login --username AWS --password-stdin <uri>
5. docker push <ecr-uri>/my-service:v1.0
6. aws ecr describe-images --repository-name my-service
Image URI format: <account-id>.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>
Best practices:
- Tag with version + git SHA (never rely solely on
latest) - Multi-stage Dockerfile (alpine base, non-root user, .dockerignore)
- Scan on push (
scanOnPush=true) - Lifecycle policies (keep last 10 images, delete untagged after 1 day)
ECS Architecture
Cluster
└── Service (desired: 3, auto-scaling: 2-20)
├── Task (v1.2.3, running, 10.0.10.15:3000)
├── Task (v1.2.3, running, 10.0.20.23:3000)
└── Task (v1.2.3, running, 10.0.10.42:3000)
Task Definition (blueprint):
family: "user-service"
cpu/memory: 256/512
image: <ecr-uri>:v1.2.3
portMappings: containerPort 3000
environment: NODE_ENV=production
secrets: DATABASE_URL from SecretsManager
logConfiguration: awslogs → CloudWatch
executionRoleArn: pulls images, writes logs
taskRoleArn: app calls AWS services
Fargate vs EC2
FARGATE (default choice):
+ No servers to manage
+ Pay per task (per-second)
+ Simpler operations
- Max 16 vCPU / 120 GB per task
- No GPU support
- Slightly higher per-unit cost
EC2:
+ Cheaper at scale (Reserved/Spot)
+ GPU support
+ Larger instances (96+ vCPU)
+ SSH access for debugging
- You manage instances (patching, scaling, AMIs)
- Must scale instances AND tasks
Fargate CPU/Memory Combos (memorize the small ones)
256 CPU → 512, 1024, 2048 MB
512 CPU → 1024, 2048, 3072, 4096 MB
1024 CPU → 2048 - 8192 MB (1024 increments)
2048 CPU → 4096 - 16384 MB
4096 CPU → 8192 - 30720 MB
ALB Routing
Internet → ALB (public subnets, port 443)
│
├── /api/users/* → user-service-tg (3 tasks, port 3000)
├── /api/orders/* → order-service-tg (2 tasks, port 3000)
├── /api/payments/* → payment-service-tg (2 tasks, port 3000)
└── default → 404 fixed response
LISTENERS:
Port 80 → Redirect to HTTPS (301)
Port 443 → Rules → Target Groups
HEALTH CHECK:
Path: /health
Interval: 15-30s
Timeout: 5s
Healthy threshold: 2
Unhealthy: 3
Matcher: 200
Key facts:
- Target type must be
ipfor Fargate - HTTPS termination at ALB (ACM cert, free + auto-renew)
- Connection draining: 30s for REST APIs, 300s for WebSocket
- Security group: only accept 443 from internet, only send 3000 to ECS SG
VPC / Subnet Diagram
VPC 10.0.0.0/16
┌──── AZ-a ────────────────┐ ┌──── AZ-b ────────────────┐
│ Public 10.0.1.0/24 │ │ Public 10.0.2.0/24 │
│ [ALB node] [NAT GW] │ │ [ALB node] │
│ │ │ │
│ Private 10.0.10.0/24 │ │ Private 10.0.20.0/24 │
│ [ECS Tasks] │ │ [ECS Tasks] │
│ │ │ │
│ Data 10.0.100.0/24 │ │ Data 10.0.200.0/24 │
│ [RDS] [ElastiCache] │ │ [RDS] [ElastiCache] │
└──────────────────────────┘ └──────────────────────────┘
Route Tables:
Public: 0.0.0.0/0 → Internet Gateway
Private: 0.0.0.0/0 → NAT Gateway
Data: No internet route
IAM Role Types
EXECUTION ROLE (ecsTaskExecutionRole):
Used by: ECS agent (infrastructure)
Timing: Before + around your app
Permissions:
- ecr:GetAuthorizationToken, ecr:BatchGetImage
- logs:CreateLogStream, logs:PutLogEvents
- secretsmanager:GetSecretValue (for task secrets)
- ssm:GetParameters (for task parameters)
Shared: Typically one role for all services
TASK ROLE (per-service):
Used by: Your application code (AWS SDK)
Timing: While your app runs
Permissions: Service-specific only
user-service: s3:GetObject on user-photos bucket
order-service: dynamodb:PutItem on orders table
email-worker: ses:SendEmail, sqs:ReceiveMessage
Scoping: Specific actions + specific resources (least privilege)
Security Group Chain
sg-alb:
IN: 443 from 0.0.0.0/0
OUT: 3000 to sg-ecs
sg-ecs-tasks:
IN: 3000 from sg-alb
OUT: 443 to 0.0.0.0/0 (APIs, AWS services)
OUT: 5432 to sg-db (PostgreSQL)
OUT: 6379 to sg-cache (Redis)
sg-database:
IN: 5432 from sg-ecs
OUT: none
sg-cache:
IN: 6379 from sg-ecs
OUT: none
Rule: Reference SG IDs (not CIDRs) for internal rules — if IPs change, rules still work.
Common CLI Commands
# ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <uri>
aws ecr create-repository --repository-name my-service --image-scanning-configuration scanOnPush=true
aws ecr describe-images --repository-name my-service
# ECS
aws ecs create-cluster --cluster-name production
aws ecs register-task-definition --cli-input-json file://task-def.json
aws ecs create-service --cluster production --service-name my-svc --task-definition my-svc:1 --desired-count 3 --launch-type FARGATE
aws ecs update-service --cluster production --service my-svc --task-definition my-svc:2
aws ecs describe-services --cluster production --services my-svc
aws ecs list-tasks --cluster production --service-name my-svc
aws ecs execute-command --cluster production --task <id> --container my-svc --interactive --command "/bin/sh"
# ALB
aws elbv2 create-load-balancer --name my-alb --subnets subnet-1 subnet-2 --security-groups sg-alb --type application
aws elbv2 create-target-group --name my-tg --protocol HTTP --port 3000 --vpc-id vpc-xxx --target-type ip
aws elbv2 create-listener --load-balancer-arn <arn> --protocol HTTPS --port 443 --certificates CertificateArn=<cert>
aws elbv2 create-rule --listener-arn <arn> --priority 1 --conditions path-pattern --actions forward
# Auto-scaling
aws application-autoscaling register-scalable-target --service-namespace ecs --resource-id service/production/my-svc --scalable-dimension ecs:service:DesiredCount --min-capacity 2 --max-capacity 20
aws application-autoscaling put-scaling-policy --policy-type TargetTrackingScaling ...
VPC Endpoints (Essential for Fargate)
com.amazonaws.<region>.ecr.api → ECR metadata (Interface)
com.amazonaws.<region>.ecr.dkr → Docker image pulls (Interface)
com.amazonaws.<region>.s3 → ECR layer storage (Gateway)
com.amazonaws.<region>.logs → CloudWatch Logs (Interface)
com.amazonaws.<region>.secretsmanager → Secrets injection (Interface)
Without these, private subnets must route through NAT Gateway (extra cost + latency).
Deployment Flow
1. docker build + docker push → ECR
2. aws ecs register-task-definition (new revision)
3. aws ecs update-service --task-definition my-svc:NEW
4. ECS rolling deployment:
- Start new tasks (up to maxPercent)
- Wait for ALB health check pass
- Drain old tasks (connection draining)
- Stop old tasks
- Repeat until all replaced
5. Circuit breaker auto-rollback if new tasks keep failing
Common Gotchas
| Gotcha | Fix |
|---|---|
CannotPullContainerError | Check: execution role, ECR permissions, NAT Gateway/VPC Endpoint, security group outbound 443 |
| Tasks start then immediately stop | Check CloudWatch Logs for crash reason; check health check timing (startPeriod too short) |
no basic auth credentials on push | Run aws ecr get-login-password first (token expired) |
| 502 Bad Gateway after deploy | New tasks not yet healthy; check health check interval x threshold |
| Service stuck at 0 tasks | Check subnet IP availability, security group rules, task definition errors |
| Invalid CPU/memory combo | Fargate has fixed combos — 256 CPU only allows 512/1024/2048 MB |
Target group type instance with Fargate | Must be ip — Fargate uses awsvpc, not EC2 instances |
| Execution role vs task role confusion | Execution = ECS agent (pull images); Task = your code (call AWS services) |
| NAT Gateway costs surprise | ~$32/month + data transfer; use VPC Endpoints to reduce |
latest tag in production | Mutable tag — no traceability; use version + git SHA |
End of 6.3 quick revision.