Episode 6 — Scaling Reliability Microservices Web3 / 6.3 — AWS Cloud Native Deployment

6.3 — AWS Cloud-Native Deployment: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps — reopen README.md → 6.3.a…6.3.d.
Practice — 6.3-Exercise-Questions.md.
Polish answers — 6.3-Interview-Questions.md.

Core Vocabulary

Term	One-liner
ECR	Managed Docker registry — private, IAM-integrated, encrypted at rest
ECS	Container orchestration service — runs, scales, and manages containers
Fargate	Serverless ECS launch type — no servers to manage
Cluster	Logical grouping of ECS services and tasks
Task Definition	JSON blueprint for how to run a container (image, CPU, memory, ports, env)
Task	Running instance of a task definition
Service	Maintains desired count of healthy tasks, handles deployments
ALB	Application Load Balancer — Layer 7, path-based routing, HTTPS termination
Target Group	Collection of targets (ECS task IPs) that receive ALB traffic
Health Check	ALB probe to verify a task is responsive (GET /health)
VPC	Virtual Private Cloud — isolated network on AWS
Public Subnet	Subnet with route to Internet Gateway
Private Subnet	Subnet with no direct internet route (NAT for outbound)
NAT Gateway	Allows outbound-only internet from private subnets
Security Group	Stateful firewall at the resource level
NACL	Stateless firewall at the subnet level
Execution Role	IAM role for ECS agent (pull images, write logs, fetch secrets)
Task Role	IAM role for your app code (call S3, SQS, DynamoDB, etc.)
VPC Endpoint	Private connection to AWS services without NAT/internet

ECR Workflow

1. aws ecr create-repository --repository-name my-service
2. docker build -t my-service:v1.0 .
3. docker tag my-service:v1.0 <acct>.dkr.ecr.<region>.amazonaws.com/my-service:v1.0
4. aws ecr get-login-password | docker login --username AWS --password-stdin <uri>
5. docker push <ecr-uri>/my-service:v1.0
6. aws ecr describe-images --repository-name my-service

Image URI format: <account-id>.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>

Best practices:

Tag with version + git SHA (never rely solely on latest)
Multi-stage Dockerfile (alpine base, non-root user, .dockerignore)
Scan on push (scanOnPush=true)
Lifecycle policies (keep last 10 images, delete untagged after 1 day)

ECS Architecture

Cluster
└── Service (desired: 3, auto-scaling: 2-20)
    ├── Task (v1.2.3, running, 10.0.10.15:3000)
    ├── Task (v1.2.3, running, 10.0.20.23:3000)
    └── Task (v1.2.3, running, 10.0.10.42:3000)

Task Definition (blueprint):
  family:           "user-service"
  cpu/memory:       256/512
  image:            <ecr-uri>:v1.2.3
  portMappings:     containerPort 3000
  environment:      NODE_ENV=production
  secrets:          DATABASE_URL from SecretsManager
  logConfiguration: awslogs → CloudWatch
  executionRoleArn: pulls images, writes logs
  taskRoleArn:      app calls AWS services

Fargate vs EC2

FARGATE (default choice):
  + No servers to manage
  + Pay per task (per-second)
  + Simpler operations
  - Max 16 vCPU / 120 GB per task
  - No GPU support
  - Slightly higher per-unit cost

EC2:
  + Cheaper at scale (Reserved/Spot)
  + GPU support
  + Larger instances (96+ vCPU)
  + SSH access for debugging
  - You manage instances (patching, scaling, AMIs)
  - Must scale instances AND tasks

Fargate CPU/Memory Combos (memorize the small ones)

256 CPU  →  512, 1024, 2048 MB
512 CPU  →  1024, 2048, 3072, 4096 MB
1024 CPU →  2048 - 8192 MB (1024 increments)
2048 CPU →  4096 - 16384 MB
4096 CPU →  8192 - 30720 MB

ALB Routing

Internet → ALB (public subnets, port 443)
  │
  ├── /api/users/*     → user-service-tg    (3 tasks, port 3000)
  ├── /api/orders/*    → order-service-tg   (2 tasks, port 3000)
  ├── /api/payments/*  → payment-service-tg (2 tasks, port 3000)
  └── default          → 404 fixed response

LISTENERS:
  Port 80  → Redirect to HTTPS (301)
  Port 443 → Rules → Target Groups

HEALTH CHECK:
  Path:              /health
  Interval:          15-30s
  Timeout:           5s
  Healthy threshold: 2
  Unhealthy:         3
  Matcher:           200

Key facts:

Target type must be ip for Fargate
HTTPS termination at ALB (ACM cert, free + auto-renew)
Connection draining: 30s for REST APIs, 300s for WebSocket
Security group: only accept 443 from internet, only send 3000 to ECS SG

VPC / Subnet Diagram

VPC 10.0.0.0/16
┌──── AZ-a ────────────────┐  ┌──── AZ-b ────────────────┐
│ Public  10.0.1.0/24      │  │ Public  10.0.2.0/24      │
│   [ALB node] [NAT GW]   │  │   [ALB node]             │
│                          │  │                          │
│ Private 10.0.10.0/24    │  │ Private 10.0.20.0/24    │
│   [ECS Tasks]            │  │   [ECS Tasks]            │
│                          │  │                          │
│ Data    10.0.100.0/24   │  │ Data    10.0.200.0/24   │
│   [RDS] [ElastiCache]   │  │   [RDS] [ElastiCache]   │
└──────────────────────────┘  └──────────────────────────┘

Route Tables:
  Public:  0.0.0.0/0 → Internet Gateway
  Private: 0.0.0.0/0 → NAT Gateway
  Data:    No internet route

IAM Role Types

EXECUTION ROLE (ecsTaskExecutionRole):
  Used by:     ECS agent (infrastructure)
  Timing:      Before + around your app
  Permissions:
    - ecr:GetAuthorizationToken, ecr:BatchGetImage
    - logs:CreateLogStream, logs:PutLogEvents
    - secretsmanager:GetSecretValue (for task secrets)
    - ssm:GetParameters (for task parameters)
  Shared:      Typically one role for all services

TASK ROLE (per-service):
  Used by:     Your application code (AWS SDK)
  Timing:      While your app runs
  Permissions: Service-specific only
    user-service:    s3:GetObject on user-photos bucket
    order-service:   dynamodb:PutItem on orders table
    email-worker:    ses:SendEmail, sqs:ReceiveMessage
  Scoping:     Specific actions + specific resources (least privilege)

Security Group Chain

sg-alb:
  IN:  443 from 0.0.0.0/0
  OUT: 3000 to sg-ecs

sg-ecs-tasks:
  IN:  3000 from sg-alb
  OUT: 443 to 0.0.0.0/0  (APIs, AWS services)
  OUT: 5432 to sg-db      (PostgreSQL)
  OUT: 6379 to sg-cache   (Redis)

sg-database:
  IN:  5432 from sg-ecs
  OUT: none

sg-cache:
  IN:  6379 from sg-ecs
  OUT: none

Rule: Reference SG IDs (not CIDRs) for internal rules — if IPs change, rules still work.

Common CLI Commands

# ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <uri>
aws ecr create-repository --repository-name my-service --image-scanning-configuration scanOnPush=true
aws ecr describe-images --repository-name my-service

# ECS
aws ecs create-cluster --cluster-name production
aws ecs register-task-definition --cli-input-json file://task-def.json
aws ecs create-service --cluster production --service-name my-svc --task-definition my-svc:1 --desired-count 3 --launch-type FARGATE
aws ecs update-service --cluster production --service my-svc --task-definition my-svc:2
aws ecs describe-services --cluster production --services my-svc
aws ecs list-tasks --cluster production --service-name my-svc
aws ecs execute-command --cluster production --task <id> --container my-svc --interactive --command "/bin/sh"

# ALB
aws elbv2 create-load-balancer --name my-alb --subnets subnet-1 subnet-2 --security-groups sg-alb --type application
aws elbv2 create-target-group --name my-tg --protocol HTTP --port 3000 --vpc-id vpc-xxx --target-type ip
aws elbv2 create-listener --load-balancer-arn <arn> --protocol HTTPS --port 443 --certificates CertificateArn=<cert>
aws elbv2 create-rule --listener-arn <arn> --priority 1 --conditions path-pattern --actions forward

# Auto-scaling
aws application-autoscaling register-scalable-target --service-namespace ecs --resource-id service/production/my-svc --scalable-dimension ecs:service:DesiredCount --min-capacity 2 --max-capacity 20
aws application-autoscaling put-scaling-policy --policy-type TargetTrackingScaling ...

VPC Endpoints (Essential for Fargate)

com.amazonaws.<region>.ecr.api         → ECR metadata (Interface)
com.amazonaws.<region>.ecr.dkr         → Docker image pulls (Interface)
com.amazonaws.<region>.s3              → ECR layer storage (Gateway)
com.amazonaws.<region>.logs            → CloudWatch Logs (Interface)
com.amazonaws.<region>.secretsmanager  → Secrets injection (Interface)

Without these, private subnets must route through NAT Gateway (extra cost + latency).

Deployment Flow

1. docker build + docker push → ECR
2. aws ecs register-task-definition (new revision)
3. aws ecs update-service --task-definition my-svc:NEW
4. ECS rolling deployment:
   - Start new tasks (up to maxPercent)
   - Wait for ALB health check pass
   - Drain old tasks (connection draining)
   - Stop old tasks
   - Repeat until all replaced
5. Circuit breaker auto-rollback if new tasks keep failing

Common Gotchas

Gotcha	Fix
`CannotPullContainerError`	Check: execution role, ECR permissions, NAT Gateway/VPC Endpoint, security group outbound 443
Tasks start then immediately stop	Check CloudWatch Logs for crash reason; check health check timing (startPeriod too short)
`no basic auth credentials` on push	Run `aws ecr get-login-password` first (token expired)
502 Bad Gateway after deploy	New tasks not yet healthy; check health check interval x threshold
Service stuck at 0 tasks	Check subnet IP availability, security group rules, task definition errors
Invalid CPU/memory combo	Fargate has fixed combos — 256 CPU only allows 512/1024/2048 MB
Target group type `instance` with Fargate	Must be `ip` — Fargate uses awsvpc, not EC2 instances
Execution role vs task role confusion	Execution = ECS agent (pull images); Task = your code (call AWS services)
NAT Gateway costs surprise	~$32/month + data transfer; use VPC Endpoints to reduce
`latest` tag in production	Mutable tag — no traceability; use version + git SHA

End of 6.3 quick revision.