Episode 6 — Scaling Reliability Microservices Web3 / 6.3 — AWS Cloud Native Deployment
6.3.b — ECS and Fargate
In one sentence: Amazon ECS (Elastic Container Service) is AWS's container orchestrator that manages running, scaling, and deploying your containers, and Fargate is the serverless launch type that lets you run containers without provisioning or managing any EC2 instances.
Navigation: ← 6.3.a — ECR and Container Images · 6.3.c — Application Load Balancer →
1. What Is Amazon ECS?
Amazon Elastic Container Service (ECS) is a container orchestration service. It answers the question: "I have Docker images in ECR — now how do I actually run them, keep them running, and scale them?"
ECS handles:
- Scheduling — deciding where and when to run containers
- Health monitoring — restarting containers that fail
- Scaling — adding/removing containers based on demand
- Deployment — rolling out new versions without downtime
- Service discovery — letting containers find each other
Think of ECS as a production-grade docker-compose that runs in the cloud, auto-heals, auto-scales, and integrates with the entire AWS ecosystem.
2. ECS Core Concepts
┌─────────────────────────────────────────────────────────────────┐
│ ECS HIERARCHY │
│ │
│ Cluster │
│ └── Service │
│ ├── Task (running instance) │
│ │ └── Container(s) │
│ ├── Task (running instance) │
│ │ └── Container(s) │
│ └── Task (running instance) │
│ └── Container(s) │
│ │
│ Task Definition (blueprint — like a docker-compose.yml) │
│ └── Container Definition(s) │
│ ├── Image, CPU, Memory │
│ ├── Port mappings │
│ ├── Environment variables │
│ └── Log configuration │
└─────────────────────────────────────────────────────────────────┘
Cluster
A cluster is a logical grouping of resources. It is the top-level container for your services. You typically have one cluster per environment (dev, staging, production) or one cluster per application.
# Create a cluster
aws ecs create-cluster --cluster-name production --region us-east-1
Task Definition
A task definition is a JSON blueprint that describes how to run your containers. It is like a docker-compose.yml for AWS. It defines:
- Which image to use
- How much CPU and memory to allocate
- Port mappings
- Environment variables
- Logging configuration
- IAM roles
Task definitions are versioned — each update creates a new revision (e.g., user-service:1, user-service:2, user-service:3).
Task
A task is a running instance of a task definition. When ECS "runs a task," it pulls the image from ECR, creates the container(s), and starts them. A task can contain one or more containers that share network and storage (similar to a Kubernetes pod).
Service
A service is a long-running, managed set of tasks. You tell ECS "I want 3 copies of the user-service task running at all times," and the service ensures exactly 3 tasks are always healthy. If one crashes, ECS launches a replacement. If you deploy a new version, the service handles rolling updates.
3. Task Definition Anatomy
Here is a complete, production-ready task definition for a Node.js microservice:
{
"family": "user-service",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/userServiceTaskRole",
"containerDefinitions": [
{
"name": "user-service",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3",
"essential": true,
"portMappings": [
{
"containerPort": 3000,
"protocol": "tcp"
}
],
"environment": [
{ "name": "NODE_ENV", "value": "production" },
{ "name": "PORT", "value": "3000" },
{ "name": "LOG_LEVEL", "value": "info" }
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:user-service/db-url"
},
{
"name": "JWT_SECRET",
"valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/user-service/jwt-secret"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/user-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
Field-by-field breakdown
| Field | Purpose |
|---|---|
family | Name of the task definition (groups revisions together) |
networkMode: "awsvpc" | Each task gets its own ENI (elastic network interface) — required for Fargate |
requiresCompatibilities | ["FARGATE"] or ["EC2"] or both |
cpu / memory | Resource allocation (Fargate has specific valid combinations) |
executionRoleArn | IAM role ECS uses to pull images and write logs (see 6.3.d) |
taskRoleArn | IAM role your application code uses to call AWS services (see 6.3.d) |
image | Full ECR URI with tag |
essential | If true, task stops if this container stops |
portMappings | Which container port to expose |
environment | Plain-text env vars |
secrets | Env vars pulled from Secrets Manager or SSM Parameter Store |
logConfiguration | Send logs to CloudWatch Logs |
healthCheck | Container-level health check command |
Fargate CPU/Memory Valid Combinations
Fargate only allows specific CPU/memory combinations:
| CPU (vCPU) | Memory Options (MB) |
|---|---|
| 256 (0.25) | 512, 1024, 2048 |
| 512 (0.5) | 1024, 2048, 3072, 4096 |
| 1024 (1) | 2048, 3072, 4096, 5120, 6144, 7168, 8192 |
| 2048 (2) | 4096 through 16384 (in 1024 increments) |
| 4096 (4) | 8192 through 30720 (in 1024 increments) |
| 8192 (8) | 16384 through 61440 (in 4096 increments) |
| 16384 (16) | 32768 through 122880 (in 8192 increments) |
4. Registering a Task Definition
# Register the task definition from a JSON file
aws ecs register-task-definition \
--cli-input-json file://task-definition.json \
--region us-east-1
# List all revisions of a task definition
aws ecs list-task-definitions \
--family-prefix user-service \
--region us-east-1
# Output:
# "taskDefinitionArns": [
# "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:1",
# "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:2",
# "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:3"
# ]
5. Fargate vs EC2 Launch Types
This is one of the most important decisions when using ECS. You have two ways to run your tasks:
┌──────────────────────────────────┐ ┌──────────────────────────────────┐
│ EC2 Launch Type │ │ Fargate Launch Type │
│ │ │ │
│ You manage EC2 instances │ │ AWS manages infrastructure │
│ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │
│ │ EC2 Instance │ │ │ │ Fargate (serverless) │ │
│ │ ┌──────┐ ┌──────┐ │ │ │ │ ┌──────┐ ┌──────┐ │ │
│ │ │Task 1│ │Task 2│ │ │ │ │ │Task 1│ │Task 2│ │ │
│ │ └──────┘ └──────┘ │ │ │ │ └──────┘ └──────┘ │ │
│ │ OS, Docker, ECS Agent │ │ │ │ No servers to manage │ │
│ └────────────────────────┘ │ │ └────────────────────────┘ │
│ You patch, scale, monitor │ │ AWS patches, scales, monitors │
└──────────────────────────────────┘ └──────────────────────────────────┘
Comparison Table
| Dimension | EC2 Launch Type | Fargate Launch Type |
|---|---|---|
| Server management | You manage EC2 instances (patching, scaling) | No servers — AWS manages everything |
| Pricing | Pay for EC2 instances (even if underutilized) | Pay per task (per-second, CPU + memory) |
| Scaling | Must scale instances AND tasks | Only scale tasks (infrastructure auto-scales) |
| Startup time | ~30s (if instance available) | ~30-60s (provision new micro-VM) |
| Max resources | Limited by instance type (up to 96 vCPU, 384 GB) | Up to 16 vCPU, 120 GB per task |
| GPU support | Yes (P3, G4 instances) | No GPU support |
| Persistent storage | EBS volumes, instance store | Ephemeral storage (20 GB default, up to 200 GB) |
| SSH access | Yes (for debugging) | No (use ECS Exec for shell access) |
| Cost at scale | Cheaper with Reserved/Spot instances | More expensive per unit, but no waste |
| Complexity | Higher (capacity planning, AMI updates) | Lower (focus on containers, not infrastructure) |
| Best for | GPU workloads, large instances, cost optimization at scale | Most workloads, teams wanting simplicity |
When to Use Fargate
- Default choice — start with Fargate unless you have a specific reason not to
- Small to medium workloads
- Teams without dedicated DevOps/infrastructure engineers
- Variable traffic (pay only for what you use)
- Rapid prototyping and MVP
When to Use EC2
- GPU-dependent workloads (ML inference)
- Need instances larger than 16 vCPU / 120 GB
- Very high, predictable traffic (Reserved Instances save 30-60%)
- Need SSH access to the host for debugging
- Specialized kernel or OS requirements
6. Creating a Fargate Service
Step 1: Ensure prerequisites
# Cluster exists
aws ecs create-cluster --cluster-name production
# Task definition is registered (from Section 4)
aws ecs register-task-definition --cli-input-json file://task-definition.json
# VPC, subnets, security groups exist (see 6.3.d)
# ALB and target group exist (see 6.3.c)
Step 2: Create the service
aws ecs create-service \
--cluster production \
--service-name user-service \
--task-definition user-service:3 \
--desired-count 3 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-0a1b2c3d4e5f", "subnet-1a2b3c4d5e6f"],
"securityGroups": ["sg-0a1b2c3d4e5f6g7h"],
"assignPublicIp": "DISABLED"
}
}' \
--load-balancers '[
{
"targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/user-svc-tg/abc123",
"containerName": "user-service",
"containerPort": 3000
}
]' \
--deployment-configuration '{
"minimumHealthyPercent": 100,
"maximumPercent": 200
}' \
--region us-east-1
Key parameters explained
| Parameter | Purpose |
|---|---|
--desired-count 3 | Run 3 copies of the task at all times |
--launch-type FARGATE | Use Fargate (no EC2 instances to manage) |
awsvpcConfiguration | Place tasks in private subnets with a security group |
assignPublicIp: DISABLED | Tasks are in private subnets, accessed via ALB |
--load-balancers | Register tasks with the ALB target group |
minimumHealthyPercent: 100 | During deployment, never drop below 100% of desired count |
maximumPercent: 200 | During deployment, allow up to 200% temporarily (rolling update) |
7. Service Auto-Scaling
ECS integrates with Application Auto Scaling to adjust the number of running tasks based on demand.
Target tracking scaling (recommended)
# Register the service as a scalable target
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/production/user-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 20
# Scale based on CPU utilization (target: 70%)
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production/user-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-target-tracking \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
# Scale based on request count per target (from ALB)
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production/user-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name request-count-tracking \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 1000.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
"ResourceLabel": "app/my-alb/abc123/targetgroup/user-svc-tg/def456"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
Scaling strategies comparison
| Strategy | How It Works | Best For |
|---|---|---|
| Target tracking | Maintain a target metric value (e.g., 70% CPU) | Most workloads — set and forget |
| Step scaling | Add/remove N tasks when metric crosses thresholds | Bursty traffic with known patterns |
| Scheduled scaling | Scale at specific times (e.g., 9 AM scale up) | Predictable daily/weekly patterns |
Scale-in protection
ScaleOutCooldown: 60 → Wait 60s after scaling out before scaling out again
ScaleInCooldown: 300 → Wait 300s after scaling in before scaling in again
Why asymmetric? Scale out fast (handle traffic), scale in slowly (avoid flapping).
8. Rolling Deployments
When you update a service (new image version), ECS performs a rolling deployment:
Timeline of a rolling deployment (desired: 3, min: 100%, max: 200%):
Time 0: [v1] [v1] [v1] ← 3 old tasks running
Time 1: [v1] [v1] [v1] [v2] ← 1 new task starting (max 200% = 6)
Time 2: [v1] [v1] [v1] [v2] ← v2 task passes health check
Time 3: [v1] [v1] [v2] ← 1 old task drained and stopped
Time 4: [v1] [v1] [v2] [v2] ← another new task starting
Time 5: [v1] [v2] [v2] ← another old task stopped
Time 6: [v1] [v2] [v2] [v2] ← last new task starting
Time 7: [v2] [v2] [v2] ← all old tasks stopped — deployment complete
Triggering a deployment
# Update the service to use a new task definition revision
aws ecs update-service \
--cluster production \
--service user-service \
--task-definition user-service:4 \
--region us-east-1
# Monitor the deployment
aws ecs describe-services \
--cluster production \
--services user-service \
--query 'services[0].deployments' \
--region us-east-1
Deployment circuit breaker
ECS can automatically roll back a failed deployment:
aws ecs update-service \
--cluster production \
--service user-service \
--deployment-configuration '{
"minimumHealthyPercent": 100,
"maximumPercent": 200,
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}'
If the new tasks keep failing health checks, ECS automatically reverts to the previous working version.
9. ECS Exec (Debugging Running Containers)
Since Fargate has no SSH access, ECS Exec gives you a shell into a running container:
# Enable ECS Exec on the service
aws ecs update-service \
--cluster production \
--service user-service \
--enable-execute-command
# Get a task ID
TASK_ID=$(aws ecs list-tasks \
--cluster production \
--service-name user-service \
--query 'taskArns[0]' \
--output text)
# Open a shell in the container
aws ecs execute-command \
--cluster production \
--task "$TASK_ID" \
--container user-service \
--interactive \
--command "/bin/sh"
Warning: ECS Exec requires the SSM agent in the container. The Fargate platform version 1.4.0+ includes it automatically. The task role must have
ssmmessagespermissions.
10. Complete Task Definitions for Microservices
Node.js API Service (lightweight)
{
"family": "api-gateway",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/apiGatewayTaskRole",
"containerDefinitions": [
{
"name": "api-gateway",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/api-gateway:v2.0.0",
"essential": true,
"portMappings": [{ "containerPort": 3000, "protocol": "tcp" }],
"environment": [
{ "name": "NODE_ENV", "value": "production" },
{ "name": "USER_SERVICE_URL", "value": "http://user-service.local:3000" },
{ "name": "ORDER_SERVICE_URL", "value": "http://order-service.local:3000" }
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/api-gateway",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}
Worker Service (no port mapping, higher memory)
{
"family": "email-worker",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/emailWorkerTaskRole",
"containerDefinitions": [
{
"name": "email-worker",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/email-worker:v1.5.0",
"essential": true,
"environment": [
{ "name": "NODE_ENV", "value": "production" },
{ "name": "SQS_QUEUE_URL", "value": "https://sqs.us-east-1.amazonaws.com/123456789012/email-queue" }
],
"secrets": [
{
"name": "SMTP_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:email-worker/smtp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/email-worker",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}
11. Key Takeaways
- ECS hierarchy: Cluster → Service → Task → Container(s). The task definition is the blueprint.
- Fargate is the default choice — no servers to manage, pay per second of task runtime.
- Task definitions are versioned — every update creates a new revision; roll back by pointing to an older revision.
- Services maintain desired count — if a task dies, ECS replaces it automatically.
- Rolling deployments ensure zero downtime — new tasks start before old tasks stop.
- Deployment circuit breakers auto-rollback failed deployments.
- Auto-scaling adjusts task count based on CPU, memory, or ALB request count.
- Secrets belong in Secrets Manager or SSM — never put them in the task definition
environmentblock.
Explain-It Challenge
- Your ECS service has
desired-count: 3but only 2 tasks are running. List all possible reasons and how you would diagnose each. - Explain the difference between
executionRoleArnandtaskRoleArnto a teammate who has never used ECS. - A deployment has been stuck for 20 minutes — new tasks keep starting and stopping. What is happening and how do you fix it?
Navigation: ← 6.3.a — ECR and Container Images · 6.3.c — Application Load Balancer →