Episode 6 — Scaling Reliability Microservices Web3 / 6.3 — AWS Cloud Native Deployment

6.3.b — ECS and Fargate

In one sentence: Amazon ECS (Elastic Container Service) is AWS's container orchestrator that manages running, scaling, and deploying your containers, and Fargate is the serverless launch type that lets you run containers without provisioning or managing any EC2 instances.

Navigation: ← 6.3.a — ECR and Container Images · 6.3.c — Application Load Balancer →


1. What Is Amazon ECS?

Amazon Elastic Container Service (ECS) is a container orchestration service. It answers the question: "I have Docker images in ECR — now how do I actually run them, keep them running, and scale them?"

ECS handles:

  • Scheduling — deciding where and when to run containers
  • Health monitoring — restarting containers that fail
  • Scaling — adding/removing containers based on demand
  • Deployment — rolling out new versions without downtime
  • Service discovery — letting containers find each other

Think of ECS as a production-grade docker-compose that runs in the cloud, auto-heals, auto-scales, and integrates with the entire AWS ecosystem.


2. ECS Core Concepts

┌─────────────────────────────────────────────────────────────────┐
│                        ECS HIERARCHY                             │
│                                                                  │
│  Cluster                                                         │
│  └── Service                                                     │
│      ├── Task (running instance)                                 │
│      │   └── Container(s)                                        │
│      ├── Task (running instance)                                 │
│      │   └── Container(s)                                        │
│      └── Task (running instance)                                 │
│          └── Container(s)                                        │
│                                                                  │
│  Task Definition (blueprint — like a docker-compose.yml)         │
│  └── Container Definition(s)                                     │
│      ├── Image, CPU, Memory                                      │
│      ├── Port mappings                                           │
│      ├── Environment variables                                   │
│      └── Log configuration                                       │
└─────────────────────────────────────────────────────────────────┘

Cluster

A cluster is a logical grouping of resources. It is the top-level container for your services. You typically have one cluster per environment (dev, staging, production) or one cluster per application.

# Create a cluster
aws ecs create-cluster --cluster-name production --region us-east-1

Task Definition

A task definition is a JSON blueprint that describes how to run your containers. It is like a docker-compose.yml for AWS. It defines:

  • Which image to use
  • How much CPU and memory to allocate
  • Port mappings
  • Environment variables
  • Logging configuration
  • IAM roles

Task definitions are versioned — each update creates a new revision (e.g., user-service:1, user-service:2, user-service:3).

Task

A task is a running instance of a task definition. When ECS "runs a task," it pulls the image from ECR, creates the container(s), and starts them. A task can contain one or more containers that share network and storage (similar to a Kubernetes pod).

Service

A service is a long-running, managed set of tasks. You tell ECS "I want 3 copies of the user-service task running at all times," and the service ensures exactly 3 tasks are always healthy. If one crashes, ECS launches a replacement. If you deploy a new version, the service handles rolling updates.


3. Task Definition Anatomy

Here is a complete, production-ready task definition for a Node.js microservice:

{
  "family": "user-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/userServiceTaskRole",
  "containerDefinitions": [
    {
      "name": "user-service",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        { "name": "NODE_ENV", "value": "production" },
        { "name": "PORT", "value": "3000" },
        { "name": "LOG_LEVEL", "value": "info" }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:user-service/db-url"
        },
        {
          "name": "JWT_SECRET",
          "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/user-service/jwt-secret"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/user-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Field-by-field breakdown

FieldPurpose
familyName of the task definition (groups revisions together)
networkMode: "awsvpc"Each task gets its own ENI (elastic network interface) — required for Fargate
requiresCompatibilities["FARGATE"] or ["EC2"] or both
cpu / memoryResource allocation (Fargate has specific valid combinations)
executionRoleArnIAM role ECS uses to pull images and write logs (see 6.3.d)
taskRoleArnIAM role your application code uses to call AWS services (see 6.3.d)
imageFull ECR URI with tag
essentialIf true, task stops if this container stops
portMappingsWhich container port to expose
environmentPlain-text env vars
secretsEnv vars pulled from Secrets Manager or SSM Parameter Store
logConfigurationSend logs to CloudWatch Logs
healthCheckContainer-level health check command

Fargate CPU/Memory Valid Combinations

Fargate only allows specific CPU/memory combinations:

CPU (vCPU)Memory Options (MB)
256 (0.25)512, 1024, 2048
512 (0.5)1024, 2048, 3072, 4096
1024 (1)2048, 3072, 4096, 5120, 6144, 7168, 8192
2048 (2)4096 through 16384 (in 1024 increments)
4096 (4)8192 through 30720 (in 1024 increments)
8192 (8)16384 through 61440 (in 4096 increments)
16384 (16)32768 through 122880 (in 8192 increments)

4. Registering a Task Definition

# Register the task definition from a JSON file
aws ecs register-task-definition \
  --cli-input-json file://task-definition.json \
  --region us-east-1

# List all revisions of a task definition
aws ecs list-task-definitions \
  --family-prefix user-service \
  --region us-east-1

# Output:
# "taskDefinitionArns": [
#   "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:1",
#   "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:2",
#   "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:3"
# ]

5. Fargate vs EC2 Launch Types

This is one of the most important decisions when using ECS. You have two ways to run your tasks:

┌──────────────────────────────────┐  ┌──────────────────────────────────┐
│         EC2 Launch Type          │  │       Fargate Launch Type        │
│                                  │  │                                  │
│  You manage EC2 instances        │  │  AWS manages infrastructure      │
│  ┌────────────────────────┐      │  │  ┌────────────────────────┐      │
│  │ EC2 Instance           │      │  │  │ Fargate (serverless)   │      │
│  │ ┌──────┐ ┌──────┐     │      │  │  │ ┌──────┐ ┌──────┐     │      │
│  │ │Task 1│ │Task 2│     │      │  │  │ │Task 1│ │Task 2│     │      │
│  │ └──────┘ └──────┘     │      │  │  │ └──────┘ └──────┘     │      │
│  │ OS, Docker, ECS Agent │      │  │  │ No servers to manage   │      │
│  └────────────────────────┘      │  │  └────────────────────────┘      │
│  You patch, scale, monitor       │  │  AWS patches, scales, monitors   │
└──────────────────────────────────┘  └──────────────────────────────────┘

Comparison Table

DimensionEC2 Launch TypeFargate Launch Type
Server managementYou manage EC2 instances (patching, scaling)No servers — AWS manages everything
PricingPay for EC2 instances (even if underutilized)Pay per task (per-second, CPU + memory)
ScalingMust scale instances AND tasksOnly scale tasks (infrastructure auto-scales)
Startup time~30s (if instance available)~30-60s (provision new micro-VM)
Max resourcesLimited by instance type (up to 96 vCPU, 384 GB)Up to 16 vCPU, 120 GB per task
GPU supportYes (P3, G4 instances)No GPU support
Persistent storageEBS volumes, instance storeEphemeral storage (20 GB default, up to 200 GB)
SSH accessYes (for debugging)No (use ECS Exec for shell access)
Cost at scaleCheaper with Reserved/Spot instancesMore expensive per unit, but no waste
ComplexityHigher (capacity planning, AMI updates)Lower (focus on containers, not infrastructure)
Best forGPU workloads, large instances, cost optimization at scaleMost workloads, teams wanting simplicity

When to Use Fargate

  • Default choice — start with Fargate unless you have a specific reason not to
  • Small to medium workloads
  • Teams without dedicated DevOps/infrastructure engineers
  • Variable traffic (pay only for what you use)
  • Rapid prototyping and MVP

When to Use EC2

  • GPU-dependent workloads (ML inference)
  • Need instances larger than 16 vCPU / 120 GB
  • Very high, predictable traffic (Reserved Instances save 30-60%)
  • Need SSH access to the host for debugging
  • Specialized kernel or OS requirements

6. Creating a Fargate Service

Step 1: Ensure prerequisites

# Cluster exists
aws ecs create-cluster --cluster-name production

# Task definition is registered (from Section 4)
aws ecs register-task-definition --cli-input-json file://task-definition.json

# VPC, subnets, security groups exist (see 6.3.d)
# ALB and target group exist (see 6.3.c)

Step 2: Create the service

aws ecs create-service \
  --cluster production \
  --service-name user-service \
  --task-definition user-service:3 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0a1b2c3d4e5f", "subnet-1a2b3c4d5e6f"],
      "securityGroups": ["sg-0a1b2c3d4e5f6g7h"],
      "assignPublicIp": "DISABLED"
    }
  }' \
  --load-balancers '[
    {
      "targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/user-svc-tg/abc123",
      "containerName": "user-service",
      "containerPort": 3000
    }
  ]' \
  --deployment-configuration '{
    "minimumHealthyPercent": 100,
    "maximumPercent": 200
  }' \
  --region us-east-1

Key parameters explained

ParameterPurpose
--desired-count 3Run 3 copies of the task at all times
--launch-type FARGATEUse Fargate (no EC2 instances to manage)
awsvpcConfigurationPlace tasks in private subnets with a security group
assignPublicIp: DISABLEDTasks are in private subnets, accessed via ALB
--load-balancersRegister tasks with the ALB target group
minimumHealthyPercent: 100During deployment, never drop below 100% of desired count
maximumPercent: 200During deployment, allow up to 200% temporarily (rolling update)

7. Service Auto-Scaling

ECS integrates with Application Auto Scaling to adjust the number of running tasks based on demand.

Target tracking scaling (recommended)

# Register the service as a scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/user-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

# Scale based on CPU utilization (target: 70%)
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production/user-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

# Scale based on request count per target (from ALB)
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production/user-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name request-count-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/my-alb/abc123/targetgroup/user-svc-tg/def456"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Scaling strategies comparison

StrategyHow It WorksBest For
Target trackingMaintain a target metric value (e.g., 70% CPU)Most workloads — set and forget
Step scalingAdd/remove N tasks when metric crosses thresholdsBursty traffic with known patterns
Scheduled scalingScale at specific times (e.g., 9 AM scale up)Predictable daily/weekly patterns

Scale-in protection

ScaleOutCooldown: 60   → Wait 60s after scaling out before scaling out again
ScaleInCooldown: 300   → Wait 300s after scaling in before scaling in again

Why asymmetric? Scale out fast (handle traffic), scale in slowly (avoid flapping).

8. Rolling Deployments

When you update a service (new image version), ECS performs a rolling deployment:

Timeline of a rolling deployment (desired: 3, min: 100%, max: 200%):

Time 0:  [v1] [v1] [v1]           ← 3 old tasks running
Time 1:  [v1] [v1] [v1] [v2]     ← 1 new task starting (max 200% = 6)
Time 2:  [v1] [v1] [v1] [v2]     ← v2 task passes health check
Time 3:  [v1] [v1] [v2]          ← 1 old task drained and stopped
Time 4:  [v1] [v1] [v2] [v2]     ← another new task starting
Time 5:  [v1] [v2] [v2]          ← another old task stopped
Time 6:  [v1] [v2] [v2] [v2]     ← last new task starting
Time 7:  [v2] [v2] [v2]          ← all old tasks stopped — deployment complete

Triggering a deployment

# Update the service to use a new task definition revision
aws ecs update-service \
  --cluster production \
  --service user-service \
  --task-definition user-service:4 \
  --region us-east-1

# Monitor the deployment
aws ecs describe-services \
  --cluster production \
  --services user-service \
  --query 'services[0].deployments' \
  --region us-east-1

Deployment circuit breaker

ECS can automatically roll back a failed deployment:

aws ecs update-service \
  --cluster production \
  --service user-service \
  --deployment-configuration '{
    "minimumHealthyPercent": 100,
    "maximumPercent": 200,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }'

If the new tasks keep failing health checks, ECS automatically reverts to the previous working version.


9. ECS Exec (Debugging Running Containers)

Since Fargate has no SSH access, ECS Exec gives you a shell into a running container:

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster production \
  --service user-service \
  --enable-execute-command

# Get a task ID
TASK_ID=$(aws ecs list-tasks \
  --cluster production \
  --service-name user-service \
  --query 'taskArns[0]' \
  --output text)

# Open a shell in the container
aws ecs execute-command \
  --cluster production \
  --task "$TASK_ID" \
  --container user-service \
  --interactive \
  --command "/bin/sh"

Warning: ECS Exec requires the SSM agent in the container. The Fargate platform version 1.4.0+ includes it automatically. The task role must have ssmmessages permissions.


10. Complete Task Definitions for Microservices

Node.js API Service (lightweight)

{
  "family": "api-gateway",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiGatewayTaskRole",
  "containerDefinitions": [
    {
      "name": "api-gateway",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/api-gateway:v2.0.0",
      "essential": true,
      "portMappings": [{ "containerPort": 3000, "protocol": "tcp" }],
      "environment": [
        { "name": "NODE_ENV", "value": "production" },
        { "name": "USER_SERVICE_URL", "value": "http://user-service.local:3000" },
        { "name": "ORDER_SERVICE_URL", "value": "http://order-service.local:3000" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-gateway",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Worker Service (no port mapping, higher memory)

{
  "family": "email-worker",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/emailWorkerTaskRole",
  "containerDefinitions": [
    {
      "name": "email-worker",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/email-worker:v1.5.0",
      "essential": true,
      "environment": [
        { "name": "NODE_ENV", "value": "production" },
        { "name": "SQS_QUEUE_URL", "value": "https://sqs.us-east-1.amazonaws.com/123456789012/email-queue" }
      ],
      "secrets": [
        {
          "name": "SMTP_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:email-worker/smtp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/email-worker",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

11. Key Takeaways

  1. ECS hierarchy: Cluster → Service → Task → Container(s). The task definition is the blueprint.
  2. Fargate is the default choice — no servers to manage, pay per second of task runtime.
  3. Task definitions are versioned — every update creates a new revision; roll back by pointing to an older revision.
  4. Services maintain desired count — if a task dies, ECS replaces it automatically.
  5. Rolling deployments ensure zero downtime — new tasks start before old tasks stop.
  6. Deployment circuit breakers auto-rollback failed deployments.
  7. Auto-scaling adjusts task count based on CPU, memory, or ALB request count.
  8. Secrets belong in Secrets Manager or SSM — never put them in the task definition environment block.

Explain-It Challenge

  1. Your ECS service has desired-count: 3 but only 2 tasks are running. List all possible reasons and how you would diagnose each.
  2. Explain the difference between executionRoleArn and taskRoleArn to a teammate who has never used ECS.
  3. A deployment has been stuck for 20 minutes — new tasks keep starting and stopping. What is happening and how do you fix it?

Navigation: ← 6.3.a — ECR and Container Images · 6.3.c — Application Load Balancer →