Episode 6 — Scaling Reliability Microservices Web3 / 6.3 — AWS Cloud Native Deployment

6.3.b — ECS and Fargate

In one sentence: Amazon ECS (Elastic Container Service) is AWS's container orchestrator that manages running, scaling, and deploying your containers, and Fargate is the serverless launch type that lets you run containers without provisioning or managing any EC2 instances.

Navigation: ← 6.3.a — ECR and Container Images · 6.3.c — Application Load Balancer →

1. What Is Amazon ECS?

Amazon Elastic Container Service (ECS) is a container orchestration service. It answers the question: "I have Docker images in ECR — now how do I actually run them, keep them running, and scale them?"

ECS handles:

Scheduling — deciding where and when to run containers
Health monitoring — restarting containers that fail
Scaling — adding/removing containers based on demand
Deployment — rolling out new versions without downtime
Service discovery — letting containers find each other

Think of ECS as a production-grade docker-compose that runs in the cloud, auto-heals, auto-scales, and integrates with the entire AWS ecosystem.

2. ECS Core Concepts

┌─────────────────────────────────────────────────────────────────┐
│                        ECS HIERARCHY                             │
│                                                                  │
│  Cluster                                                         │
│  └── Service                                                     │
│      ├── Task (running instance)                                 │
│      │   └── Container(s)                                        │
│      ├── Task (running instance)                                 │
│      │   └── Container(s)                                        │
│      └── Task (running instance)                                 │
│          └── Container(s)                                        │
│                                                                  │
│  Task Definition (blueprint — like a docker-compose.yml)         │
│  └── Container Definition(s)                                     │
│      ├── Image, CPU, Memory                                      │
│      ├── Port mappings                                           │
│      ├── Environment variables                                   │
│      └── Log configuration                                       │
└─────────────────────────────────────────────────────────────────┘

Cluster

A cluster is a logical grouping of resources. It is the top-level container for your services. You typically have one cluster per environment (dev, staging, production) or one cluster per application.

# Create a cluster
aws ecs create-cluster --cluster-name production --region us-east-1

Task Definition

A task definition is a JSON blueprint that describes how to run your containers. It is like a docker-compose.yml for AWS. It defines:

Which image to use
How much CPU and memory to allocate
Port mappings
Environment variables
Logging configuration
IAM roles

Task definitions are versioned — each update creates a new revision (e.g., user-service:1, user-service:2, user-service:3).

Task

A task is a running instance of a task definition. When ECS "runs a task," it pulls the image from ECR, creates the container(s), and starts them. A task can contain one or more containers that share network and storage (similar to a Kubernetes pod).

Service

A service is a long-running, managed set of tasks. You tell ECS "I want 3 copies of the user-service task running at all times," and the service ensures exactly 3 tasks are always healthy. If one crashes, ECS launches a replacement. If you deploy a new version, the service handles rolling updates.

3. Task Definition Anatomy

Here is a complete, production-ready task definition for a Node.js microservice:

{
  "family": "user-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/userServiceTaskRole",
  "containerDefinitions": [
    {
      "name": "user-service",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        { "name": "NODE_ENV", "value": "production" },
        { "name": "PORT", "value": "3000" },
        { "name": "LOG_LEVEL", "value": "info" }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:user-service/db-url"
        },
        {
          "name": "JWT_SECRET",
          "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/user-service/jwt-secret"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/user-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Field-by-field breakdown

Field	Purpose
`family`	Name of the task definition (groups revisions together)
`networkMode: "awsvpc"`	Each task gets its own ENI (elastic network interface) — required for Fargate
`requiresCompatibilities`	`["FARGATE"]` or `["EC2"]` or both
`cpu` / `memory`	Resource allocation (Fargate has specific valid combinations)
`executionRoleArn`	IAM role ECS uses to pull images and write logs (see 6.3.d)
`taskRoleArn`	IAM role your application code uses to call AWS services (see 6.3.d)
`image`	Full ECR URI with tag
`essential`	If `true`, task stops if this container stops
`portMappings`	Which container port to expose
`environment`	Plain-text env vars
`secrets`	Env vars pulled from Secrets Manager or SSM Parameter Store
`logConfiguration`	Send logs to CloudWatch Logs
`healthCheck`	Container-level health check command

Fargate CPU/Memory Valid Combinations

Fargate only allows specific CPU/memory combinations:

CPU (vCPU)	Memory Options (MB)
256 (0.25)	512, 1024, 2048
512 (0.5)	1024, 2048, 3072, 4096
1024 (1)	2048, 3072, 4096, 5120, 6144, 7168, 8192
2048 (2)	4096 through 16384 (in 1024 increments)
4096 (4)	8192 through 30720 (in 1024 increments)
8192 (8)	16384 through 61440 (in 4096 increments)
16384 (16)	32768 through 122880 (in 8192 increments)

4. Registering a Task Definition

# Register the task definition from a JSON file
aws ecs register-task-definition \
  --cli-input-json file://task-definition.json \
  --region us-east-1

# List all revisions of a task definition
aws ecs list-task-definitions \
  --family-prefix user-service \
  --region us-east-1

# Output:
# "taskDefinitionArns": [
#   "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:1",
#   "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:2",
#   "arn:aws:ecs:us-east-1:123456789012:task-definition/user-service:3"
# ]

5. Fargate vs EC2 Launch Types

This is one of the most important decisions when using ECS. You have two ways to run your tasks:

┌──────────────────────────────────┐  ┌──────────────────────────────────┐
│         EC2 Launch Type          │  │       Fargate Launch Type        │
│                                  │  │                                  │
│  You manage EC2 instances        │  │  AWS manages infrastructure      │
│  ┌────────────────────────┐      │  │  ┌────────────────────────┐      │
│  │ EC2 Instance           │      │  │  │ Fargate (serverless)   │      │
│  │ ┌──────┐ ┌──────┐     │      │  │  │ ┌──────┐ ┌──────┐     │      │
│  │ │Task 1│ │Task 2│     │      │  │  │ │Task 1│ │Task 2│     │      │
│  │ └──────┘ └──────┘     │      │  │  │ └──────┘ └──────┘     │      │
│  │ OS, Docker, ECS Agent │      │  │  │ No servers to manage   │      │
│  └────────────────────────┘      │  │  └────────────────────────┘      │
│  You patch, scale, monitor       │  │  AWS patches, scales, monitors   │
└──────────────────────────────────┘  └──────────────────────────────────┘

Comparison Table

Dimension	EC2 Launch Type	Fargate Launch Type
Server management	You manage EC2 instances (patching, scaling)	No servers — AWS manages everything
Pricing	Pay for EC2 instances (even if underutilized)	Pay per task (per-second, CPU + memory)
Scaling	Must scale instances AND tasks	Only scale tasks (infrastructure auto-scales)
Startup time	~30s (if instance available)	~30-60s (provision new micro-VM)
Max resources	Limited by instance type (up to 96 vCPU, 384 GB)	Up to 16 vCPU, 120 GB per task
GPU support	Yes (P3, G4 instances)	No GPU support
Persistent storage	EBS volumes, instance store	Ephemeral storage (20 GB default, up to 200 GB)
SSH access	Yes (for debugging)	No (use ECS Exec for shell access)
Cost at scale	Cheaper with Reserved/Spot instances	More expensive per unit, but no waste
Complexity	Higher (capacity planning, AMI updates)	Lower (focus on containers, not infrastructure)
Best for	GPU workloads, large instances, cost optimization at scale	Most workloads, teams wanting simplicity

When to Use Fargate

Default choice — start with Fargate unless you have a specific reason not to
Small to medium workloads
Teams without dedicated DevOps/infrastructure engineers
Variable traffic (pay only for what you use)
Rapid prototyping and MVP

When to Use EC2

GPU-dependent workloads (ML inference)
Need instances larger than 16 vCPU / 120 GB
Very high, predictable traffic (Reserved Instances save 30-60%)
Need SSH access to the host for debugging
Specialized kernel or OS requirements

6. Creating a Fargate Service

Step 1: Ensure prerequisites

# Cluster exists
aws ecs create-cluster --cluster-name production

# Task definition is registered (from Section 4)
aws ecs register-task-definition --cli-input-json file://task-definition.json

# VPC, subnets, security groups exist (see 6.3.d)
# ALB and target group exist (see 6.3.c)

Step 2: Create the service

aws ecs create-service \
  --cluster production \
  --service-name user-service \
  --task-definition user-service:3 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-0a1b2c3d4e5f", "subnet-1a2b3c4d5e6f"],
      "securityGroups": ["sg-0a1b2c3d4e5f6g7h"],
      "assignPublicIp": "DISABLED"
    }
  }' \
  --load-balancers '[
    {
      "targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/user-svc-tg/abc123",
      "containerName": "user-service",
      "containerPort": 3000
    }
  ]' \
  --deployment-configuration '{
    "minimumHealthyPercent": 100,
    "maximumPercent": 200
  }' \
  --region us-east-1

Key parameters explained

Parameter	Purpose
`--desired-count 3`	Run 3 copies of the task at all times
`--launch-type FARGATE`	Use Fargate (no EC2 instances to manage)
`awsvpcConfiguration`	Place tasks in private subnets with a security group
`assignPublicIp: DISABLED`	Tasks are in private subnets, accessed via ALB
`--load-balancers`	Register tasks with the ALB target group
`minimumHealthyPercent: 100`	During deployment, never drop below 100% of desired count
`maximumPercent: 200`	During deployment, allow up to 200% temporarily (rolling update)

7. Service Auto-Scaling

ECS integrates with Application Auto Scaling to adjust the number of running tasks based on demand.

Target tracking scaling (recommended)

# Register the service as a scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/user-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

# Scale based on CPU utilization (target: 70%)
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production/user-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

# Scale based on request count per target (from ALB)
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production/user-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name request-count-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/my-alb/abc123/targetgroup/user-svc-tg/def456"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Scaling strategies comparison

Strategy	How It Works	Best For
Target tracking	Maintain a target metric value (e.g., 70% CPU)	Most workloads — set and forget
Step scaling	Add/remove N tasks when metric crosses thresholds	Bursty traffic with known patterns
Scheduled scaling	Scale at specific times (e.g., 9 AM scale up)	Predictable daily/weekly patterns

Scale-in protection

ScaleOutCooldown: 60   → Wait 60s after scaling out before scaling out again
ScaleInCooldown: 300   → Wait 300s after scaling in before scaling in again

Why asymmetric? Scale out fast (handle traffic), scale in slowly (avoid flapping).

8. Rolling Deployments

When you update a service (new image version), ECS performs a rolling deployment:

Timeline of a rolling deployment (desired: 3, min: 100%, max: 200%):

Time 0:  [v1] [v1] [v1]           ← 3 old tasks running
Time 1:  [v1] [v1] [v1] [v2]     ← 1 new task starting (max 200% = 6)
Time 2:  [v1] [v1] [v1] [v2]     ← v2 task passes health check
Time 3:  [v1] [v1] [v2]          ← 1 old task drained and stopped
Time 4:  [v1] [v1] [v2] [v2]     ← another new task starting
Time 5:  [v1] [v2] [v2]          ← another old task stopped
Time 6:  [v1] [v2] [v2] [v2]     ← last new task starting
Time 7:  [v2] [v2] [v2]          ← all old tasks stopped — deployment complete

Triggering a deployment

# Update the service to use a new task definition revision
aws ecs update-service \
  --cluster production \
  --service user-service \
  --task-definition user-service:4 \
  --region us-east-1

# Monitor the deployment
aws ecs describe-services \
  --cluster production \
  --services user-service \
  --query 'services[0].deployments' \
  --region us-east-1

Deployment circuit breaker

ECS can automatically roll back a failed deployment:

aws ecs update-service \
  --cluster production \
  --service user-service \
  --deployment-configuration '{
    "minimumHealthyPercent": 100,
    "maximumPercent": 200,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }'

If the new tasks keep failing health checks, ECS automatically reverts to the previous working version.

9. ECS Exec (Debugging Running Containers)

Since Fargate has no SSH access, ECS Exec gives you a shell into a running container:

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster production \
  --service user-service \
  --enable-execute-command

# Get a task ID
TASK_ID=$(aws ecs list-tasks \
  --cluster production \
  --service-name user-service \
  --query 'taskArns[0]' \
  --output text)

# Open a shell in the container
aws ecs execute-command \
  --cluster production \
  --task "$TASK_ID" \
  --container user-service \
  --interactive \
  --command "/bin/sh"

Warning: ECS Exec requires the SSM agent in the container. The Fargate platform version 1.4.0+ includes it automatically. The task role must have ssmmessages permissions.

10. Complete Task Definitions for Microservices

Node.js API Service (lightweight)

{
  "family": "api-gateway",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiGatewayTaskRole",
  "containerDefinitions": [
    {
      "name": "api-gateway",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/api-gateway:v2.0.0",
      "essential": true,
      "portMappings": [{ "containerPort": 3000, "protocol": "tcp" }],
      "environment": [
        { "name": "NODE_ENV", "value": "production" },
        { "name": "USER_SERVICE_URL", "value": "http://user-service.local:3000" },
        { "name": "ORDER_SERVICE_URL", "value": "http://order-service.local:3000" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-gateway",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Worker Service (no port mapping, higher memory)

{
  "family": "email-worker",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/emailWorkerTaskRole",
  "containerDefinitions": [
    {
      "name": "email-worker",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/email-worker:v1.5.0",
      "essential": true,
      "environment": [
        { "name": "NODE_ENV", "value": "production" },
        { "name": "SQS_QUEUE_URL", "value": "https://sqs.us-east-1.amazonaws.com/123456789012/email-queue" }
      ],
      "secrets": [
        {
          "name": "SMTP_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:email-worker/smtp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/email-worker",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

11. Key Takeaways

ECS hierarchy: Cluster → Service → Task → Container(s). The task definition is the blueprint.
Fargate is the default choice — no servers to manage, pay per second of task runtime.
Task definitions are versioned — every update creates a new revision; roll back by pointing to an older revision.
Services maintain desired count — if a task dies, ECS replaces it automatically.
Rolling deployments ensure zero downtime — new tasks start before old tasks stop.
Deployment circuit breakers auto-rollback failed deployments.
Auto-scaling adjusts task count based on CPU, memory, or ALB request count.
Secrets belong in Secrets Manager or SSM — never put them in the task definition environment block.

Explain-It Challenge

Your ECS service has desired-count: 3 but only 2 tasks are running. List all possible reasons and how you would diagnose each.
Explain the difference between executionRoleArn and taskRoleArn to a teammate who has never used ECS.
A deployment has been stuck for 20 minutes — new tasks keep starting and stopping. What is happening and how do you fix it?

Navigation: ← 6.3.a — ECR and Container Images · 6.3.c — Application Load Balancer →