Episode 6 — Scaling Reliability Microservices Web3 / 6.3 — AWS Cloud Native Deployment

6.3 — Interview Questions: AWS Cloud-Native Deployment

Model answers for ECR, ECS/Fargate, Application Load Balancer, and VPC/IAM.

How to use this material (instructions)

  1. Read lessons in orderREADME.md, then 6.3.a6.3.d.
  2. Practice out loud — definition → example → pitfall.
  3. Pair with exercises6.3-Exercise-Questions.md.
  4. Quick review6.3-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What is Amazon ECS and how does it relate to Docker?

Why interviewers ask: Tests foundational understanding of container orchestration on AWS — ensures you know why ECS exists beyond just running docker run on a server.

Model answer:

Amazon ECS (Elastic Container Service) is a container orchestration service that manages running Docker containers at scale. While Docker packages your application into a container image and can run it on a single machine, ECS solves the production problem: how do you run dozens or hundreds of containers, keep them healthy, scale them, and deploy new versions without downtime?

ECS manages the complete lifecycle: it pulls Docker images from ECR (AWS's managed Docker registry), schedules containers across infrastructure, monitors their health, restarts failures, performs rolling deployments, and integrates with load balancers and auto-scaling. The key building blocks are clusters (logical grouping), task definitions (blueprints describing how to run a container), tasks (running instances), and services (maintain a desired number of healthy tasks).

ECS supports two launch types: Fargate (serverless — AWS manages the infrastructure) and EC2 (you manage the underlying instances). For most teams, Fargate is the default choice because it eliminates server management entirely.


Q2. What is the difference between Fargate and EC2 launch types?

Why interviewers ask: A common decision point in AWS architecture — tests whether you understand the trade-offs and can recommend the right approach.

Model answer:

Fargate is the serverless launch type — you define CPU and memory for your task, and AWS provisions, manages, and scales the underlying infrastructure. You never see or manage any servers. You pay per second of task runtime (CPU + memory). EC2 requires you to provision and manage a fleet of EC2 instances. The ECS agent runs on each instance and places your tasks on available capacity.

The key trade-offs: Fargate is simpler (no patching, no capacity planning, no AMI updates) but more expensive per unit and limited to 16 vCPU / 120 GB per task. EC2 is more complex but cheaper at scale with Reserved or Spot instances, supports GPUs, and allows larger instances (up to 96+ vCPU). With EC2 you must also scale the instances themselves, not just the tasks.

My recommendation: start with Fargate unless you need GPUs, very large instances, or are running predictable high-volume workloads where Reserved Instance pricing makes EC2 significantly cheaper. The operational simplicity of Fargate is worth the premium for most teams.


Q3. How do you push a Docker image to ECR?

Why interviewers ask: Tests hands-on CLI experience with the ECR workflow — a daily task in containerized deployments.

Model answer:

The ECR push workflow has five steps:

1. Create the repository (one-time): aws ecr create-repository --repository-name my-service

2. Build the image: docker build -t my-service:v1.0 .

3. Tag for ECR: docker tag my-service:v1.0 <account-id>.dkr.ecr.<region>.amazonaws.com/my-service:v1.0

4. Authenticate Docker with ECR: aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com — this returns a 12-hour temporary token.

5. Push: docker push <account-id>.dkr.ecr.<region>.amazonaws.com/my-service:v1.0

In practice, I always tag with both a version number (v1.0) and a git SHA so I can trace exactly which commit produced each image. I also enable scanOnPush on the repository to automatically scan for vulnerabilities, and set lifecycle policies to clean up old images and control storage costs.


Q4. What is an Application Load Balancer and why do microservices need one?

Why interviewers ask: Ensures you understand Layer 7 routing and why it's essential in a microservices architecture.

Model answer:

An Application Load Balancer (ALB) is a Layer 7 (HTTP/HTTPS) load balancer that distributes incoming requests across multiple backend targets (ECS tasks). It provides three critical capabilities for microservices:

1. Path-based routing — a single ALB can route /api/users/* to the user-service, /api/orders/* to the order-service, and /api/payments/* to the payment-service. Clients interact with one domain and one HTTPS certificate.

2. Health checking — the ALB continuously verifies that each backend task is healthy (via a /health endpoint). Unhealthy tasks are automatically removed from rotation, and ECS replaces them.

3. HTTPS termination — the ALB handles SSL/TLS encryption using certificates from ACM (free, auto-renewing). Backend traffic between the ALB and ECS tasks is plain HTTP over the private network, which simplifies container configuration.

Without an ALB, you would need to expose each microservice directly to the internet (security risk), handle SSL certificates per container (operational burden), and build your own health checking and traffic distribution (reinventing the wheel).


Intermediate (Q5–Q8)

Q5. Walk me through how a new version of a service gets deployed on ECS without downtime.

Why interviewers ask: Tests understanding of rolling deployments, health checks, and the interaction between ECS, ALB, and target groups.

Model answer:

A zero-downtime deployment on ECS with Fargate follows this sequence:

1. Push new image to ECR (my-service:v2.0).

2. Register new task definition revision that references the new image tag.

3. Update the ECS service to use the new task definition: aws ecs update-service --task-definition my-service:5.

4. ECS starts a rolling deployment. With minimumHealthyPercent=100 and maximumPercent=200, and a desired count of 3, ECS first launches new v2 tasks (up to 6 total). It does not stop any old tasks yet.

5. New tasks register with the ALB target group. The ALB begins health checks.

6. Once a new task passes the healthy threshold (e.g., 2 consecutive 200 responses), the ALB starts routing traffic to it. ECS then drains and stops one old v1 task (connection draining lets in-flight requests complete).

7. This cycle repeats until all 3 old tasks are replaced with new tasks.

If the new tasks keep failing health checks, the deployment circuit breaker (if enabled) automatically rolls back to the previous task definition. The key timing dependency: the health check interval times the healthy threshold determines how long before new tasks receive traffic.


Q6. Explain the difference between the ECS Task Execution Role and the Task Role.

Why interviewers ask: One of the most commonly confused ECS concepts — getting this wrong causes deployment failures and security issues.

Model answer:

These are two distinct IAM roles with completely different purposes:

The Execution Role (executionRoleArn) is used by the ECS agent — the AWS infrastructure that manages your task. It operates before and around your application. It needs permissions to: pull container images from ECR, write logs to CloudWatch, and fetch secrets from Secrets Manager or SSM Parameter Store. This role is typically shared across services because every task needs to pull images and write logs.

The Task Role (taskRoleArn) is used by your application code running inside the container. The AWS SDK in your Node.js application automatically assumes this role. It needs permissions specific to your service's business logic: read from S3, send messages to SQS, write to DynamoDB. Each microservice should have its own task role with only the permissions it needs (least privilege).

An analogy: the execution role is like a building janitor who opens doors and turns on lights (infrastructure). The task role is like an employee who does the actual work (business logic). If you mix them up — say, giving your application the execution role — your app gains unnecessary permissions to pull ECR images and access secrets for other services, violating least privilege.


Q7. Explain VPC subnets and why ECS tasks should run in private subnets.

Why interviewers ask: Tests understanding of network security architecture — a fundamental AWS skill.

Model answer:

A VPC is divided into subnets, each in a specific Availability Zone. Subnets are classified by their routing:

Public subnets have a route to an Internet Gateway — resources in them can have public IPs and be reached from the internet. We place the ALB and NAT Gateway here.

Private subnets have no direct internet route. Resources can only communicate within the VPC and (via NAT Gateway) make outbound connections. We place ECS tasks here.

ECS tasks should run in private subnets for defense in depth. Even if a security group is misconfigured, a task in a private subnet simply cannot be reached from the internet because there is no network path. Traffic must flow through the ALB, which acts as a controlled entry point with its own security group.

Private subnets still need outbound internet access (to pull images from ECR, call external APIs). A NAT Gateway in a public subnet provides this — it allows outbound connections but blocks all inbound connections from the internet. For even better security and cost savings, VPC Endpoints allow private subnets to reach AWS services (ECR, S3, CloudWatch) without traversing the internet at all.


Q8. How would you configure auto-scaling for an ECS service?

Why interviewers ask: Tests ability to design for variable traffic — a practical skill for any production workload.

Model answer:

ECS integrates with Application Auto Scaling to adjust the number of running tasks. The setup has three parts:

1. Register scalable target: Define the service and set min/max bounds (e.g., min 2, max 20 tasks).

2. Create scaling policies. I use target tracking as the default strategy — you set a target metric value and AWS handles the math. Common metrics: ECSServiceAverageCPUUtilization (target 70%), ECSServiceAverageMemoryUtilization, or ALBRequestCountPerTarget (e.g., 1000 requests per task).

3. Configure cooldown periods. Asymmetric cooldowns are key: short ScaleOutCooldown (60 seconds — respond to traffic spikes quickly) and longer ScaleInCooldown (300 seconds — avoid flapping by waiting before scaling down).

In practice, I often set two policies: one on CPU utilization and one on ALB request count. The service scales out when either metric triggers and scales in only when both are below threshold. I also consider scheduled scaling for predictable patterns — scale up before known traffic peaks (like 9 AM on weekdays) rather than reacting after latency spikes.

The combination of health checks, rolling deployments, and auto-scaling gives you a self-healing, self-adjusting system that can handle anything from a quiet night to a viral traffic spike.


Advanced (Q9–Q11)

Q9. Design the complete AWS infrastructure for deploying a microservices application with 3 services.

Why interviewers ask: Tests end-to-end architecture knowledge — can you design a production system, not just explain individual components?

Model answer:

For three services (user-service, order-service, payment-service), here is the complete architecture:

Networking (VPC): Custom VPC with a /16 CIDR (65K IPs). Two Availability Zones for high availability. In each AZ: one public subnet (ALB + NAT Gateway), one private subnet (ECS tasks), one data subnet (RDS + ElastiCache). Internet Gateway for public subnets. NAT Gateways (one per AZ for HA) for private subnet outbound access. VPC Endpoints for ECR, S3, CloudWatch Logs, Secrets Manager to reduce NAT costs and improve security.

Container Registry (ECR): Three repositories (one per service). Lifecycle policies to keep only the last 10 tagged images. Scan-on-push enabled.

Compute (ECS + Fargate): One ECS cluster. Three services, each with its own task definition. Fargate launch type. Each service has auto-scaling (min 2, max based on projected traffic). Deployment circuit breakers with automatic rollback.

Load Balancing (ALB): One internet-facing ALB in public subnets. HTTPS listener on port 443 with ACM certificate. HTTP listener redirects to HTTPS. Path-based routing rules: /api/users/* → user-tg, /api/orders/* → order-tg, /api/payments/* → payment-tg. Health checks on /health with 15s interval, 2/3 healthy/unhealthy thresholds.

Security (IAM + Security Groups): Security group chain: ALB (443 from internet) → ECS tasks (3000 from ALB SG only) → RDS (5432 from ECS SG only). Shared execution role for ECR pulls and logging. Separate task roles per service (user-service can access user-photos S3 bucket, order-service can access orders DynamoDB table, payment-service can call payment provider APIs). All secrets in Secrets Manager, injected via task definition.

Observability: CloudWatch Logs via awslogs log driver. CloudWatch alarms on 5xx rates, latency p99, and CPU utilization. ALB access logs to S3.


Q10. An ECS deployment is failing — new tasks start but are immediately killed. How do you troubleshoot?

Why interviewers ask: Tests real-world debugging skills — this is the most common ECS deployment issue.

Model answer:

I follow a systematic approach, checking the most common causes first:

1. Check stopped task reason: aws ecs describe-tasks --cluster production --tasks <task-arn> --query 'tasks[0].stoppedReason'. Common reasons: "Essential container in task exited" (app crashed), "CannotPullContainerError" (ECR/networking), "ResourceNotFoundException" (wrong image tag).

2. Check container logs: Look at CloudWatch Logs for the task's log stream. If the app crashes on startup (missing env var, DB connection failure, port conflict), the error will be here. No logs at all? The container never started — check the execution role and ECR access.

3. Check health checks: If the task starts but the ALB health check fails, the task is marked unhealthy and ECS replaces it in a loop. Verify: Is /health returning 200? Is the container port correct in the target group? Is the health check interval long enough for the app to start (check startPeriod in the task health check)?

4. Check networking: In Fargate with awsvpc, each task gets its own ENI. If the subnet has no more IPs, tasks cannot launch. Check: Does the security group allow outbound on 443 (for Secrets Manager, ECR)? Does the private subnet have a NAT Gateway route? Are VPC Endpoints configured correctly?

5. Check resource limits: If the task is OOM-killed, increase memory. If CPU is maxed during startup, increase CPU. CloudWatch metrics will show task-level CPU/memory utilization.

6. Check secrets and parameters: If a secret referenced in the task definition doesn't exist or the execution role lacks secretsmanager:GetSecretValue permission, the task fails before the container even starts.

The deployment circuit breaker will auto-roll back after enough failures, but fixing the root cause requires this systematic investigation.


Q11. How would you secure the communication between microservices running on ECS?

Why interviewers ask: Tests deep understanding of security beyond the basics — production systems need defense at multiple layers.

Model answer:

Security between microservices on ECS requires defense in depth across multiple layers:

Network layer: All services run in private subnets with no internet ingress. Security groups restrict which services can talk to which — the payment-service SG might only allow inbound from the order-service SG, not from the user-service. This prevents lateral movement if one service is compromised.

Service-to-service authentication: In a zero-trust architecture, network isolation alone is insufficient. Options include: AWS Cloud Map for service discovery combined with mutual TLS (mTLS) where each service presents a certificate. Alternatively, use AWS App Mesh (service mesh) which handles mTLS automatically between Envoy sidecars, plus provides traffic management, observability, and circuit breaking.

IAM authorization: Each service has its own task role. Even if a service can reach another service's endpoint over the network, it cannot impersonate that service or access its AWS resources. For AWS-service-to-service calls, IAM policies ensure least privilege.

Secrets management: All credentials (database passwords, API keys, JWT secrets) are stored in Secrets Manager or SSM Parameter Store, injected into containers at startup via the execution role. Secrets are never baked into images or task definition environment variables. Rotation is enabled where possible.

Runtime security: ECS runtime monitoring (via GuardDuty) detects anomalous behavior like cryptocurrency mining, reverse shells, or unusual API calls. Container images are scanned on push via ECR/Inspector.

Logging and audit: All inter-service traffic is logged. AWS CloudTrail records all IAM role assumptions. CloudWatch Logs capture application-level communication. In a breach investigation, you can trace exactly which service called which, when, and with what permissions.

The layered approach means that even if one control fails (e.g., a security group is misconfigured), other controls (mTLS, IAM, runtime monitoring) still protect the system.


Quick-fire

#QuestionOne-line answer
1ECR authentication token lasts how long?12 hours
2Fargate tasks use which network mode?awsvpc — each task gets its own ENI
3ALB operates at which OSI layer?Layer 7 (HTTP/HTTPS)
4Target type for Fargate target groups?ip (not instance)
5What does a NAT Gateway do?Allows outbound-only internet access from private subnets

← Back to 6.3 — AWS Cloud-Native Deployment (README)