Episode 6 — Scaling Reliability Microservices Web3 / 6.9 — Final Production Deployment

6.9.a -- Docker Deployment

In one sentence: A production Docker deployment uses multi-stage builds to create tiny, secure images, runs containers as a non-root user, manages multi-service stacks with Docker Compose, and enforces health checks, resource limits, and secrets so your system is reproducible, portable, and hardened from day one.

Navigation: <-- 6.9 Overview | 6.9.b -- EC2 and SSL -->

1. Why Docker for Production?

Docker solves the "works on my machine" problem by packaging your application, its dependencies, and its runtime environment into a single image that runs identically everywhere -- your laptop, staging, and production.

Without Docker:
  "It works on my machine" --> different Node version, missing env vars, OS differences

With Docker:
  Same image everywhere --> deterministic, reproducible, portable

Key benefits for production:

Benefit	Explanation
Reproducibility	The same image runs in dev, staging, and production
Isolation	Each service runs in its own container with its own filesystem
Portability	Runs on any machine with Docker installed -- AWS, GCP, Azure, bare metal
Fast rollbacks	Bad deploy? Run the previous image tag -- instant rollback
Microservice enablement	Each service has its own Dockerfile, image, and deployment lifecycle

2. Production Dockerfile Best Practices

2.1 Multi-Stage Builds

Multi-stage builds separate the build environment (where you install devDependencies and compile) from the runtime environment (where you run the app). This produces dramatically smaller images.

# ============================================
# Stage 1: Build
# ============================================
FROM node:20-alpine AS builder

WORKDIR /app

# Copy package files first (cache layer)
COPY package.json package-lock.json ./

# Install ALL dependencies (including devDependencies for building)
RUN npm ci

# Copy source code
COPY . .

# Build (TypeScript compile, bundle, etc.)
RUN npm run build

# ============================================
# Stage 2: Production Runtime
# ============================================
FROM node:20-alpine AS production

WORKDIR /app

# Copy only package files
COPY package.json package-lock.json ./

# Install ONLY production dependencies
RUN npm ci --only=production && npm cache clean --force

# Copy built output from builder stage
COPY --from=builder /app/dist ./dist

# Create non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Switch to non-root user
USER appuser

# Expose the port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

# Start the application
CMD ["node", "dist/index.js"]

Why multi-stage matters:

Single-stage image:  ~1.2 GB  (includes devDependencies, source, build tools)
Multi-stage image:   ~150 MB  (only runtime deps + compiled output)

Savings: ~87% smaller image
  - Faster pulls from registry
  - Smaller attack surface
  - Lower storage costs

2.2 Non-Root User

By default, Docker containers run as root. If an attacker exploits your application, they get root access inside the container. Always create and switch to a non-root user.

# Create a system group and user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Change ownership of the app directory
RUN chown -R appuser:appgroup /app

# Switch to the non-root user for all subsequent commands
USER appuser

What happens without this:

Attacker exploits Node.js vulnerability
  --> Gets shell inside container as root
  --> Can modify container filesystem
  --> Potential container escape to host

With non-root user:
  --> Gets shell as appuser (limited permissions)
  --> Cannot modify system files
  --> Cannot install packages
  --> Blast radius contained

2.3 `.dockerignore`

A .dockerignore file prevents unnecessary files from being sent to the Docker daemon during build. This speeds up builds and prevents secrets from leaking into images.

# .dockerignore

# Dependencies (installed inside container)
node_modules

# Source control
.git
.gitignore

# IDE
.vscode
.idea
*.swp

# Environment files (NEVER bake secrets into images)
.env
.env.*

# Build artifacts
dist
coverage

# Docker files (not needed inside the image)
Dockerfile
docker-compose*.yml
.dockerignore

# Documentation
README.md
docs/

# Tests (not needed in production image)
__tests__
*.test.js
*.spec.js
jest.config.js

2.4 Layer Optimisation

Docker caches layers. Order your Dockerfile so that infrequently changing layers come first and frequently changing layers come last.

# GOOD: Package files change less often than source code
COPY package.json package-lock.json ./   # <-- Layer 1 (cached most of the time)
RUN npm ci                                # <-- Layer 2 (cached when deps don't change)
COPY . .                                  # <-- Layer 3 (invalidated on every code change)

# BAD: Every code change reinstalls all dependencies
COPY . .                                  # <-- Layer 1 (invalidated on every change)
RUN npm ci                                # <-- Layer 2 (also invalidated -- cache miss)

Layer caching mental model:

Layer 1: FROM node:20-alpine           (changes: almost never)
Layer 2: COPY package*.json            (changes: when deps change)
Layer 3: RUN npm ci                    (changes: when deps change)
Layer 4: COPY . .                      (changes: every commit)
Layer 5: RUN npm run build             (changes: every commit)

If you only changed source code:
  Layers 1-3: CACHED (fast!)
  Layers 4-5: Rebuilt (only the changed parts)

2.5 Security Scanning

Scan your images for known vulnerabilities before deploying.

# Using Docker Scout (built into Docker Desktop)
docker scout cves my-app:latest

# Using Trivy (popular open-source scanner)
trivy image my-app:latest

# Using Snyk
snyk container test my-app:latest

# In CI/CD (GitHub Actions example)
# - name: Scan image
#   uses: aquasecurity/trivy-action@master
#   with:
#     image-ref: my-app:latest
#     severity: CRITICAL,HIGH
#     exit-code: 1   # Fail the build if critical vulnerabilities found

Best practices for secure images:

Practice	Reason
Use `-alpine` base images	Minimal OS, fewer CVEs
Pin exact base image versions	`node:20.11.1-alpine` not `node:latest`
Run as non-root	Limit blast radius
No secrets in image	Use runtime env vars or Docker secrets
Scan in CI	Catch vulnerabilities before deploy
Update base images regularly	Patch known CVEs

3. Docker Compose for Multi-Service Development

Docker Compose lets you define and run multiple containers as a single stack. In development, it replaces the need for installing PostgreSQL, Redis, and RabbitMQ locally.

3.1 Complete Multi-Service Docker Compose

# docker-compose.yml
version: "3.9"

services:
  # =============================================
  # API Gateway
  # =============================================
  gateway:
    build:
      context: ./services/gateway
      dockerfile: Dockerfile
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
      - USER_SERVICE_URL=http://user-service:3001
      - ORDER_SERVICE_URL=http://order-service:3002
      - REDIS_URL=redis://redis:6379
    depends_on:
      user-service:
        condition: service_healthy
      order-service:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
        reservations:
          cpus: "0.25"
          memory: 128M
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  # =============================================
  # User Service
  # =============================================
  user-service:
    build:
      context: ./services/user-service
      dockerfile: Dockerfile
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://appuser:secret@postgres:5432/users_db
      - RABBITMQ_URL=amqp://rabbitmq:5672
    depends_on:
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3001/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  # =============================================
  # Order Service
  # =============================================
  order-service:
    build:
      context: ./services/order-service
      dockerfile: Dockerfile
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://appuser:secret@postgres:5432/orders_db
      - RABBITMQ_URL=amqp://rabbitmq:5672
      - REDIS_URL=redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3002/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  # =============================================
  # PostgreSQL
  # =============================================
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: users_db
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "1.00"
          memory: 512M
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d users_db"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  # =============================================
  # Redis
  # =============================================
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy allkeys-lru
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.25"
          memory: 192M
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 5s

  # =============================================
  # RabbitMQ
  # =============================================
  rabbitmq:
    image: rabbitmq:3-management-alpine
    ports:
      - "5672:5672"
      - "15672:15672"
    environment:
      RABBITMQ_DEFAULT_USER: guest
      RABBITMQ_DEFAULT_PASS: guest
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s

# =============================================
# Named Volumes (data persists across restarts)
# =============================================
volumes:
  postgres_data:
  redis_data:
  rabbitmq_data:

# =============================================
# Networks
# =============================================
networks:
  backend:
    driver: bridge

4. Docker Networking

Docker provides three main network drivers for connecting containers.

4.1 Network Types

Driver	Description	Use Case
bridge	Default. Creates an isolated virtual network. Containers communicate by service name.	Multi-service stacks on a single host
host	Container shares the host's network namespace. No network isolation.	Maximum network performance (Linux only)
overlay	Spans multiple Docker hosts. Used with Docker Swarm.	Multi-host deployments

4.2 Service Discovery with Bridge Networks

networks:
  backend:
    driver: bridge

services:
  user-service:
    networks:
      - backend
  order-service:
    networks:
      - backend

With a shared bridge network, containers can reach each other by service name:

// Inside order-service, reach user-service by name
const userRes = await fetch('http://user-service:3001/api/users/123');
// Docker DNS resolves "user-service" to the container's IP

4.3 Network Isolation

# Only gateway can talk to the frontend network AND the backend network.
# Database containers are on backend only -- not reachable from outside.

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge

services:
  gateway:
    networks:
      - frontend
      - backend
  user-service:
    networks:
      - backend
  postgres:
    networks:
      - backend  # Not on frontend -- cannot be reached directly

5. Volume Management for Persistence

Containers are ephemeral -- when a container is destroyed, its filesystem is gone. Volumes persist data across container restarts and recreations.

5.1 Volume Types

Type	Syntax	Use Case
Named volume	`postgres_data:/var/lib/postgresql/data`	Production databases, persistent state
Bind mount	`./local-dir:/container-dir`	Development -- live code reloading
tmpfs	`tmpfs: /tmp`	Ephemeral data that should never touch disk

5.2 Volume Commands

# List all volumes
docker volume ls

# Inspect a volume (see mount point on host)
docker volume inspect postgres_data

# Remove unused volumes (CAREFUL in production!)
docker volume prune

# Back up a named volume
docker run --rm -v postgres_data:/data -v $(pwd):/backup \
  alpine tar czf /backup/postgres_backup.tar.gz /data

6. Environment Variable Management

Never hardcode secrets in your Dockerfile or Compose file. Use environment variables injected at runtime.

6.1 Methods for Injecting Environment Variables

services:
  app:
    # Method 1: Inline (OK for non-sensitive defaults)
    environment:
      - NODE_ENV=production
      - PORT=3000

    # Method 2: .env file (good for local development)
    env_file:
      - .env

    # Method 3: Docker secrets (best for production)
    secrets:
      - db_password

secrets:
  db_password:
    file: ./secrets/db_password.txt

6.2 `.env` File Pattern

# .env (NEVER commit this file -- add to .gitignore)
DATABASE_URL=postgresql://user:password@postgres:5432/mydb
REDIS_URL=redis://redis:6379
JWT_SECRET=super-secret-key-change-me
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI...

6.3 Docker Secrets (Swarm Mode)

services:
  app:
    secrets:
      - db_password
    environment:
      - DB_PASSWORD_FILE=/run/secrets/db_password

secrets:
  db_password:
    external: true  # Created with: docker secret create db_password ./password.txt

// Reading a Docker secret in Node.js
const fs = require('fs');
const dbPassword = fs.readFileSync('/run/secrets/db_password', 'utf8').trim();

7. Docker Health Checks

Health checks tell Docker (and orchestrators like ECS) whether a container is actually healthy, not just running.

7.1 Dockerfile Health Check

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

7.2 Health Endpoint in Express

// health.js -- production health check endpoint
const express = require('express');
const router = express.Router();

router.get('/health', async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    status: 'ok',
    checks: {}
  };

  // Check database connection
  try {
    await pool.query('SELECT 1');
    checks.checks.database = 'ok';
  } catch (err) {
    checks.checks.database = 'fail';
    checks.status = 'degraded';
  }

  // Check Redis connection
  try {
    await redis.ping();
    checks.checks.redis = 'ok';
  } catch (err) {
    checks.checks.redis = 'fail';
    checks.status = 'degraded';
  }

  const statusCode = checks.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(checks);
});

module.exports = router;

7.3 Health Check Parameters

Parameter	Default	Meaning
`--interval`	30s	Time between checks
`--timeout`	30s	Max time for a single check to respond
`--start-period`	0s	Grace period for slow-starting containers
`--retries`	3	Consecutive failures before marking "unhealthy"

8. Container Resource Limits

Without limits, a single container can consume all host CPU and memory, starving other containers.

services:
  app:
    deploy:
      resources:
        limits:
          cpus: "0.50"     # Max 50% of one CPU core
          memory: 256M      # Max 256 MB RAM -- container killed if exceeded (OOMKilled)
        reservations:
          cpus: "0.25"     # Guaranteed 25% of one CPU core
          memory: 128M      # Guaranteed 128 MB RAM

What happens when limits are hit:

Memory limit exceeded --> Container is OOMKilled (Out of Memory Killed)
CPU limit exceeded    --> Container is throttled (slowed down, not killed)

Sizing guidance for Node.js:

Service Type	CPU Limit	Memory Limit
API Gateway	0.50	256 MB
Microservice (stateless)	0.50	256 MB
Worker / Background job	1.00	512 MB
PostgreSQL	1.00	512 MB - 1 GB
Redis (cache)	0.25	128 - 256 MB
RabbitMQ	0.50	256 MB

9. Production vs Development Docker Configs

Use separate Compose files or override files for different environments.

9.1 Development Override

# docker-compose.override.yml (auto-loaded in development)
services:
  gateway:
    build:
      target: builder   # Use the builder stage (includes devDependencies)
    volumes:
      - ./services/gateway/src:/app/src  # Live code reloading
    environment:
      - NODE_ENV=development
      - LOG_LEVEL=debug
    command: npm run dev  # Nodemon for auto-restart

  postgres:
    ports:
      - "5432:5432"  # Expose to host for local tools (pgAdmin, DBeaver)

9.2 Production Compose

# docker-compose.prod.yml
services:
  gateway:
    image: 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:${IMAGE_TAG}
    restart: always
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    # No volume mounts -- use the built image
    # No port exposure except through reverse proxy

  postgres:
    # No port exposure to host in production
    ports: []

9.3 Running with Environment-Specific Configs

# Development (uses docker-compose.yml + docker-compose.override.yml automatically)
docker compose up

# Production (explicit file, no override)
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Staging
docker compose -f docker-compose.yml -f docker-compose.staging.yml up -d

10. Complete Production Workflow

# Step 1: Build all images
docker compose -f docker-compose.yml -f docker-compose.prod.yml build

# Step 2: Run security scan
trivy image gateway:latest
trivy image user-service:latest

# Step 3: Tag images for registry
docker tag gateway:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:v1.2.3
docker tag user-service:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3

# Step 4: Push to registry
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:v1.2.3
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3

# Step 5: Deploy (pull and restart on production server)
ssh production-server "cd /app && docker compose -f docker-compose.prod.yml pull && docker compose -f docker-compose.prod.yml up -d"

# Step 6: Verify health
curl https://api.example.com/health

11. Key Takeaways

Multi-stage builds cut image size by 80%+ and remove build tools from the runtime.
Always run as a non-root user -- one line in the Dockerfile drastically reduces attack surface.
.dockerignore prevents secrets and unnecessary files from entering the image.
Layer ordering matters -- put COPY package*.json and RUN npm ci before COPY . . to maximise cache hits.
Health checks let orchestrators know when a container is actually ready, not just running.
Resource limits prevent one runaway container from killing the host.
Separate dev and prod configs -- dev uses bind mounts and debug logging; prod uses built images and structured logging.

Explain-It Challenge

A junior developer asks "why is our Docker image 1.2 GB?" Walk them through converting to a multi-stage build.
Your container is getting OOMKilled in production. What is happening and how do you fix it?
Explain why COPY . . before RUN npm ci destroys your build cache, using a concrete example.