Episode 6 — Scaling Reliability Microservices Web3 / 6.9 — Final Production Deployment

6.9.a -- Docker Deployment

In one sentence: A production Docker deployment uses multi-stage builds to create tiny, secure images, runs containers as a non-root user, manages multi-service stacks with Docker Compose, and enforces health checks, resource limits, and secrets so your system is reproducible, portable, and hardened from day one.

Navigation: <-- 6.9 Overview | 6.9.b -- EC2 and SSL -->


1. Why Docker for Production?

Docker solves the "works on my machine" problem by packaging your application, its dependencies, and its runtime environment into a single image that runs identically everywhere -- your laptop, staging, and production.

Without Docker:
  "It works on my machine" --> different Node version, missing env vars, OS differences

With Docker:
  Same image everywhere --> deterministic, reproducible, portable

Key benefits for production:

BenefitExplanation
ReproducibilityThe same image runs in dev, staging, and production
IsolationEach service runs in its own container with its own filesystem
PortabilityRuns on any machine with Docker installed -- AWS, GCP, Azure, bare metal
Fast rollbacksBad deploy? Run the previous image tag -- instant rollback
Microservice enablementEach service has its own Dockerfile, image, and deployment lifecycle

2. Production Dockerfile Best Practices

2.1 Multi-Stage Builds

Multi-stage builds separate the build environment (where you install devDependencies and compile) from the runtime environment (where you run the app). This produces dramatically smaller images.

# ============================================
# Stage 1: Build
# ============================================
FROM node:20-alpine AS builder

WORKDIR /app

# Copy package files first (cache layer)
COPY package.json package-lock.json ./

# Install ALL dependencies (including devDependencies for building)
RUN npm ci

# Copy source code
COPY . .

# Build (TypeScript compile, bundle, etc.)
RUN npm run build

# ============================================
# Stage 2: Production Runtime
# ============================================
FROM node:20-alpine AS production

WORKDIR /app

# Copy only package files
COPY package.json package-lock.json ./

# Install ONLY production dependencies
RUN npm ci --only=production && npm cache clean --force

# Copy built output from builder stage
COPY --from=builder /app/dist ./dist

# Create non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Switch to non-root user
USER appuser

# Expose the port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

# Start the application
CMD ["node", "dist/index.js"]

Why multi-stage matters:

Single-stage image:  ~1.2 GB  (includes devDependencies, source, build tools)
Multi-stage image:   ~150 MB  (only runtime deps + compiled output)

Savings: ~87% smaller image
  - Faster pulls from registry
  - Smaller attack surface
  - Lower storage costs

2.2 Non-Root User

By default, Docker containers run as root. If an attacker exploits your application, they get root access inside the container. Always create and switch to a non-root user.

# Create a system group and user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Change ownership of the app directory
RUN chown -R appuser:appgroup /app

# Switch to the non-root user for all subsequent commands
USER appuser

What happens without this:

Attacker exploits Node.js vulnerability
  --> Gets shell inside container as root
  --> Can modify container filesystem
  --> Potential container escape to host

With non-root user:
  --> Gets shell as appuser (limited permissions)
  --> Cannot modify system files
  --> Cannot install packages
  --> Blast radius contained

2.3 .dockerignore

A .dockerignore file prevents unnecessary files from being sent to the Docker daemon during build. This speeds up builds and prevents secrets from leaking into images.

# .dockerignore

# Dependencies (installed inside container)
node_modules

# Source control
.git
.gitignore

# IDE
.vscode
.idea
*.swp

# Environment files (NEVER bake secrets into images)
.env
.env.*

# Build artifacts
dist
coverage

# Docker files (not needed inside the image)
Dockerfile
docker-compose*.yml
.dockerignore

# Documentation
README.md
docs/

# Tests (not needed in production image)
__tests__
*.test.js
*.spec.js
jest.config.js

2.4 Layer Optimisation

Docker caches layers. Order your Dockerfile so that infrequently changing layers come first and frequently changing layers come last.

# GOOD: Package files change less often than source code
COPY package.json package-lock.json ./   # <-- Layer 1 (cached most of the time)
RUN npm ci                                # <-- Layer 2 (cached when deps don't change)
COPY . .                                  # <-- Layer 3 (invalidated on every code change)

# BAD: Every code change reinstalls all dependencies
COPY . .                                  # <-- Layer 1 (invalidated on every change)
RUN npm ci                                # <-- Layer 2 (also invalidated -- cache miss)

Layer caching mental model:

Layer 1: FROM node:20-alpine           (changes: almost never)
Layer 2: COPY package*.json            (changes: when deps change)
Layer 3: RUN npm ci                    (changes: when deps change)
Layer 4: COPY . .                      (changes: every commit)
Layer 5: RUN npm run build             (changes: every commit)

If you only changed source code:
  Layers 1-3: CACHED (fast!)
  Layers 4-5: Rebuilt (only the changed parts)

2.5 Security Scanning

Scan your images for known vulnerabilities before deploying.

# Using Docker Scout (built into Docker Desktop)
docker scout cves my-app:latest

# Using Trivy (popular open-source scanner)
trivy image my-app:latest

# Using Snyk
snyk container test my-app:latest

# In CI/CD (GitHub Actions example)
# - name: Scan image
#   uses: aquasecurity/trivy-action@master
#   with:
#     image-ref: my-app:latest
#     severity: CRITICAL,HIGH
#     exit-code: 1   # Fail the build if critical vulnerabilities found

Best practices for secure images:

PracticeReason
Use -alpine base imagesMinimal OS, fewer CVEs
Pin exact base image versionsnode:20.11.1-alpine not node:latest
Run as non-rootLimit blast radius
No secrets in imageUse runtime env vars or Docker secrets
Scan in CICatch vulnerabilities before deploy
Update base images regularlyPatch known CVEs

3. Docker Compose for Multi-Service Development

Docker Compose lets you define and run multiple containers as a single stack. In development, it replaces the need for installing PostgreSQL, Redis, and RabbitMQ locally.

3.1 Complete Multi-Service Docker Compose

# docker-compose.yml
version: "3.9"

services:
  # =============================================
  # API Gateway
  # =============================================
  gateway:
    build:
      context: ./services/gateway
      dockerfile: Dockerfile
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
      - USER_SERVICE_URL=http://user-service:3001
      - ORDER_SERVICE_URL=http://order-service:3002
      - REDIS_URL=redis://redis:6379
    depends_on:
      user-service:
        condition: service_healthy
      order-service:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
        reservations:
          cpus: "0.25"
          memory: 128M
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  # =============================================
  # User Service
  # =============================================
  user-service:
    build:
      context: ./services/user-service
      dockerfile: Dockerfile
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://appuser:secret@postgres:5432/users_db
      - RABBITMQ_URL=amqp://rabbitmq:5672
    depends_on:
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3001/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  # =============================================
  # Order Service
  # =============================================
  order-service:
    build:
      context: ./services/order-service
      dockerfile: Dockerfile
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgresql://appuser:secret@postgres:5432/orders_db
      - RABBITMQ_URL=amqp://rabbitmq:5672
      - REDIS_URL=redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3002/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  # =============================================
  # PostgreSQL
  # =============================================
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: users_db
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "1.00"
          memory: 512M
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d users_db"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  # =============================================
  # Redis
  # =============================================
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy allkeys-lru
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.25"
          memory: 192M
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 5s

  # =============================================
  # RabbitMQ
  # =============================================
  rabbitmq:
    image: rabbitmq:3-management-alpine
    ports:
      - "5672:5672"
      - "15672:15672"
    environment:
      RABBITMQ_DEFAULT_USER: guest
      RABBITMQ_DEFAULT_PASS: guest
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
    networks:
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s

# =============================================
# Named Volumes (data persists across restarts)
# =============================================
volumes:
  postgres_data:
  redis_data:
  rabbitmq_data:

# =============================================
# Networks
# =============================================
networks:
  backend:
    driver: bridge

4. Docker Networking

Docker provides three main network drivers for connecting containers.

4.1 Network Types

DriverDescriptionUse Case
bridgeDefault. Creates an isolated virtual network. Containers communicate by service name.Multi-service stacks on a single host
hostContainer shares the host's network namespace. No network isolation.Maximum network performance (Linux only)
overlaySpans multiple Docker hosts. Used with Docker Swarm.Multi-host deployments

4.2 Service Discovery with Bridge Networks

networks:
  backend:
    driver: bridge

services:
  user-service:
    networks:
      - backend
  order-service:
    networks:
      - backend

With a shared bridge network, containers can reach each other by service name:

// Inside order-service, reach user-service by name
const userRes = await fetch('http://user-service:3001/api/users/123');
// Docker DNS resolves "user-service" to the container's IP

4.3 Network Isolation

# Only gateway can talk to the frontend network AND the backend network.
# Database containers are on backend only -- not reachable from outside.

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge

services:
  gateway:
    networks:
      - frontend
      - backend
  user-service:
    networks:
      - backend
  postgres:
    networks:
      - backend  # Not on frontend -- cannot be reached directly

5. Volume Management for Persistence

Containers are ephemeral -- when a container is destroyed, its filesystem is gone. Volumes persist data across container restarts and recreations.

5.1 Volume Types

TypeSyntaxUse Case
Named volumepostgres_data:/var/lib/postgresql/dataProduction databases, persistent state
Bind mount./local-dir:/container-dirDevelopment -- live code reloading
tmpfstmpfs: /tmpEphemeral data that should never touch disk

5.2 Volume Commands

# List all volumes
docker volume ls

# Inspect a volume (see mount point on host)
docker volume inspect postgres_data

# Remove unused volumes (CAREFUL in production!)
docker volume prune

# Back up a named volume
docker run --rm -v postgres_data:/data -v $(pwd):/backup \
  alpine tar czf /backup/postgres_backup.tar.gz /data

6. Environment Variable Management

Never hardcode secrets in your Dockerfile or Compose file. Use environment variables injected at runtime.

6.1 Methods for Injecting Environment Variables

services:
  app:
    # Method 1: Inline (OK for non-sensitive defaults)
    environment:
      - NODE_ENV=production
      - PORT=3000

    # Method 2: .env file (good for local development)
    env_file:
      - .env

    # Method 3: Docker secrets (best for production)
    secrets:
      - db_password

secrets:
  db_password:
    file: ./secrets/db_password.txt

6.2 .env File Pattern

# .env (NEVER commit this file -- add to .gitignore)
DATABASE_URL=postgresql://user:password@postgres:5432/mydb
REDIS_URL=redis://redis:6379
JWT_SECRET=super-secret-key-change-me
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI...

6.3 Docker Secrets (Swarm Mode)

services:
  app:
    secrets:
      - db_password
    environment:
      - DB_PASSWORD_FILE=/run/secrets/db_password

secrets:
  db_password:
    external: true  # Created with: docker secret create db_password ./password.txt
// Reading a Docker secret in Node.js
const fs = require('fs');
const dbPassword = fs.readFileSync('/run/secrets/db_password', 'utf8').trim();

7. Docker Health Checks

Health checks tell Docker (and orchestrators like ECS) whether a container is actually healthy, not just running.

7.1 Dockerfile Health Check

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

7.2 Health Endpoint in Express

// health.js -- production health check endpoint
const express = require('express');
const router = express.Router();

router.get('/health', async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    status: 'ok',
    checks: {}
  };

  // Check database connection
  try {
    await pool.query('SELECT 1');
    checks.checks.database = 'ok';
  } catch (err) {
    checks.checks.database = 'fail';
    checks.status = 'degraded';
  }

  // Check Redis connection
  try {
    await redis.ping();
    checks.checks.redis = 'ok';
  } catch (err) {
    checks.checks.redis = 'fail';
    checks.status = 'degraded';
  }

  const statusCode = checks.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(checks);
});

module.exports = router;

7.3 Health Check Parameters

ParameterDefaultMeaning
--interval30sTime between checks
--timeout30sMax time for a single check to respond
--start-period0sGrace period for slow-starting containers
--retries3Consecutive failures before marking "unhealthy"

8. Container Resource Limits

Without limits, a single container can consume all host CPU and memory, starving other containers.

services:
  app:
    deploy:
      resources:
        limits:
          cpus: "0.50"     # Max 50% of one CPU core
          memory: 256M      # Max 256 MB RAM -- container killed if exceeded (OOMKilled)
        reservations:
          cpus: "0.25"     # Guaranteed 25% of one CPU core
          memory: 128M      # Guaranteed 128 MB RAM

What happens when limits are hit:

Memory limit exceeded --> Container is OOMKilled (Out of Memory Killed)
CPU limit exceeded    --> Container is throttled (slowed down, not killed)

Sizing guidance for Node.js:

Service TypeCPU LimitMemory Limit
API Gateway0.50256 MB
Microservice (stateless)0.50256 MB
Worker / Background job1.00512 MB
PostgreSQL1.00512 MB - 1 GB
Redis (cache)0.25128 - 256 MB
RabbitMQ0.50256 MB

9. Production vs Development Docker Configs

Use separate Compose files or override files for different environments.

9.1 Development Override

# docker-compose.override.yml (auto-loaded in development)
services:
  gateway:
    build:
      target: builder   # Use the builder stage (includes devDependencies)
    volumes:
      - ./services/gateway/src:/app/src  # Live code reloading
    environment:
      - NODE_ENV=development
      - LOG_LEVEL=debug
    command: npm run dev  # Nodemon for auto-restart

  postgres:
    ports:
      - "5432:5432"  # Expose to host for local tools (pgAdmin, DBeaver)

9.2 Production Compose

# docker-compose.prod.yml
services:
  gateway:
    image: 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:${IMAGE_TAG}
    restart: always
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    # No volume mounts -- use the built image
    # No port exposure except through reverse proxy

  postgres:
    # No port exposure to host in production
    ports: []

9.3 Running with Environment-Specific Configs

# Development (uses docker-compose.yml + docker-compose.override.yml automatically)
docker compose up

# Production (explicit file, no override)
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Staging
docker compose -f docker-compose.yml -f docker-compose.staging.yml up -d

10. Complete Production Workflow

# Step 1: Build all images
docker compose -f docker-compose.yml -f docker-compose.prod.yml build

# Step 2: Run security scan
trivy image gateway:latest
trivy image user-service:latest

# Step 3: Tag images for registry
docker tag gateway:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:v1.2.3
docker tag user-service:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3

# Step 4: Push to registry
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:v1.2.3
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3

# Step 5: Deploy (pull and restart on production server)
ssh production-server "cd /app && docker compose -f docker-compose.prod.yml pull && docker compose -f docker-compose.prod.yml up -d"

# Step 6: Verify health
curl https://api.example.com/health

11. Key Takeaways

  1. Multi-stage builds cut image size by 80%+ and remove build tools from the runtime.
  2. Always run as a non-root user -- one line in the Dockerfile drastically reduces attack surface.
  3. .dockerignore prevents secrets and unnecessary files from entering the image.
  4. Layer ordering matters -- put COPY package*.json and RUN npm ci before COPY . . to maximise cache hits.
  5. Health checks let orchestrators know when a container is actually ready, not just running.
  6. Resource limits prevent one runaway container from killing the host.
  7. Separate dev and prod configs -- dev uses bind mounts and debug logging; prod uses built images and structured logging.

Explain-It Challenge

  1. A junior developer asks "why is our Docker image 1.2 GB?" Walk them through converting to a multi-stage build.
  2. Your container is getting OOMKilled in production. What is happening and how do you fix it?
  3. Explain why COPY . . before RUN npm ci destroys your build cache, using a concrete example.

Navigation: <-- 6.9 Overview | 6.9.b -- EC2 and SSL -->