Episode 6 — Scaling Reliability Microservices Web3 / 6.9 — Final Production Deployment
6.9.a -- Docker Deployment
In one sentence: A production Docker deployment uses multi-stage builds to create tiny, secure images, runs containers as a non-root user, manages multi-service stacks with Docker Compose, and enforces health checks, resource limits, and secrets so your system is reproducible, portable, and hardened from day one.
Navigation: <-- 6.9 Overview | 6.9.b -- EC2 and SSL -->
1. Why Docker for Production?
Docker solves the "works on my machine" problem by packaging your application, its dependencies, and its runtime environment into a single image that runs identically everywhere -- your laptop, staging, and production.
Without Docker:
"It works on my machine" --> different Node version, missing env vars, OS differences
With Docker:
Same image everywhere --> deterministic, reproducible, portable
Key benefits for production:
| Benefit | Explanation |
|---|---|
| Reproducibility | The same image runs in dev, staging, and production |
| Isolation | Each service runs in its own container with its own filesystem |
| Portability | Runs on any machine with Docker installed -- AWS, GCP, Azure, bare metal |
| Fast rollbacks | Bad deploy? Run the previous image tag -- instant rollback |
| Microservice enablement | Each service has its own Dockerfile, image, and deployment lifecycle |
2. Production Dockerfile Best Practices
2.1 Multi-Stage Builds
Multi-stage builds separate the build environment (where you install devDependencies and compile) from the runtime environment (where you run the app). This produces dramatically smaller images.
# ============================================
# Stage 1: Build
# ============================================
FROM node:20-alpine AS builder
WORKDIR /app
# Copy package files first (cache layer)
COPY package.json package-lock.json ./
# Install ALL dependencies (including devDependencies for building)
RUN npm ci
# Copy source code
COPY . .
# Build (TypeScript compile, bundle, etc.)
RUN npm run build
# ============================================
# Stage 2: Production Runtime
# ============================================
FROM node:20-alpine AS production
WORKDIR /app
# Copy only package files
COPY package.json package-lock.json ./
# Install ONLY production dependencies
RUN npm ci --only=production && npm cache clean --force
# Copy built output from builder stage
COPY --from=builder /app/dist ./dist
# Create non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
# Switch to non-root user
USER appuser
# Expose the port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
# Start the application
CMD ["node", "dist/index.js"]
Why multi-stage matters:
Single-stage image: ~1.2 GB (includes devDependencies, source, build tools)
Multi-stage image: ~150 MB (only runtime deps + compiled output)
Savings: ~87% smaller image
- Faster pulls from registry
- Smaller attack surface
- Lower storage costs
2.2 Non-Root User
By default, Docker containers run as root. If an attacker exploits your application, they get root access inside the container. Always create and switch to a non-root user.
# Create a system group and user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
# Change ownership of the app directory
RUN chown -R appuser:appgroup /app
# Switch to the non-root user for all subsequent commands
USER appuser
What happens without this:
Attacker exploits Node.js vulnerability
--> Gets shell inside container as root
--> Can modify container filesystem
--> Potential container escape to host
With non-root user:
--> Gets shell as appuser (limited permissions)
--> Cannot modify system files
--> Cannot install packages
--> Blast radius contained
2.3 .dockerignore
A .dockerignore file prevents unnecessary files from being sent to the Docker daemon during build. This speeds up builds and prevents secrets from leaking into images.
# .dockerignore
# Dependencies (installed inside container)
node_modules
# Source control
.git
.gitignore
# IDE
.vscode
.idea
*.swp
# Environment files (NEVER bake secrets into images)
.env
.env.*
# Build artifacts
dist
coverage
# Docker files (not needed inside the image)
Dockerfile
docker-compose*.yml
.dockerignore
# Documentation
README.md
docs/
# Tests (not needed in production image)
__tests__
*.test.js
*.spec.js
jest.config.js
2.4 Layer Optimisation
Docker caches layers. Order your Dockerfile so that infrequently changing layers come first and frequently changing layers come last.
# GOOD: Package files change less often than source code
COPY package.json package-lock.json ./ # <-- Layer 1 (cached most of the time)
RUN npm ci # <-- Layer 2 (cached when deps don't change)
COPY . . # <-- Layer 3 (invalidated on every code change)
# BAD: Every code change reinstalls all dependencies
COPY . . # <-- Layer 1 (invalidated on every change)
RUN npm ci # <-- Layer 2 (also invalidated -- cache miss)
Layer caching mental model:
Layer 1: FROM node:20-alpine (changes: almost never)
Layer 2: COPY package*.json (changes: when deps change)
Layer 3: RUN npm ci (changes: when deps change)
Layer 4: COPY . . (changes: every commit)
Layer 5: RUN npm run build (changes: every commit)
If you only changed source code:
Layers 1-3: CACHED (fast!)
Layers 4-5: Rebuilt (only the changed parts)
2.5 Security Scanning
Scan your images for known vulnerabilities before deploying.
# Using Docker Scout (built into Docker Desktop)
docker scout cves my-app:latest
# Using Trivy (popular open-source scanner)
trivy image my-app:latest
# Using Snyk
snyk container test my-app:latest
# In CI/CD (GitHub Actions example)
# - name: Scan image
# uses: aquasecurity/trivy-action@master
# with:
# image-ref: my-app:latest
# severity: CRITICAL,HIGH
# exit-code: 1 # Fail the build if critical vulnerabilities found
Best practices for secure images:
| Practice | Reason |
|---|---|
Use -alpine base images | Minimal OS, fewer CVEs |
| Pin exact base image versions | node:20.11.1-alpine not node:latest |
| Run as non-root | Limit blast radius |
| No secrets in image | Use runtime env vars or Docker secrets |
| Scan in CI | Catch vulnerabilities before deploy |
| Update base images regularly | Patch known CVEs |
3. Docker Compose for Multi-Service Development
Docker Compose lets you define and run multiple containers as a single stack. In development, it replaces the need for installing PostgreSQL, Redis, and RabbitMQ locally.
3.1 Complete Multi-Service Docker Compose
# docker-compose.yml
version: "3.9"
services:
# =============================================
# API Gateway
# =============================================
gateway:
build:
context: ./services/gateway
dockerfile: Dockerfile
ports:
- "3000:3000"
environment:
- NODE_ENV=production
- USER_SERVICE_URL=http://user-service:3001
- ORDER_SERVICE_URL=http://order-service:3002
- REDIS_URL=redis://redis:6379
depends_on:
user-service:
condition: service_healthy
order-service:
condition: service_healthy
redis:
condition: service_healthy
networks:
- backend
restart: unless-stopped
deploy:
resources:
limits:
cpus: "0.50"
memory: 256M
reservations:
cpus: "0.25"
memory: 128M
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
# =============================================
# User Service
# =============================================
user-service:
build:
context: ./services/user-service
dockerfile: Dockerfile
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://appuser:secret@postgres:5432/users_db
- RABBITMQ_URL=amqp://rabbitmq:5672
depends_on:
postgres:
condition: service_healthy
rabbitmq:
condition: service_healthy
networks:
- backend
restart: unless-stopped
deploy:
resources:
limits:
cpus: "0.50"
memory: 256M
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3001/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
# =============================================
# Order Service
# =============================================
order-service:
build:
context: ./services/order-service
dockerfile: Dockerfile
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://appuser:secret@postgres:5432/orders_db
- RABBITMQ_URL=amqp://rabbitmq:5672
- REDIS_URL=redis://redis:6379
depends_on:
postgres:
condition: service_healthy
rabbitmq:
condition: service_healthy
redis:
condition: service_healthy
networks:
- backend
restart: unless-stopped
deploy:
resources:
limits:
cpus: "0.50"
memory: 256M
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3002/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
# =============================================
# PostgreSQL
# =============================================
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: appuser
POSTGRES_PASSWORD: secret
POSTGRES_DB: users_db
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init-scripts:/docker-entrypoint-initdb.d
networks:
- backend
restart: unless-stopped
deploy:
resources:
limits:
cpus: "1.00"
memory: 512M
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d users_db"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
# =============================================
# Redis
# =============================================
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy allkeys-lru
networks:
- backend
restart: unless-stopped
deploy:
resources:
limits:
cpus: "0.25"
memory: 192M
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
start_period: 5s
# =============================================
# RabbitMQ
# =============================================
rabbitmq:
image: rabbitmq:3-management-alpine
ports:
- "5672:5672"
- "15672:15672"
environment:
RABBITMQ_DEFAULT_USER: guest
RABBITMQ_DEFAULT_PASS: guest
volumes:
- rabbitmq_data:/var/lib/rabbitmq
networks:
- backend
restart: unless-stopped
deploy:
resources:
limits:
cpus: "0.50"
memory: 256M
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "check_running"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
# =============================================
# Named Volumes (data persists across restarts)
# =============================================
volumes:
postgres_data:
redis_data:
rabbitmq_data:
# =============================================
# Networks
# =============================================
networks:
backend:
driver: bridge
4. Docker Networking
Docker provides three main network drivers for connecting containers.
4.1 Network Types
| Driver | Description | Use Case |
|---|---|---|
| bridge | Default. Creates an isolated virtual network. Containers communicate by service name. | Multi-service stacks on a single host |
| host | Container shares the host's network namespace. No network isolation. | Maximum network performance (Linux only) |
| overlay | Spans multiple Docker hosts. Used with Docker Swarm. | Multi-host deployments |
4.2 Service Discovery with Bridge Networks
networks:
backend:
driver: bridge
services:
user-service:
networks:
- backend
order-service:
networks:
- backend
With a shared bridge network, containers can reach each other by service name:
// Inside order-service, reach user-service by name
const userRes = await fetch('http://user-service:3001/api/users/123');
// Docker DNS resolves "user-service" to the container's IP
4.3 Network Isolation
# Only gateway can talk to the frontend network AND the backend network.
# Database containers are on backend only -- not reachable from outside.
networks:
frontend:
driver: bridge
backend:
driver: bridge
services:
gateway:
networks:
- frontend
- backend
user-service:
networks:
- backend
postgres:
networks:
- backend # Not on frontend -- cannot be reached directly
5. Volume Management for Persistence
Containers are ephemeral -- when a container is destroyed, its filesystem is gone. Volumes persist data across container restarts and recreations.
5.1 Volume Types
| Type | Syntax | Use Case |
|---|---|---|
| Named volume | postgres_data:/var/lib/postgresql/data | Production databases, persistent state |
| Bind mount | ./local-dir:/container-dir | Development -- live code reloading |
| tmpfs | tmpfs: /tmp | Ephemeral data that should never touch disk |
5.2 Volume Commands
# List all volumes
docker volume ls
# Inspect a volume (see mount point on host)
docker volume inspect postgres_data
# Remove unused volumes (CAREFUL in production!)
docker volume prune
# Back up a named volume
docker run --rm -v postgres_data:/data -v $(pwd):/backup \
alpine tar czf /backup/postgres_backup.tar.gz /data
6. Environment Variable Management
Never hardcode secrets in your Dockerfile or Compose file. Use environment variables injected at runtime.
6.1 Methods for Injecting Environment Variables
services:
app:
# Method 1: Inline (OK for non-sensitive defaults)
environment:
- NODE_ENV=production
- PORT=3000
# Method 2: .env file (good for local development)
env_file:
- .env
# Method 3: Docker secrets (best for production)
secrets:
- db_password
secrets:
db_password:
file: ./secrets/db_password.txt
6.2 .env File Pattern
# .env (NEVER commit this file -- add to .gitignore)
DATABASE_URL=postgresql://user:password@postgres:5432/mydb
REDIS_URL=redis://redis:6379
JWT_SECRET=super-secret-key-change-me
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI...
6.3 Docker Secrets (Swarm Mode)
services:
app:
secrets:
- db_password
environment:
- DB_PASSWORD_FILE=/run/secrets/db_password
secrets:
db_password:
external: true # Created with: docker secret create db_password ./password.txt
// Reading a Docker secret in Node.js
const fs = require('fs');
const dbPassword = fs.readFileSync('/run/secrets/db_password', 'utf8').trim();
7. Docker Health Checks
Health checks tell Docker (and orchestrators like ECS) whether a container is actually healthy, not just running.
7.1 Dockerfile Health Check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
7.2 Health Endpoint in Express
// health.js -- production health check endpoint
const express = require('express');
const router = express.Router();
router.get('/health', async (req, res) => {
const checks = {
uptime: process.uptime(),
timestamp: Date.now(),
status: 'ok',
checks: {}
};
// Check database connection
try {
await pool.query('SELECT 1');
checks.checks.database = 'ok';
} catch (err) {
checks.checks.database = 'fail';
checks.status = 'degraded';
}
// Check Redis connection
try {
await redis.ping();
checks.checks.redis = 'ok';
} catch (err) {
checks.checks.redis = 'fail';
checks.status = 'degraded';
}
const statusCode = checks.status === 'ok' ? 200 : 503;
res.status(statusCode).json(checks);
});
module.exports = router;
7.3 Health Check Parameters
| Parameter | Default | Meaning |
|---|---|---|
--interval | 30s | Time between checks |
--timeout | 30s | Max time for a single check to respond |
--start-period | 0s | Grace period for slow-starting containers |
--retries | 3 | Consecutive failures before marking "unhealthy" |
8. Container Resource Limits
Without limits, a single container can consume all host CPU and memory, starving other containers.
services:
app:
deploy:
resources:
limits:
cpus: "0.50" # Max 50% of one CPU core
memory: 256M # Max 256 MB RAM -- container killed if exceeded (OOMKilled)
reservations:
cpus: "0.25" # Guaranteed 25% of one CPU core
memory: 128M # Guaranteed 128 MB RAM
What happens when limits are hit:
Memory limit exceeded --> Container is OOMKilled (Out of Memory Killed)
CPU limit exceeded --> Container is throttled (slowed down, not killed)
Sizing guidance for Node.js:
| Service Type | CPU Limit | Memory Limit |
|---|---|---|
| API Gateway | 0.50 | 256 MB |
| Microservice (stateless) | 0.50 | 256 MB |
| Worker / Background job | 1.00 | 512 MB |
| PostgreSQL | 1.00 | 512 MB - 1 GB |
| Redis (cache) | 0.25 | 128 - 256 MB |
| RabbitMQ | 0.50 | 256 MB |
9. Production vs Development Docker Configs
Use separate Compose files or override files for different environments.
9.1 Development Override
# docker-compose.override.yml (auto-loaded in development)
services:
gateway:
build:
target: builder # Use the builder stage (includes devDependencies)
volumes:
- ./services/gateway/src:/app/src # Live code reloading
environment:
- NODE_ENV=development
- LOG_LEVEL=debug
command: npm run dev # Nodemon for auto-restart
postgres:
ports:
- "5432:5432" # Expose to host for local tools (pgAdmin, DBeaver)
9.2 Production Compose
# docker-compose.prod.yml
services:
gateway:
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:${IMAGE_TAG}
restart: always
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
# No volume mounts -- use the built image
# No port exposure except through reverse proxy
postgres:
# No port exposure to host in production
ports: []
9.3 Running with Environment-Specific Configs
# Development (uses docker-compose.yml + docker-compose.override.yml automatically)
docker compose up
# Production (explicit file, no override)
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
# Staging
docker compose -f docker-compose.yml -f docker-compose.staging.yml up -d
10. Complete Production Workflow
# Step 1: Build all images
docker compose -f docker-compose.yml -f docker-compose.prod.yml build
# Step 2: Run security scan
trivy image gateway:latest
trivy image user-service:latest
# Step 3: Tag images for registry
docker tag gateway:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:v1.2.3
docker tag user-service:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3
# Step 4: Push to registry
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/gateway:v1.2.3
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/user-service:v1.2.3
# Step 5: Deploy (pull and restart on production server)
ssh production-server "cd /app && docker compose -f docker-compose.prod.yml pull && docker compose -f docker-compose.prod.yml up -d"
# Step 6: Verify health
curl https://api.example.com/health
11. Key Takeaways
- Multi-stage builds cut image size by 80%+ and remove build tools from the runtime.
- Always run as a non-root user -- one line in the Dockerfile drastically reduces attack surface.
.dockerignoreprevents secrets and unnecessary files from entering the image.- Layer ordering matters -- put
COPY package*.jsonandRUN npm cibeforeCOPY . .to maximise cache hits. - Health checks let orchestrators know when a container is actually ready, not just running.
- Resource limits prevent one runaway container from killing the host.
- Separate dev and prod configs -- dev uses bind mounts and debug logging; prod uses built images and structured logging.
Explain-It Challenge
- A junior developer asks "why is our Docker image 1.2 GB?" Walk them through converting to a multi-stage build.
- Your container is getting OOMKilled in production. What is happening and how do you fix it?
- Explain why
COPY . .beforeRUN npm cidestroys your build cache, using a concrete example.
Navigation: <-- 6.9 Overview | 6.9.b -- EC2 and SSL -->