Episode 6 — Scaling Reliability Microservices Web3 / 6.9 — Final Production Deployment

6.9.c -- CI/CD Pipelines

In one sentence: CI/CD automates the journey from git push to production -- Continuous Integration builds and tests every commit, Continuous Deployment ships passing builds automatically, and GitHub Actions orchestrates the entire pipeline with Docker builds, security scans, environment promotion, and rollback strategies.

Navigation: <-- 6.9.b EC2 and SSL | 6.9 Overview

1. What CI/CD Is and Why It Matters

1.1 The Problem CI/CD Solves

Without CI/CD (manual deployment):
  Developer writes code
    --> "Works on my machine!"
    --> Emails zip file to ops team
    --> Ops SSHes into server at 2 AM
    --> Runs deploy script (fingers crossed)
    --> 30-minute rollback if things break
    --> Team is afraid to deploy on Fridays

With CI/CD:
  Developer pushes to main
    --> Automated lint, test, build, scan
    --> Docker image pushed to registry
    --> Deployed to staging automatically
    --> One-click promotion to production
    --> Instant rollback to previous version
    --> Deploy 10 times a day with confidence

1.2 Definitions

Term	Definition	Example
Continuous Integration (CI)	Automatically build and test every commit	Run `npm test` and `npm run lint` on every push
Continuous Delivery (CD)	Every passing build is ready to deploy (manual approval)	Build passes all tests; human clicks "Deploy to Prod"
Continuous Deployment (CD)	Every passing build is automatically deployed	Merge to `main` --> live in production in 5 minutes

CI vs Delivery vs Deployment:

   Code --> Build --> Test --> [Stage]  --> [Prod]
   |                           |            |
   +--- CI (automated) -------+            |
   +--- Continuous Delivery ----+ (manual) |
   +--- Continuous Deployment ---+----------+ (automatic)

2. Continuous Integration: Build + Test on Every Push

CI is the foundation. Every push to the repository triggers an automated pipeline that:

Checks out the code
Installs dependencies
Lints the code (ESLint, Prettier)
Runs tests (unit, integration)
Builds the project (TypeScript compile, Docker image)
Scans for vulnerabilities

If any step fails, the pipeline stops and the team is notified. The pull request is blocked from merging.

2.1 Why CI Matters

Without CI	With CI
Bugs found in production	Bugs caught before merge
"It works on my machine"	Works in a clean environment
Merge conflicts pile up	Small, frequent merges
No one runs tests locally	Tests run automatically
Inconsistent code style	Linting enforced on every PR

3. GitHub Actions Workflow Files

GitHub Actions workflows are defined in YAML files inside .github/workflows/. They run on GitHub's hosted runners.

3.1 Workflow File Structure

# .github/workflows/ci.yml
name: CI Pipeline          # Name shown in the GitHub UI

on:                         # Trigger conditions
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:                        # Global environment variables
  NODE_VERSION: "20"
  REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com

jobs:                       # One or more jobs
  lint:                     # Job name
    runs-on: ubuntu-latest  # Runner machine
    steps:                  # Sequential steps within the job
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: [lint]           # Run after lint passes
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"
      - run: npm ci
      - run: npm test

3.2 Key Concepts

Concept	Explanation
Trigger (`on`)	When the workflow runs: push, pull_request, schedule, manual
Job	A set of steps that run on the same runner. Jobs run in parallel by default.
Step	A single command or action within a job. Steps run sequentially.
`needs`	Creates a dependency between jobs (sequential execution)
`uses`	References a reusable action (e.g., `actions/checkout@v4`)
`run`	Executes a shell command
Secrets	Encrypted variables accessed via `${{ secrets.MY_SECRET }}`
Artifacts	Files shared between jobs (test reports, build output)
Cache	Speeds up builds by caching `node_modules` between runs

4. Build, Test, Lint, Deploy Pipeline

4.1 Pipeline Stages

+--------+     +--------+     +--------+     +--------+     +--------+
|  Lint  | --> |  Test  | --> | Build  | --> |  Scan  | --> | Deploy |
| ESLint |     | Jest   |     | Docker |     | Trivy  |     | ECS /  |
| Prettier|    | Integ  |     | Image  |     | Snyk   |     | EC2    |
+--------+     +--------+     +--------+     +--------+     +--------+
    |              |              |              |              |
    | Fail?        | Fail?        | Fail?        | Fail?        | Fail?
    v              v              v              v              v
  STOP           STOP          STOP           STOP         ROLLBACK
  Notify         Notify        Notify         Notify        Notify

4.2 Complete GitHub Actions Workflow

# .github/workflows/deploy.yml
name: Build, Test, and Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  NODE_VERSION: "20"
  AWS_REGION: us-east-1
  ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
  ECR_REPOSITORY: my-api
  ECS_CLUSTER: production-cluster
  ECS_SERVICE: my-api-service
  ECS_TASK_DEF: my-api-task

permissions:
  contents: read
  id-token: write

jobs:
  # =============================================
  # Job 1: Lint
  # =============================================
  lint:
    name: Lint Code
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"

      - name: Install dependencies
        run: npm ci

      - name: Run ESLint
        run: npm run lint

      - name: Check formatting
        run: npx prettier --check .

  # =============================================
  # Job 2: Test (runs after lint)
  # =============================================
  test:
    name: Run Tests
    runs-on: ubuntu-latest
    needs: [lint]
    services:
      postgres:
        image: postgres:16-alpine
        env:
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
          POSTGRES_DB: test_db
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7-alpine
        ports:
          - 6379:6379
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"

      - name: Install dependencies
        run: npm ci

      - name: Run unit tests
        run: npm run test:unit
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test_db
          REDIS_URL: redis://localhost:6379

      - name: Run integration tests
        run: npm run test:integration
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test_db
          REDIS_URL: redis://localhost:6379

      - name: Upload coverage report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/

  # =============================================
  # Job 3: Build and Push Docker Image
  # =============================================
  build:
    name: Build Docker Image
    runs-on: ubuntu-latest
    needs: [test]
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-role
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Generate image metadata
        id: meta
        run: |
          SHA=$(echo ${{ github.sha }} | cut -c1-7)
          echo "tags=${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${SHA}" >> $GITHUB_OUTPUT
          echo "sha_tag=${SHA}" >> $GITHUB_OUTPUT

      - name: Build Docker image
        run: |
          docker build \
            --build-arg BUILD_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
            --build-arg GIT_SHA=${{ github.sha }} \
            -t ${{ steps.meta.outputs.tags }} \
            -t ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:latest \
            .

      - name: Scan image for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ steps.meta.outputs.tags }}
          format: "table"
          exit-code: "1"
          severity: "CRITICAL,HIGH"

      - name: Push image to ECR
        run: |
          docker push ${{ steps.meta.outputs.tags }}
          docker push ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:latest

  # =============================================
  # Job 4: Deploy to ECS
  # =============================================
  deploy:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: [build]
    environment: production     # Requires approval in GitHub settings
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-role
          aws-region: ${{ env.AWS_REGION }}

      - name: Download current task definition
        run: |
          aws ecs describe-task-definition \
            --task-definition ${{ env.ECS_TASK_DEF }} \
            --query taskDefinition \
            > task-definition.json

      - name: Update task definition with new image
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: task-definition.json
          container-name: my-api
          image: ${{ needs.build.outputs.image_tag }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 10

      - name: Verify deployment
        run: |
          HEALTH_URL="https://api.example.com/health"
          for i in {1..5}; do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL)
            if [ "$STATUS" = "200" ]; then
              echo "Health check passed!"
              exit 0
            fi
            echo "Attempt $i: status $STATUS, retrying in 10s..."
            sleep 10
          done
          echo "Health check failed after 5 attempts"
          exit 1

5. Docker Image Build and Push in CI

5.1 The Build Flow

Source Code (GitHub)
    |
    v
GitHub Actions Runner
    |
    +--> docker build -t my-api:abc1234 .
    |       |
    |       +--> Stage 1 (builder): install deps, compile TS
    |       +--> Stage 2 (production): copy dist, prod deps only
    |       +--> Image: ~150 MB
    |
    +--> trivy image my-api:abc1234
    |       |
    |       +--> PASS: No critical vulnerabilities
    |
    +--> docker push ECR/my-api:abc1234
            |
            +--> Available for ECS to pull

5.2 Tagging Strategy

# Tag with git SHA (unique, traceable)
docker tag my-api:latest ECR/my-api:abc1234

# Tag with semantic version (human-readable)
docker tag my-api:latest ECR/my-api:v1.2.3

# Tag with "latest" (convenience, but dangerous for production)
docker tag my-api:latest ECR/my-api:latest

# Best practice: use SHA for deployments, semver for releases
# NEVER deploy :latest to production -- it is not specific enough

6. Deploying to ECS from CI/CD

6.1 ECS Deployment Flow

CI/CD pushes new image to ECR
    |
    v
Update ECS task definition with new image tag
    |
    v
ECS service detects new task definition
    |
    v
Rolling update begins:
    |
    +--> Start new task(s) with new image
    +--> Wait for new task(s) to pass health check
    +--> Drain connections from old task(s)
    +--> Stop old task(s)
    |
    v
Deployment complete (zero downtime)

6.2 Rolling Update Configuration

{
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }
}

What these settings mean:

Setting	Value	Effect
`maximumPercent`	200	Can run up to 2x the desired count during deploy
`minimumHealthyPercent`	100	Old tasks stay until new ones are healthy
`deploymentCircuitBreaker`	enabled	Auto-rollback if new tasks keep failing

7. Environment-Specific Deployments

7.1 Staging to Production Promotion

develop branch --> Staging Environment
    |
    | Tests pass, QA approves
    v
main branch    --> Production Environment

# Separate workflows per environment
# .github/workflows/deploy-staging.yml
on:
  push:
    branches: [develop]
jobs:
  deploy:
    environment: staging
    # ... deploy to staging ECS cluster

# .github/workflows/deploy-production.yml
on:
  push:
    branches: [main]
jobs:
  deploy:
    environment: production
    # ... deploy to production ECS cluster

7.2 Environment Variables per Stage

# In the deploy job
- name: Set environment variables
  run: |
    if [ "${{ github.ref }}" = "refs/heads/main" ]; then
      echo "DEPLOY_ENV=production" >> $GITHUB_ENV
      echo "ECS_CLUSTER=prod-cluster" >> $GITHUB_ENV
      echo "API_URL=https://api.example.com" >> $GITHUB_ENV
    else
      echo "DEPLOY_ENV=staging" >> $GITHUB_ENV
      echo "ECS_CLUSTER=staging-cluster" >> $GITHUB_ENV
      echo "API_URL=https://staging-api.example.com" >> $GITHUB_ENV
    fi

8. Secrets Management in CI

8.1 GitHub Secrets

Store sensitive values in GitHub Settings > Secrets and variables > Actions.

# Reference secrets in workflow
- name: Configure AWS
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
    aws-region: us-east-1

- name: Deploy
  env:
    DATABASE_URL: ${{ secrets.DATABASE_URL }}
    JWT_SECRET: ${{ secrets.JWT_SECRET }}
  run: ./deploy.sh

8.2 Environment-Level Secrets

GitHub Secrets hierarchy:

Organization secrets --> Available to all repos
    |
    v
Repository secrets   --> Available to all workflows in this repo
    |
    v
Environment secrets  --> Available only in a specific environment (staging/production)

jobs:
  deploy:
    environment: production    # This job can ONLY access production secrets
    steps:
      - run: echo "${{ secrets.PROD_DATABASE_URL }}"

8.3 Security Best Practices for CI Secrets

Practice	Reason
Never echo secrets	`echo $SECRET` will print `***` but still risky in logs
Use OIDC for AWS	`role-to-assume` instead of access keys -- no long-lived credentials
Scope secrets to environments	Production secrets only available in production jobs
Rotate secrets regularly	Update secrets quarterly or after team changes
Audit secret access	Review who has permission to read/write repository secrets
Never commit secrets	Use `.gitignore` for `.env` files; use `git-secrets` pre-commit hook

9. Rollback Strategies

9.1 Image-Based Rollback (Fastest)

# Every deployment tags the image with the git SHA
# To rollback, re-deploy the previous image tag

# Find the previous working image
aws ecr describe-images --repository-name my-api \
  --query 'sort_by(imageDetails,&imagePushedAt)[-5:].[imageTags[0],imagePushedAt]' \
  --output table

# Update the ECS service to use the old image
aws ecs update-service \
  --cluster production-cluster \
  --service my-api-service \
  --task-definition my-api-task:42   # Previous task definition revision
  --force-new-deployment

9.2 Git Revert Rollback

# Revert the problematic commit
git revert HEAD
git push origin main
# CI/CD pipeline runs automatically --> deploys the reverted code

9.3 ECS Circuit Breaker (Automatic)

{
  "deploymentCircuitBreaker": {
    "enable": true,
    "rollback": true
  }
}

If the new task definition fails health checks repeatedly, ECS automatically rolls back to the previous working task definition.

10. Blue/Green and Canary Deployments

10.1 Blue/Green Deployment

                    Load Balancer
                    /           \
                   /             \
          +-------+---+    +-----+-----+
          | BLUE      |    | GREEN     |
          | (current) |    | (new)     |
          | v1.2.3    |    | v1.3.0    |
          | 100%      |    | 0%        |
          +-----------+    +-----------+

Step 1: Deploy new version to GREEN (no traffic)
Step 2: Run smoke tests against GREEN
Step 3: Switch load balancer to GREEN (instant)
Step 4: If problems --> switch back to BLUE (instant rollback)
Step 5: Tear down BLUE after confirmation period

Advantages:

Instant rollback (just switch the load balancer)
Zero downtime
Full production testing before users see the new version

Disadvantage:

Requires 2x resources during deployment

10.2 Canary Deployment

                    Load Balancer
                    /           \
                   / 95%         \ 5%
          +-------+---+    +-----+-----+
          | STABLE    |    | CANARY    |
          | v1.2.3    |    | v1.3.0    |
          | 95% of    |    | 5% of     |
          | traffic   |    | traffic   |
          +-----------+    +-----------+

Step 1: Deploy new version to CANARY (5% traffic)
Step 2: Monitor error rates, latency, logs
Step 3: Gradually increase: 5% --> 25% --> 50% --> 100%
Step 4: If problems at any stage --> route 100% back to STABLE

Advantages:

Limits blast radius (only 5% of users affected by bugs)
Real production traffic for testing
Gradual confidence building

Disadvantage:

More complex routing configuration
Need good monitoring to detect canary issues

10.3 Deployment Strategy Comparison

Strategy	Rollback Speed	Risk	Complexity	Resource Cost
Rolling update	Minutes	Medium	Low	1x
Blue/Green	Seconds	Low	Medium	2x during deploy
Canary	Seconds	Very Low	High	1.05x during deploy
Recreate	Minutes (downtime)	High	Very Low	1x

11. Deploying to EC2 from CI/CD (Alternative to ECS)

For simpler setups without ECS, you can deploy directly to EC2.

# .github/workflows/deploy-ec2.yml
deploy:
  name: Deploy to EC2
  runs-on: ubuntu-latest
  needs: [test]
  if: github.ref == 'refs/heads/main'
  steps:
    - name: Deploy via SSH
      uses: appleboy/ssh-action@v1
      with:
        host: ${{ secrets.EC2_HOST }}
        username: ec2-user
        key: ${{ secrets.EC2_SSH_KEY }}
        script: |
          cd /home/ec2-user/app
          git pull origin main
          npm ci --only=production
          npm run build
          pm2 reload ecosystem.config.js
          # Verify health
          sleep 5
          curl -f http://localhost:3000/health || (pm2 reload ecosystem.config.js --update-env && exit 1)

12. Complete CI/CD Checklist

Pre-merge (CI):
  [ ] Code linted (ESLint, Prettier)
  [ ] Unit tests pass
  [ ] Integration tests pass
  [ ] Test coverage above threshold (e.g., 80%)
  [ ] No new lint warnings
  [ ] PR reviewed and approved

Post-merge (CD):
  [ ] Docker image built (multi-stage)
  [ ] Image scanned for vulnerabilities
  [ ] Image pushed to ECR
  [ ] Task definition updated
  [ ] ECS rolling deployment started
  [ ] Health checks pass
  [ ] Deployment verified
  [ ] Monitoring confirmed normal

Rollback ready:
  [ ] Previous image tag known
  [ ] Circuit breaker enabled
  [ ] Rollback procedure documented
  [ ] Team notified of deployment

13. Key Takeaways

CI catches bugs before merge -- lint, test, and build on every push, not just before release.
Continuous Deployment is the goal -- every merge to main should be production-ready.
GitHub Actions provides free CI/CD for public repos and generous limits for private repos.
Tag Docker images with git SHA -- traceable, unique, and enables instant rollback.
Never deploy :latest to production -- always use specific tags.
Use GitHub Secrets for all sensitive values -- never hardcode credentials in workflow files.
ECS circuit breaker provides automatic rollback if new deployments fail health checks.
Blue/green gives instant rollback; canary limits blast radius -- choose based on your risk tolerance.

Explain-It Challenge

A product manager asks "why can't we just FTP the code to the server?" Explain the value of CI/CD without using jargon.
Your CI pipeline takes 15 minutes. What strategies would you use to cut it to 5 minutes?
A deployment goes wrong at 5 PM on Friday. Walk through your rollback procedure step by step.

Navigation: <-- 6.9.b EC2 and SSL | 6.9 Overview