Episode 6 — Scaling Reliability Microservices Web3 / 6.9 — Final Production Deployment
6.9.c -- CI/CD Pipelines
In one sentence: CI/CD automates the journey from
git pushto production -- Continuous Integration builds and tests every commit, Continuous Deployment ships passing builds automatically, and GitHub Actions orchestrates the entire pipeline with Docker builds, security scans, environment promotion, and rollback strategies.
Navigation: <-- 6.9.b EC2 and SSL | 6.9 Overview
1. What CI/CD Is and Why It Matters
1.1 The Problem CI/CD Solves
Without CI/CD (manual deployment):
Developer writes code
--> "Works on my machine!"
--> Emails zip file to ops team
--> Ops SSHes into server at 2 AM
--> Runs deploy script (fingers crossed)
--> 30-minute rollback if things break
--> Team is afraid to deploy on Fridays
With CI/CD:
Developer pushes to main
--> Automated lint, test, build, scan
--> Docker image pushed to registry
--> Deployed to staging automatically
--> One-click promotion to production
--> Instant rollback to previous version
--> Deploy 10 times a day with confidence
1.2 Definitions
| Term | Definition | Example |
|---|---|---|
| Continuous Integration (CI) | Automatically build and test every commit | Run npm test and npm run lint on every push |
| Continuous Delivery (CD) | Every passing build is ready to deploy (manual approval) | Build passes all tests; human clicks "Deploy to Prod" |
| Continuous Deployment (CD) | Every passing build is automatically deployed | Merge to main --> live in production in 5 minutes |
CI vs Delivery vs Deployment:
Code --> Build --> Test --> [Stage] --> [Prod]
| | |
+--- CI (automated) -------+ |
+--- Continuous Delivery ----+ (manual) |
+--- Continuous Deployment ---+----------+ (automatic)
2. Continuous Integration: Build + Test on Every Push
CI is the foundation. Every push to the repository triggers an automated pipeline that:
- Checks out the code
- Installs dependencies
- Lints the code (ESLint, Prettier)
- Runs tests (unit, integration)
- Builds the project (TypeScript compile, Docker image)
- Scans for vulnerabilities
If any step fails, the pipeline stops and the team is notified. The pull request is blocked from merging.
2.1 Why CI Matters
| Without CI | With CI |
|---|---|
| Bugs found in production | Bugs caught before merge |
| "It works on my machine" | Works in a clean environment |
| Merge conflicts pile up | Small, frequent merges |
| No one runs tests locally | Tests run automatically |
| Inconsistent code style | Linting enforced on every PR |
3. GitHub Actions Workflow Files
GitHub Actions workflows are defined in YAML files inside .github/workflows/. They run on GitHub's hosted runners.
3.1 Workflow File Structure
# .github/workflows/ci.yml
name: CI Pipeline # Name shown in the GitHub UI
on: # Trigger conditions
push:
branches: [main, develop]
pull_request:
branches: [main]
env: # Global environment variables
NODE_VERSION: "20"
REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
jobs: # One or more jobs
lint: # Job name
runs-on: ubuntu-latest # Runner machine
steps: # Sequential steps within the job
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: [lint] # Run after lint passes
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "npm"
- run: npm ci
- run: npm test
3.2 Key Concepts
| Concept | Explanation |
|---|---|
Trigger (on) | When the workflow runs: push, pull_request, schedule, manual |
| Job | A set of steps that run on the same runner. Jobs run in parallel by default. |
| Step | A single command or action within a job. Steps run sequentially. |
needs | Creates a dependency between jobs (sequential execution) |
uses | References a reusable action (e.g., actions/checkout@v4) |
run | Executes a shell command |
| Secrets | Encrypted variables accessed via ${{ secrets.MY_SECRET }} |
| Artifacts | Files shared between jobs (test reports, build output) |
| Cache | Speeds up builds by caching node_modules between runs |
4. Build, Test, Lint, Deploy Pipeline
4.1 Pipeline Stages
+--------+ +--------+ +--------+ +--------+ +--------+
| Lint | --> | Test | --> | Build | --> | Scan | --> | Deploy |
| ESLint | | Jest | | Docker | | Trivy | | ECS / |
| Prettier| | Integ | | Image | | Snyk | | EC2 |
+--------+ +--------+ +--------+ +--------+ +--------+
| | | | |
| Fail? | Fail? | Fail? | Fail? | Fail?
v v v v v
STOP STOP STOP STOP ROLLBACK
Notify Notify Notify Notify Notify
4.2 Complete GitHub Actions Workflow
# .github/workflows/deploy.yml
name: Build, Test, and Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
NODE_VERSION: "20"
AWS_REGION: us-east-1
ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
ECR_REPOSITORY: my-api
ECS_CLUSTER: production-cluster
ECS_SERVICE: my-api-service
ECS_TASK_DEF: my-api-task
permissions:
contents: read
id-token: write
jobs:
# =============================================
# Job 1: Lint
# =============================================
lint:
name: Lint Code
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "npm"
- name: Install dependencies
run: npm ci
- name: Run ESLint
run: npm run lint
- name: Check formatting
run: npx prettier --check .
# =============================================
# Job 2: Test (runs after lint)
# =============================================
test:
name: Run Tests
runs-on: ubuntu-latest
needs: [lint]
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: test_db
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7-alpine
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "npm"
- name: Install dependencies
run: npm ci
- name: Run unit tests
run: npm run test:unit
env:
DATABASE_URL: postgresql://test:test@localhost:5432/test_db
REDIS_URL: redis://localhost:6379
- name: Run integration tests
run: npm run test:integration
env:
DATABASE_URL: postgresql://test:test@localhost:5432/test_db
REDIS_URL: redis://localhost:6379
- name: Upload coverage report
if: always()
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: coverage/
# =============================================
# Job 3: Build and Push Docker Image
# =============================================
build:
name: Build Docker Image
runs-on: ubuntu-latest
needs: [test]
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-role
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Generate image metadata
id: meta
run: |
SHA=$(echo ${{ github.sha }} | cut -c1-7)
echo "tags=${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${SHA}" >> $GITHUB_OUTPUT
echo "sha_tag=${SHA}" >> $GITHUB_OUTPUT
- name: Build Docker image
run: |
docker build \
--build-arg BUILD_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--build-arg GIT_SHA=${{ github.sha }} \
-t ${{ steps.meta.outputs.tags }} \
-t ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:latest \
.
- name: Scan image for vulnerabilities
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ steps.meta.outputs.tags }}
format: "table"
exit-code: "1"
severity: "CRITICAL,HIGH"
- name: Push image to ECR
run: |
docker push ${{ steps.meta.outputs.tags }}
docker push ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:latest
# =============================================
# Job 4: Deploy to ECS
# =============================================
deploy:
name: Deploy to Production
runs-on: ubuntu-latest
needs: [build]
environment: production # Requires approval in GitHub settings
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-role
aws-region: ${{ env.AWS_REGION }}
- name: Download current task definition
run: |
aws ecs describe-task-definition \
--task-definition ${{ env.ECS_TASK_DEF }} \
--query taskDefinition \
> task-definition.json
- name: Update task definition with new image
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: task-definition.json
container-name: my-api
image: ${{ needs.build.outputs.image_tag }}
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v2
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
wait-for-minutes: 10
- name: Verify deployment
run: |
HEALTH_URL="https://api.example.com/health"
for i in {1..5}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL)
if [ "$STATUS" = "200" ]; then
echo "Health check passed!"
exit 0
fi
echo "Attempt $i: status $STATUS, retrying in 10s..."
sleep 10
done
echo "Health check failed after 5 attempts"
exit 1
5. Docker Image Build and Push in CI
5.1 The Build Flow
Source Code (GitHub)
|
v
GitHub Actions Runner
|
+--> docker build -t my-api:abc1234 .
| |
| +--> Stage 1 (builder): install deps, compile TS
| +--> Stage 2 (production): copy dist, prod deps only
| +--> Image: ~150 MB
|
+--> trivy image my-api:abc1234
| |
| +--> PASS: No critical vulnerabilities
|
+--> docker push ECR/my-api:abc1234
|
+--> Available for ECS to pull
5.2 Tagging Strategy
# Tag with git SHA (unique, traceable)
docker tag my-api:latest ECR/my-api:abc1234
# Tag with semantic version (human-readable)
docker tag my-api:latest ECR/my-api:v1.2.3
# Tag with "latest" (convenience, but dangerous for production)
docker tag my-api:latest ECR/my-api:latest
# Best practice: use SHA for deployments, semver for releases
# NEVER deploy :latest to production -- it is not specific enough
6. Deploying to ECS from CI/CD
6.1 ECS Deployment Flow
CI/CD pushes new image to ECR
|
v
Update ECS task definition with new image tag
|
v
ECS service detects new task definition
|
v
Rolling update begins:
|
+--> Start new task(s) with new image
+--> Wait for new task(s) to pass health check
+--> Drain connections from old task(s)
+--> Stop old task(s)
|
v
Deployment complete (zero downtime)
6.2 Rolling Update Configuration
{
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100,
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
}
What these settings mean:
| Setting | Value | Effect |
|---|---|---|
maximumPercent | 200 | Can run up to 2x the desired count during deploy |
minimumHealthyPercent | 100 | Old tasks stay until new ones are healthy |
deploymentCircuitBreaker | enabled | Auto-rollback if new tasks keep failing |
7. Environment-Specific Deployments
7.1 Staging to Production Promotion
develop branch --> Staging Environment
|
| Tests pass, QA approves
v
main branch --> Production Environment
# Separate workflows per environment
# .github/workflows/deploy-staging.yml
on:
push:
branches: [develop]
jobs:
deploy:
environment: staging
# ... deploy to staging ECS cluster
# .github/workflows/deploy-production.yml
on:
push:
branches: [main]
jobs:
deploy:
environment: production
# ... deploy to production ECS cluster
7.2 Environment Variables per Stage
# In the deploy job
- name: Set environment variables
run: |
if [ "${{ github.ref }}" = "refs/heads/main" ]; then
echo "DEPLOY_ENV=production" >> $GITHUB_ENV
echo "ECS_CLUSTER=prod-cluster" >> $GITHUB_ENV
echo "API_URL=https://api.example.com" >> $GITHUB_ENV
else
echo "DEPLOY_ENV=staging" >> $GITHUB_ENV
echo "ECS_CLUSTER=staging-cluster" >> $GITHUB_ENV
echo "API_URL=https://staging-api.example.com" >> $GITHUB_ENV
fi
8. Secrets Management in CI
8.1 GitHub Secrets
Store sensitive values in GitHub Settings > Secrets and variables > Actions.
# Reference secrets in workflow
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Deploy
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
JWT_SECRET: ${{ secrets.JWT_SECRET }}
run: ./deploy.sh
8.2 Environment-Level Secrets
GitHub Secrets hierarchy:
Organization secrets --> Available to all repos
|
v
Repository secrets --> Available to all workflows in this repo
|
v
Environment secrets --> Available only in a specific environment (staging/production)
jobs:
deploy:
environment: production # This job can ONLY access production secrets
steps:
- run: echo "${{ secrets.PROD_DATABASE_URL }}"
8.3 Security Best Practices for CI Secrets
| Practice | Reason |
|---|---|
| Never echo secrets | echo $SECRET will print *** but still risky in logs |
| Use OIDC for AWS | role-to-assume instead of access keys -- no long-lived credentials |
| Scope secrets to environments | Production secrets only available in production jobs |
| Rotate secrets regularly | Update secrets quarterly or after team changes |
| Audit secret access | Review who has permission to read/write repository secrets |
| Never commit secrets | Use .gitignore for .env files; use git-secrets pre-commit hook |
9. Rollback Strategies
9.1 Image-Based Rollback (Fastest)
# Every deployment tags the image with the git SHA
# To rollback, re-deploy the previous image tag
# Find the previous working image
aws ecr describe-images --repository-name my-api \
--query 'sort_by(imageDetails,&imagePushedAt)[-5:].[imageTags[0],imagePushedAt]' \
--output table
# Update the ECS service to use the old image
aws ecs update-service \
--cluster production-cluster \
--service my-api-service \
--task-definition my-api-task:42 # Previous task definition revision
--force-new-deployment
9.2 Git Revert Rollback
# Revert the problematic commit
git revert HEAD
git push origin main
# CI/CD pipeline runs automatically --> deploys the reverted code
9.3 ECS Circuit Breaker (Automatic)
{
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
If the new task definition fails health checks repeatedly, ECS automatically rolls back to the previous working task definition.
10. Blue/Green and Canary Deployments
10.1 Blue/Green Deployment
Load Balancer
/ \
/ \
+-------+---+ +-----+-----+
| BLUE | | GREEN |
| (current) | | (new) |
| v1.2.3 | | v1.3.0 |
| 100% | | 0% |
+-----------+ +-----------+
Step 1: Deploy new version to GREEN (no traffic)
Step 2: Run smoke tests against GREEN
Step 3: Switch load balancer to GREEN (instant)
Step 4: If problems --> switch back to BLUE (instant rollback)
Step 5: Tear down BLUE after confirmation period
Advantages:
- Instant rollback (just switch the load balancer)
- Zero downtime
- Full production testing before users see the new version
Disadvantage:
- Requires 2x resources during deployment
10.2 Canary Deployment
Load Balancer
/ \
/ 95% \ 5%
+-------+---+ +-----+-----+
| STABLE | | CANARY |
| v1.2.3 | | v1.3.0 |
| 95% of | | 5% of |
| traffic | | traffic |
+-----------+ +-----------+
Step 1: Deploy new version to CANARY (5% traffic)
Step 2: Monitor error rates, latency, logs
Step 3: Gradually increase: 5% --> 25% --> 50% --> 100%
Step 4: If problems at any stage --> route 100% back to STABLE
Advantages:
- Limits blast radius (only 5% of users affected by bugs)
- Real production traffic for testing
- Gradual confidence building
Disadvantage:
- More complex routing configuration
- Need good monitoring to detect canary issues
10.3 Deployment Strategy Comparison
| Strategy | Rollback Speed | Risk | Complexity | Resource Cost |
|---|---|---|---|---|
| Rolling update | Minutes | Medium | Low | 1x |
| Blue/Green | Seconds | Low | Medium | 2x during deploy |
| Canary | Seconds | Very Low | High | 1.05x during deploy |
| Recreate | Minutes (downtime) | High | Very Low | 1x |
11. Deploying to EC2 from CI/CD (Alternative to ECS)
For simpler setups without ECS, you can deploy directly to EC2.
# .github/workflows/deploy-ec2.yml
deploy:
name: Deploy to EC2
runs-on: ubuntu-latest
needs: [test]
if: github.ref == 'refs/heads/main'
steps:
- name: Deploy via SSH
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.EC2_HOST }}
username: ec2-user
key: ${{ secrets.EC2_SSH_KEY }}
script: |
cd /home/ec2-user/app
git pull origin main
npm ci --only=production
npm run build
pm2 reload ecosystem.config.js
# Verify health
sleep 5
curl -f http://localhost:3000/health || (pm2 reload ecosystem.config.js --update-env && exit 1)
12. Complete CI/CD Checklist
Pre-merge (CI):
[ ] Code linted (ESLint, Prettier)
[ ] Unit tests pass
[ ] Integration tests pass
[ ] Test coverage above threshold (e.g., 80%)
[ ] No new lint warnings
[ ] PR reviewed and approved
Post-merge (CD):
[ ] Docker image built (multi-stage)
[ ] Image scanned for vulnerabilities
[ ] Image pushed to ECR
[ ] Task definition updated
[ ] ECS rolling deployment started
[ ] Health checks pass
[ ] Deployment verified
[ ] Monitoring confirmed normal
Rollback ready:
[ ] Previous image tag known
[ ] Circuit breaker enabled
[ ] Rollback procedure documented
[ ] Team notified of deployment
13. Key Takeaways
- CI catches bugs before merge -- lint, test, and build on every push, not just before release.
- Continuous Deployment is the goal -- every merge to
mainshould be production-ready. - GitHub Actions provides free CI/CD for public repos and generous limits for private repos.
- Tag Docker images with git SHA -- traceable, unique, and enables instant rollback.
- Never deploy
:latestto production -- always use specific tags. - Use GitHub Secrets for all sensitive values -- never hardcode credentials in workflow files.
- ECS circuit breaker provides automatic rollback if new deployments fail health checks.
- Blue/green gives instant rollback; canary limits blast radius -- choose based on your risk tolerance.
Explain-It Challenge
- A product manager asks "why can't we just FTP the code to the server?" Explain the value of CI/CD without using jargon.
- Your CI pipeline takes 15 minutes. What strategies would you use to cut it to 5 minutes?
- A deployment goes wrong at 5 PM on Friday. Walk through your rollback procedure step by step.
Navigation: <-- 6.9.b EC2 and SSL | 6.9 Overview