Episode 6 — Scaling Reliability Microservices Web3 / 6.9 — Final Production Deployment

Interview Questions: Final Production Deployment

Model answers for Docker deployment, EC2 and SSL, and CI/CD pipelines.

How to use this material (instructions)

  1. Read lessons in order -- README.md, then 6.9.a --> 6.9.c.
  2. Practice out loud -- definition --> example --> pitfall.
  3. Pair with exercises -- 6.9-Exercise-Questions.md.
  4. Quick review -- 6.9-Quick-Revision.md.

Beginner (Q1--Q4)

Q1. What is a multi-stage Docker build and why should you use one?

Why interviewers ask: Tests understanding of production Docker practices -- image size directly affects deploy speed, registry costs, and security surface area.

Model answer:

A multi-stage build uses multiple FROM statements in a single Dockerfile. The first stage (the "builder") installs all dependencies including devDependencies, compiles TypeScript, and runs any build steps. The second stage (the "production" stage) starts from a clean base image and copies only the compiled output and production dependencies from the builder stage.

This matters because a single-stage Node.js image is typically 1-1.5 GB (includes TypeScript compiler, test frameworks, build tools, source code). A multi-stage image is typically 100-200 MB -- only the runtime, production node_modules, and compiled JavaScript.

Smaller images mean: faster pulls from the registry during deployment, lower storage costs in ECR, smaller attack surface (fewer packages to have CVEs), and faster auto-scaling (new instances start sooner because the image downloads faster).

The pattern is: build everything in stage 1, then COPY --from=builder /app/dist ./dist in stage 2 to cherry-pick only what you need.


Q2. Why do you need a reverse proxy like Nginx in front of Node.js?

Why interviewers ask: Evaluates understanding of production architecture -- a very common setup that candidates should be able to justify.

Model answer:

Node.js is a single-threaded runtime optimised for handling asynchronous I/O, not for heavy HTTP processing like SSL handshakes, static file serving, or connection buffering. Nginx is a C-based, multi-threaded server specifically designed for these tasks.

A reverse proxy provides: (1) SSL termination -- Nginx handles the TLS handshake much more efficiently than Node.js. (2) Static file serving -- Nginx serves images, CSS, and JS directly from disk without touching Node.js. (3) Connection buffering -- slow clients are handled by Nginx, freeing Node.js to process the next request. (4) Gzip compression -- Nginx compresses responses before sending. (5) Load balancing -- Nginx can distribute requests across multiple Node.js processes. (6) Security headers -- HSTS, X-Frame-Options, and other headers added at the proxy level.

Without Nginx, your Node.js process is exposed directly to the internet and must handle all these concerns itself, which is both slower and less secure.


Q3. What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?

Why interviewers ask: Foundational vocabulary -- if you cannot define these three terms, the interviewer doubts your deployment experience.

Model answer:

Continuous Integration (CI) is the practice of automatically building and testing code every time someone pushes a commit. The goal is to catch bugs and integration issues early. A typical CI pipeline runs lint, unit tests, and integration tests. If any step fails, the pull request is blocked from merging.

Continuous Delivery extends CI by ensuring that every passing build is ready to deploy to production. The code is built, tested, and packaged (e.g., a Docker image pushed to a registry), but a human manually approves the actual production deployment.

Continuous Deployment goes one step further -- every passing build is automatically deployed to production with no human intervention. The only thing stopping bad code from reaching production is your test suite and automated checks.

The key distinction: Delivery requires a human to press "deploy"; Deployment does it automatically. Most teams start with Delivery and move to Deployment as their test coverage and monitoring improve.


Q4. How do you connect a custom domain to your server and enable HTTPS?

Why interviewers ask: Practical knowledge -- every production deployment needs a domain and SSL. Tests whether you have actually deployed something.

Model answer:

The process has three steps: DNS configuration, SSL certificate, and web server configuration.

Step 1 -- DNS: Register your domain or use an existing one. In your DNS provider (e.g., Route 53), create an A record pointing api.example.com to your server's public IP address (Elastic IP for EC2) or a CNAME/Alias record pointing to your ALB's DNS name.

Step 2 -- SSL certificate: For ALB-based setups, use AWS Certificate Manager (ACM) -- it is free, auto-renewing, and requires only DNS validation. For direct-to-server setups, use Let's Encrypt with Certbot -- also free, issues certificates via HTTP challenge, and auto-renews via a cron job or systemd timer.

Step 3 -- Web server: Configure Nginx to listen on port 443 with the SSL certificate, proxy requests to your Node.js application on localhost:3000, and redirect all port 80 (HTTP) traffic to HTTPS with a 301 redirect.

The result: users visit https://api.example.com, DNS resolves to your server, Nginx handles the SSL handshake, and proxies the decrypted request to Node.js.


Intermediate (Q5--Q8)

Q5. Walk me through your Docker production checklist. What makes a Dockerfile "production-ready"?

Why interviewers ask: Separates candidates who have deployed Docker in production from those who only used it for development.

Model answer:

A production-ready Dockerfile follows this checklist:

1. Multi-stage build -- separate build and runtime stages. The final image contains only compiled output and production dependencies.

2. Minimal base image -- use node:20-alpine (5 MB base) instead of node:20 (340 MB base). Alpine has fewer packages, meaning fewer potential vulnerabilities.

3. Non-root user -- create a dedicated user with adduser and switch to it with USER. Never run the application as root inside the container.

4. .dockerignore -- exclude node_modules, .git, .env, test files, and documentation from the build context.

5. Layer optimisation -- copy package.json and package-lock.json first, run npm ci, then copy source code. This maximises cache hits when only source code changes.

6. npm ci not npm install -- npm ci deletes node_modules and installs exact versions from the lock file. Deterministic and faster.

7. Health check -- a HEALTHCHECK instruction tells Docker (and orchestrators) how to verify the container is healthy.

8. Resource limits -- set CPU and memory limits in Compose or the orchestrator to prevent runaway containers.

9. No secrets in the image -- inject secrets via environment variables, .env files, or Docker/Kubernetes secrets at runtime.

10. Security scanning -- scan the built image with Trivy or Snyk in CI before pushing to the registry.


Q6. Explain SSL termination. Where should it happen and what are the trade-offs?

Why interviewers ask: Tests understanding of network architecture and security considerations in cloud environments.

Model answer:

SSL termination is the process of decrypting HTTPS traffic. Wherever termination happens, that component handles the CPU-intensive TLS handshake and passes plaintext HTTP downstream.

Option A: Terminate at the ALB. The ALB uses an ACM certificate (free, auto-renewing). Traffic between the ALB and your application servers is HTTP over the VPC's private network. This is the most common pattern in AWS because it offloads SSL processing from your servers, centralises certificate management, and costs nothing extra. The trade-off is that traffic between ALB and EC2/ECS is unencrypted, but since it stays within your VPC, AWS considers this acceptable for most compliance requirements.

Option B: Terminate at Nginx on the server. Nginx on EC2 uses a Let's Encrypt certificate. This gives you end-to-end encryption from the client to your server. You manage renewal yourself (Certbot automates this). This is the right choice for single-server setups without an ALB or when compliance requires end-to-end encryption.

Option C: Re-encrypt at both levels. The ALB terminates and re-encrypts to Nginx, which terminates again and proxies to Node.js as HTTP. Maximum security, but double the certificate management and slight performance overhead.

For most AWS deployments, Option A (ALB termination) is correct. Use Option B for simple single-instance setups. Use Option C only when regulatory compliance demands it.


Q7. Your CI/CD pipeline builds and deploys on every push to main. A bad commit reaches production. Walk me through your rollback.

Why interviewers ask: Incident response and operational maturity. Tests whether you have actually dealt with production issues.

Model answer:

Immediate rollback happens in under 5 minutes:

Step 1 (30 seconds): Identify the previous working version. Every deployment tags the Docker image with the git commit SHA. Check the ECS service events or ECR image list to find the SHA of the last successful deployment.

Step 2 (2 minutes): Re-deploy the previous image. If using ECS, update the service to use the previous task definition revision: aws ecs update-service --cluster prod --service my-api --task-definition my-api-task:41 --force-new-deployment. If using EC2 with PM2, run git revert HEAD && git push origin main to trigger a new CI/CD pipeline that deploys the reverted code.

Step 3 (2 minutes): Verify. Hit the health endpoint: curl https://api.example.com/health. Check error rates in monitoring (CloudWatch, Datadog). Verify the rollback completed by checking the running task definition.

Step 4 (post-incident): Investigate. Read the logs for the failed deployment. Determine root cause. Add a test or check that would have caught this issue in CI. Write a post-mortem.

Prevention: ECS deployment circuit breaker should be enabled -- if the new task fails health checks, ECS automatically rolls back without human intervention. This handles most bad deployments before they serve significant traffic.


Q8. How do you manage environment-specific configuration (development, staging, production) in Docker and CI/CD?

Why interviewers ask: Configuration management is a real source of production incidents. Tests practical experience with multi-environment systems.

Model answer:

Configuration is managed at three levels:

1. Docker Compose files. Use a base docker-compose.yml with shared service definitions, then environment-specific overrides: docker-compose.override.yml (auto-loaded in development, adds volume mounts and debug logging) and docker-compose.prod.yml (uses pre-built images from ECR, sets resource limits, removes exposed ports). Run with: docker compose -f docker-compose.yml -f docker-compose.prod.yml up.

2. Environment variables. The application reads configuration from environment variables, not hardcoded values. In development, a .env file provides them. In CI, GitHub Secrets provides them. In production, ECS task definitions or AWS Systems Manager Parameter Store provides them. The app code is identical across environments -- only the injected values differ.

3. CI/CD environment promotion. GitHub Actions workflows use the environment key to scope secrets and require approvals. Pushing to develop deploys to staging with staging secrets. Merging to main deploys to production with production secrets. The Docker image is the same -- only the environment variables change.

The key principle: build once, deploy many times. The same Docker image runs in staging and production. Never rebuild for a different environment -- only inject different configuration.


Advanced (Q9--Q11)

Q9. Design a complete CI/CD pipeline for a microservices architecture with 5 services. How do you handle independent deployments, shared libraries, and database migrations?

Why interviewers ask: Tests system design and operational thinking at scale. Separates engineers who have managed multi-service deployments from those who have only worked with monoliths.

Model answer:

Repository structure: Monorepo with each service in its own directory, plus a shared libs/ directory. Each service has its own Dockerfile, its own GitHub Actions workflow, and its own ECS service.

Change detection: Each workflow uses paths filters so that pushing changes to services/user-service/ only triggers the user-service pipeline, not all 5 services. Changes to libs/ trigger all service pipelines that import from it.

on:
  push:
    paths:
      - 'services/user-service/**'
      - 'libs/shared/**'

Pipeline per service: Each service has: lint --> unit test --> integration test --> build Docker image --> scan --> push to ECR --> deploy to ECS. Services are independent -- the order-service can deploy without waiting for user-service.

Shared libraries: Published as internal npm packages or imported as workspace packages. When a shared library changes, all dependent service pipelines run and deploy.

Database migrations: Migrations run as a separate step before the service deployment. The CI/CD pipeline runs npm run db:migrate as a one-off ECS task (or a GitHub Actions step that connects to the database). Migrations must be backward-compatible -- the old version of the service must be able to run against the new database schema, because during a rolling deploy, both old and new versions run simultaneously.

Rollback complexity: With independent services, you can roll back one service without affecting others. But if a migration was involved, you need a forward-fix (deploy a new version that fixes the issue) rather than rolling back the migration, because other services may already depend on the new schema.


Q10. Your application serves 50,000 requests per second. Design the deployment architecture, including Docker, load balancing, CI/CD, and SSL. Justify every choice.

Why interviewers ask: Tests ability to design for scale with real production constraints. Expected for senior/staff-level positions.

Model answer:

Compute: ECS Fargate with auto-scaling. Each service task runs the multi-stage Docker image (Alpine-based, ~100 MB). Fargate eliminates EC2 management. Auto-scaling policy: target 60% CPU utilisation, minimum 10 tasks, maximum 50 tasks per service. At 50K RPS, assume each task handles ~2,000 RPS (Node.js with cluster mode in the container), so 25 tasks at steady state.

Load balancing: Application Load Balancer (ALB) with path-based routing: /api/users/* to user-service target group, /api/orders/* to order-service target group. Connection draining enabled (30 seconds). Health check: /health every 10 seconds, 2 consecutive failures to deregister.

SSL: ACM certificate on the ALB. SSL termination at the ALB -- at 50K RPS, terminating SSL on each container would waste significant CPU. Traffic between ALB and Fargate tasks is HTTP within the VPC (private subnets only).

DNS: Route 53 with Alias record pointing to the ALB. Low TTL (60 seconds) for fast failover. CloudFront in front of the ALB for global edge caching and DDoS protection.

CI/CD: GitHub Actions with parallel jobs per service (monorepo, path-filtered triggers). Build --> test --> scan --> push to ECR --> update ECS task definition --> rolling deploy with circuit breaker. Staging environment mirrors production at 1/10 scale. Canary deployment strategy: 5% of traffic to new version for 10 minutes, then automatic promotion to 100% if error rate stays below 0.1%.

Monitoring: CloudWatch Container Insights for ECS metrics. ALB access logs to S3. Application-level metrics (request latency, error rates) to Datadog. Alarms on: error rate > 1%, p99 latency > 500ms, CPU > 80% for 5 minutes.


Q11. A junior developer pushes a commit to main that passes CI but brings down production. Post-mortem: what went wrong, and how do you prevent it from happening again?

Why interviewers ask: Tests incident analysis, blameless culture, and systematic improvement. Evaluates maturity and leadership.

Model answer:

Immediate response:

  1. ECS circuit breaker should have caught this. If it did not, manually roll back: update ECS service to previous task definition.
  2. Verify health check passes on the rolled-back version.
  3. Communicate status to stakeholders: what happened, when it will be resolved, what is the customer impact.

Root cause analysis (why CI passed but production broke):

Possible causes and their fixes:

(A) Missing integration tests. The unit tests mocked external dependencies. The actual database query, cache interaction, or third-party API call fails in production. Fix: Add integration tests that run against real service containers in CI (PostgreSQL, Redis as GitHub Actions service containers).

(B) Environment-specific configuration. The code works with staging environment variables but fails with production values (different database, different API keys, stricter rate limits). Fix: Test against production-like configuration in staging. Use the same Docker image in staging and production.

(C) Database migration issue. A migration altered a column that the currently-running (old) version depends on. During the rolling deploy, old tasks crash. Fix: Enforce backward-compatible migrations. Add a CI check that runs the old version's tests against the new schema.

(D) Load-dependent failure. The code works at low traffic (staging) but crashes under production load (memory leak, connection pool exhaustion). Fix: Add load testing to the CI/CD pipeline. Run k6 or Artillery against staging before promoting to production.

Systemic improvements:

  1. Require staging soak time -- the new version must run in staging for at least 30 minutes before production promotion.
  2. Canary deploys -- route 5% of production traffic first, monitor for 10 minutes, then promote.
  3. Feature flags -- deploy code without activating it. Enable the feature gradually.
  4. Improved monitoring -- set alarms on the specific metric that would have caught this issue.
  5. Blameless post-mortem -- document what happened, why, and the action items. Never blame the developer -- blame the system that allowed the bug through.

Quick-fire

#QuestionOne-line answer
1Multi-stage build saves what?Image size -- 80%+ reduction by excluding build tools and devDependencies
2npm ci vs npm install?npm ci is deterministic (exact lock file versions) and faster in CI
3Why non-root in Docker?Limits blast radius if the container is compromised
4ACM vs Let's Encrypt?ACM for ALB (free, auto-renew); Let's Encrypt for direct server SSL
5PM2 reload vs restart?Reload = zero downtime (cluster); Restart = brief downtime
6:latest tag in production?Never -- use git SHA or semver for traceability
7ECS circuit breaker does what?Auto-rolls back if new tasks keep failing health checks
8Blue/green vs canary?Blue/green: instant switch (2x cost); Canary: gradual shift (low risk)
9Where do CI secrets live?GitHub Settings > Secrets > Actions (encrypted, scoped per environment)
10Build once, deploy many means?Same Docker image in staging and production -- only env vars differ

<-- Back to 6.9 -- Final Production Deployment (README)