Episode 9 — System Design / 9.9 — Core Infrastructure

9.9.f Microservices Architecture

Microservices Recap for HLD

Microservices architecture decomposes a system into small, independently deployable services, each owning a specific business capability. In High-Level Design (HLD), you must decide service boundaries, communication patterns, and data ownership.

  Monolith vs Microservices:
  
  MONOLITH:                          MICROSERVICES:
  +-------------------+              +------+  +------+  +------+
  |  User Module      |              | User |  |Order |  | Pay  |
  |  Order Module     |              | Svc  |  | Svc  |  | Svc  |
  |  Payment Module   |              +--+---+  +--+---+  +--+---+
  |  Notification Mod |                 |         |         |
  |  ─── Shared DB ── |              +--+---+  +--+---+  +--+---+
  +-------------------+              | User |  |Order |  | Pay  |
                                     |  DB  |  |  DB  |  |  DB  |
  Single deployment unit.            +------+  +------+  +------+
  Single database.
  Single codebase.                   Each service has its own DB.
                                     Independent deployment.
                                     Independent scaling.

Core Microservices Principles

Principle	Description
Single Responsibility	Each service does one thing well
Loose Coupling	Services know minimal details about each other
High Cohesion	Related functionality is grouped together
Independent Deployment	Deploy one service without touching others
Decentralized Data	Each service owns its data (no shared database)
Design for Failure	Every service assumes other services can fail
Automate Everything	CI/CD, monitoring, scaling must be automated

Service Discovery

In a dynamic environment (containers, auto-scaling), service locations change. Service discovery is how services find each other.

Client-Side Discovery

The client queries a service registry and picks an instance.

  Client-Side Discovery:
  
  Order Service                  Service Registry
       |                              |
       |-- Where is Payment Svc? ---->|
       |<-- [10.0.0.5:8080,       ---|
       |     10.0.0.6:8080,          |
       |     10.0.0.7:8080]          |
       |                              
       |-- (client picks 10.0.0.6) ---|
       |-- HTTP POST /charge -------->| Payment Service (10.0.0.6)
       
  Client handles load balancing.
  Examples: Netflix Eureka + Ribbon, Consul + custom client

Server-Side Discovery

The client talks to a load balancer/router that handles discovery.

  Server-Side Discovery:
  
  Order Service         Load Balancer        Service Registry
       |                     |                      |
       |-- POST /charge ---->|                      |
       |                     |-- Where is Payment? ->|
       |                     |<-- [instances] -------|
       |                     |-- route to instance -->|
       |<-- response --------|                      |
       
  Client does not know about instances.
  Examples: AWS ALB + ECS, Kubernetes Services, Consul + Envoy

Service Discovery Comparison

Approach	Pros	Cons	Examples
Client-side	No extra hop, flexible LB	Client complexity, language-specific	Eureka, Consul (direct)
Server-side	Simple client, language-agnostic	Extra network hop, LB is a dependency	K8s Services, AWS ALB
DNS-based	Universal, no special client	TTL caching delays, limited health checks	Consul DNS, Route 53
Service mesh	Full-featured, transparent	Operational complexity	Istio, Linkerd

Kubernetes Service Discovery

  Kubernetes:
  
  Pod: payment-service-abc12    IP: 10.0.1.5
  Pod: payment-service-def34    IP: 10.0.1.6
  Pod: payment-service-ghi56    IP: 10.0.1.7
  
  Kubernetes Service: "payment-service"
  ClusterIP: 10.96.0.100 (virtual IP)
  
  Order Service calls: http://payment-service:8080/charge
  
  kube-proxy routes to one of the pods.
  If a pod dies, Kubernetes removes it automatically.
  If a new pod starts, Kubernetes adds it automatically.

Inter-Service Communication

Synchronous Communication

Services call each other directly and wait for a response.

  Synchronous Patterns:
  
  1. REST (HTTP/JSON):
     Order Svc --> GET http://user-svc/users/42 --> User Svc
     Simple, universal, human-readable
     
  2. gRPC (HTTP/2 + Protobuf):
     Order Svc --> gRPC call --> User Svc
     Fast, type-safe, bidirectional streaming
     
  3. GraphQL (federated):
     API Gateway --> GraphQL Federation --> User Svc, Order Svc
     Clients query exactly what they need

Asynchronous Communication

Services communicate via messages without waiting for a response.

  Asynchronous Patterns:
  
  1. Event-Driven:
     Order Svc: publish("order.placed", {order_id: 123})
     Payment Svc: subscribe("order.placed") --> charge card
     
  2. Command Queue:
     Order Svc --> Queue: "process-payment" --> Payment Svc
     
  3. Event Sourcing:
     All state changes stored as events in an append-only log.
     Services rebuild state by replaying events.

Communication Pattern Comparison

Aspect	Synchronous (REST/gRPC)	Asynchronous (Events/Queues)
Coupling	Tighter (caller depends on callee)	Looser (fire-and-forget)
Latency	Real-time response	Eventual processing
Failure handling	Immediate error propagation	Retry via queue
Scalability	Limited by slowest service	Independent scaling
Debugging	Easier to trace	Requires distributed tracing
Use case	Query, validation, auth	Background tasks, notifications

Service Communication Decision Matrix

  Need immediate response?
  |
  +-- Yes --> Need streaming? 
  |           |
  |           +-- Yes --> gRPC (bidirectional streaming)
  |           +-- No  --> REST (simple) or gRPC (performance)
  |
  +-- No --> Need guaranteed delivery?
             |
             +-- Yes --> Message Queue (Kafka, RabbitMQ, SQS)
             +-- No  --> Fire-and-forget event (Redis Pub/Sub, SNS)

Data Management in Microservices

Database per Service

Each service owns its data. No other service can directly access another service's database.

  CORRECT:
  
  Order Svc        User Svc        Payment Svc
      |                |                |
  +---+---+        +---+---+        +---+---+
  |Order  |        |User   |        |Payment|
  |  DB   |        |  DB   |        |  DB   |
  +-------+        +-------+        +-------+
  
  Order Svc needs user data? --> Call User Svc API (not User DB!)
  
  
  WRONG (shared database):
  
  Order Svc ----+
                |---> Shared Database  <-- Creates tight coupling!
  User Svc  ----+                          Schema changes break everyone.
  Payment Svc --+

Data Consistency Challenges

When each service has its own database, maintaining consistency across services is hard.

  Problem: "Transfer $100 from Account A to Account B"
  
  Monolith:
  BEGIN TRANSACTION
    UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
    UPDATE accounts SET balance = balance + 100 WHERE id = 'B';
  COMMIT;  -- atomic!
  
  Microservices:
  Account Service A: debit $100   (succeeds)
  Account Service B: credit $100  (fails -- network error!)
  
  Now $100 has vanished! No single transaction spans services.

Solutions covered in the Distributed Transactions section below.

Data Replication Patterns

When service B frequently needs data from service A, calling the API every time is inefficient. Instead:

  Pattern 1: API Call (simple but slow):
  Order Svc --> GET /users/42 --> User Svc
  
  Pattern 2: Data Replication via Events:
  User Svc: publish("user.updated", {id: 42, name: "Alice"})
  Order Svc: subscribe --> store local copy of user data
  
  Pattern 3: Shared Cache:
  User Svc --> write to Redis
  Order Svc --> read from Redis
  
  Pattern 4: CQRS (Command Query Responsibility Segregation):
  Write Model (User Svc, Order Svc) --> Events --> Read Model (denormalized)
  Query Service reads from denormalized read model.

Distributed Transactions

Two-Phase Commit (2PC)

A coordination protocol that ensures all services commit or all rollback.

  Two-Phase Commit:
  
  Coordinator          Service A         Service B
       |                   |                 |
       |-- PREPARE ------->|                 |
       |-- PREPARE -------->|--------------->|
       |                   |                 |
       |<-- VOTE YES ------|                 |
       |<-- VOTE YES ------|-----------------| 
       |                   |                 |
       |-- COMMIT -------->|                 |
       |-- COMMIT -------->|---------------->|
       |                   |                 |
       |<-- ACK -----------|                 |
       |<-- ACK -----------|-----------------| 
       
  If ANY service votes NO:
       |-- ROLLBACK ------>|                 |
       |-- ROLLBACK ------>|---------------->|

Problems with 2PC:

Coordinator is a single point of failure
Blocking: all participants hold locks during prepare phase
Poor performance at scale
Not suitable for microservices (too much coupling)

Saga Pattern (Preferred for Microservices)

A sequence of local transactions with compensating transactions for rollback.

Choreography (Event-Driven):

  Order Saga - Choreography:
  
  1. Order Svc: Create Order (PENDING)
     --> publish "order.created"
     
  2. Payment Svc: Charge Card
     --> publish "payment.charged"
     
  3. Inventory Svc: Reserve Items
     --> publish "inventory.reserved"
     
  4. Order Svc: Confirm Order (CONFIRMED)
  
  
  FAILURE at step 3 (out of stock):
  
  3. Inventory Svc: publish "inventory.failed"
  --> Payment Svc: Refund Card (compensating transaction)
  --> Order Svc: Cancel Order (compensating transaction)
  
  Each service reacts to events. No central coordinator.

Orchestration (Centralized):

  Order Saga - Orchestration:
  
  +-------------------+
  | Saga Orchestrator  |
  +--------+----------+
           |
           |-- 1. Create Order ---------> Order Svc
           |<-- Order Created ------------|
           |
           |-- 2. Charge Card ----------> Payment Svc
           |<-- Payment Success ----------|
           |
           |-- 3. Reserve Inventory ----> Inventory Svc
           |<-- Inventory Reserved -------|
           |
           |-- 4. Confirm Order --------> Order Svc
           |<-- Order Confirmed ----------|
  
  FAILURE at step 3:
           |-- Refund Card -------------> Payment Svc
           |-- Cancel Order ------------> Order Svc
  
  The orchestrator controls the flow and handles compensation.

Saga Comparison

Aspect	Choreography	Orchestration
Coordination	Decentralized (events)	Centralized (orchestrator)
Coupling	Lower	Higher (orchestrator knows all services)
Complexity	Hard to follow flow	Easy to understand flow
Single point of failure	None	Orchestrator
Scalability	Better	Orchestrator can be bottleneck
Best for	Simple sagas (3-4 steps)	Complex sagas (5+ steps)

Deployment Strategies

Blue-Green Deployment

  Blue-Green:
  
  Current (Blue):  [Server 1] [Server 2] [Server 3]  (v1.0)
  New (Green):     [Server 4] [Server 5] [Server 6]  (v2.0)
  
  Step 1: Deploy v2.0 to Green (Blue still serves traffic)
  Step 2: Test Green environment
  Step 3: Switch load balancer from Blue to Green
  Step 4: Green serves all traffic (instant cutover)
  Step 5: Keep Blue as rollback option
  
  Rollback: Switch LB back to Blue (instant)

Canary Deployment

  Canary:
  
  Phase 1: 95% --> v1.0 servers    5% --> v2.0 server (canary)
  Phase 2: Monitor metrics (errors, latency, CPU)
  Phase 3: 80% --> v1.0            20% --> v2.0
  Phase 4: Monitor
  Phase 5: 50% --> v1.0            50% --> v2.0
  Phase 6: Monitor
  Phase 7: 0%  --> v1.0            100% --> v2.0  (fully rolled out)
  
  At any phase, if metrics degrade: rollback canary to 0%.

Rolling Deployment

  Rolling:
  
  Start: [v1][v1][v1][v1][v1]   (5 instances)
  
  Step 1: [v2][v1][v1][v1][v1]   Update instance 1
  Step 2: [v2][v2][v1][v1][v1]   Update instance 2
  Step 3: [v2][v2][v2][v1][v1]   Update instance 3
  Step 4: [v2][v2][v2][v2][v1]   Update instance 4
  Step 5: [v2][v2][v2][v2][v2]   Update instance 5 (done)
  
  Capacity is always >= 80%. No downtime.

Deployment Strategy Comparison

Strategy	Zero Downtime	Rollback Speed	Resource Cost	Risk
Blue-Green	Yes	Instant	2x (both envs live)	Low
Canary	Yes	Fast	1x + canary	Very low
Rolling	Yes	Slow (roll back one by one)	1x	Medium
Recreate	No (brief downtime)	Deploy old version	1x	High

Monitoring and Observability

Microservices are inherently harder to debug. You MUST invest in observability.

The Three Pillars

  +------------------------------------------------------------------+
  |                  OBSERVABILITY PILLARS                            |
  |                                                                   |
  |  1. METRICS           2. LOGS              3. TRACES             |
  |  (What is happening)  (Why it happened)    (Where it happened)   |
  |                                                                   |
  |  - Request rate       - Structured logs    - Distributed traces  |
  |  - Error rate         - Error details      - Span context        |
  |  - Latency (p50/p99) - Stack traces       - Service dependency  |
  |  - CPU/Memory        - Business events     - Latency breakdown   |
  |                                                                   |
  |  Tools:              Tools:                Tools:                 |
  |  Prometheus          ELK Stack             Jaeger                |
  |  Grafana             Datadog Logs          Zipkin                |
  |  CloudWatch          Splunk                AWS X-Ray             |
  +------------------------------------------------------------------+

Distributed Tracing

  Trace: "GET /api/orders/42"
  
  +-- [API Gateway: 120ms] -----------------------------------------+
  |   |                                                              |
  |   +-- [Order Service: 80ms] -----------------------------------+ |
  |   |   |                                                        | |
  |   |   +-- [User Service: 15ms] --------+                      | |
  |   |   |   GET /users/42                |                      | |
  |   |   |   +-- [User DB: 5ms] ------+  |                      | |
  |   |   |                             |  |                      | |
  |   |   +-- [Payment Service: 40ms] -+                          | |
  |   |   |   GET /payments?order=42    |                          | |
  |   |   |   +-- [Payment DB: 20ms] --+                          | |
  |   |   |                                                        | |
  |   |   +-- [Redis Cache: 2ms] --+                               | |
  |   |                                                             | |
  +------------------------------------------------------------------+
  
  Trace ID: abc-123 (propagated across all services)
  Each span shows: service, operation, duration, status

Health Check Patterns

  /health (shallow):
  {"status": "UP"}
  
  /health/detailed (deep):
  {
    "status": "UP",
    "checks": {
      "database": {"status": "UP", "latency_ms": 5},
      "redis": {"status": "UP", "latency_ms": 1},
      "payment-service": {"status": "UP", "latency_ms": 25},
      "disk": {"status": "UP", "free_gb": 45}
    }
  }

Key Metrics for Microservices (RED Method)

Metric	What It Measures	Alert When
Rate	Requests per second	Sudden drop or spike
Errors	Error rate (% of requests failing)	> 1% errors
Duration	Latency (p50, p95, p99)	p99 > SLA threshold

When Monolith Is Better

Microservices are NOT always the answer. Start with a monolith unless you have specific reasons for microservices.

Monolith Advantages

Advantage	Details
Simplicity	One codebase, one deployment, one database
Development speed	No inter-service communication overhead
Data consistency	Single database transactions
Debugging	Stack traces show the full picture
Testing	Integration tests are straightforward
Operational cost	No service mesh, no distributed tracing needed

When to Use Microservices

Signal	Reasoning
Team size > 20-30 engineers	Teams step on each other in a monolith
Different scaling needs	User service scales differently than video encoding
Different technology stacks	ML in Python, API in Go, frontend in Node.js
Independent deployment needed	Deploy payment fix without touching user service
Organizational boundaries	Separate teams own separate services
High availability requirements	Isolate blast radius of failures

The Migration Path

  Migration from Monolith to Microservices:
  
  Phase 1: Modular Monolith
  +-------------------+
  | [User Module]     |
  | [Order Module]    |  Clear module boundaries
  | [Payment Module]  |  But single deployment
  +-------------------+
  
  Phase 2: Strangler Fig Pattern
  +-------------------+     +--------+
  | [User Module]     |     |Payment |  <-- Extracted first
  | [Order Module]    |---->|Service |
  | [Payment Facade]  |     +--------+
  +-------------------+
  
  Phase 3: Full Microservices
  +------+  +------+  +--------+
  | User |  |Order |  |Payment |
  | Svc  |  | Svc  |  | Svc   |
  +------+  +------+  +--------+

Anti-Patterns in Microservices

Anti-Pattern	Problem	Solution
Distributed monolith	Services are tightly coupled; must deploy together	Define clear boundaries; async communication
Shared database	Schema changes break multiple services	Database per service; API for data access
Too-fine granularity	100 services for a 5-person team	Merge related services; right-size boundaries
Synchronous chains	A calls B calls C calls D (latency compounds)	Async events; reduce call depth
No API versioning	Breaking changes cascade	Version APIs; backward compatibility
Ignoring data ownership	Multiple services write to same entity	Single owner per entity; events for propagation
No circuit breakers	One slow service brings down everything	Circuit breaker pattern (Hystrix, Resilience4j)
Manual deployments	Error-prone, slow, inconsistent	CI/CD pipeline per service

Microservices Architecture Checklist for System Design Interviews

  When designing a microservices system, address:
  
  [ ] Service boundaries (what does each service own?)
  [ ] Communication (sync REST/gRPC vs async events?)
  [ ] Data management (database per service, consistency strategy)
  [ ] Service discovery (how do services find each other?)
  [ ] API Gateway (single entry point for clients)
  [ ] Load balancing (how is traffic distributed?)
  [ ] Fault tolerance (circuit breakers, retries, timeouts)
  [ ] Deployment (blue-green, canary, rolling)
  [ ] Monitoring (metrics, logs, distributed tracing)
  [ ] Security (mTLS, auth propagation, network policies)

Key Takeaways

Microservices decompose by business capability -- not by technical layer
Service discovery is essential in dynamic environments; prefer server-side or mesh
Sync for queries, async for events -- match the communication pattern to the need
Database per service is non-negotiable for true independence
Saga pattern replaces distributed transactions; prefer choreography for simple flows
Canary deployments minimize risk; blue-green enables instant rollback
Observability is not optional -- invest in metrics, logs, and distributed tracing
Start with a monolith unless your team and system clearly need microservices
The biggest microservices mistake is premature decomposition

Explain-It Challenge

"Your company has a monolithic e-commerce application with 50 developers. Deployments take 4 hours, a bug in the recommendation engine caused a payment outage last week, and the ML team wants to use Python while the core is Java. The CTO asks you to plan the migration to microservices. Describe how you would identify service boundaries, what you would extract first and why, how you would handle data that is currently shared across modules, and what infrastructure you would put in place before the migration."