Episode 9 — System Design / 9.9 — Core Infrastructure

9.9.f Microservices Architecture

Microservices Recap for HLD

Microservices architecture decomposes a system into small, independently deployable services, each owning a specific business capability. In High-Level Design (HLD), you must decide service boundaries, communication patterns, and data ownership.

  Monolith vs Microservices:
  
  MONOLITH:                          MICROSERVICES:
  +-------------------+              +------+  +------+  +------+
  |  User Module      |              | User |  |Order |  | Pay  |
  |  Order Module     |              | Svc  |  | Svc  |  | Svc  |
  |  Payment Module   |              +--+---+  +--+---+  +--+---+
  |  Notification Mod |                 |         |         |
  |  ─── Shared DB ── |              +--+---+  +--+---+  +--+---+
  +-------------------+              | User |  |Order |  | Pay  |
                                     |  DB  |  |  DB  |  |  DB  |
  Single deployment unit.            +------+  +------+  +------+
  Single database.
  Single codebase.                   Each service has its own DB.
                                     Independent deployment.
                                     Independent scaling.

Core Microservices Principles

PrincipleDescription
Single ResponsibilityEach service does one thing well
Loose CouplingServices know minimal details about each other
High CohesionRelated functionality is grouped together
Independent DeploymentDeploy one service without touching others
Decentralized DataEach service owns its data (no shared database)
Design for FailureEvery service assumes other services can fail
Automate EverythingCI/CD, monitoring, scaling must be automated

Service Discovery

In a dynamic environment (containers, auto-scaling), service locations change. Service discovery is how services find each other.

Client-Side Discovery

The client queries a service registry and picks an instance.

  Client-Side Discovery:
  
  Order Service                  Service Registry
       |                              |
       |-- Where is Payment Svc? ---->|
       |<-- [10.0.0.5:8080,       ---|
       |     10.0.0.6:8080,          |
       |     10.0.0.7:8080]          |
       |                              
       |-- (client picks 10.0.0.6) ---|
       |-- HTTP POST /charge -------->| Payment Service (10.0.0.6)
       
  Client handles load balancing.
  Examples: Netflix Eureka + Ribbon, Consul + custom client

Server-Side Discovery

The client talks to a load balancer/router that handles discovery.

  Server-Side Discovery:
  
  Order Service         Load Balancer        Service Registry
       |                     |                      |
       |-- POST /charge ---->|                      |
       |                     |-- Where is Payment? ->|
       |                     |<-- [instances] -------|
       |                     |-- route to instance -->|
       |<-- response --------|                      |
       
  Client does not know about instances.
  Examples: AWS ALB + ECS, Kubernetes Services, Consul + Envoy

Service Discovery Comparison

ApproachProsConsExamples
Client-sideNo extra hop, flexible LBClient complexity, language-specificEureka, Consul (direct)
Server-sideSimple client, language-agnosticExtra network hop, LB is a dependencyK8s Services, AWS ALB
DNS-basedUniversal, no special clientTTL caching delays, limited health checksConsul DNS, Route 53
Service meshFull-featured, transparentOperational complexityIstio, Linkerd

Kubernetes Service Discovery

  Kubernetes:
  
  Pod: payment-service-abc12    IP: 10.0.1.5
  Pod: payment-service-def34    IP: 10.0.1.6
  Pod: payment-service-ghi56    IP: 10.0.1.7
  
  Kubernetes Service: "payment-service"
  ClusterIP: 10.96.0.100 (virtual IP)
  
  Order Service calls: http://payment-service:8080/charge
  
  kube-proxy routes to one of the pods.
  If a pod dies, Kubernetes removes it automatically.
  If a new pod starts, Kubernetes adds it automatically.

Inter-Service Communication

Synchronous Communication

Services call each other directly and wait for a response.

  Synchronous Patterns:
  
  1. REST (HTTP/JSON):
     Order Svc --> GET http://user-svc/users/42 --> User Svc
     Simple, universal, human-readable
     
  2. gRPC (HTTP/2 + Protobuf):
     Order Svc --> gRPC call --> User Svc
     Fast, type-safe, bidirectional streaming
     
  3. GraphQL (federated):
     API Gateway --> GraphQL Federation --> User Svc, Order Svc
     Clients query exactly what they need

Asynchronous Communication

Services communicate via messages without waiting for a response.

  Asynchronous Patterns:
  
  1. Event-Driven:
     Order Svc: publish("order.placed", {order_id: 123})
     Payment Svc: subscribe("order.placed") --> charge card
     
  2. Command Queue:
     Order Svc --> Queue: "process-payment" --> Payment Svc
     
  3. Event Sourcing:
     All state changes stored as events in an append-only log.
     Services rebuild state by replaying events.

Communication Pattern Comparison

AspectSynchronous (REST/gRPC)Asynchronous (Events/Queues)
CouplingTighter (caller depends on callee)Looser (fire-and-forget)
LatencyReal-time responseEventual processing
Failure handlingImmediate error propagationRetry via queue
ScalabilityLimited by slowest serviceIndependent scaling
DebuggingEasier to traceRequires distributed tracing
Use caseQuery, validation, authBackground tasks, notifications

Service Communication Decision Matrix

  Need immediate response?
  |
  +-- Yes --> Need streaming? 
  |           |
  |           +-- Yes --> gRPC (bidirectional streaming)
  |           +-- No  --> REST (simple) or gRPC (performance)
  |
  +-- No --> Need guaranteed delivery?
             |
             +-- Yes --> Message Queue (Kafka, RabbitMQ, SQS)
             +-- No  --> Fire-and-forget event (Redis Pub/Sub, SNS)

Data Management in Microservices

Database per Service

Each service owns its data. No other service can directly access another service's database.

  CORRECT:
  
  Order Svc        User Svc        Payment Svc
      |                |                |
  +---+---+        +---+---+        +---+---+
  |Order  |        |User   |        |Payment|
  |  DB   |        |  DB   |        |  DB   |
  +-------+        +-------+        +-------+
  
  Order Svc needs user data? --> Call User Svc API (not User DB!)
  
  
  WRONG (shared database):
  
  Order Svc ----+
                |---> Shared Database  <-- Creates tight coupling!
  User Svc  ----+                          Schema changes break everyone.
  Payment Svc --+

Data Consistency Challenges

When each service has its own database, maintaining consistency across services is hard.

  Problem: "Transfer $100 from Account A to Account B"
  
  Monolith:
  BEGIN TRANSACTION
    UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
    UPDATE accounts SET balance = balance + 100 WHERE id = 'B';
  COMMIT;  -- atomic!
  
  Microservices:
  Account Service A: debit $100   (succeeds)
  Account Service B: credit $100  (fails -- network error!)
  
  Now $100 has vanished! No single transaction spans services.

Solutions covered in the Distributed Transactions section below.

Data Replication Patterns

When service B frequently needs data from service A, calling the API every time is inefficient. Instead:

  Pattern 1: API Call (simple but slow):
  Order Svc --> GET /users/42 --> User Svc
  
  Pattern 2: Data Replication via Events:
  User Svc: publish("user.updated", {id: 42, name: "Alice"})
  Order Svc: subscribe --> store local copy of user data
  
  Pattern 3: Shared Cache:
  User Svc --> write to Redis
  Order Svc --> read from Redis
  
  Pattern 4: CQRS (Command Query Responsibility Segregation):
  Write Model (User Svc, Order Svc) --> Events --> Read Model (denormalized)
  Query Service reads from denormalized read model.

Distributed Transactions

Two-Phase Commit (2PC)

A coordination protocol that ensures all services commit or all rollback.

  Two-Phase Commit:
  
  Coordinator          Service A         Service B
       |                   |                 |
       |-- PREPARE ------->|                 |
       |-- PREPARE -------->|--------------->|
       |                   |                 |
       |<-- VOTE YES ------|                 |
       |<-- VOTE YES ------|-----------------| 
       |                   |                 |
       |-- COMMIT -------->|                 |
       |-- COMMIT -------->|---------------->|
       |                   |                 |
       |<-- ACK -----------|                 |
       |<-- ACK -----------|-----------------| 
       
  If ANY service votes NO:
       |-- ROLLBACK ------>|                 |
       |-- ROLLBACK ------>|---------------->|

Problems with 2PC:

  • Coordinator is a single point of failure
  • Blocking: all participants hold locks during prepare phase
  • Poor performance at scale
  • Not suitable for microservices (too much coupling)

Saga Pattern (Preferred for Microservices)

A sequence of local transactions with compensating transactions for rollback.

Choreography (Event-Driven):

  Order Saga - Choreography:
  
  1. Order Svc: Create Order (PENDING)
     --> publish "order.created"
     
  2. Payment Svc: Charge Card
     --> publish "payment.charged"
     
  3. Inventory Svc: Reserve Items
     --> publish "inventory.reserved"
     
  4. Order Svc: Confirm Order (CONFIRMED)
  
  
  FAILURE at step 3 (out of stock):
  
  3. Inventory Svc: publish "inventory.failed"
  --> Payment Svc: Refund Card (compensating transaction)
  --> Order Svc: Cancel Order (compensating transaction)
  
  Each service reacts to events. No central coordinator.

Orchestration (Centralized):

  Order Saga - Orchestration:
  
  +-------------------+
  | Saga Orchestrator  |
  +--------+----------+
           |
           |-- 1. Create Order ---------> Order Svc
           |<-- Order Created ------------|
           |
           |-- 2. Charge Card ----------> Payment Svc
           |<-- Payment Success ----------|
           |
           |-- 3. Reserve Inventory ----> Inventory Svc
           |<-- Inventory Reserved -------|
           |
           |-- 4. Confirm Order --------> Order Svc
           |<-- Order Confirmed ----------|
  
  FAILURE at step 3:
           |-- Refund Card -------------> Payment Svc
           |-- Cancel Order ------------> Order Svc
  
  The orchestrator controls the flow and handles compensation.

Saga Comparison

AspectChoreographyOrchestration
CoordinationDecentralized (events)Centralized (orchestrator)
CouplingLowerHigher (orchestrator knows all services)
ComplexityHard to follow flowEasy to understand flow
Single point of failureNoneOrchestrator
ScalabilityBetterOrchestrator can be bottleneck
Best forSimple sagas (3-4 steps)Complex sagas (5+ steps)

Deployment Strategies

Blue-Green Deployment

  Blue-Green:
  
  Current (Blue):  [Server 1] [Server 2] [Server 3]  (v1.0)
  New (Green):     [Server 4] [Server 5] [Server 6]  (v2.0)
  
  Step 1: Deploy v2.0 to Green (Blue still serves traffic)
  Step 2: Test Green environment
  Step 3: Switch load balancer from Blue to Green
  Step 4: Green serves all traffic (instant cutover)
  Step 5: Keep Blue as rollback option
  
  Rollback: Switch LB back to Blue (instant)

Canary Deployment

  Canary:
  
  Phase 1: 95% --> v1.0 servers    5% --> v2.0 server (canary)
  Phase 2: Monitor metrics (errors, latency, CPU)
  Phase 3: 80% --> v1.0            20% --> v2.0
  Phase 4: Monitor
  Phase 5: 50% --> v1.0            50% --> v2.0
  Phase 6: Monitor
  Phase 7: 0%  --> v1.0            100% --> v2.0  (fully rolled out)
  
  At any phase, if metrics degrade: rollback canary to 0%.

Rolling Deployment

  Rolling:
  
  Start: [v1][v1][v1][v1][v1]   (5 instances)
  
  Step 1: [v2][v1][v1][v1][v1]   Update instance 1
  Step 2: [v2][v2][v1][v1][v1]   Update instance 2
  Step 3: [v2][v2][v2][v1][v1]   Update instance 3
  Step 4: [v2][v2][v2][v2][v1]   Update instance 4
  Step 5: [v2][v2][v2][v2][v2]   Update instance 5 (done)
  
  Capacity is always >= 80%. No downtime.

Deployment Strategy Comparison

StrategyZero DowntimeRollback SpeedResource CostRisk
Blue-GreenYesInstant2x (both envs live)Low
CanaryYesFast1x + canaryVery low
RollingYesSlow (roll back one by one)1xMedium
RecreateNo (brief downtime)Deploy old version1xHigh

Monitoring and Observability

Microservices are inherently harder to debug. You MUST invest in observability.

The Three Pillars

  +------------------------------------------------------------------+
  |                  OBSERVABILITY PILLARS                            |
  |                                                                   |
  |  1. METRICS           2. LOGS              3. TRACES             |
  |  (What is happening)  (Why it happened)    (Where it happened)   |
  |                                                                   |
  |  - Request rate       - Structured logs    - Distributed traces  |
  |  - Error rate         - Error details      - Span context        |
  |  - Latency (p50/p99) - Stack traces       - Service dependency  |
  |  - CPU/Memory        - Business events     - Latency breakdown   |
  |                                                                   |
  |  Tools:              Tools:                Tools:                 |
  |  Prometheus          ELK Stack             Jaeger                |
  |  Grafana             Datadog Logs          Zipkin                |
  |  CloudWatch          Splunk                AWS X-Ray             |
  +------------------------------------------------------------------+

Distributed Tracing

  Trace: "GET /api/orders/42"
  
  +-- [API Gateway: 120ms] -----------------------------------------+
  |   |                                                              |
  |   +-- [Order Service: 80ms] -----------------------------------+ |
  |   |   |                                                        | |
  |   |   +-- [User Service: 15ms] --------+                      | |
  |   |   |   GET /users/42                |                      | |
  |   |   |   +-- [User DB: 5ms] ------+  |                      | |
  |   |   |                             |  |                      | |
  |   |   +-- [Payment Service: 40ms] -+                          | |
  |   |   |   GET /payments?order=42    |                          | |
  |   |   |   +-- [Payment DB: 20ms] --+                          | |
  |   |   |                                                        | |
  |   |   +-- [Redis Cache: 2ms] --+                               | |
  |   |                                                             | |
  +------------------------------------------------------------------+
  
  Trace ID: abc-123 (propagated across all services)
  Each span shows: service, operation, duration, status

Health Check Patterns

  /health (shallow):
  {"status": "UP"}
  
  /health/detailed (deep):
  {
    "status": "UP",
    "checks": {
      "database": {"status": "UP", "latency_ms": 5},
      "redis": {"status": "UP", "latency_ms": 1},
      "payment-service": {"status": "UP", "latency_ms": 25},
      "disk": {"status": "UP", "free_gb": 45}
    }
  }

Key Metrics for Microservices (RED Method)

MetricWhat It MeasuresAlert When
RateRequests per secondSudden drop or spike
ErrorsError rate (% of requests failing)> 1% errors
DurationLatency (p50, p95, p99)p99 > SLA threshold

When Monolith Is Better

Microservices are NOT always the answer. Start with a monolith unless you have specific reasons for microservices.

Monolith Advantages

AdvantageDetails
SimplicityOne codebase, one deployment, one database
Development speedNo inter-service communication overhead
Data consistencySingle database transactions
DebuggingStack traces show the full picture
TestingIntegration tests are straightforward
Operational costNo service mesh, no distributed tracing needed

When to Use Microservices

SignalReasoning
Team size > 20-30 engineersTeams step on each other in a monolith
Different scaling needsUser service scales differently than video encoding
Different technology stacksML in Python, API in Go, frontend in Node.js
Independent deployment neededDeploy payment fix without touching user service
Organizational boundariesSeparate teams own separate services
High availability requirementsIsolate blast radius of failures

The Migration Path

  Migration from Monolith to Microservices:
  
  Phase 1: Modular Monolith
  +-------------------+
  | [User Module]     |
  | [Order Module]    |  Clear module boundaries
  | [Payment Module]  |  But single deployment
  +-------------------+
  
  Phase 2: Strangler Fig Pattern
  +-------------------+     +--------+
  | [User Module]     |     |Payment |  <-- Extracted first
  | [Order Module]    |---->|Service |
  | [Payment Facade]  |     +--------+
  +-------------------+
  
  Phase 3: Full Microservices
  +------+  +------+  +--------+
  | User |  |Order |  |Payment |
  | Svc  |  | Svc  |  | Svc   |
  +------+  +------+  +--------+

Anti-Patterns in Microservices

Anti-PatternProblemSolution
Distributed monolithServices are tightly coupled; must deploy togetherDefine clear boundaries; async communication
Shared databaseSchema changes break multiple servicesDatabase per service; API for data access
Too-fine granularity100 services for a 5-person teamMerge related services; right-size boundaries
Synchronous chainsA calls B calls C calls D (latency compounds)Async events; reduce call depth
No API versioningBreaking changes cascadeVersion APIs; backward compatibility
Ignoring data ownershipMultiple services write to same entitySingle owner per entity; events for propagation
No circuit breakersOne slow service brings down everythingCircuit breaker pattern (Hystrix, Resilience4j)
Manual deploymentsError-prone, slow, inconsistentCI/CD pipeline per service

Microservices Architecture Checklist for System Design Interviews

  When designing a microservices system, address:
  
  [ ] Service boundaries (what does each service own?)
  [ ] Communication (sync REST/gRPC vs async events?)
  [ ] Data management (database per service, consistency strategy)
  [ ] Service discovery (how do services find each other?)
  [ ] API Gateway (single entry point for clients)
  [ ] Load balancing (how is traffic distributed?)
  [ ] Fault tolerance (circuit breakers, retries, timeouts)
  [ ] Deployment (blue-green, canary, rolling)
  [ ] Monitoring (metrics, logs, distributed tracing)
  [ ] Security (mTLS, auth propagation, network policies)

Key Takeaways

  1. Microservices decompose by business capability -- not by technical layer
  2. Service discovery is essential in dynamic environments; prefer server-side or mesh
  3. Sync for queries, async for events -- match the communication pattern to the need
  4. Database per service is non-negotiable for true independence
  5. Saga pattern replaces distributed transactions; prefer choreography for simple flows
  6. Canary deployments minimize risk; blue-green enables instant rollback
  7. Observability is not optional -- invest in metrics, logs, and distributed tracing
  8. Start with a monolith unless your team and system clearly need microservices
  9. The biggest microservices mistake is premature decomposition

Explain-It Challenge

"Your company has a monolithic e-commerce application with 50 developers. Deployments take 4 hours, a bug in the recommendation engine caused a payment outage last week, and the ML team wants to use Python while the core is Java. The CTO asks you to plan the migration to microservices. Describe how you would identify service boundaries, what you would extract first and why, how you would handle data that is currently shared across modules, and what infrastructure you would put in place before the migration."