Episode 9 — System Design / 9.9 — Core Infrastructure
9.9.f Microservices Architecture
Microservices Recap for HLD
Microservices architecture decomposes a system into small, independently deployable services, each owning a specific business capability. In High-Level Design (HLD), you must decide service boundaries, communication patterns, and data ownership.
Monolith vs Microservices:
MONOLITH: MICROSERVICES:
+-------------------+ +------+ +------+ +------+
| User Module | | User | |Order | | Pay |
| Order Module | | Svc | | Svc | | Svc |
| Payment Module | +--+---+ +--+---+ +--+---+
| Notification Mod | | | |
| ─── Shared DB ── | +--+---+ +--+---+ +--+---+
+-------------------+ | User | |Order | | Pay |
| DB | | DB | | DB |
Single deployment unit. +------+ +------+ +------+
Single database.
Single codebase. Each service has its own DB.
Independent deployment.
Independent scaling.
Core Microservices Principles
| Principle | Description |
|---|---|
| Single Responsibility | Each service does one thing well |
| Loose Coupling | Services know minimal details about each other |
| High Cohesion | Related functionality is grouped together |
| Independent Deployment | Deploy one service without touching others |
| Decentralized Data | Each service owns its data (no shared database) |
| Design for Failure | Every service assumes other services can fail |
| Automate Everything | CI/CD, monitoring, scaling must be automated |
Service Discovery
In a dynamic environment (containers, auto-scaling), service locations change. Service discovery is how services find each other.
Client-Side Discovery
The client queries a service registry and picks an instance.
Client-Side Discovery:
Order Service Service Registry
| |
|-- Where is Payment Svc? ---->|
|<-- [10.0.0.5:8080, ---|
| 10.0.0.6:8080, |
| 10.0.0.7:8080] |
|
|-- (client picks 10.0.0.6) ---|
|-- HTTP POST /charge -------->| Payment Service (10.0.0.6)
Client handles load balancing.
Examples: Netflix Eureka + Ribbon, Consul + custom client
Server-Side Discovery
The client talks to a load balancer/router that handles discovery.
Server-Side Discovery:
Order Service Load Balancer Service Registry
| | |
|-- POST /charge ---->| |
| |-- Where is Payment? ->|
| |<-- [instances] -------|
| |-- route to instance -->|
|<-- response --------| |
Client does not know about instances.
Examples: AWS ALB + ECS, Kubernetes Services, Consul + Envoy
Service Discovery Comparison
| Approach | Pros | Cons | Examples |
|---|---|---|---|
| Client-side | No extra hop, flexible LB | Client complexity, language-specific | Eureka, Consul (direct) |
| Server-side | Simple client, language-agnostic | Extra network hop, LB is a dependency | K8s Services, AWS ALB |
| DNS-based | Universal, no special client | TTL caching delays, limited health checks | Consul DNS, Route 53 |
| Service mesh | Full-featured, transparent | Operational complexity | Istio, Linkerd |
Kubernetes Service Discovery
Kubernetes:
Pod: payment-service-abc12 IP: 10.0.1.5
Pod: payment-service-def34 IP: 10.0.1.6
Pod: payment-service-ghi56 IP: 10.0.1.7
Kubernetes Service: "payment-service"
ClusterIP: 10.96.0.100 (virtual IP)
Order Service calls: http://payment-service:8080/charge
kube-proxy routes to one of the pods.
If a pod dies, Kubernetes removes it automatically.
If a new pod starts, Kubernetes adds it automatically.
Inter-Service Communication
Synchronous Communication
Services call each other directly and wait for a response.
Synchronous Patterns:
1. REST (HTTP/JSON):
Order Svc --> GET http://user-svc/users/42 --> User Svc
Simple, universal, human-readable
2. gRPC (HTTP/2 + Protobuf):
Order Svc --> gRPC call --> User Svc
Fast, type-safe, bidirectional streaming
3. GraphQL (federated):
API Gateway --> GraphQL Federation --> User Svc, Order Svc
Clients query exactly what they need
Asynchronous Communication
Services communicate via messages without waiting for a response.
Asynchronous Patterns:
1. Event-Driven:
Order Svc: publish("order.placed", {order_id: 123})
Payment Svc: subscribe("order.placed") --> charge card
2. Command Queue:
Order Svc --> Queue: "process-payment" --> Payment Svc
3. Event Sourcing:
All state changes stored as events in an append-only log.
Services rebuild state by replaying events.
Communication Pattern Comparison
| Aspect | Synchronous (REST/gRPC) | Asynchronous (Events/Queues) |
|---|---|---|
| Coupling | Tighter (caller depends on callee) | Looser (fire-and-forget) |
| Latency | Real-time response | Eventual processing |
| Failure handling | Immediate error propagation | Retry via queue |
| Scalability | Limited by slowest service | Independent scaling |
| Debugging | Easier to trace | Requires distributed tracing |
| Use case | Query, validation, auth | Background tasks, notifications |
Service Communication Decision Matrix
Need immediate response?
|
+-- Yes --> Need streaming?
| |
| +-- Yes --> gRPC (bidirectional streaming)
| +-- No --> REST (simple) or gRPC (performance)
|
+-- No --> Need guaranteed delivery?
|
+-- Yes --> Message Queue (Kafka, RabbitMQ, SQS)
+-- No --> Fire-and-forget event (Redis Pub/Sub, SNS)
Data Management in Microservices
Database per Service
Each service owns its data. No other service can directly access another service's database.
CORRECT:
Order Svc User Svc Payment Svc
| | |
+---+---+ +---+---+ +---+---+
|Order | |User | |Payment|
| DB | | DB | | DB |
+-------+ +-------+ +-------+
Order Svc needs user data? --> Call User Svc API (not User DB!)
WRONG (shared database):
Order Svc ----+
|---> Shared Database <-- Creates tight coupling!
User Svc ----+ Schema changes break everyone.
Payment Svc --+
Data Consistency Challenges
When each service has its own database, maintaining consistency across services is hard.
Problem: "Transfer $100 from Account A to Account B"
Monolith:
BEGIN TRANSACTION
UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE id = 'B';
COMMIT; -- atomic!
Microservices:
Account Service A: debit $100 (succeeds)
Account Service B: credit $100 (fails -- network error!)
Now $100 has vanished! No single transaction spans services.
Solutions covered in the Distributed Transactions section below.
Data Replication Patterns
When service B frequently needs data from service A, calling the API every time is inefficient. Instead:
Pattern 1: API Call (simple but slow):
Order Svc --> GET /users/42 --> User Svc
Pattern 2: Data Replication via Events:
User Svc: publish("user.updated", {id: 42, name: "Alice"})
Order Svc: subscribe --> store local copy of user data
Pattern 3: Shared Cache:
User Svc --> write to Redis
Order Svc --> read from Redis
Pattern 4: CQRS (Command Query Responsibility Segregation):
Write Model (User Svc, Order Svc) --> Events --> Read Model (denormalized)
Query Service reads from denormalized read model.
Distributed Transactions
Two-Phase Commit (2PC)
A coordination protocol that ensures all services commit or all rollback.
Two-Phase Commit:
Coordinator Service A Service B
| | |
|-- PREPARE ------->| |
|-- PREPARE -------->|--------------->|
| | |
|<-- VOTE YES ------| |
|<-- VOTE YES ------|-----------------|
| | |
|-- COMMIT -------->| |
|-- COMMIT -------->|---------------->|
| | |
|<-- ACK -----------| |
|<-- ACK -----------|-----------------|
If ANY service votes NO:
|-- ROLLBACK ------>| |
|-- ROLLBACK ------>|---------------->|
Problems with 2PC:
- Coordinator is a single point of failure
- Blocking: all participants hold locks during prepare phase
- Poor performance at scale
- Not suitable for microservices (too much coupling)
Saga Pattern (Preferred for Microservices)
A sequence of local transactions with compensating transactions for rollback.
Choreography (Event-Driven):
Order Saga - Choreography:
1. Order Svc: Create Order (PENDING)
--> publish "order.created"
2. Payment Svc: Charge Card
--> publish "payment.charged"
3. Inventory Svc: Reserve Items
--> publish "inventory.reserved"
4. Order Svc: Confirm Order (CONFIRMED)
FAILURE at step 3 (out of stock):
3. Inventory Svc: publish "inventory.failed"
--> Payment Svc: Refund Card (compensating transaction)
--> Order Svc: Cancel Order (compensating transaction)
Each service reacts to events. No central coordinator.
Orchestration (Centralized):
Order Saga - Orchestration:
+-------------------+
| Saga Orchestrator |
+--------+----------+
|
|-- 1. Create Order ---------> Order Svc
|<-- Order Created ------------|
|
|-- 2. Charge Card ----------> Payment Svc
|<-- Payment Success ----------|
|
|-- 3. Reserve Inventory ----> Inventory Svc
|<-- Inventory Reserved -------|
|
|-- 4. Confirm Order --------> Order Svc
|<-- Order Confirmed ----------|
FAILURE at step 3:
|-- Refund Card -------------> Payment Svc
|-- Cancel Order ------------> Order Svc
The orchestrator controls the flow and handles compensation.
Saga Comparison
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coordination | Decentralized (events) | Centralized (orchestrator) |
| Coupling | Lower | Higher (orchestrator knows all services) |
| Complexity | Hard to follow flow | Easy to understand flow |
| Single point of failure | None | Orchestrator |
| Scalability | Better | Orchestrator can be bottleneck |
| Best for | Simple sagas (3-4 steps) | Complex sagas (5+ steps) |
Deployment Strategies
Blue-Green Deployment
Blue-Green:
Current (Blue): [Server 1] [Server 2] [Server 3] (v1.0)
New (Green): [Server 4] [Server 5] [Server 6] (v2.0)
Step 1: Deploy v2.0 to Green (Blue still serves traffic)
Step 2: Test Green environment
Step 3: Switch load balancer from Blue to Green
Step 4: Green serves all traffic (instant cutover)
Step 5: Keep Blue as rollback option
Rollback: Switch LB back to Blue (instant)
Canary Deployment
Canary:
Phase 1: 95% --> v1.0 servers 5% --> v2.0 server (canary)
Phase 2: Monitor metrics (errors, latency, CPU)
Phase 3: 80% --> v1.0 20% --> v2.0
Phase 4: Monitor
Phase 5: 50% --> v1.0 50% --> v2.0
Phase 6: Monitor
Phase 7: 0% --> v1.0 100% --> v2.0 (fully rolled out)
At any phase, if metrics degrade: rollback canary to 0%.
Rolling Deployment
Rolling:
Start: [v1][v1][v1][v1][v1] (5 instances)
Step 1: [v2][v1][v1][v1][v1] Update instance 1
Step 2: [v2][v2][v1][v1][v1] Update instance 2
Step 3: [v2][v2][v2][v1][v1] Update instance 3
Step 4: [v2][v2][v2][v2][v1] Update instance 4
Step 5: [v2][v2][v2][v2][v2] Update instance 5 (done)
Capacity is always >= 80%. No downtime.
Deployment Strategy Comparison
| Strategy | Zero Downtime | Rollback Speed | Resource Cost | Risk |
|---|---|---|---|---|
| Blue-Green | Yes | Instant | 2x (both envs live) | Low |
| Canary | Yes | Fast | 1x + canary | Very low |
| Rolling | Yes | Slow (roll back one by one) | 1x | Medium |
| Recreate | No (brief downtime) | Deploy old version | 1x | High |
Monitoring and Observability
Microservices are inherently harder to debug. You MUST invest in observability.
The Three Pillars
+------------------------------------------------------------------+
| OBSERVABILITY PILLARS |
| |
| 1. METRICS 2. LOGS 3. TRACES |
| (What is happening) (Why it happened) (Where it happened) |
| |
| - Request rate - Structured logs - Distributed traces |
| - Error rate - Error details - Span context |
| - Latency (p50/p99) - Stack traces - Service dependency |
| - CPU/Memory - Business events - Latency breakdown |
| |
| Tools: Tools: Tools: |
| Prometheus ELK Stack Jaeger |
| Grafana Datadog Logs Zipkin |
| CloudWatch Splunk AWS X-Ray |
+------------------------------------------------------------------+
Distributed Tracing
Trace: "GET /api/orders/42"
+-- [API Gateway: 120ms] -----------------------------------------+
| | |
| +-- [Order Service: 80ms] -----------------------------------+ |
| | | | |
| | +-- [User Service: 15ms] --------+ | |
| | | GET /users/42 | | |
| | | +-- [User DB: 5ms] ------+ | | |
| | | | | | |
| | +-- [Payment Service: 40ms] -+ | |
| | | GET /payments?order=42 | | |
| | | +-- [Payment DB: 20ms] --+ | |
| | | | |
| | +-- [Redis Cache: 2ms] --+ | |
| | | |
+------------------------------------------------------------------+
Trace ID: abc-123 (propagated across all services)
Each span shows: service, operation, duration, status
Health Check Patterns
/health (shallow):
{"status": "UP"}
/health/detailed (deep):
{
"status": "UP",
"checks": {
"database": {"status": "UP", "latency_ms": 5},
"redis": {"status": "UP", "latency_ms": 1},
"payment-service": {"status": "UP", "latency_ms": 25},
"disk": {"status": "UP", "free_gb": 45}
}
}
Key Metrics for Microservices (RED Method)
| Metric | What It Measures | Alert When |
|---|---|---|
| Rate | Requests per second | Sudden drop or spike |
| Errors | Error rate (% of requests failing) | > 1% errors |
| Duration | Latency (p50, p95, p99) | p99 > SLA threshold |
When Monolith Is Better
Microservices are NOT always the answer. Start with a monolith unless you have specific reasons for microservices.
Monolith Advantages
| Advantage | Details |
|---|---|
| Simplicity | One codebase, one deployment, one database |
| Development speed | No inter-service communication overhead |
| Data consistency | Single database transactions |
| Debugging | Stack traces show the full picture |
| Testing | Integration tests are straightforward |
| Operational cost | No service mesh, no distributed tracing needed |
When to Use Microservices
| Signal | Reasoning |
|---|---|
| Team size > 20-30 engineers | Teams step on each other in a monolith |
| Different scaling needs | User service scales differently than video encoding |
| Different technology stacks | ML in Python, API in Go, frontend in Node.js |
| Independent deployment needed | Deploy payment fix without touching user service |
| Organizational boundaries | Separate teams own separate services |
| High availability requirements | Isolate blast radius of failures |
The Migration Path
Migration from Monolith to Microservices:
Phase 1: Modular Monolith
+-------------------+
| [User Module] |
| [Order Module] | Clear module boundaries
| [Payment Module] | But single deployment
+-------------------+
Phase 2: Strangler Fig Pattern
+-------------------+ +--------+
| [User Module] | |Payment | <-- Extracted first
| [Order Module] |---->|Service |
| [Payment Facade] | +--------+
+-------------------+
Phase 3: Full Microservices
+------+ +------+ +--------+
| User | |Order | |Payment |
| Svc | | Svc | | Svc |
+------+ +------+ +--------+
Anti-Patterns in Microservices
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Distributed monolith | Services are tightly coupled; must deploy together | Define clear boundaries; async communication |
| Shared database | Schema changes break multiple services | Database per service; API for data access |
| Too-fine granularity | 100 services for a 5-person team | Merge related services; right-size boundaries |
| Synchronous chains | A calls B calls C calls D (latency compounds) | Async events; reduce call depth |
| No API versioning | Breaking changes cascade | Version APIs; backward compatibility |
| Ignoring data ownership | Multiple services write to same entity | Single owner per entity; events for propagation |
| No circuit breakers | One slow service brings down everything | Circuit breaker pattern (Hystrix, Resilience4j) |
| Manual deployments | Error-prone, slow, inconsistent | CI/CD pipeline per service |
Microservices Architecture Checklist for System Design Interviews
When designing a microservices system, address:
[ ] Service boundaries (what does each service own?)
[ ] Communication (sync REST/gRPC vs async events?)
[ ] Data management (database per service, consistency strategy)
[ ] Service discovery (how do services find each other?)
[ ] API Gateway (single entry point for clients)
[ ] Load balancing (how is traffic distributed?)
[ ] Fault tolerance (circuit breakers, retries, timeouts)
[ ] Deployment (blue-green, canary, rolling)
[ ] Monitoring (metrics, logs, distributed tracing)
[ ] Security (mTLS, auth propagation, network policies)
Key Takeaways
- Microservices decompose by business capability -- not by technical layer
- Service discovery is essential in dynamic environments; prefer server-side or mesh
- Sync for queries, async for events -- match the communication pattern to the need
- Database per service is non-negotiable for true independence
- Saga pattern replaces distributed transactions; prefer choreography for simple flows
- Canary deployments minimize risk; blue-green enables instant rollback
- Observability is not optional -- invest in metrics, logs, and distributed tracing
- Start with a monolith unless your team and system clearly need microservices
- The biggest microservices mistake is premature decomposition
Explain-It Challenge
"Your company has a monolithic e-commerce application with 50 developers. Deployments take 4 hours, a bug in the recommendation engine caused a payment outage last week, and the ML team wants to use Python while the core is Java. The CTO asks you to plan the migration to microservices. Describe how you would identify service boundaries, what you would extract first and why, how you would handle data that is currently shared across modules, and what infrastructure you would put in place before the migration."