Episode 9 — System Design / 9.9 — Core Infrastructure
9.9.c Load Balancing
What Is Load Balancing?
A load balancer distributes incoming network traffic across multiple servers to ensure no single server bears too much load. It is one of the most fundamental components in scalable system design.
WITHOUT Load Balancer:
All Users --> Single Server (overloaded, single point of failure)
WITH Load Balancer:
All Users --> Load Balancer --> Server 1
--> Server 2
--> Server 3
--> Server 4
Why load balancers exist:
- Scalability: Distribute traffic across many servers
- Availability: If one server dies, others take over
- Performance: Each server handles a fraction of total load
- Flexibility: Add/remove servers without downtime
Layer 4 vs Layer 7 Load Balancing
The "layer" refers to the OSI model. The key difference is how much the load balancer understands about the traffic.
Layer 4 (Transport Layer)
Operates on TCP/UDP level. Sees IP addresses and ports. Does NOT inspect HTTP content.
Layer 4 Load Balancer:
Client: 203.0.113.5:54321
|
v
LB sees: SRC=203.0.113.5:54321, DST=LB_IP:443, Protocol=TCP
LB decides: Route to Server 2 (based on IP hash or round robin)
|
v
Server 2: 10.0.0.2:8080
The LB does NOT look inside the HTTP request.
It does NOT know the URL, headers, or cookies.
Characteristics:
- Extremely fast (minimal processing)
- Protocol-agnostic (works with any TCP/UDP traffic)
- Cannot make routing decisions based on content
- Cannot terminate SSL (typically)
- Lower resource consumption
Examples: AWS NLB (Network Load Balancer), HAProxy (TCP mode), Linux IPVS
Layer 7 (Application Layer)
Operates on HTTP/HTTPS level. Inspects URLs, headers, cookies, request body.
Layer 7 Load Balancer:
Client Request:
GET /api/users HTTP/1.1
Host: example.com
Cookie: session=abc123
LB inspects:
- URL: /api/users --> route to API server pool
- Cookie: session=abc123 --> sticky to Server 3
- Header: Accept: application/json --> route to JSON API
LB can also:
- Terminate SSL
- Compress responses
- Add/modify headers
- Cache responses
- Rate limit by URL
Characteristics:
- Content-aware routing (URL, headers, cookies)
- SSL/TLS termination
- HTTP caching and compression
- More resource-intensive
- Can modify requests and responses
Examples: AWS ALB (Application Load Balancer), Nginx, HAProxy (HTTP mode), Envoy
Comparison Table
| Feature | Layer 4 | Layer 7 |
|---|---|---|
| OSI Layer | Transport (TCP/UDP) | Application (HTTP/HTTPS) |
| Inspects content | No | Yes (URL, headers, cookies) |
| SSL termination | No (pass-through) | Yes |
| Routing decisions | IP, port, protocol | URL path, headers, cookies, body |
| Performance | Faster (less processing) | Slower (deep inspection) |
| Use case | Raw TCP traffic, databases, gaming | Web apps, APIs, microservices |
| Connection handling | 1:1 (client to server) | Can pool/multiplex connections |
| Cost | Lower | Higher |
When to Use Each
Decision Tree:
Need content-based routing?
|
+-- Yes --> Layer 7
| |
| +-- Route /api/* to API servers
| +-- Route /static/* to file servers
| +-- SSL termination needed
| +-- Cookie-based sticky sessions
|
+-- No --> Layer 4
|
+-- Database load balancing
+-- Raw TCP/UDP traffic
+-- Maximum throughput needed
+-- Protocol-agnostic routing
Load Balancing Algorithms
1. Round Robin
Distributes requests sequentially across servers in order.
Round Robin:
Request 1 --> Server A
Request 2 --> Server B
Request 3 --> Server C
Request 4 --> Server A (cycle repeats)
Request 5 --> Server B
Request 6 --> Server C
Pros: Simple, fair distribution, no state needed Cons: Ignores server capacity differences, ignores current load Best for: Homogeneous servers with similar request costs
2. Weighted Round Robin
Like round robin, but servers with higher weights get more requests.
Weighted Round Robin (A=3, B=2, C=1):
Request 1 --> Server A
Request 2 --> Server A
Request 3 --> Server A
Request 4 --> Server B
Request 5 --> Server B
Request 6 --> Server C
(cycle repeats)
Best for: Heterogeneous servers (some more powerful than others)
3. Least Connections
Routes to the server with the fewest active connections.
Least Connections:
Server A: 12 active connections
Server B: 5 active connections <-- next request goes here
Server C: 8 active connections
New request --> Server B (least loaded)
Pros: Adapts to varying request processing times Cons: Requires tracking connection counts Best for: Requests with variable processing time (e.g., some API calls take 10ms, others take 5s)
4. Weighted Least Connections
Combines weights with connection counts.
Score = active_connections / weight
Server A: 12 connections, weight=4 --> score = 3.0
Server B: 5 connections, weight=2 --> score = 2.5 <-- lowest
Server C: 8 connections, weight=2 --> score = 4.0
New request --> Server B
5. IP Hash
Hashes the client IP to determine the server. Same client always goes to the same server.
IP Hash:
Client 203.0.113.5 --> hash --> Server A (always)
Client 198.51.100.2 --> hash --> Server C (always)
Client 192.0.2.1 --> hash --> Server B (always)
Pros: Session affinity without cookies, stateless LB Cons: Uneven distribution if IP ranges cluster, fails when server count changes Best for: When you need basic session affinity at Layer 4
6. Consistent Hashing
An advanced hashing algorithm that minimizes redistribution when servers are added or removed.
Consistent Hash Ring:
S1
/ | \
/ | \
S4 | S2
\ | /
\ | /
S3
Keys map to points on the ring.
Each key is handled by the next server clockwise.
If S2 is removed:
- Only keys between S1 and S2 are reassigned to S3
- S1, S4 keys are UNAFFECTED
With virtual nodes:
S1 appears at multiple points on the ring
--> More even distribution
Pros: Minimal redistribution on topology changes, even with virtual nodes Cons: More complex to implement Best for: Caching layers, distributed systems, stateful services
7. Least Response Time
Routes to the server with the fastest recent response time.
Server A: avg response = 45ms
Server B: avg response = 12ms <-- next request goes here
Server C: avg response = 30ms
Best for: When server response time varies significantly
Algorithm Comparison
| Algorithm | Stateful? | Even Distribution | Handles Heterogeneous Servers | Session Affinity |
|---|---|---|---|---|
| Round Robin | No | Yes (if homogeneous) | No | No |
| Weighted Round Robin | No | Yes | Yes | No |
| Least Connections | Yes | Adaptive | Partial | No |
| IP Hash | No | Depends on IPs | No | Yes |
| Consistent Hashing | No | Yes (with vnodes) | No | Yes |
| Least Response Time | Yes | Adaptive | Yes | No |
Health Checks
Load balancers must detect unhealthy servers and stop routing traffic to them.
Types of Health Checks
1. TCP Health Check (Layer 4)
LB --> SYN --> Server
Server --> SYN-ACK --> LB
If SYN-ACK received: server is healthy
If timeout or RST: server is unhealthy
2. HTTP Health Check (Layer 7)
LB --> GET /health --> Server
Server --> 200 OK {"status": "healthy", "db": "ok", "cache": "ok"}
If 200 OK: server is healthy
If 5xx or timeout: server is unhealthy
3. Deep Health Check
# /health endpoint implementation
@app.get("/health")
def health_check():
checks = {
"database": check_db_connection(),
"cache": check_redis_connection(),
"disk": check_disk_space(),
"memory": check_memory_usage(),
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return JSONResponse(
status_code=status_code,
content={"status": "healthy" if all_healthy else "degraded", "checks": checks}
)
Health Check Configuration
Health Check Parameters:
+-- Interval: 10 seconds (check every 10s)
+-- Timeout: 5 seconds (wait up to 5s for response)
+-- Unhealthy threshold: 3 (mark unhealthy after 3 failures)
+-- Healthy threshold: 2 (mark healthy after 2 successes)
Timeline:
[OK] [OK] [FAIL] [FAIL] [FAIL] --> Mark UNHEALTHY, stop routing
[OK] [OK] --> Mark HEALTHY, resume
Graceful Degradation
Server draining (graceful shutdown):
1. Server signals "going down" (deregister or fail health check)
2. LB stops sending NEW requests
3. Existing connections are allowed to complete
4. After drain timeout, server is fully removed
No requests are dropped during deployment!
Sticky Sessions (Session Affinity)
Sticky sessions ensure a user's requests always go to the same server. Needed when servers hold session state in memory.
Without Sticky Sessions:
Request 1 (login) --> Server A (session created)
Request 2 (profile) --> Server B (no session! user appears logged out)
With Sticky Sessions:
Request 1 (login) --> Server A (session created)
Request 2 (profile) --> Server A (same server, session found)
Request 3 (order) --> Server A (same server)
Implementation Methods
1. Cookie-Based (Layer 7)
LB sets cookie: Set-Cookie: SERVERID=server-a; Path=/
Subsequent requests include: Cookie: SERVERID=server-a
LB routes to server-a
2. IP-Based (Layer 4)
hash(client_ip) % num_servers = target_server
Same IP always maps to same server
Why Sticky Sessions Are Problematic
| Problem | Description |
|---|---|
| Uneven load | Popular sessions concentrate on one server |
| Server failure | Session lost if server dies |
| Scaling | Adding/removing servers disrupts sessions |
| State coupling | Servers are no longer interchangeable |
Better alternative: Store sessions in a shared store (Redis) and make servers stateless.
Stateless Servers + Shared Session Store:
Request 1 --> Server A --> Redis (store session)
Request 2 --> Server B --> Redis (read session) <-- works!
Request 3 --> Server C --> Redis (read session) <-- works!
Any server can handle any request.
Global Server Load Balancing (GSLB)
GSLB distributes traffic across multiple data centers or regions worldwide.
Global Server Load Balancing:
User (Tokyo) --> DNS Query: api.example.com
|
v
GSLB DNS Server
"User is in Asia-Pacific"
"US-East datacenter: healthy, 150ms"
"EU-West datacenter: healthy, 200ms"
"AP-Southeast datacenter: healthy, 10ms" <-- closest
|
v
Return IP of AP-Southeast datacenter
|
v
User (Tokyo) --> AP-Southeast Load Balancer --> Servers
GSLB Routing Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Geographic | Route to nearest region | Latency-sensitive apps |
| Latency-based | Route to fastest region (measured) | Performance optimization |
| Failover | Route to primary; switch to backup on failure | Disaster recovery |
| Weighted | Distribute % of traffic across regions | Gradual migration, canary |
| Round Robin (DNS) | Rotate DNS across regions | Basic multi-region |
DNS-Based vs Anycast
DNS-Based GSLB:
User --> DNS --> Returns regional IP --> User connects to regional LB
TTL=60s
Pros: Fine-grained control, weighted routing
Cons: DNS caching delays failover, TTL compliance varies
Anycast:
User --> Same IP globally --> BGP routes to nearest PoP
Pros: Instant failover, no DNS TTL issues
Cons: Less control over routing, BGP convergence time
Load Balancer in System Architecture
Common Architecture Patterns
Pattern 1: Single LB (Simple)
Internet --> LB --> [Server 1, Server 2, Server 3]
Pattern 2: LB Pair with Failover (HA)
Internet --> Active LB -----> [Servers]
|
Heartbeat
|
Standby LB (takes over if active fails)
Pattern 3: Multi-Tier LB (Large Scale)
Internet --> L4 LB (NLB) --> L7 LB (ALB) --> [API Servers]
--> L7 LB (ALB) --> [Web Servers]
--> L7 LB (ALB) --> [Static Servers]
Pattern 4: Service Mesh (Microservices)
Service A --> Sidecar Proxy (Envoy) --> Service B
Service A --> Sidecar Proxy (Envoy) --> Service C
Each service has its own proxy that handles load balancing,
retries, circuit breaking, and observability.
No central load balancer needed for east-west traffic.
Load Balancer as Single Point of Failure
Problem: LB itself can fail
Solution 1: Active-Passive (VRRP/Keepalived)
+----------+ +----------+
| Active |<--->| Passive |
| LB | | LB |
+-----+----+ +-----+----+
| |
Virtual IP (VIP) floats between them
Solution 2: DNS Round Robin to multiple LBs
api.example.com --> 10.0.0.1 (LB1)
--> 10.0.0.2 (LB2)
Solution 3: Cloud-managed LB (AWS ALB/NLB)
AWS manages redundancy automatically across AZs
Load Balancer Technologies
| Technology | Type | Layer | Best For |
|---|---|---|---|
| AWS ALB | Cloud | L7 | HTTP/HTTPS, microservices |
| AWS NLB | Cloud | L4 | High throughput TCP/UDP |
| Nginx | Software | L4/L7 | Web servers, reverse proxy |
| HAProxy | Software | L4/L7 | High-performance load balancing |
| Envoy | Software | L7 | Service mesh, gRPC |
| F5 BIG-IP | Hardware | L4/L7 | Enterprise, on-premise |
| Cloudflare LB | Cloud | L7 | Global load balancing |
| Traefik | Software | L7 | Container-native, auto-discovery |
Key Takeaways
- Layer 4 is for raw TCP/UDP performance; Layer 7 is for content-aware routing
- Round robin is the simplest algorithm; least connections adapts to variable loads
- Consistent hashing minimizes disruption when servers change -- essential for caches
- Health checks are non-negotiable -- unhealthy servers must be removed automatically
- Avoid sticky sessions -- prefer stateless servers with shared session storage
- GSLB distributes traffic globally using DNS or Anycast
- The LB itself must be highly available -- use active-passive pairs or cloud-managed LBs
- In interviews, always mention which layer and algorithm you would choose, and why
Explain-It Challenge
"You are designing a video streaming service that serves 1 million concurrent viewers worldwide. The service has a mix of live streams (latency-sensitive, WebSocket-based) and on-demand videos (HTTP-based, cacheable). Design the load balancing strategy at every layer: global routing, regional load balancing, and service-level routing. Explain your choice of algorithm at each layer."