Episode 9 — System Design / 9.9 — Core Infrastructure

9.9.c Load Balancing

What Is Load Balancing?

A load balancer distributes incoming network traffic across multiple servers to ensure no single server bears too much load. It is one of the most fundamental components in scalable system design.

  WITHOUT Load Balancer:
  
  All Users --> Single Server (overloaded, single point of failure)
  
  
  WITH Load Balancer:
  
  All Users --> Load Balancer --> Server 1
                              --> Server 2
                              --> Server 3
                              --> Server 4

Why load balancers exist:

Scalability: Distribute traffic across many servers
Availability: If one server dies, others take over
Performance: Each server handles a fraction of total load
Flexibility: Add/remove servers without downtime

Layer 4 vs Layer 7 Load Balancing

The "layer" refers to the OSI model. The key difference is how much the load balancer understands about the traffic.

Layer 4 (Transport Layer)

Operates on TCP/UDP level. Sees IP addresses and ports. Does NOT inspect HTTP content.

  Layer 4 Load Balancer:
  
  Client: 203.0.113.5:54321
     |
     v
  LB sees: SRC=203.0.113.5:54321, DST=LB_IP:443, Protocol=TCP
  LB decides: Route to Server 2 (based on IP hash or round robin)
     |
     v
  Server 2: 10.0.0.2:8080
  
  The LB does NOT look inside the HTTP request.
  It does NOT know the URL, headers, or cookies.

Characteristics:

Extremely fast (minimal processing)
Protocol-agnostic (works with any TCP/UDP traffic)
Cannot make routing decisions based on content
Cannot terminate SSL (typically)
Lower resource consumption

Examples: AWS NLB (Network Load Balancer), HAProxy (TCP mode), Linux IPVS

Layer 7 (Application Layer)

Operates on HTTP/HTTPS level. Inspects URLs, headers, cookies, request body.

  Layer 7 Load Balancer:
  
  Client Request:
  GET /api/users HTTP/1.1
  Host: example.com
  Cookie: session=abc123
  
  LB inspects:
  - URL: /api/users --> route to API server pool
  - Cookie: session=abc123 --> sticky to Server 3
  - Header: Accept: application/json --> route to JSON API
  
  LB can also:
  - Terminate SSL
  - Compress responses
  - Add/modify headers
  - Cache responses
  - Rate limit by URL

Characteristics:

Content-aware routing (URL, headers, cookies)
SSL/TLS termination
HTTP caching and compression
More resource-intensive
Can modify requests and responses

Examples: AWS ALB (Application Load Balancer), Nginx, HAProxy (HTTP mode), Envoy

Comparison Table

Feature	Layer 4	Layer 7
OSI Layer	Transport (TCP/UDP)	Application (HTTP/HTTPS)
Inspects content	No	Yes (URL, headers, cookies)
SSL termination	No (pass-through)	Yes
Routing decisions	IP, port, protocol	URL path, headers, cookies, body
Performance	Faster (less processing)	Slower (deep inspection)
Use case	Raw TCP traffic, databases, gaming	Web apps, APIs, microservices
Connection handling	1:1 (client to server)	Can pool/multiplex connections
Cost	Lower	Higher

When to Use Each

  Decision Tree:
  
  Need content-based routing?
  |
  +-- Yes --> Layer 7
  |   |
  |   +-- Route /api/* to API servers
  |   +-- Route /static/* to file servers  
  |   +-- SSL termination needed
  |   +-- Cookie-based sticky sessions
  |
  +-- No --> Layer 4
      |
      +-- Database load balancing
      +-- Raw TCP/UDP traffic
      +-- Maximum throughput needed
      +-- Protocol-agnostic routing

Load Balancing Algorithms

1. Round Robin

Distributes requests sequentially across servers in order.

  Round Robin:
  
  Request 1 --> Server A
  Request 2 --> Server B
  Request 3 --> Server C
  Request 4 --> Server A  (cycle repeats)
  Request 5 --> Server B
  Request 6 --> Server C

Pros: Simple, fair distribution, no state needed Cons: Ignores server capacity differences, ignores current load Best for: Homogeneous servers with similar request costs

2. Weighted Round Robin

Like round robin, but servers with higher weights get more requests.

  Weighted Round Robin (A=3, B=2, C=1):
  
  Request 1 --> Server A
  Request 2 --> Server A
  Request 3 --> Server A
  Request 4 --> Server B
  Request 5 --> Server B
  Request 6 --> Server C
  (cycle repeats)

Best for: Heterogeneous servers (some more powerful than others)

3. Least Connections

Routes to the server with the fewest active connections.

  Least Connections:
  
  Server A: 12 active connections
  Server B: 5 active connections   <-- next request goes here
  Server C: 8 active connections
  
  New request --> Server B (least loaded)

Pros: Adapts to varying request processing times Cons: Requires tracking connection counts Best for: Requests with variable processing time (e.g., some API calls take 10ms, others take 5s)

4. Weighted Least Connections

Combines weights with connection counts.

  Score = active_connections / weight
  
  Server A: 12 connections, weight=4 --> score = 3.0
  Server B: 5 connections, weight=2  --> score = 2.5  <-- lowest
  Server C: 8 connections, weight=2  --> score = 4.0
  
  New request --> Server B

5. IP Hash

Hashes the client IP to determine the server. Same client always goes to the same server.

  IP Hash:
  
  Client 203.0.113.5  --> hash --> Server A (always)
  Client 198.51.100.2 --> hash --> Server C (always)
  Client 192.0.2.1    --> hash --> Server B (always)

Pros: Session affinity without cookies, stateless LB Cons: Uneven distribution if IP ranges cluster, fails when server count changes Best for: When you need basic session affinity at Layer 4

6. Consistent Hashing

An advanced hashing algorithm that minimizes redistribution when servers are added or removed.

  Consistent Hash Ring:
  
              S1
             / |  \
            /  |   \
         S4   |    S2
            \  |   /
             \ |  /
              S3
  
  Keys map to points on the ring.
  Each key is handled by the next server clockwise.
  
  If S2 is removed:
  - Only keys between S1 and S2 are reassigned to S3
  - S1, S4 keys are UNAFFECTED
  
  With virtual nodes:
  S1 appears at multiple points on the ring
  --> More even distribution

Pros: Minimal redistribution on topology changes, even with virtual nodes Cons: More complex to implement Best for: Caching layers, distributed systems, stateful services

7. Least Response Time

Routes to the server with the fastest recent response time.

  Server A: avg response = 45ms
  Server B: avg response = 12ms  <-- next request goes here
  Server C: avg response = 30ms

Best for: When server response time varies significantly

Algorithm Comparison

Algorithm	Stateful?	Even Distribution	Handles Heterogeneous Servers	Session Affinity
Round Robin	No	Yes (if homogeneous)	No	No
Weighted Round Robin	No	Yes	Yes	No
Least Connections	Yes	Adaptive	Partial	No
IP Hash	No	Depends on IPs	No	Yes
Consistent Hashing	No	Yes (with vnodes)	No	Yes
Least Response Time	Yes	Adaptive	Yes	No

Health Checks

Load balancers must detect unhealthy servers and stop routing traffic to them.

Types of Health Checks

1. TCP Health Check (Layer 4)

  LB --> SYN --> Server
  Server --> SYN-ACK --> LB
  
  If SYN-ACK received: server is healthy
  If timeout or RST: server is unhealthy

2. HTTP Health Check (Layer 7)

  LB --> GET /health --> Server
  Server --> 200 OK {"status": "healthy", "db": "ok", "cache": "ok"}
  
  If 200 OK: server is healthy
  If 5xx or timeout: server is unhealthy

3. Deep Health Check

# /health endpoint implementation
@app.get("/health")
def health_check():
    checks = {
        "database": check_db_connection(),
        "cache": check_redis_connection(),
        "disk": check_disk_space(),
        "memory": check_memory_usage(),
    }
    
    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503
    
    return JSONResponse(
        status_code=status_code,
        content={"status": "healthy" if all_healthy else "degraded", "checks": checks}
    )

Health Check Configuration

  Health Check Parameters:
  
  +-- Interval: 10 seconds (check every 10s)
  +-- Timeout: 5 seconds (wait up to 5s for response)
  +-- Unhealthy threshold: 3 (mark unhealthy after 3 failures)
  +-- Healthy threshold: 2 (mark healthy after 2 successes)
  
  Timeline:
  [OK] [OK] [FAIL] [FAIL] [FAIL] --> Mark UNHEALTHY, stop routing
                                      [OK] [OK] --> Mark HEALTHY, resume

Graceful Degradation

  Server draining (graceful shutdown):
  
  1. Server signals "going down" (deregister or fail health check)
  2. LB stops sending NEW requests
  3. Existing connections are allowed to complete
  4. After drain timeout, server is fully removed
  
  No requests are dropped during deployment!

Sticky Sessions (Session Affinity)

Sticky sessions ensure a user's requests always go to the same server. Needed when servers hold session state in memory.

  Without Sticky Sessions:
  
  Request 1 (login)   --> Server A (session created)
  Request 2 (profile) --> Server B (no session! user appears logged out)
  
  
  With Sticky Sessions:
  
  Request 1 (login)   --> Server A (session created)
  Request 2 (profile) --> Server A (same server, session found)
  Request 3 (order)   --> Server A (same server)

Implementation Methods

1. Cookie-Based (Layer 7)

LB sets cookie: Set-Cookie: SERVERID=server-a; Path=/
Subsequent requests include: Cookie: SERVERID=server-a
LB routes to server-a

2. IP-Based (Layer 4)

hash(client_ip) % num_servers = target_server
Same IP always maps to same server

Why Sticky Sessions Are Problematic

Problem	Description
Uneven load	Popular sessions concentrate on one server
Server failure	Session lost if server dies
Scaling	Adding/removing servers disrupts sessions
State coupling	Servers are no longer interchangeable

Better alternative: Store sessions in a shared store (Redis) and make servers stateless.

  Stateless Servers + Shared Session Store:
  
  Request 1 --> Server A --> Redis (store session)
  Request 2 --> Server B --> Redis (read session)  <-- works!
  Request 3 --> Server C --> Redis (read session)  <-- works!
  
  Any server can handle any request.

Global Server Load Balancing (GSLB)

GSLB distributes traffic across multiple data centers or regions worldwide.

  Global Server Load Balancing:
  
  User (Tokyo) --> DNS Query: api.example.com
                       |
                       v
                  GSLB DNS Server
                  "User is in Asia-Pacific"
                  "US-East datacenter: healthy, 150ms"
                  "EU-West datacenter: healthy, 200ms"
                  "AP-Southeast datacenter: healthy, 10ms"  <-- closest
                       |
                       v
                  Return IP of AP-Southeast datacenter
                       |
                       v
  User (Tokyo) --> AP-Southeast Load Balancer --> Servers

GSLB Routing Strategies

Strategy	How It Works	Best For
Geographic	Route to nearest region	Latency-sensitive apps
Latency-based	Route to fastest region (measured)	Performance optimization
Failover	Route to primary; switch to backup on failure	Disaster recovery
Weighted	Distribute % of traffic across regions	Gradual migration, canary
Round Robin (DNS)	Rotate DNS across regions	Basic multi-region

DNS-Based vs Anycast

DNS-Based GSLB:

User --> DNS --> Returns regional IP --> User connects to regional LB
                TTL=60s

Pros: Fine-grained control, weighted routing
Cons: DNS caching delays failover, TTL compliance varies

Anycast:

User --> Same IP globally --> BGP routes to nearest PoP

Pros: Instant failover, no DNS TTL issues
Cons: Less control over routing, BGP convergence time

Load Balancer in System Architecture

Common Architecture Patterns

Pattern 1: Single LB (Simple)

  Internet --> LB --> [Server 1, Server 2, Server 3]

Pattern 2: LB Pair with Failover (HA)

  Internet --> Active LB -----> [Servers]
                  |
              Heartbeat
                  |
               Standby LB (takes over if active fails)

Pattern 3: Multi-Tier LB (Large Scale)

  Internet --> L4 LB (NLB) --> L7 LB (ALB) --> [API Servers]
                            --> L7 LB (ALB) --> [Web Servers]
                            --> L7 LB (ALB) --> [Static Servers]

Pattern 4: Service Mesh (Microservices)

  Service A --> Sidecar Proxy (Envoy) --> Service B
  Service A --> Sidecar Proxy (Envoy) --> Service C
  
  Each service has its own proxy that handles load balancing,
  retries, circuit breaking, and observability.
  No central load balancer needed for east-west traffic.

Load Balancer as Single Point of Failure

  Problem: LB itself can fail
  
  Solution 1: Active-Passive (VRRP/Keepalived)
  +----------+     +----------+
  | Active   |<--->| Passive  |
  | LB       |     | LB       |
  +-----+----+     +-----+----+
        |                |
  Virtual IP (VIP) floats between them
  
  Solution 2: DNS Round Robin to multiple LBs
  api.example.com --> 10.0.0.1 (LB1)
                  --> 10.0.0.2 (LB2)
  
  Solution 3: Cloud-managed LB (AWS ALB/NLB)
  AWS manages redundancy automatically across AZs

Load Balancer Technologies

Technology	Type	Layer	Best For
AWS ALB	Cloud	L7	HTTP/HTTPS, microservices
AWS NLB	Cloud	L4	High throughput TCP/UDP
Nginx	Software	L4/L7	Web servers, reverse proxy
HAProxy	Software	L4/L7	High-performance load balancing
Envoy	Software	L7	Service mesh, gRPC
F5 BIG-IP	Hardware	L4/L7	Enterprise, on-premise
Cloudflare LB	Cloud	L7	Global load balancing
Traefik	Software	L7	Container-native, auto-discovery

Key Takeaways

Layer 4 is for raw TCP/UDP performance; Layer 7 is for content-aware routing
Round robin is the simplest algorithm; least connections adapts to variable loads
Consistent hashing minimizes disruption when servers change -- essential for caches
Health checks are non-negotiable -- unhealthy servers must be removed automatically
Avoid sticky sessions -- prefer stateless servers with shared session storage
GSLB distributes traffic globally using DNS or Anycast
The LB itself must be highly available -- use active-passive pairs or cloud-managed LBs
In interviews, always mention which layer and algorithm you would choose, and why

Explain-It Challenge

"You are designing a video streaming service that serves 1 million concurrent viewers worldwide. The service has a mix of live streams (latency-sensitive, WebSocket-based) and on-demand videos (HTTP-based, cacheable). Design the load balancing strategy at every layer: global routing, regional load balancing, and service-level routing. Explain your choice of algorithm at each layer."