Episode 9 — System Design / 9.9 — Core Infrastructure

9.9.c Load Balancing

What Is Load Balancing?

A load balancer distributes incoming network traffic across multiple servers to ensure no single server bears too much load. It is one of the most fundamental components in scalable system design.

  WITHOUT Load Balancer:
  
  All Users --> Single Server (overloaded, single point of failure)
  
  
  WITH Load Balancer:
  
  All Users --> Load Balancer --> Server 1
                              --> Server 2
                              --> Server 3
                              --> Server 4

Why load balancers exist:

  • Scalability: Distribute traffic across many servers
  • Availability: If one server dies, others take over
  • Performance: Each server handles a fraction of total load
  • Flexibility: Add/remove servers without downtime

Layer 4 vs Layer 7 Load Balancing

The "layer" refers to the OSI model. The key difference is how much the load balancer understands about the traffic.

Layer 4 (Transport Layer)

Operates on TCP/UDP level. Sees IP addresses and ports. Does NOT inspect HTTP content.

  Layer 4 Load Balancer:
  
  Client: 203.0.113.5:54321
     |
     v
  LB sees: SRC=203.0.113.5:54321, DST=LB_IP:443, Protocol=TCP
  LB decides: Route to Server 2 (based on IP hash or round robin)
     |
     v
  Server 2: 10.0.0.2:8080
  
  The LB does NOT look inside the HTTP request.
  It does NOT know the URL, headers, or cookies.

Characteristics:

  • Extremely fast (minimal processing)
  • Protocol-agnostic (works with any TCP/UDP traffic)
  • Cannot make routing decisions based on content
  • Cannot terminate SSL (typically)
  • Lower resource consumption

Examples: AWS NLB (Network Load Balancer), HAProxy (TCP mode), Linux IPVS

Layer 7 (Application Layer)

Operates on HTTP/HTTPS level. Inspects URLs, headers, cookies, request body.

  Layer 7 Load Balancer:
  
  Client Request:
  GET /api/users HTTP/1.1
  Host: example.com
  Cookie: session=abc123
  
  LB inspects:
  - URL: /api/users --> route to API server pool
  - Cookie: session=abc123 --> sticky to Server 3
  - Header: Accept: application/json --> route to JSON API
  
  LB can also:
  - Terminate SSL
  - Compress responses
  - Add/modify headers
  - Cache responses
  - Rate limit by URL

Characteristics:

  • Content-aware routing (URL, headers, cookies)
  • SSL/TLS termination
  • HTTP caching and compression
  • More resource-intensive
  • Can modify requests and responses

Examples: AWS ALB (Application Load Balancer), Nginx, HAProxy (HTTP mode), Envoy

Comparison Table

FeatureLayer 4Layer 7
OSI LayerTransport (TCP/UDP)Application (HTTP/HTTPS)
Inspects contentNoYes (URL, headers, cookies)
SSL terminationNo (pass-through)Yes
Routing decisionsIP, port, protocolURL path, headers, cookies, body
PerformanceFaster (less processing)Slower (deep inspection)
Use caseRaw TCP traffic, databases, gamingWeb apps, APIs, microservices
Connection handling1:1 (client to server)Can pool/multiplex connections
CostLowerHigher

When to Use Each

  Decision Tree:
  
  Need content-based routing?
  |
  +-- Yes --> Layer 7
  |   |
  |   +-- Route /api/* to API servers
  |   +-- Route /static/* to file servers  
  |   +-- SSL termination needed
  |   +-- Cookie-based sticky sessions
  |
  +-- No --> Layer 4
      |
      +-- Database load balancing
      +-- Raw TCP/UDP traffic
      +-- Maximum throughput needed
      +-- Protocol-agnostic routing

Load Balancing Algorithms

1. Round Robin

Distributes requests sequentially across servers in order.

  Round Robin:
  
  Request 1 --> Server A
  Request 2 --> Server B
  Request 3 --> Server C
  Request 4 --> Server A  (cycle repeats)
  Request 5 --> Server B
  Request 6 --> Server C

Pros: Simple, fair distribution, no state needed Cons: Ignores server capacity differences, ignores current load Best for: Homogeneous servers with similar request costs

2. Weighted Round Robin

Like round robin, but servers with higher weights get more requests.

  Weighted Round Robin (A=3, B=2, C=1):
  
  Request 1 --> Server A
  Request 2 --> Server A
  Request 3 --> Server A
  Request 4 --> Server B
  Request 5 --> Server B
  Request 6 --> Server C
  (cycle repeats)

Best for: Heterogeneous servers (some more powerful than others)

3. Least Connections

Routes to the server with the fewest active connections.

  Least Connections:
  
  Server A: 12 active connections
  Server B: 5 active connections   <-- next request goes here
  Server C: 8 active connections
  
  New request --> Server B (least loaded)

Pros: Adapts to varying request processing times Cons: Requires tracking connection counts Best for: Requests with variable processing time (e.g., some API calls take 10ms, others take 5s)

4. Weighted Least Connections

Combines weights with connection counts.

  Score = active_connections / weight
  
  Server A: 12 connections, weight=4 --> score = 3.0
  Server B: 5 connections, weight=2  --> score = 2.5  <-- lowest
  Server C: 8 connections, weight=2  --> score = 4.0
  
  New request --> Server B

5. IP Hash

Hashes the client IP to determine the server. Same client always goes to the same server.

  IP Hash:
  
  Client 203.0.113.5  --> hash --> Server A (always)
  Client 198.51.100.2 --> hash --> Server C (always)
  Client 192.0.2.1    --> hash --> Server B (always)

Pros: Session affinity without cookies, stateless LB Cons: Uneven distribution if IP ranges cluster, fails when server count changes Best for: When you need basic session affinity at Layer 4

6. Consistent Hashing

An advanced hashing algorithm that minimizes redistribution when servers are added or removed.

  Consistent Hash Ring:
  
              S1
             / |  \
            /  |   \
         S4   |    S2
            \  |   /
             \ |  /
              S3
  
  Keys map to points on the ring.
  Each key is handled by the next server clockwise.
  
  If S2 is removed:
  - Only keys between S1 and S2 are reassigned to S3
  - S1, S4 keys are UNAFFECTED
  
  With virtual nodes:
  S1 appears at multiple points on the ring
  --> More even distribution

Pros: Minimal redistribution on topology changes, even with virtual nodes Cons: More complex to implement Best for: Caching layers, distributed systems, stateful services

7. Least Response Time

Routes to the server with the fastest recent response time.

  Server A: avg response = 45ms
  Server B: avg response = 12ms  <-- next request goes here
  Server C: avg response = 30ms

Best for: When server response time varies significantly

Algorithm Comparison

AlgorithmStateful?Even DistributionHandles Heterogeneous ServersSession Affinity
Round RobinNoYes (if homogeneous)NoNo
Weighted Round RobinNoYesYesNo
Least ConnectionsYesAdaptivePartialNo
IP HashNoDepends on IPsNoYes
Consistent HashingNoYes (with vnodes)NoYes
Least Response TimeYesAdaptiveYesNo

Health Checks

Load balancers must detect unhealthy servers and stop routing traffic to them.

Types of Health Checks

1. TCP Health Check (Layer 4)

  LB --> SYN --> Server
  Server --> SYN-ACK --> LB
  
  If SYN-ACK received: server is healthy
  If timeout or RST: server is unhealthy

2. HTTP Health Check (Layer 7)

  LB --> GET /health --> Server
  Server --> 200 OK {"status": "healthy", "db": "ok", "cache": "ok"}
  
  If 200 OK: server is healthy
  If 5xx or timeout: server is unhealthy

3. Deep Health Check

# /health endpoint implementation
@app.get("/health")
def health_check():
    checks = {
        "database": check_db_connection(),
        "cache": check_redis_connection(),
        "disk": check_disk_space(),
        "memory": check_memory_usage(),
    }
    
    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503
    
    return JSONResponse(
        status_code=status_code,
        content={"status": "healthy" if all_healthy else "degraded", "checks": checks}
    )

Health Check Configuration

  Health Check Parameters:
  
  +-- Interval: 10 seconds (check every 10s)
  +-- Timeout: 5 seconds (wait up to 5s for response)
  +-- Unhealthy threshold: 3 (mark unhealthy after 3 failures)
  +-- Healthy threshold: 2 (mark healthy after 2 successes)
  
  Timeline:
  [OK] [OK] [FAIL] [FAIL] [FAIL] --> Mark UNHEALTHY, stop routing
                                      [OK] [OK] --> Mark HEALTHY, resume

Graceful Degradation

  Server draining (graceful shutdown):
  
  1. Server signals "going down" (deregister or fail health check)
  2. LB stops sending NEW requests
  3. Existing connections are allowed to complete
  4. After drain timeout, server is fully removed
  
  No requests are dropped during deployment!

Sticky Sessions (Session Affinity)

Sticky sessions ensure a user's requests always go to the same server. Needed when servers hold session state in memory.

  Without Sticky Sessions:
  
  Request 1 (login)   --> Server A (session created)
  Request 2 (profile) --> Server B (no session! user appears logged out)
  
  
  With Sticky Sessions:
  
  Request 1 (login)   --> Server A (session created)
  Request 2 (profile) --> Server A (same server, session found)
  Request 3 (order)   --> Server A (same server)

Implementation Methods

1. Cookie-Based (Layer 7)

LB sets cookie: Set-Cookie: SERVERID=server-a; Path=/
Subsequent requests include: Cookie: SERVERID=server-a
LB routes to server-a

2. IP-Based (Layer 4)

hash(client_ip) % num_servers = target_server
Same IP always maps to same server

Why Sticky Sessions Are Problematic

ProblemDescription
Uneven loadPopular sessions concentrate on one server
Server failureSession lost if server dies
ScalingAdding/removing servers disrupts sessions
State couplingServers are no longer interchangeable

Better alternative: Store sessions in a shared store (Redis) and make servers stateless.

  Stateless Servers + Shared Session Store:
  
  Request 1 --> Server A --> Redis (store session)
  Request 2 --> Server B --> Redis (read session)  <-- works!
  Request 3 --> Server C --> Redis (read session)  <-- works!
  
  Any server can handle any request.

Global Server Load Balancing (GSLB)

GSLB distributes traffic across multiple data centers or regions worldwide.

  Global Server Load Balancing:
  
  User (Tokyo) --> DNS Query: api.example.com
                       |
                       v
                  GSLB DNS Server
                  "User is in Asia-Pacific"
                  "US-East datacenter: healthy, 150ms"
                  "EU-West datacenter: healthy, 200ms"
                  "AP-Southeast datacenter: healthy, 10ms"  <-- closest
                       |
                       v
                  Return IP of AP-Southeast datacenter
                       |
                       v
  User (Tokyo) --> AP-Southeast Load Balancer --> Servers

GSLB Routing Strategies

StrategyHow It WorksBest For
GeographicRoute to nearest regionLatency-sensitive apps
Latency-basedRoute to fastest region (measured)Performance optimization
FailoverRoute to primary; switch to backup on failureDisaster recovery
WeightedDistribute % of traffic across regionsGradual migration, canary
Round Robin (DNS)Rotate DNS across regionsBasic multi-region

DNS-Based vs Anycast

DNS-Based GSLB:

User --> DNS --> Returns regional IP --> User connects to regional LB
                TTL=60s

Pros: Fine-grained control, weighted routing
Cons: DNS caching delays failover, TTL compliance varies

Anycast:

User --> Same IP globally --> BGP routes to nearest PoP

Pros: Instant failover, no DNS TTL issues
Cons: Less control over routing, BGP convergence time

Load Balancer in System Architecture

Common Architecture Patterns

Pattern 1: Single LB (Simple)

  Internet --> LB --> [Server 1, Server 2, Server 3]

Pattern 2: LB Pair with Failover (HA)

  Internet --> Active LB -----> [Servers]
                  |
              Heartbeat
                  |
               Standby LB (takes over if active fails)

Pattern 3: Multi-Tier LB (Large Scale)

  Internet --> L4 LB (NLB) --> L7 LB (ALB) --> [API Servers]
                            --> L7 LB (ALB) --> [Web Servers]
                            --> L7 LB (ALB) --> [Static Servers]

Pattern 4: Service Mesh (Microservices)

  Service A --> Sidecar Proxy (Envoy) --> Service B
  Service A --> Sidecar Proxy (Envoy) --> Service C
  
  Each service has its own proxy that handles load balancing,
  retries, circuit breaking, and observability.
  No central load balancer needed for east-west traffic.

Load Balancer as Single Point of Failure

  Problem: LB itself can fail
  
  Solution 1: Active-Passive (VRRP/Keepalived)
  +----------+     +----------+
  | Active   |<--->| Passive  |
  | LB       |     | LB       |
  +-----+----+     +-----+----+
        |                |
  Virtual IP (VIP) floats between them
  
  Solution 2: DNS Round Robin to multiple LBs
  api.example.com --> 10.0.0.1 (LB1)
                  --> 10.0.0.2 (LB2)
  
  Solution 3: Cloud-managed LB (AWS ALB/NLB)
  AWS manages redundancy automatically across AZs

Load Balancer Technologies

TechnologyTypeLayerBest For
AWS ALBCloudL7HTTP/HTTPS, microservices
AWS NLBCloudL4High throughput TCP/UDP
NginxSoftwareL4/L7Web servers, reverse proxy
HAProxySoftwareL4/L7High-performance load balancing
EnvoySoftwareL7Service mesh, gRPC
F5 BIG-IPHardwareL4/L7Enterprise, on-premise
Cloudflare LBCloudL7Global load balancing
TraefikSoftwareL7Container-native, auto-discovery

Key Takeaways

  1. Layer 4 is for raw TCP/UDP performance; Layer 7 is for content-aware routing
  2. Round robin is the simplest algorithm; least connections adapts to variable loads
  3. Consistent hashing minimizes disruption when servers change -- essential for caches
  4. Health checks are non-negotiable -- unhealthy servers must be removed automatically
  5. Avoid sticky sessions -- prefer stateless servers with shared session storage
  6. GSLB distributes traffic globally using DNS or Anycast
  7. The LB itself must be highly available -- use active-passive pairs or cloud-managed LBs
  8. In interviews, always mention which layer and algorithm you would choose, and why

Explain-It Challenge

"You are designing a video streaming service that serves 1 million concurrent viewers worldwide. The service has a mix of live streams (latency-sensitive, WebSocket-based) and on-demand videos (HTTP-based, cacheable). Design the load balancing strategy at every layer: global routing, regional load balancing, and service-level routing. Explain your choice of algorithm at each layer."