Episode 9 — System Design / 9.9 — Core Infrastructure

9.9.e Message Queues

Why Asynchronous Communication?

In a synchronous system, the caller waits for the callee to respond. This creates tight coupling, and if any service is slow or down, the entire chain suffers.

  SYNCHRONOUS (Tight Coupling):
  
  User --> Order Service --> Payment Service --> Notification Service
                  |                  |                    |
               WAITS              WAITS                WAITS
               
  If Payment Service takes 5 seconds, user waits 5+ seconds.
  If Notification Service is down, the entire order fails.
  
  
  ASYNCHRONOUS (Loose Coupling):
  
  User --> Order Service --> Message Queue
                                |
                      +----+----+----+
                      |    |         |
                      v    v         v
                  Payment  Inventory  Notification
                  Service  Service    Service
                  
  Order Service responds immediately.
  Downstream services process at their own pace.
  If Notification Service is down, messages wait in the queue.

Benefits of Async Communication

Benefit	Description
Decoupling	Producer does not need to know about consumers
Resilience	Messages survive if a consumer is temporarily down
Scalability	Consumers can scale independently based on queue depth
Load leveling	Queue absorbs traffic spikes; consumers process at steady rate
Guaranteed delivery	Messages are persisted until acknowledged
Temporal decoupling	Producer and consumer do not need to be online simultaneously

Queue vs Topic

These are two fundamental messaging patterns with different semantics.

Queue (Point-to-Point)

Each message is consumed by exactly one consumer.

  Queue (Point-to-Point):
  
  Producer --> [MSG-3][MSG-2][MSG-1] --> Consumer A (gets MSG-1)
                                    --> Consumer B (gets MSG-2)
                                    --> Consumer C (gets MSG-3)
  
  Messages are distributed across consumers.
  Each message is processed ONCE by ONE consumer.
  
  Use case: Task processing, job queues, order processing

Topic (Pub/Sub)

Each message is delivered to all subscribers.

  Topic (Publish/Subscribe):
  
  Publisher --> [MSG-1] --> Topic --> Subscriber A (gets MSG-1)
                                 --> Subscriber B (gets MSG-1)
                                 --> Subscriber C (gets MSG-1)
  
  ALL subscribers receive EVERY message.
  Each subscriber processes independently.
  
  Use case: Event broadcasting, notifications, real-time updates

Comparison

Aspect	Queue	Topic
Delivery	One consumer per message	All subscribers per message
Pattern	Competing consumers	Fan-out
Use case	Work distribution	Event broadcasting
Scaling	Add consumers to increase throughput	Each subscriber scales independently
Example	"Process this order"	"An order was placed"

RabbitMQ vs Kafka vs SQS

RabbitMQ

A traditional message broker that supports multiple messaging patterns.

  RabbitMQ Architecture:
  
  Producer --> Exchange --> Binding --> Queue --> Consumer
  
  Exchange Types:
  - Direct:  route by exact routing key
  - Topic:   route by pattern matching (e.g., order.*)
  - Fanout:  broadcast to all bound queues
  - Headers: route by message headers
  
  
  Example (Direct Exchange):
  
  Producer: publish(exchange="orders", routing_key="payment")
  
  Exchange "orders":
    routing_key="payment"  --> Queue: payment-queue  --> Payment Consumer
    routing_key="inventory" --> Queue: inventory-queue --> Inventory Consumer
    routing_key="email"    --> Queue: email-queue     --> Email Consumer

Key characteristics:

Push-based: broker pushes messages to consumers
Per-message acknowledgment
Rich routing via exchange types
Messages are deleted after consumption
Supports priorities, TTL, dead letter queues
Written in Erlang (excellent concurrency)

Apache Kafka

A distributed event streaming platform designed for high-throughput, ordered, durable event streams.

  Kafka Architecture:
  
  +------------------------------------------------------------------+
  |  Kafka Cluster                                                    |
  |                                                                   |
  |  Topic: "orders" (3 partitions)                                   |
  |                                                                   |
  |  Partition 0: [msg-0][msg-3][msg-6][msg-9]  --> Consumer A       |
  |  Partition 1: [msg-1][msg-4][msg-7]         --> Consumer B       |
  |  Partition 2: [msg-2][msg-5][msg-8]         --> Consumer C       |
  |                                                                   |
  |  Consumer Group "order-processing":                               |
  |  Each partition assigned to exactly one consumer in the group.    |
  |  Messages within a partition are strictly ordered.                |
  +------------------------------------------------------------------+
  
  Producer: send(topic="orders", key="user-42", value={...})
  Key determines partition: hash("user-42") % 3 = partition 1
  All messages for user-42 go to partition 1 (ordered!)

Key characteristics:

Pull-based: consumers pull messages at their own pace
Messages are retained (configurable retention, e.g., 7 days)
Consumers track their offset (position in the log)
Partitions enable parallelism and ordering
Replication for fault tolerance
Consumer groups for load distribution
Can replay messages (rewind offset)
Extremely high throughput (millions of messages/sec)

Amazon SQS

A fully managed message queue service from AWS.

  SQS Architecture:
  
  Standard Queue:
  Producer --> SQS Queue --> Consumer
  - At-least-once delivery
  - Best-effort ordering
  - Nearly unlimited throughput
  
  FIFO Queue:
  Producer --> SQS FIFO Queue --> Consumer
  - Exactly-once processing
  - Strict ordering within message group
  - 3,000 msg/sec with batching

Key characteristics:

Fully managed (no infrastructure to operate)
Pay-per-message pricing
Standard queues: unlimited throughput, at-least-once, best-effort order
FIFO queues: exactly-once, strict order, limited throughput
Visibility timeout (prevents double processing)
Dead letter queues built-in
Long polling for efficiency

Feature Comparison

Feature	RabbitMQ	Kafka	SQS
Type	Message broker	Event streaming platform	Managed queue
Delivery model	Push	Pull	Pull (long polling)
Ordering	Per-queue FIFO	Per-partition FIFO	Best-effort (Standard) / Strict (FIFO)
Throughput	~50K msg/sec	Millions msg/sec	Unlimited (Standard) / 3K (FIFO)
Message retention	Until consumed	Configurable (days/weeks)	Up to 14 days
Replay	No (message deleted)	Yes (rewind offset)	No
Routing	Rich (exchanges)	Topic + partition key	Simple (queue per consumer)
Exactly-once	No (at-least-once)	Yes (with transactions)	Yes (FIFO only)
Operations	Self-managed	Self-managed or managed (Confluent, MSK)	Fully managed
Best for	Complex routing, RPC	Event sourcing, log aggregation, streaming	Simple async tasks, AWS-native
Protocols	AMQP, MQTT, STOMP	Custom (Kafka protocol)	HTTP/HTTPS (AWS API)
Consumer groups	Competing consumers	Native support	Not built-in (use multiple queues)

Decision Guide

  Which Message System?
  
  Need event streaming / replay / high throughput?
  |
  +-- Yes --> Kafka
  |
  Need complex routing (topic exchanges, headers)?
  |
  +-- Yes --> RabbitMQ
  |
  Need fully managed, simple queue, AWS-native?
  |
  +-- Yes --> SQS
  |
  Need real-time pub/sub with low latency?
  |
  +-- Yes --> RabbitMQ or Redis Pub/Sub
  |
  Need to process millions of events for analytics?
  |
  +-- Yes --> Kafka

Producer-Consumer Pattern

The most fundamental messaging pattern. Producers create messages; consumers process them.

  Producer-Consumer:
  
  +----------+     +---------+     +----------+
  | Producer |---->|  Queue  |---->| Consumer |
  | (Order   |     | [m3][m2]|     | (Payment |
  |  Service)|     | [m1]    |     |  Worker) |
  +----------+     +---------+     +----------+
  
  Multiple producers, multiple consumers:
  
  Producer A --+     +---------+     +--> Consumer 1
               |---->|  Queue  |---->|
  Producer B --+     |         |     +--> Consumer 2
               |---->|         |---->|
  Producer C --+     +---------+     +--> Consumer 3
  
  Messages are distributed among consumers (competing consumers).
  Each message is processed by exactly one consumer.

Acknowledgment and Visibility

  Message Lifecycle:
  
  1. Producer sends message to queue
  2. Message is persisted in queue
  3. Consumer receives message (message becomes "invisible")
  4. Consumer processes message
  5a. SUCCESS: Consumer ACKs --> message is deleted
  5b. FAILURE: Consumer NACKs or timeout --> message becomes visible again
  
  
  Visibility Timeout (SQS) / Acknowledgment (RabbitMQ):
  
  Time: 0s    10s   20s   30s   40s   50s   60s
        |      |     |     |     |     |     |
  Recv  |----->|     |     |     |     |     |
  Processing   |---->|---->|     |     |     |
  ACK                      |---->|     |     |  (message deleted)
  
  If no ACK by timeout:
        |      |     |     |     |     |     |
  Recv  |----->|     |     |     |     |     |
  Processing   |---->|---->|---->|---->|     |
  TIMEOUT                              |---->|  (message reappears for retry)

Publish/Subscribe Pattern

A message is broadcast to all interested subscribers.

  Pub/Sub Architecture:
  
  +----------+
  | Event:   |     +---> Subscriber A: Email Service
  | "Order   |---->| Topic: order.placed
  |  Placed" |     +---> Subscriber B: Inventory Service
  +----------+     |
                   +---> Subscriber C: Analytics Service
                   |
                   +---> Subscriber D: Notification Service
  
  ALL subscribers receive the event.
  Each processes independently.
  Adding a new subscriber requires NO changes to the publisher.

Kafka Consumer Groups Enable Both Patterns

  Kafka Topic: "orders" (3 partitions)
  
  Consumer Group A (order-processing):
    Consumer A1 <-- Partition 0
    Consumer A2 <-- Partition 1
    Consumer A3 <-- Partition 2
    (Point-to-point within this group)
  
  Consumer Group B (analytics):
    Consumer B1 <-- Partition 0, 1, 2
    (Receives ALL messages)
  
  Consumer Group C (audit-log):
    Consumer C1 <-- Partition 0, 1
    Consumer C2 <-- Partition 2
    (Point-to-point within this group)
  
  Result: Each group gets ALL messages (pub/sub between groups).
  Within a group, messages are distributed (competing consumers).

Dead Letter Queues (DLQ)

A dead letter queue captures messages that fail processing after a maximum number of retries.

  Dead Letter Queue Flow:
  
  Main Queue                              DLQ
  +--------+                          +--------+
  |  MSG-1 | --> Consumer --> FAIL    |        |
  |        | --> Consumer --> FAIL    |        |
  |        | --> Consumer --> FAIL    |  MSG-1 | (moved after 3 failures)
  +--------+    (3 retries)          +--------+
  
  DLQ allows:
  - Investigation of why messages failed
  - Manual reprocessing after fixing the bug
  - Alerting on DLQ depth (something is wrong!)
  - Preventing poison messages from blocking the queue

DLQ Best Practices:

Practice	Why
Set max retries (3-5)	Prevent infinite retry loops
Alert on DLQ depth	Catch processing failures early
Preserve original message + error	Enable debugging
Set DLQ retention longer than main queue	Give time to investigate
Implement DLQ reprocessing	Replay messages after fixing bugs

Ordering Guarantees

Why Ordering Matters

  Scenario: Bank account updates
  
  Message 1: Deposit $100 (balance: $100)
  Message 2: Withdraw $80  (balance: $20)
  
  If processed out of order:
  Message 2: Withdraw $80  (balance: -$80 ???)  <-- REJECTED or negative balance!
  Message 1: Deposit $100 (balance: $100)

Ordering by System

System	Ordering Guarantee
Kafka	Strict order per partition. Use partition key to group related messages.
RabbitMQ	FIFO per queue (single consumer). Lost with competing consumers.
SQS Standard	Best-effort ordering (messages can arrive out of order)
SQS FIFO	Strict ordering per message group ID

Kafka Partition Ordering

  Topic: "account-events"
  Partition key: account_id
  
  Account 42:  [deposit $100] --> [withdraw $80] --> [deposit $50]
               All go to partition hash("42") % N = partition 2
               Processed in exact order by one consumer.
  
  Account 99:  [deposit $200] --> [withdraw $150]
               All go to partition hash("99") % N = partition 0
               Processed in exact order by a different consumer.
  
  Account 42 and Account 99 events are NOT ordered relative to each other.
  (They do not need to be -- they are independent.)

Exactly-Once Delivery

The holy grail of messaging. Three delivery semantics exist:

At-Most-Once

Message may be lost but never duplicated.

  Producer --> Queue --> Consumer
  
  Producer sends, does not wait for ACK.
  If message is lost in transit, it is gone.
  
  Use case: Metrics, logs (losing a few is OK)

At-Least-Once

Message is guaranteed to be delivered but may be duplicated.

  Producer --> Queue --> Consumer
  
  If consumer fails after processing but before ACK:
  Message is redelivered --> processed AGAIN (duplicate!)
  
  Most common semantic. Requires idempotent consumers.

Exactly-Once

Message is delivered and processed exactly once. Hard to achieve.

  Approaches:
  
  1. Idempotent Consumer (most practical):
     Consumer checks: "Have I processed message ID abc123?"
     If yes: skip (deduplicate)
     If no: process and record message ID
     
     +-----------------------------------------------------+
     | Consumer                                             |
     |                                                      |
     | msg_id = message.id                                  |
     | if msg_id in processed_set:                          |
     |     return  # already processed, skip                |
     | else:                                                |
     |     process(message)                                 |
     |     processed_set.add(msg_id)                        |
     |     ack(message)                                     |
     +-----------------------------------------------------+
  
  2. Kafka Transactions (Kafka-specific):
     Producer uses transactional API.
     Consumer uses read_committed isolation.
     Exactly-once within the Kafka ecosystem.
  
  3. Outbox Pattern (database + queue):
     Write to DB and outbox table in same transaction.
     Separate process reads outbox and publishes to queue.
     Guarantees consistency between DB and queue.

Delivery Guarantee Comparison

Semantic	Message Loss	Duplication	Complexity	Use Case
At-most-once	Possible	Never	Low	Metrics, non-critical logs
At-least-once	Never	Possible	Medium	Most business operations
Exactly-once	Never	Never	High	Financial transactions, critical operations

Backpressure

Backpressure is a mechanism to slow down producers when consumers cannot keep up.

  Without Backpressure:
  
  Producer (1000 msg/s) --> Queue (growing!) --> Consumer (100 msg/s)
  
  Queue grows unboundedly:
  t=0:   0 messages
  t=10:  9,000 messages
  t=60:  54,000 messages
  t=300: 270,000 messages  --> OUT OF MEMORY!
  
  
  With Backpressure:
  
  Producer (1000 msg/s) --> Queue (bounded, max 10,000)
  
  Queue full? Producer options:
  1. Block (wait until space available)
  2. Drop (discard message)
  3. Buffer locally and retry
  4. Signal upstream to slow down (HTTP 429)

Backpressure Strategies

Strategy	How It Works	Trade-off
Bounded queue	Queue has max size; rejects when full	May drop messages
Rate limiting producers	Limit producer send rate	Slows entire pipeline
Consumer scaling	Auto-scale consumers based on queue depth	Takes time, costs money
Priority queues	Process high-priority messages first	Low-priority may starve
Load shedding	Drop non-essential messages under load	Acceptable for some use cases

  Auto-Scaling Based on Queue Depth:
  
  Queue depth < 1,000:   2 consumers
  Queue depth 1,000-10K: 5 consumers  (scale up)
  Queue depth > 10,000:  10 consumers (scale up more)
  Queue depth < 500:     2 consumers  (scale down)
  
  CloudWatch Alarm --> Auto Scaling Group --> Add/Remove Consumers

When to Use Queues in System Design

Use a Queue When...

Scenario	Example	Why Queue Helps
Async processing	Order placed --> send email	User does not wait for email
Workload smoothing	Flash sale spike	Queue absorbs burst; workers process steadily
Service decoupling	Order service --> Inventory service	Services evolve independently
Retry handling	Payment processing	Failed payments retry automatically
Fan-out	New blog post --> notify all followers	Queue distributes to notification workers
Rate limiting	API calls to external service	Queue controls outbound request rate
Batch processing	Aggregate events for analytics	Collect in queue, process in batches

Do NOT Use a Queue When...

Scenario	Why Not
Synchronous response needed	User needs immediate result (e.g., login response)
Simple request-response	Adding a queue adds unnecessary complexity
Low latency critical	Queue adds milliseconds of latency
Data consistency required immediately	Queue introduces eventual consistency
Small, simple system	Overkill; direct HTTP calls suffice

Common Queue Patterns in System Design

Pattern 1: Work Queue (Task Distribution)

  Image Upload Service:
  
  User uploads image --> API Server --> Queue: "image-processing"
                                            |
                              +----+----+----+
                              |    |    |    |
                              v    v    v    v
                           Worker Worker Worker Worker
                           (resize, compress, thumbnail)
                              |
                              v
                           S3 (store processed images)

Pattern 2: Event-Driven Architecture

  E-Commerce Event Flow:
  
  Order Service: publish("order.placed", {order_id: 123, ...})
       |
       v
  Event Bus (Kafka)
       |
       +---> Inventory Service: subscribe("order.placed")
       |     --> Reserve items
       |
       +---> Payment Service: subscribe("order.placed")
       |     --> Charge customer
       |
       +---> Email Service: subscribe("order.placed")
       |     --> Send confirmation
       |
       +---> Analytics Service: subscribe("order.placed")
             --> Record metrics

Pattern 3: Saga Pattern (Distributed Transactions)

  Order Saga (Choreography):
  
  1. Order Service: create order --> publish "order.created"
  2. Payment Service: charge card --> publish "payment.success"
  3. Inventory Service: reserve items --> publish "inventory.reserved"
  4. Shipping Service: create shipment --> publish "shipment.created"
  
  If step 3 fails (out of stock):
  3. Inventory Service: publish "inventory.failed"
  --> Payment Service: refund charge (compensating transaction)
  --> Order Service: cancel order (compensating transaction)
  
  Each step communicates via messages. No distributed lock needed.

Pattern 4: CQRS with Event Sourcing

  Command Side:                    Query Side:
  
  Write API --> Command Handler    Read API --> Query Handler
                    |                              |
                    v                              v
              Event Store (Kafka)         Read Database (Elasticsearch)
                    |                              ^
                    +--- Events ------------------>|
                         (projected into read model)

Message Queue Best Practices

Practice	Details
Make consumers idempotent	Process same message twice without side effects
Use dead letter queues	Capture failed messages for investigation
Set message TTL	Prevent stale messages from being processed
Monitor queue depth	Alert when queues grow unexpectedly
Use correlation IDs	Track a request across multiple services
Batch when possible	Reduce overhead by processing messages in batches
Test failure scenarios	What happens when consumer crashes mid-processing?
Version your message schemas	Allow producers and consumers to evolve independently
Start simple	SQS or RabbitMQ before Kafka unless you need streaming

Key Takeaways

Async communication decouples services, improves resilience, and smooths load
Queues are point-to-point (one consumer); topics are pub/sub (all subscribers)
Kafka for high-throughput event streaming; RabbitMQ for routing; SQS for simplicity
At-least-once + idempotent consumers is the most practical delivery guarantee
Dead letter queues are essential -- always configure them
Ordering is per-partition (Kafka) or per-queue -- use partition keys wisely
Backpressure prevents queues from growing unboundedly
Use queues when you need decoupling, async processing, or load leveling
Do not use queues when you need synchronous responses or immediate consistency

Explain-It Challenge

"You are designing a food delivery platform. When a customer places an order, you need to: (1) charge their card, (2) notify the restaurant, (3) find a driver, (4) send the customer a confirmation, and (5) update analytics. Some of these must succeed for the order to be valid; others are best-effort. Design the messaging architecture, specifying which steps are synchronous, which are async via queues, how you handle failures at each step, and what message broker you would choose."