Episode 9 — System Design / 9.9 — Core Infrastructure

9.9.e Message Queues

Why Asynchronous Communication?

In a synchronous system, the caller waits for the callee to respond. This creates tight coupling, and if any service is slow or down, the entire chain suffers.

  SYNCHRONOUS (Tight Coupling):
  
  User --> Order Service --> Payment Service --> Notification Service
                  |                  |                    |
               WAITS              WAITS                WAITS
               
  If Payment Service takes 5 seconds, user waits 5+ seconds.
  If Notification Service is down, the entire order fails.
  
  
  ASYNCHRONOUS (Loose Coupling):
  
  User --> Order Service --> Message Queue
                                |
                      +----+----+----+
                      |    |         |
                      v    v         v
                  Payment  Inventory  Notification
                  Service  Service    Service
                  
  Order Service responds immediately.
  Downstream services process at their own pace.
  If Notification Service is down, messages wait in the queue.

Benefits of Async Communication

BenefitDescription
DecouplingProducer does not need to know about consumers
ResilienceMessages survive if a consumer is temporarily down
ScalabilityConsumers can scale independently based on queue depth
Load levelingQueue absorbs traffic spikes; consumers process at steady rate
Guaranteed deliveryMessages are persisted until acknowledged
Temporal decouplingProducer and consumer do not need to be online simultaneously

Queue vs Topic

These are two fundamental messaging patterns with different semantics.

Queue (Point-to-Point)

Each message is consumed by exactly one consumer.

  Queue (Point-to-Point):
  
  Producer --> [MSG-3][MSG-2][MSG-1] --> Consumer A (gets MSG-1)
                                    --> Consumer B (gets MSG-2)
                                    --> Consumer C (gets MSG-3)
  
  Messages are distributed across consumers.
  Each message is processed ONCE by ONE consumer.
  
  Use case: Task processing, job queues, order processing

Topic (Pub/Sub)

Each message is delivered to all subscribers.

  Topic (Publish/Subscribe):
  
  Publisher --> [MSG-1] --> Topic --> Subscriber A (gets MSG-1)
                                 --> Subscriber B (gets MSG-1)
                                 --> Subscriber C (gets MSG-1)
  
  ALL subscribers receive EVERY message.
  Each subscriber processes independently.
  
  Use case: Event broadcasting, notifications, real-time updates

Comparison

AspectQueueTopic
DeliveryOne consumer per messageAll subscribers per message
PatternCompeting consumersFan-out
Use caseWork distributionEvent broadcasting
ScalingAdd consumers to increase throughputEach subscriber scales independently
Example"Process this order""An order was placed"

RabbitMQ vs Kafka vs SQS

RabbitMQ

A traditional message broker that supports multiple messaging patterns.

  RabbitMQ Architecture:
  
  Producer --> Exchange --> Binding --> Queue --> Consumer
  
  Exchange Types:
  - Direct:  route by exact routing key
  - Topic:   route by pattern matching (e.g., order.*)
  - Fanout:  broadcast to all bound queues
  - Headers: route by message headers
  
  
  Example (Direct Exchange):
  
  Producer: publish(exchange="orders", routing_key="payment")
  
  Exchange "orders":
    routing_key="payment"  --> Queue: payment-queue  --> Payment Consumer
    routing_key="inventory" --> Queue: inventory-queue --> Inventory Consumer
    routing_key="email"    --> Queue: email-queue     --> Email Consumer

Key characteristics:

  • Push-based: broker pushes messages to consumers
  • Per-message acknowledgment
  • Rich routing via exchange types
  • Messages are deleted after consumption
  • Supports priorities, TTL, dead letter queues
  • Written in Erlang (excellent concurrency)

Apache Kafka

A distributed event streaming platform designed for high-throughput, ordered, durable event streams.

  Kafka Architecture:
  
  +------------------------------------------------------------------+
  |  Kafka Cluster                                                    |
  |                                                                   |
  |  Topic: "orders" (3 partitions)                                   |
  |                                                                   |
  |  Partition 0: [msg-0][msg-3][msg-6][msg-9]  --> Consumer A       |
  |  Partition 1: [msg-1][msg-4][msg-7]         --> Consumer B       |
  |  Partition 2: [msg-2][msg-5][msg-8]         --> Consumer C       |
  |                                                                   |
  |  Consumer Group "order-processing":                               |
  |  Each partition assigned to exactly one consumer in the group.    |
  |  Messages within a partition are strictly ordered.                |
  +------------------------------------------------------------------+
  
  Producer: send(topic="orders", key="user-42", value={...})
  Key determines partition: hash("user-42") % 3 = partition 1
  All messages for user-42 go to partition 1 (ordered!)

Key characteristics:

  • Pull-based: consumers pull messages at their own pace
  • Messages are retained (configurable retention, e.g., 7 days)
  • Consumers track their offset (position in the log)
  • Partitions enable parallelism and ordering
  • Replication for fault tolerance
  • Consumer groups for load distribution
  • Can replay messages (rewind offset)
  • Extremely high throughput (millions of messages/sec)

Amazon SQS

A fully managed message queue service from AWS.

  SQS Architecture:
  
  Standard Queue:
  Producer --> SQS Queue --> Consumer
  - At-least-once delivery
  - Best-effort ordering
  - Nearly unlimited throughput
  
  FIFO Queue:
  Producer --> SQS FIFO Queue --> Consumer
  - Exactly-once processing
  - Strict ordering within message group
  - 3,000 msg/sec with batching

Key characteristics:

  • Fully managed (no infrastructure to operate)
  • Pay-per-message pricing
  • Standard queues: unlimited throughput, at-least-once, best-effort order
  • FIFO queues: exactly-once, strict order, limited throughput
  • Visibility timeout (prevents double processing)
  • Dead letter queues built-in
  • Long polling for efficiency

Feature Comparison

FeatureRabbitMQKafkaSQS
TypeMessage brokerEvent streaming platformManaged queue
Delivery modelPushPullPull (long polling)
OrderingPer-queue FIFOPer-partition FIFOBest-effort (Standard) / Strict (FIFO)
Throughput~50K msg/secMillions msg/secUnlimited (Standard) / 3K (FIFO)
Message retentionUntil consumedConfigurable (days/weeks)Up to 14 days
ReplayNo (message deleted)Yes (rewind offset)No
RoutingRich (exchanges)Topic + partition keySimple (queue per consumer)
Exactly-onceNo (at-least-once)Yes (with transactions)Yes (FIFO only)
OperationsSelf-managedSelf-managed or managed (Confluent, MSK)Fully managed
Best forComplex routing, RPCEvent sourcing, log aggregation, streamingSimple async tasks, AWS-native
ProtocolsAMQP, MQTT, STOMPCustom (Kafka protocol)HTTP/HTTPS (AWS API)
Consumer groupsCompeting consumersNative supportNot built-in (use multiple queues)

Decision Guide

  Which Message System?
  
  Need event streaming / replay / high throughput?
  |
  +-- Yes --> Kafka
  |
  Need complex routing (topic exchanges, headers)?
  |
  +-- Yes --> RabbitMQ
  |
  Need fully managed, simple queue, AWS-native?
  |
  +-- Yes --> SQS
  |
  Need real-time pub/sub with low latency?
  |
  +-- Yes --> RabbitMQ or Redis Pub/Sub
  |
  Need to process millions of events for analytics?
  |
  +-- Yes --> Kafka

Producer-Consumer Pattern

The most fundamental messaging pattern. Producers create messages; consumers process them.

  Producer-Consumer:
  
  +----------+     +---------+     +----------+
  | Producer |---->|  Queue  |---->| Consumer |
  | (Order   |     | [m3][m2]|     | (Payment |
  |  Service)|     | [m1]    |     |  Worker) |
  +----------+     +---------+     +----------+
  
  Multiple producers, multiple consumers:
  
  Producer A --+     +---------+     +--> Consumer 1
               |---->|  Queue  |---->|
  Producer B --+     |         |     +--> Consumer 2
               |---->|         |---->|
  Producer C --+     +---------+     +--> Consumer 3
  
  Messages are distributed among consumers (competing consumers).
  Each message is processed by exactly one consumer.

Acknowledgment and Visibility

  Message Lifecycle:
  
  1. Producer sends message to queue
  2. Message is persisted in queue
  3. Consumer receives message (message becomes "invisible")
  4. Consumer processes message
  5a. SUCCESS: Consumer ACKs --> message is deleted
  5b. FAILURE: Consumer NACKs or timeout --> message becomes visible again
  
  
  Visibility Timeout (SQS) / Acknowledgment (RabbitMQ):
  
  Time: 0s    10s   20s   30s   40s   50s   60s
        |      |     |     |     |     |     |
  Recv  |----->|     |     |     |     |     |
  Processing   |---->|---->|     |     |     |
  ACK                      |---->|     |     |  (message deleted)
  
  If no ACK by timeout:
        |      |     |     |     |     |     |
  Recv  |----->|     |     |     |     |     |
  Processing   |---->|---->|---->|---->|     |
  TIMEOUT                              |---->|  (message reappears for retry)

Publish/Subscribe Pattern

A message is broadcast to all interested subscribers.

  Pub/Sub Architecture:
  
  +----------+
  | Event:   |     +---> Subscriber A: Email Service
  | "Order   |---->| Topic: order.placed
  |  Placed" |     +---> Subscriber B: Inventory Service
  +----------+     |
                   +---> Subscriber C: Analytics Service
                   |
                   +---> Subscriber D: Notification Service
  
  ALL subscribers receive the event.
  Each processes independently.
  Adding a new subscriber requires NO changes to the publisher.

Kafka Consumer Groups Enable Both Patterns

  Kafka Topic: "orders" (3 partitions)
  
  Consumer Group A (order-processing):
    Consumer A1 <-- Partition 0
    Consumer A2 <-- Partition 1
    Consumer A3 <-- Partition 2
    (Point-to-point within this group)
  
  Consumer Group B (analytics):
    Consumer B1 <-- Partition 0, 1, 2
    (Receives ALL messages)
  
  Consumer Group C (audit-log):
    Consumer C1 <-- Partition 0, 1
    Consumer C2 <-- Partition 2
    (Point-to-point within this group)
  
  Result: Each group gets ALL messages (pub/sub between groups).
  Within a group, messages are distributed (competing consumers).

Dead Letter Queues (DLQ)

A dead letter queue captures messages that fail processing after a maximum number of retries.

  Dead Letter Queue Flow:
  
  Main Queue                              DLQ
  +--------+                          +--------+
  |  MSG-1 | --> Consumer --> FAIL    |        |
  |        | --> Consumer --> FAIL    |        |
  |        | --> Consumer --> FAIL    |  MSG-1 | (moved after 3 failures)
  +--------+    (3 retries)          +--------+
  
  DLQ allows:
  - Investigation of why messages failed
  - Manual reprocessing after fixing the bug
  - Alerting on DLQ depth (something is wrong!)
  - Preventing poison messages from blocking the queue

DLQ Best Practices:

PracticeWhy
Set max retries (3-5)Prevent infinite retry loops
Alert on DLQ depthCatch processing failures early
Preserve original message + errorEnable debugging
Set DLQ retention longer than main queueGive time to investigate
Implement DLQ reprocessingReplay messages after fixing bugs

Ordering Guarantees

Why Ordering Matters

  Scenario: Bank account updates
  
  Message 1: Deposit $100 (balance: $100)
  Message 2: Withdraw $80  (balance: $20)
  
  If processed out of order:
  Message 2: Withdraw $80  (balance: -$80 ???)  <-- REJECTED or negative balance!
  Message 1: Deposit $100 (balance: $100)

Ordering by System

SystemOrdering Guarantee
KafkaStrict order per partition. Use partition key to group related messages.
RabbitMQFIFO per queue (single consumer). Lost with competing consumers.
SQS StandardBest-effort ordering (messages can arrive out of order)
SQS FIFOStrict ordering per message group ID

Kafka Partition Ordering

  Topic: "account-events"
  Partition key: account_id
  
  Account 42:  [deposit $100] --> [withdraw $80] --> [deposit $50]
               All go to partition hash("42") % N = partition 2
               Processed in exact order by one consumer.
  
  Account 99:  [deposit $200] --> [withdraw $150]
               All go to partition hash("99") % N = partition 0
               Processed in exact order by a different consumer.
  
  Account 42 and Account 99 events are NOT ordered relative to each other.
  (They do not need to be -- they are independent.)

Exactly-Once Delivery

The holy grail of messaging. Three delivery semantics exist:

At-Most-Once

Message may be lost but never duplicated.

  Producer --> Queue --> Consumer
  
  Producer sends, does not wait for ACK.
  If message is lost in transit, it is gone.
  
  Use case: Metrics, logs (losing a few is OK)

At-Least-Once

Message is guaranteed to be delivered but may be duplicated.

  Producer --> Queue --> Consumer
  
  If consumer fails after processing but before ACK:
  Message is redelivered --> processed AGAIN (duplicate!)
  
  Most common semantic. Requires idempotent consumers.

Exactly-Once

Message is delivered and processed exactly once. Hard to achieve.

  Approaches:
  
  1. Idempotent Consumer (most practical):
     Consumer checks: "Have I processed message ID abc123?"
     If yes: skip (deduplicate)
     If no: process and record message ID
     
     +-----------------------------------------------------+
     | Consumer                                             |
     |                                                      |
     | msg_id = message.id                                  |
     | if msg_id in processed_set:                          |
     |     return  # already processed, skip                |
     | else:                                                |
     |     process(message)                                 |
     |     processed_set.add(msg_id)                        |
     |     ack(message)                                     |
     +-----------------------------------------------------+
  
  2. Kafka Transactions (Kafka-specific):
     Producer uses transactional API.
     Consumer uses read_committed isolation.
     Exactly-once within the Kafka ecosystem.
  
  3. Outbox Pattern (database + queue):
     Write to DB and outbox table in same transaction.
     Separate process reads outbox and publishes to queue.
     Guarantees consistency between DB and queue.

Delivery Guarantee Comparison

SemanticMessage LossDuplicationComplexityUse Case
At-most-oncePossibleNeverLowMetrics, non-critical logs
At-least-onceNeverPossibleMediumMost business operations
Exactly-onceNeverNeverHighFinancial transactions, critical operations

Backpressure

Backpressure is a mechanism to slow down producers when consumers cannot keep up.

  Without Backpressure:
  
  Producer (1000 msg/s) --> Queue (growing!) --> Consumer (100 msg/s)
  
  Queue grows unboundedly:
  t=0:   0 messages
  t=10:  9,000 messages
  t=60:  54,000 messages
  t=300: 270,000 messages  --> OUT OF MEMORY!
  
  
  With Backpressure:
  
  Producer (1000 msg/s) --> Queue (bounded, max 10,000)
  
  Queue full? Producer options:
  1. Block (wait until space available)
  2. Drop (discard message)
  3. Buffer locally and retry
  4. Signal upstream to slow down (HTTP 429)

Backpressure Strategies

StrategyHow It WorksTrade-off
Bounded queueQueue has max size; rejects when fullMay drop messages
Rate limiting producersLimit producer send rateSlows entire pipeline
Consumer scalingAuto-scale consumers based on queue depthTakes time, costs money
Priority queuesProcess high-priority messages firstLow-priority may starve
Load sheddingDrop non-essential messages under loadAcceptable for some use cases
  Auto-Scaling Based on Queue Depth:
  
  Queue depth < 1,000:   2 consumers
  Queue depth 1,000-10K: 5 consumers  (scale up)
  Queue depth > 10,000:  10 consumers (scale up more)
  Queue depth < 500:     2 consumers  (scale down)
  
  CloudWatch Alarm --> Auto Scaling Group --> Add/Remove Consumers

When to Use Queues in System Design

Use a Queue When...

ScenarioExampleWhy Queue Helps
Async processingOrder placed --> send emailUser does not wait for email
Workload smoothingFlash sale spikeQueue absorbs burst; workers process steadily
Service decouplingOrder service --> Inventory serviceServices evolve independently
Retry handlingPayment processingFailed payments retry automatically
Fan-outNew blog post --> notify all followersQueue distributes to notification workers
Rate limitingAPI calls to external serviceQueue controls outbound request rate
Batch processingAggregate events for analyticsCollect in queue, process in batches

Do NOT Use a Queue When...

ScenarioWhy Not
Synchronous response neededUser needs immediate result (e.g., login response)
Simple request-responseAdding a queue adds unnecessary complexity
Low latency criticalQueue adds milliseconds of latency
Data consistency required immediatelyQueue introduces eventual consistency
Small, simple systemOverkill; direct HTTP calls suffice

Common Queue Patterns in System Design

Pattern 1: Work Queue (Task Distribution)

  Image Upload Service:
  
  User uploads image --> API Server --> Queue: "image-processing"
                                            |
                              +----+----+----+
                              |    |    |    |
                              v    v    v    v
                           Worker Worker Worker Worker
                           (resize, compress, thumbnail)
                              |
                              v
                           S3 (store processed images)

Pattern 2: Event-Driven Architecture

  E-Commerce Event Flow:
  
  Order Service: publish("order.placed", {order_id: 123, ...})
       |
       v
  Event Bus (Kafka)
       |
       +---> Inventory Service: subscribe("order.placed")
       |     --> Reserve items
       |
       +---> Payment Service: subscribe("order.placed")
       |     --> Charge customer
       |
       +---> Email Service: subscribe("order.placed")
       |     --> Send confirmation
       |
       +---> Analytics Service: subscribe("order.placed")
             --> Record metrics

Pattern 3: Saga Pattern (Distributed Transactions)

  Order Saga (Choreography):
  
  1. Order Service: create order --> publish "order.created"
  2. Payment Service: charge card --> publish "payment.success"
  3. Inventory Service: reserve items --> publish "inventory.reserved"
  4. Shipping Service: create shipment --> publish "shipment.created"
  
  If step 3 fails (out of stock):
  3. Inventory Service: publish "inventory.failed"
  --> Payment Service: refund charge (compensating transaction)
  --> Order Service: cancel order (compensating transaction)
  
  Each step communicates via messages. No distributed lock needed.

Pattern 4: CQRS with Event Sourcing

  Command Side:                    Query Side:
  
  Write API --> Command Handler    Read API --> Query Handler
                    |                              |
                    v                              v
              Event Store (Kafka)         Read Database (Elasticsearch)
                    |                              ^
                    +--- Events ------------------>|
                         (projected into read model)

Message Queue Best Practices

PracticeDetails
Make consumers idempotentProcess same message twice without side effects
Use dead letter queuesCapture failed messages for investigation
Set message TTLPrevent stale messages from being processed
Monitor queue depthAlert when queues grow unexpectedly
Use correlation IDsTrack a request across multiple services
Batch when possibleReduce overhead by processing messages in batches
Test failure scenariosWhat happens when consumer crashes mid-processing?
Version your message schemasAllow producers and consumers to evolve independently
Start simpleSQS or RabbitMQ before Kafka unless you need streaming

Key Takeaways

  1. Async communication decouples services, improves resilience, and smooths load
  2. Queues are point-to-point (one consumer); topics are pub/sub (all subscribers)
  3. Kafka for high-throughput event streaming; RabbitMQ for routing; SQS for simplicity
  4. At-least-once + idempotent consumers is the most practical delivery guarantee
  5. Dead letter queues are essential -- always configure them
  6. Ordering is per-partition (Kafka) or per-queue -- use partition keys wisely
  7. Backpressure prevents queues from growing unboundedly
  8. Use queues when you need decoupling, async processing, or load leveling
  9. Do not use queues when you need synchronous responses or immediate consistency

Explain-It Challenge

"You are designing a food delivery platform. When a customer places an order, you need to: (1) charge their card, (2) notify the restaurant, (3) find a driver, (4) send the customer a confirmation, and (5) update analytics. Some of these must succeed for the order to be valid; others are best-effort. Design the messaging architecture, specifying which steps are synchronous, which are async via queues, how you handle failures at each step, and what message broker you would choose."