Episode 9 — System Design / 9.9 — Core Infrastructure
9.9.e Message Queues
Why Asynchronous Communication?
In a synchronous system, the caller waits for the callee to respond. This creates tight coupling, and if any service is slow or down, the entire chain suffers.
SYNCHRONOUS (Tight Coupling):
User --> Order Service --> Payment Service --> Notification Service
| | |
WAITS WAITS WAITS
If Payment Service takes 5 seconds, user waits 5+ seconds.
If Notification Service is down, the entire order fails.
ASYNCHRONOUS (Loose Coupling):
User --> Order Service --> Message Queue
|
+----+----+----+
| | |
v v v
Payment Inventory Notification
Service Service Service
Order Service responds immediately.
Downstream services process at their own pace.
If Notification Service is down, messages wait in the queue.
Benefits of Async Communication
| Benefit | Description |
|---|---|
| Decoupling | Producer does not need to know about consumers |
| Resilience | Messages survive if a consumer is temporarily down |
| Scalability | Consumers can scale independently based on queue depth |
| Load leveling | Queue absorbs traffic spikes; consumers process at steady rate |
| Guaranteed delivery | Messages are persisted until acknowledged |
| Temporal decoupling | Producer and consumer do not need to be online simultaneously |
Queue vs Topic
These are two fundamental messaging patterns with different semantics.
Queue (Point-to-Point)
Each message is consumed by exactly one consumer.
Queue (Point-to-Point):
Producer --> [MSG-3][MSG-2][MSG-1] --> Consumer A (gets MSG-1)
--> Consumer B (gets MSG-2)
--> Consumer C (gets MSG-3)
Messages are distributed across consumers.
Each message is processed ONCE by ONE consumer.
Use case: Task processing, job queues, order processing
Topic (Pub/Sub)
Each message is delivered to all subscribers.
Topic (Publish/Subscribe):
Publisher --> [MSG-1] --> Topic --> Subscriber A (gets MSG-1)
--> Subscriber B (gets MSG-1)
--> Subscriber C (gets MSG-1)
ALL subscribers receive EVERY message.
Each subscriber processes independently.
Use case: Event broadcasting, notifications, real-time updates
Comparison
| Aspect | Queue | Topic |
|---|---|---|
| Delivery | One consumer per message | All subscribers per message |
| Pattern | Competing consumers | Fan-out |
| Use case | Work distribution | Event broadcasting |
| Scaling | Add consumers to increase throughput | Each subscriber scales independently |
| Example | "Process this order" | "An order was placed" |
RabbitMQ vs Kafka vs SQS
RabbitMQ
A traditional message broker that supports multiple messaging patterns.
RabbitMQ Architecture:
Producer --> Exchange --> Binding --> Queue --> Consumer
Exchange Types:
- Direct: route by exact routing key
- Topic: route by pattern matching (e.g., order.*)
- Fanout: broadcast to all bound queues
- Headers: route by message headers
Example (Direct Exchange):
Producer: publish(exchange="orders", routing_key="payment")
Exchange "orders":
routing_key="payment" --> Queue: payment-queue --> Payment Consumer
routing_key="inventory" --> Queue: inventory-queue --> Inventory Consumer
routing_key="email" --> Queue: email-queue --> Email Consumer
Key characteristics:
- Push-based: broker pushes messages to consumers
- Per-message acknowledgment
- Rich routing via exchange types
- Messages are deleted after consumption
- Supports priorities, TTL, dead letter queues
- Written in Erlang (excellent concurrency)
Apache Kafka
A distributed event streaming platform designed for high-throughput, ordered, durable event streams.
Kafka Architecture:
+------------------------------------------------------------------+
| Kafka Cluster |
| |
| Topic: "orders" (3 partitions) |
| |
| Partition 0: [msg-0][msg-3][msg-6][msg-9] --> Consumer A |
| Partition 1: [msg-1][msg-4][msg-7] --> Consumer B |
| Partition 2: [msg-2][msg-5][msg-8] --> Consumer C |
| |
| Consumer Group "order-processing": |
| Each partition assigned to exactly one consumer in the group. |
| Messages within a partition are strictly ordered. |
+------------------------------------------------------------------+
Producer: send(topic="orders", key="user-42", value={...})
Key determines partition: hash("user-42") % 3 = partition 1
All messages for user-42 go to partition 1 (ordered!)
Key characteristics:
- Pull-based: consumers pull messages at their own pace
- Messages are retained (configurable retention, e.g., 7 days)
- Consumers track their offset (position in the log)
- Partitions enable parallelism and ordering
- Replication for fault tolerance
- Consumer groups for load distribution
- Can replay messages (rewind offset)
- Extremely high throughput (millions of messages/sec)
Amazon SQS
A fully managed message queue service from AWS.
SQS Architecture:
Standard Queue:
Producer --> SQS Queue --> Consumer
- At-least-once delivery
- Best-effort ordering
- Nearly unlimited throughput
FIFO Queue:
Producer --> SQS FIFO Queue --> Consumer
- Exactly-once processing
- Strict ordering within message group
- 3,000 msg/sec with batching
Key characteristics:
- Fully managed (no infrastructure to operate)
- Pay-per-message pricing
- Standard queues: unlimited throughput, at-least-once, best-effort order
- FIFO queues: exactly-once, strict order, limited throughput
- Visibility timeout (prevents double processing)
- Dead letter queues built-in
- Long polling for efficiency
Feature Comparison
| Feature | RabbitMQ | Kafka | SQS |
|---|---|---|---|
| Type | Message broker | Event streaming platform | Managed queue |
| Delivery model | Push | Pull | Pull (long polling) |
| Ordering | Per-queue FIFO | Per-partition FIFO | Best-effort (Standard) / Strict (FIFO) |
| Throughput | ~50K msg/sec | Millions msg/sec | Unlimited (Standard) / 3K (FIFO) |
| Message retention | Until consumed | Configurable (days/weeks) | Up to 14 days |
| Replay | No (message deleted) | Yes (rewind offset) | No |
| Routing | Rich (exchanges) | Topic + partition key | Simple (queue per consumer) |
| Exactly-once | No (at-least-once) | Yes (with transactions) | Yes (FIFO only) |
| Operations | Self-managed | Self-managed or managed (Confluent, MSK) | Fully managed |
| Best for | Complex routing, RPC | Event sourcing, log aggregation, streaming | Simple async tasks, AWS-native |
| Protocols | AMQP, MQTT, STOMP | Custom (Kafka protocol) | HTTP/HTTPS (AWS API) |
| Consumer groups | Competing consumers | Native support | Not built-in (use multiple queues) |
Decision Guide
Which Message System?
Need event streaming / replay / high throughput?
|
+-- Yes --> Kafka
|
Need complex routing (topic exchanges, headers)?
|
+-- Yes --> RabbitMQ
|
Need fully managed, simple queue, AWS-native?
|
+-- Yes --> SQS
|
Need real-time pub/sub with low latency?
|
+-- Yes --> RabbitMQ or Redis Pub/Sub
|
Need to process millions of events for analytics?
|
+-- Yes --> Kafka
Producer-Consumer Pattern
The most fundamental messaging pattern. Producers create messages; consumers process them.
Producer-Consumer:
+----------+ +---------+ +----------+
| Producer |---->| Queue |---->| Consumer |
| (Order | | [m3][m2]| | (Payment |
| Service)| | [m1] | | Worker) |
+----------+ +---------+ +----------+
Multiple producers, multiple consumers:
Producer A --+ +---------+ +--> Consumer 1
|---->| Queue |---->|
Producer B --+ | | +--> Consumer 2
|---->| |---->|
Producer C --+ +---------+ +--> Consumer 3
Messages are distributed among consumers (competing consumers).
Each message is processed by exactly one consumer.
Acknowledgment and Visibility
Message Lifecycle:
1. Producer sends message to queue
2. Message is persisted in queue
3. Consumer receives message (message becomes "invisible")
4. Consumer processes message
5a. SUCCESS: Consumer ACKs --> message is deleted
5b. FAILURE: Consumer NACKs or timeout --> message becomes visible again
Visibility Timeout (SQS) / Acknowledgment (RabbitMQ):
Time: 0s 10s 20s 30s 40s 50s 60s
| | | | | | |
Recv |----->| | | | | |
Processing |---->|---->| | | |
ACK |---->| | | (message deleted)
If no ACK by timeout:
| | | | | | |
Recv |----->| | | | | |
Processing |---->|---->|---->|---->| |
TIMEOUT |---->| (message reappears for retry)
Publish/Subscribe Pattern
A message is broadcast to all interested subscribers.
Pub/Sub Architecture:
+----------+
| Event: | +---> Subscriber A: Email Service
| "Order |---->| Topic: order.placed
| Placed" | +---> Subscriber B: Inventory Service
+----------+ |
+---> Subscriber C: Analytics Service
|
+---> Subscriber D: Notification Service
ALL subscribers receive the event.
Each processes independently.
Adding a new subscriber requires NO changes to the publisher.
Kafka Consumer Groups Enable Both Patterns
Kafka Topic: "orders" (3 partitions)
Consumer Group A (order-processing):
Consumer A1 <-- Partition 0
Consumer A2 <-- Partition 1
Consumer A3 <-- Partition 2
(Point-to-point within this group)
Consumer Group B (analytics):
Consumer B1 <-- Partition 0, 1, 2
(Receives ALL messages)
Consumer Group C (audit-log):
Consumer C1 <-- Partition 0, 1
Consumer C2 <-- Partition 2
(Point-to-point within this group)
Result: Each group gets ALL messages (pub/sub between groups).
Within a group, messages are distributed (competing consumers).
Dead Letter Queues (DLQ)
A dead letter queue captures messages that fail processing after a maximum number of retries.
Dead Letter Queue Flow:
Main Queue DLQ
+--------+ +--------+
| MSG-1 | --> Consumer --> FAIL | |
| | --> Consumer --> FAIL | |
| | --> Consumer --> FAIL | MSG-1 | (moved after 3 failures)
+--------+ (3 retries) +--------+
DLQ allows:
- Investigation of why messages failed
- Manual reprocessing after fixing the bug
- Alerting on DLQ depth (something is wrong!)
- Preventing poison messages from blocking the queue
DLQ Best Practices:
| Practice | Why |
|---|---|
| Set max retries (3-5) | Prevent infinite retry loops |
| Alert on DLQ depth | Catch processing failures early |
| Preserve original message + error | Enable debugging |
| Set DLQ retention longer than main queue | Give time to investigate |
| Implement DLQ reprocessing | Replay messages after fixing bugs |
Ordering Guarantees
Why Ordering Matters
Scenario: Bank account updates
Message 1: Deposit $100 (balance: $100)
Message 2: Withdraw $80 (balance: $20)
If processed out of order:
Message 2: Withdraw $80 (balance: -$80 ???) <-- REJECTED or negative balance!
Message 1: Deposit $100 (balance: $100)
Ordering by System
| System | Ordering Guarantee |
|---|---|
| Kafka | Strict order per partition. Use partition key to group related messages. |
| RabbitMQ | FIFO per queue (single consumer). Lost with competing consumers. |
| SQS Standard | Best-effort ordering (messages can arrive out of order) |
| SQS FIFO | Strict ordering per message group ID |
Kafka Partition Ordering
Topic: "account-events"
Partition key: account_id
Account 42: [deposit $100] --> [withdraw $80] --> [deposit $50]
All go to partition hash("42") % N = partition 2
Processed in exact order by one consumer.
Account 99: [deposit $200] --> [withdraw $150]
All go to partition hash("99") % N = partition 0
Processed in exact order by a different consumer.
Account 42 and Account 99 events are NOT ordered relative to each other.
(They do not need to be -- they are independent.)
Exactly-Once Delivery
The holy grail of messaging. Three delivery semantics exist:
At-Most-Once
Message may be lost but never duplicated.
Producer --> Queue --> Consumer
Producer sends, does not wait for ACK.
If message is lost in transit, it is gone.
Use case: Metrics, logs (losing a few is OK)
At-Least-Once
Message is guaranteed to be delivered but may be duplicated.
Producer --> Queue --> Consumer
If consumer fails after processing but before ACK:
Message is redelivered --> processed AGAIN (duplicate!)
Most common semantic. Requires idempotent consumers.
Exactly-Once
Message is delivered and processed exactly once. Hard to achieve.
Approaches:
1. Idempotent Consumer (most practical):
Consumer checks: "Have I processed message ID abc123?"
If yes: skip (deduplicate)
If no: process and record message ID
+-----------------------------------------------------+
| Consumer |
| |
| msg_id = message.id |
| if msg_id in processed_set: |
| return # already processed, skip |
| else: |
| process(message) |
| processed_set.add(msg_id) |
| ack(message) |
+-----------------------------------------------------+
2. Kafka Transactions (Kafka-specific):
Producer uses transactional API.
Consumer uses read_committed isolation.
Exactly-once within the Kafka ecosystem.
3. Outbox Pattern (database + queue):
Write to DB and outbox table in same transaction.
Separate process reads outbox and publishes to queue.
Guarantees consistency between DB and queue.
Delivery Guarantee Comparison
| Semantic | Message Loss | Duplication | Complexity | Use Case |
|---|---|---|---|---|
| At-most-once | Possible | Never | Low | Metrics, non-critical logs |
| At-least-once | Never | Possible | Medium | Most business operations |
| Exactly-once | Never | Never | High | Financial transactions, critical operations |
Backpressure
Backpressure is a mechanism to slow down producers when consumers cannot keep up.
Without Backpressure:
Producer (1000 msg/s) --> Queue (growing!) --> Consumer (100 msg/s)
Queue grows unboundedly:
t=0: 0 messages
t=10: 9,000 messages
t=60: 54,000 messages
t=300: 270,000 messages --> OUT OF MEMORY!
With Backpressure:
Producer (1000 msg/s) --> Queue (bounded, max 10,000)
Queue full? Producer options:
1. Block (wait until space available)
2. Drop (discard message)
3. Buffer locally and retry
4. Signal upstream to slow down (HTTP 429)
Backpressure Strategies
| Strategy | How It Works | Trade-off |
|---|---|---|
| Bounded queue | Queue has max size; rejects when full | May drop messages |
| Rate limiting producers | Limit producer send rate | Slows entire pipeline |
| Consumer scaling | Auto-scale consumers based on queue depth | Takes time, costs money |
| Priority queues | Process high-priority messages first | Low-priority may starve |
| Load shedding | Drop non-essential messages under load | Acceptable for some use cases |
Auto-Scaling Based on Queue Depth:
Queue depth < 1,000: 2 consumers
Queue depth 1,000-10K: 5 consumers (scale up)
Queue depth > 10,000: 10 consumers (scale up more)
Queue depth < 500: 2 consumers (scale down)
CloudWatch Alarm --> Auto Scaling Group --> Add/Remove Consumers
When to Use Queues in System Design
Use a Queue When...
| Scenario | Example | Why Queue Helps |
|---|---|---|
| Async processing | Order placed --> send email | User does not wait for email |
| Workload smoothing | Flash sale spike | Queue absorbs burst; workers process steadily |
| Service decoupling | Order service --> Inventory service | Services evolve independently |
| Retry handling | Payment processing | Failed payments retry automatically |
| Fan-out | New blog post --> notify all followers | Queue distributes to notification workers |
| Rate limiting | API calls to external service | Queue controls outbound request rate |
| Batch processing | Aggregate events for analytics | Collect in queue, process in batches |
Do NOT Use a Queue When...
| Scenario | Why Not |
|---|---|
| Synchronous response needed | User needs immediate result (e.g., login response) |
| Simple request-response | Adding a queue adds unnecessary complexity |
| Low latency critical | Queue adds milliseconds of latency |
| Data consistency required immediately | Queue introduces eventual consistency |
| Small, simple system | Overkill; direct HTTP calls suffice |
Common Queue Patterns in System Design
Pattern 1: Work Queue (Task Distribution)
Image Upload Service:
User uploads image --> API Server --> Queue: "image-processing"
|
+----+----+----+
| | | |
v v v v
Worker Worker Worker Worker
(resize, compress, thumbnail)
|
v
S3 (store processed images)
Pattern 2: Event-Driven Architecture
E-Commerce Event Flow:
Order Service: publish("order.placed", {order_id: 123, ...})
|
v
Event Bus (Kafka)
|
+---> Inventory Service: subscribe("order.placed")
| --> Reserve items
|
+---> Payment Service: subscribe("order.placed")
| --> Charge customer
|
+---> Email Service: subscribe("order.placed")
| --> Send confirmation
|
+---> Analytics Service: subscribe("order.placed")
--> Record metrics
Pattern 3: Saga Pattern (Distributed Transactions)
Order Saga (Choreography):
1. Order Service: create order --> publish "order.created"
2. Payment Service: charge card --> publish "payment.success"
3. Inventory Service: reserve items --> publish "inventory.reserved"
4. Shipping Service: create shipment --> publish "shipment.created"
If step 3 fails (out of stock):
3. Inventory Service: publish "inventory.failed"
--> Payment Service: refund charge (compensating transaction)
--> Order Service: cancel order (compensating transaction)
Each step communicates via messages. No distributed lock needed.
Pattern 4: CQRS with Event Sourcing
Command Side: Query Side:
Write API --> Command Handler Read API --> Query Handler
| |
v v
Event Store (Kafka) Read Database (Elasticsearch)
| ^
+--- Events ------------------>|
(projected into read model)
Message Queue Best Practices
| Practice | Details |
|---|---|
| Make consumers idempotent | Process same message twice without side effects |
| Use dead letter queues | Capture failed messages for investigation |
| Set message TTL | Prevent stale messages from being processed |
| Monitor queue depth | Alert when queues grow unexpectedly |
| Use correlation IDs | Track a request across multiple services |
| Batch when possible | Reduce overhead by processing messages in batches |
| Test failure scenarios | What happens when consumer crashes mid-processing? |
| Version your message schemas | Allow producers and consumers to evolve independently |
| Start simple | SQS or RabbitMQ before Kafka unless you need streaming |
Key Takeaways
- Async communication decouples services, improves resilience, and smooths load
- Queues are point-to-point (one consumer); topics are pub/sub (all subscribers)
- Kafka for high-throughput event streaming; RabbitMQ for routing; SQS for simplicity
- At-least-once + idempotent consumers is the most practical delivery guarantee
- Dead letter queues are essential -- always configure them
- Ordering is per-partition (Kafka) or per-queue -- use partition keys wisely
- Backpressure prevents queues from growing unboundedly
- Use queues when you need decoupling, async processing, or load leveling
- Do not use queues when you need synchronous responses or immediate consistency
Explain-It Challenge
"You are designing a food delivery platform. When a customer places an order, you need to: (1) charge their card, (2) notify the restaurant, (3) find a driver, (4) send the customer a confirmation, and (5) update analytics. Some of these must succeed for the order to be valid; others are best-effort. Design the messaging architecture, specifying which steps are synchronous, which are async via queues, how you handle failures at each step, and what message broker you would choose."