Staff Prep 25: Message Queues — Kafka vs SQS vs Redis Streams
Back to Part 24: Redis Patterns. Message queues decouple producers from consumers and enable async processing at scale. Kafka, SQS, and Redis Streams solve the same fundamental problem but with very different trade-offs in durability, throughput, operational complexity, and replay capability. Choosing wrong is expensive to undo.
Core concepts: topics, partitions, consumer groups
Topic: A named stream of messages. Producers write to topics; consumers read from topics.
Partition: A topic is split into N partitions. Each partition is an ordered, append-only log. Partitions enable parallelism: each consumer in a group is assigned one or more partitions exclusively.
Consumer Group: Multiple consumers sharing the work of processing a topic. Each message is delivered to exactly one consumer in the group. Different groups each receive all messages independently.
Kafka: high-throughput, durable, replayable
from aiokafka import AIOKafkaProducer, AIOKafkaConsumer
import asyncio
import json
# Producer
async def produce_order_event(order_id: int, event_type: str, data: dict):
producer = AIOKafkaProducer(
bootstrap_servers="kafka:9092",
value_serializer=lambda v: json.dumps(v).encode(),
)
await producer.start()
try:
# Partition by order_id for ordering guarantees within an order
await producer.send_and_wait(
"order-events",
value={"order_id": order_id, "type": event_type, "data": data},
key=str(order_id).encode(), # same key = same partition = ordered
)
finally:
await producer.stop()
# Consumer with manual offset management
async def consume_order_events():
consumer = AIOKafkaConsumer(
"order-events",
bootstrap_servers="kafka:9092",
group_id="order-processor",
auto_offset_reset="earliest",
enable_auto_commit=False, # manual commit for exactly-once processing
value_deserializer=lambda v: json.loads(v.decode()),
)
await consumer.start()
try:
async for message in consumer:
try:
await process_order_event(message.value)
await consumer.commit() # commit AFTER processing (at-least-once)
except ProcessingError:
# Don't commit — message will be redelivered
pass
finally:
await consumer.stop()
Kafka delivery semantics
from aiokafka import AIOKafkaProducer
# At-most-once: commit before processing (message may be lost on crash)
async def at_most_once(consumer, message):
await consumer.commit() # commit first
await process(message) # if this crashes, message is lost
# At-least-once: commit after processing (message may be processed twice)
async def at_least_once(consumer, message):
await process(message) # process first
await consumer.commit() # if crash between process and commit: redelivered
# Exactly-once: producer transactions + idempotent consumer
producer = AIOKafkaProducer(
bootstrap_servers="kafka:9092",
transactional_id="my-producer-1", # enables transactions
enable_idempotence=True, # deduplication at producer level
)
await producer.start()
async with producer.transaction():
await producer.send("output-topic", value=processed_data)
# Only committed if the transaction succeeds
# Consumer must use isolation_level="read_committed" to see only committed messages
SQS: fully managed, simpler, limited
import boto3
import json
from typing import Optional
sqs = boto3.client("sqs", region_name="us-east-1")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events"
DLQ_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events-dlq"
# Producer
def publish_event(event: dict):
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps(event),
MessageGroupId=str(event["order_id"]), # FIFO queue: ordering per group
MessageDeduplicationId=event["idempotency_key"], # exactly-once delivery in FIFO
)
# Consumer
def consume_messages(max_messages: int = 10):
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=max_messages,
WaitTimeSeconds=20, # long polling (reduces empty receives)
VisibilityTimeout=60, # message hidden from other consumers for 60s
)
for message in response.get("Messages", []):
try:
body = json.loads(message["Body"])
process_event(body)
# Delete after successful processing
sqs.delete_message(
QueueUrl=QUEUE_URL,
ReceiptHandle=message["ReceiptHandle"]
)
except Exception as e:
# Do NOT delete — message becomes visible again after VisibilityTimeout
# After maxReceiveCount failures, SQS moves to DLQ automatically
print(f"Processing failed: {e}")
Dead letter queues: handling persistent failures
import boto3
sqs = boto3.client("sqs")
# Configure DLQ: after 5 failed processing attempts, move to DLQ
sqs.set_queue_attributes(
QueueUrl=QUEUE_URL,
Attributes={
"RedrivePolicy": json.dumps({
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789:order-events-dlq",
"maxReceiveCount": "5",
})
}
)
# Monitor DLQ for alerts
def get_dlq_depth() -> int:
attrs = sqs.get_queue_attributes(
QueueUrl=DLQ_URL,
AttributeNames=["ApproximateNumberOfMessages"]
)
return int(attrs["Attributes"]["ApproximateNumberOfMessages"])
# Replay DLQ messages (after fixing the bug)
def redrive_dlq():
sqs.start_message_move_task(
SourceArn="arn:aws:sqs:us-east-1:123456789:order-events-dlq",
DestinationArn="arn:aws:sqs:us-east-1:123456789:order-events",
MaxNumberOfMessagesPerSecond=10,
)
Comparison: Kafka vs SQS vs Redis streams
Kafka: 1M+ msg/s, message retention (replay), consumer groups, complex to operate, great for event sourcing and analytics pipelines.
SQS: 3,000 msg/s standard (unlimited with FIFO), fully managed, no replay, 14-day max retention, dead letter queues built-in, simplest operations.
Redis Streams: ~100k msg/s, configurable retention, consumer groups, light operational overhead if Redis is already deployed. Best for moderate-throughput use cases.
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- Kafka retains messages for days or weeks. SQS retains for up to 14 days. What capability does retention enable that a traditional queue does not?
- You have a Kafka consumer processing payment events. The consumer processes the message but crashes before committing the offset. What happens on restart? Is this safe?
- What is a Dead Letter Queue? When should a message be sent to the DLQ instead of retried? How do you handle DLQ messages after fixing the underlying bug?
- Your order processing pipeline has 5 consumers in a Kafka consumer group reading from a topic with 3 partitions. How many consumers are actually working at any given time? How many are idle?
- You need ordering guarantees for all events related to the same order. How do you achieve this in Kafka? In SQS?
Next up — Part 26: Load Balancing Strategies. L4 vs L7, consistent hashing, health checks, and sticky session trade-offs.