Staff Prep 25: Message Queues — Kafka vs SQS vs Redis Streams
ArchitectureStaff

Staff Prep 25: Message Queues — Kafka vs SQS vs Redis Streams

April 4, 202610 min readPART 03 / 06

Back to Part 24: Redis Patterns. Message queues decouple producers from consumers and enable async processing at scale. Kafka, SQS, and Redis Streams solve the same fundamental problem but with very different trade-offs in durability, throughput, operational complexity, and replay capability. Choosing wrong is expensive to undo.

Core concepts: topics, partitions, consumer groups

Topic: A named stream of messages. Producers write to topics; consumers read from topics.

Partition: A topic is split into N partitions. Each partition is an ordered, append-only log. Partitions enable parallelism: each consumer in a group is assigned one or more partitions exclusively.

Consumer Group: Multiple consumers sharing the work of processing a topic. Each message is delivered to exactly one consumer in the group. Different groups each receive all messages independently.

Kafka: high-throughput, durable, replayable

python
from aiokafka import AIOKafkaProducer, AIOKafkaConsumer
import asyncio
import json

# Producer
async def produce_order_event(order_id: int, event_type: str, data: dict):
    producer = AIOKafkaProducer(
        bootstrap_servers="kafka:9092",
        value_serializer=lambda v: json.dumps(v).encode(),
    )
    await producer.start()
    try:
        # Partition by order_id for ordering guarantees within an order
        await producer.send_and_wait(
            "order-events",
            value={"order_id": order_id, "type": event_type, "data": data},
            key=str(order_id).encode(),  # same key = same partition = ordered
        )
    finally:
        await producer.stop()

# Consumer with manual offset management
async def consume_order_events():
    consumer = AIOKafkaConsumer(
        "order-events",
        bootstrap_servers="kafka:9092",
        group_id="order-processor",
        auto_offset_reset="earliest",
        enable_auto_commit=False,  # manual commit for exactly-once processing
        value_deserializer=lambda v: json.loads(v.decode()),
    )
    await consumer.start()
    try:
        async for message in consumer:
            try:
                await process_order_event(message.value)
                await consumer.commit()  # commit AFTER processing (at-least-once)
            except ProcessingError:
                # Don't commit — message will be redelivered
                pass
    finally:
        await consumer.stop()

Kafka delivery semantics

python
from aiokafka import AIOKafkaProducer

# At-most-once: commit before processing (message may be lost on crash)
async def at_most_once(consumer, message):
    await consumer.commit()  # commit first
    await process(message)   # if this crashes, message is lost

# At-least-once: commit after processing (message may be processed twice)
async def at_least_once(consumer, message):
    await process(message)   # process first
    await consumer.commit()  # if crash between process and commit: redelivered

# Exactly-once: producer transactions + idempotent consumer
producer = AIOKafkaProducer(
    bootstrap_servers="kafka:9092",
    transactional_id="my-producer-1",  # enables transactions
    enable_idempotence=True,           # deduplication at producer level
)
await producer.start()

async with producer.transaction():
    await producer.send("output-topic", value=processed_data)
    # Only committed if the transaction succeeds
    # Consumer must use isolation_level="read_committed" to see only committed messages

SQS: fully managed, simpler, limited

python
import boto3
import json
from typing import Optional

sqs = boto3.client("sqs", region_name="us-east-1")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events"
DLQ_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events-dlq"

# Producer
def publish_event(event: dict):
    sqs.send_message(
        QueueUrl=QUEUE_URL,
        MessageBody=json.dumps(event),
        MessageGroupId=str(event["order_id"]),  # FIFO queue: ordering per group
        MessageDeduplicationId=event["idempotency_key"],  # exactly-once delivery in FIFO
    )

# Consumer
def consume_messages(max_messages: int = 10):
    response = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=max_messages,
        WaitTimeSeconds=20,  # long polling (reduces empty receives)
        VisibilityTimeout=60,  # message hidden from other consumers for 60s
    )

    for message in response.get("Messages", []):
        try:
            body = json.loads(message["Body"])
            process_event(body)
            # Delete after successful processing
            sqs.delete_message(
                QueueUrl=QUEUE_URL,
                ReceiptHandle=message["ReceiptHandle"]
            )
        except Exception as e:
            # Do NOT delete — message becomes visible again after VisibilityTimeout
            # After maxReceiveCount failures, SQS moves to DLQ automatically
            print(f"Processing failed: {e}")

Dead letter queues: handling persistent failures

python
import boto3

sqs = boto3.client("sqs")

# Configure DLQ: after 5 failed processing attempts, move to DLQ
sqs.set_queue_attributes(
    QueueUrl=QUEUE_URL,
    Attributes={
        "RedrivePolicy": json.dumps({
            "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789:order-events-dlq",
            "maxReceiveCount": "5",
        })
    }
)

# Monitor DLQ for alerts
def get_dlq_depth() -> int:
    attrs = sqs.get_queue_attributes(
        QueueUrl=DLQ_URL,
        AttributeNames=["ApproximateNumberOfMessages"]
    )
    return int(attrs["Attributes"]["ApproximateNumberOfMessages"])

# Replay DLQ messages (after fixing the bug)
def redrive_dlq():
    sqs.start_message_move_task(
        SourceArn="arn:aws:sqs:us-east-1:123456789:order-events-dlq",
        DestinationArn="arn:aws:sqs:us-east-1:123456789:order-events",
        MaxNumberOfMessagesPerSecond=10,
    )

Comparison: Kafka vs SQS vs Redis streams

Kafka: 1M+ msg/s, message retention (replay), consumer groups, complex to operate, great for event sourcing and analytics pipelines.

SQS: 3,000 msg/s standard (unlimited with FIFO), fully managed, no replay, 14-day max retention, dead letter queues built-in, simplest operations.

Redis Streams: ~100k msg/s, configurable retention, consumer groups, light operational overhead if Redis is already deployed. Best for moderate-throughput use cases.

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

  1. Kafka retains messages for days or weeks. SQS retains for up to 14 days. What capability does retention enable that a traditional queue does not?
  2. You have a Kafka consumer processing payment events. The consumer processes the message but crashes before committing the offset. What happens on restart? Is this safe?
  3. What is a Dead Letter Queue? When should a message be sent to the DLQ instead of retried? How do you handle DLQ messages after fixing the underlying bug?
  4. Your order processing pipeline has 5 consumers in a Kafka consumer group reading from a topic with 3 partitions. How many consumers are actually working at any given time? How many are idle?
  5. You need ordering guarantees for all events related to the same order. How do you achieve this in Kafka? In SQS?

Next up — Part 26: Load Balancing Strategies. L4 vs L7, consistent hashing, health checks, and sticky session trade-offs.

← PREV
Staff Prep 24: Redis Data Structures & Caching Patterns Beyond GET/SET
← All Architecture Posts