ArchitectureStaff

Staff Prep 25: Message Queues — Kafka vs SQS vs Redis Streams

April 4, 202610 min readPART 03 / 06

Back to Part 24: Redis Patterns. Message queues decouple producers from consumers and let you process work asynchronously. Kafka, SQS and Redis Streams all solve that same problem with very different trade-offs in durability, throughput, operational overhead and replay. Picking the wrong one is the kind of mistake you pay for twelve months later when you're rewriting your event pipeline on a Saturday. Ask me how I know.

Core concepts: topics, partitions, consumer groups

Topic: A named stream of messages. Producers write to topics; consumers read from topics.

Partition: A topic is split into N partitions. Each partition is an ordered, append-only log. Partitions enable parallelism: each consumer in a group is assigned one or more partitions exclusively.

Consumer Group: Multiple consumers sharing the work of processing a topic. Each message is delivered to exactly one consumer in the group. Different groups each receive all messages independently.

Kafka: high-throughput, durable, replayable

python

from aiokafka import AIOKafkaProducer, AIOKafkaConsumer
import asyncio
import json

# Producer
async def produce_order_event(order_id: int, event_type: str, data: dict):
    producer = AIOKafkaProducer(
        bootstrap_servers="kafka:9092",
        value_serializer=lambda v: json.dumps(v).encode(),
    )
    await producer.start()
    try:
        # Partition by order_id for ordering guarantees within an order
        await producer.send_and_wait(
            "order-events",
            value={"order_id": order_id, "type": event_type, "data": data},
            key=str(order_id).encode(),  # same key = same partition = ordered
        )
    finally:
        await producer.stop()

# Consumer with manual offset management
async def consume_order_events():
    consumer = AIOKafkaConsumer(
        "order-events",
        bootstrap_servers="kafka:9092",
        group_id="order-processor",
        auto_offset_reset="earliest",
        enable_auto_commit=False,  # manual commit for exactly-once processing
        value_deserializer=lambda v: json.loads(v.decode()),
    )
    await consumer.start()
    try:
        async for message in consumer:
            try:
                await process_order_event(message.value)
                await consumer.commit()  # commit AFTER processing (at-least-once)
            except ProcessingError:
                # Don't commit — message will be redelivered
                pass
    finally:
        await consumer.stop()

Kafka delivery semantics

python

from aiokafka import AIOKafkaProducer

# At-most-once: commit before processing (message may be lost on crash)
async def at_most_once(consumer, message):
    await consumer.commit()  # commit first
    await process(message)   # if this crashes, message is lost

# At-least-once: commit after processing (message may be processed twice)
async def at_least_once(consumer, message):
    await process(message)   # process first
    await consumer.commit()  # if crash between process and commit: redelivered

# Exactly-once: producer transactions + idempotent consumer
producer = AIOKafkaProducer(
    bootstrap_servers="kafka:9092",
    transactional_id="my-producer-1",  # enables transactions
    enable_idempotence=True,           # deduplication at producer level
)
await producer.start()

async with producer.transaction():
    await producer.send("output-topic", value=processed_data)
    # Only committed if the transaction succeeds
    # Consumer must use isolation_level="read_committed" to see only committed messages

SQS: fully managed, simpler, limited

python

import boto3
import json
from typing import Optional

sqs = boto3.client("sqs", region_name="us-east-1")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events"
DLQ_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events-dlq"

# Producer
def publish_event(event: dict):
    sqs.send_message(
        QueueUrl=QUEUE_URL,
        MessageBody=json.dumps(event),
        MessageGroupId=str(event["order_id"]),  # FIFO queue: ordering per group
        MessageDeduplicationId=event["idempotency_key"],  # exactly-once delivery in FIFO
    )

# Consumer
def consume_messages(max_messages: int = 10):
    response = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=max_messages,
        WaitTimeSeconds=20,  # long polling (reduces empty receives)
        VisibilityTimeout=60,  # message hidden from other consumers for 60s
    )

    for message in response.get("Messages", []):
        try:
            body = json.loads(message["Body"])
            process_event(body)
            # Delete after successful processing
            sqs.delete_message(
                QueueUrl=QUEUE_URL,
                ReceiptHandle=message["ReceiptHandle"]
            )
        except Exception as e:
            # Do NOT delete — message becomes visible again after VisibilityTimeout
            # After maxReceiveCount failures, SQS moves to DLQ automatically
            print(f"Processing failed: {e}")

Dead letter queues: handling persistent failures

python

import boto3

sqs = boto3.client("sqs")

# Configure DLQ: after 5 failed processing attempts, move to DLQ
sqs.set_queue_attributes(
    QueueUrl=QUEUE_URL,
    Attributes={
        "RedrivePolicy": json.dumps({
            "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789:order-events-dlq",
            "maxReceiveCount": "5",
        })
    }
)

# Monitor DLQ for alerts
def get_dlq_depth() -> int:
    attrs = sqs.get_queue_attributes(
        QueueUrl=DLQ_URL,
        AttributeNames=["ApproximateNumberOfMessages"]
    )
    return int(attrs["Attributes"]["ApproximateNumberOfMessages"])

# Replay DLQ messages (after fixing the bug)
def redrive_dlq():
    sqs.start_message_move_task(
        SourceArn="arn:aws:sqs:us-east-1:123456789:order-events-dlq",
        DestinationArn="arn:aws:sqs:us-east-1:123456789:order-events",
        MaxNumberOfMessagesPerSecond=10,
    )

Comparison: Kafka vs SQS vs Redis streams

Kafka does 1M+ msg/s, retains messages for replay and has mature consumer group semantics. It is also genuinely painful to operate yourself. Use it when you need event sourcing, analytics pipelines or cross-team event buses, and preferably on a managed offering like MSK or Confluent Cloud unless you have a dedicated infra team.

SQS handles up to 3,000 msg/s standard (higher with FIFO), is fully managed, has no replay, caps retention at 14 days and gives you dead letter queues for free. The ops story is the simplest of the three, and for most web apps that's all you need.

Redis Streams sits around 100k msg/s with configurable retention and consumer groups. If Redis is already in your stack, it's a very low-friction choice for moderate throughput. I tend to start here and graduate only when the numbers force me to.

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

Kafka retains messages for days or weeks. SQS retains for up to 14 days. What capability does retention enable that a traditional queue does not?
You have a Kafka consumer processing payment events. The consumer processes the message but crashes before committing the offset. What happens on restart? Is this safe?
What is a Dead Letter Queue? When should a message be sent to the DLQ instead of retried? How do you handle DLQ messages after fixing the underlying bug?
Your order processing pipeline has 5 consumers in a Kafka consumer group reading from a topic with 3 partitions. How many consumers are actually working at any given time? How many are idle?
You need ordering guarantees for all events related to the same order. How do you achieve this in Kafka? In SQS?

Next up: Part 26: Load Balancing Strategies. L4 vs L7, consistent hashing, health checks, and sticky session trade-offs.

← PREV

Staff Prep 24: Redis Data Structures & Caching Patterns Beyond GET/SET