Skip to content

Cloud & Infrastructure · Architecture

Event-Driven Architecture in Practice: Kafka, Redis Streams, and When Each Fits

Message queues and event streams solve different problems. Kafka is not always the right answer. Here's how to think through event-driven patterns and choose the right tool for your production workload.

Anurag Verma

Anurag Verma

9 min read

Event-Driven Architecture in Practice: Kafka, Redis Streams, and When Each Fits

Sponsored

Share

Most teams reach for Kafka the moment they hear “event-driven architecture.” Kafka is the name they know. But Kafka is a distributed streaming platform designed for high-throughput, durable, replayable event streams. If you’re running a few thousand events per day and need background job processing, you’ve added a Zookeeper dependency (or KRaft), multiple brokers, consumer groups, and offset management to solve a problem that could have been handled with Redis Streams and BullMQ.

Choosing the wrong tool doesn’t just mean extra configuration. It means operational complexity that doesn’t pay off at your actual scale.

What Event-Driven Architecture Actually Solves

The synchronous request-response model is fine until it isn’t. When an API endpoint needs to:

  • Send a welcome email
  • Trigger a Slack notification
  • Kick off a billing job
  • Sync data to a CRM

…doing all of that in the request handler means slow responses, tight coupling, and a failure in the Slack API taking down your signup flow.

Event-driven architecture decouples the action from the consequence. The signup handler publishes an event (user.created). Subscribers (email service, billing service, CRM sync) handle it independently, asynchronously, and in isolation. If the CRM is down, the event waits in the queue rather than failing the signup.

That decoupling is the core value. Everything else (replay, fan-out, consumer groups) is built on top of it.

Events vs Commands vs Messages

These three terms are often used interchangeably but mean different things in practice.

A command is a directive: “SendWelcomeEmail” to user 42. It has one intended recipient and implies the sender cares about the outcome. Commands are point-to-point.

An event is a fact: “UserCreated” with user ID 42 at timestamp T. The publisher doesn’t know or care who consumes it. Multiple services can react independently. Events are broadcast.

A message is the envelope. Both commands and events are messages; the terminology describes the semantic, not the transport format.

In a well-designed system, services publish events and subscribe to events they care about. The publisher doesn’t maintain a list of subscribers. Adding a new subscriber to user.created doesn’t change the signup service at all.

When Not to Use Event-Driven

Before choosing tools, check whether you need event-driven at all.

Event-driven introduces asynchrony. Asynchrony introduces eventual consistency. Eventual consistency makes debugging harder, tracing harder, and user-facing error handling more complex (“Your email will arrive shortly” is a worse UX than “Email sent”).

If your operations are:

  • Fast (sub-100ms)
  • Strongly sequential (B must happen before C, A before B)
  • Low volume (hundreds per day, not thousands per second)
  • Needing immediate confirmation to the user

…synchronous request-response is simpler and often better. Don’t add async infrastructure to solve a problem that doesn’t exist yet.

Kafka: When You Actually Need It

Kafka is the right tool when you need:

  • High throughput: Millions of events per second across a distributed cluster
  • Durable, replayable streams: Events are stored on disk and can be replayed by any consumer at any time, even days later
  • Multiple independent consumers: Different services consuming the same event stream at different rates and offsets
  • Long retention: Event history as a system of record, not just a queue that empties

A typical Kafka producer in Python:

from confluent_kafka import Producer

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "acks": "all",                    # wait for all replicas
    "enable.idempotence": True,       # prevent duplicate messages
})

def publish_event(topic: str, key: str, payload: dict):
    producer.produce(
        topic=topic,
        key=key.encode("utf-8"),
        value=json.dumps(payload).encode("utf-8"),
        callback=delivery_callback,
    )
    producer.flush()

def delivery_callback(err, msg):
    if err:
        logger.error(f"Event delivery failed: {err}")
    else:
        logger.debug(f"Event delivered: {msg.topic()} partition {msg.partition()}")

And a consumer:

from confluent_kafka import Consumer

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "email-service",
    "auto.offset.reset": "earliest",    # start from beginning if no offset stored
    "enable.auto.commit": False,         # commit manually after processing
})

consumer.subscribe(["user.created"])

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue
    if msg.error():
        logger.error(f"Consumer error: {msg.error()}")
        continue
    
    event = json.loads(msg.value())
    try:
        send_welcome_email(event["user_id"])
        consumer.commit(msg)            # only commit on success
    except Exception as e:
        logger.error(f"Failed to process event: {e}")
        # Don't commit — message will be redelivered

The consumer group (group.id) determines how Kafka distributes partitions. All instances of email-service share a group ID, so each event is processed by one instance. A separate group ID for billing-service means billing consumes every event independently.

Kafka’s operational cost is real. You need to manage:

  • Broker replication and leader election
  • Partition count (set too low and you can’t scale; changing it later is disruptive)
  • Consumer group lag monitoring
  • Topic retention policies
  • Schema compatibility (Avro + Schema Registry, or JSON with version fields)

Confluent Cloud and AWS MSK take the infrastructure overhead off your plate but add cost. For teams that genuinely need Kafka’s capabilities, this is worth it. For teams that don’t, managed Kafka is an expensive solution to a problem they don’t have.

Redis Streams: The Middle Ground

Redis Streams, added in Redis 5.0, give you a durable, consumer-group-aware event stream without Kafka’s operational complexity. If you’re already using Redis (and most teams are), you’re one command away from a production-grade event system.

Publishing to a stream:

import redis

r = redis.Redis(host="localhost", port=6379)

# XADD appends to the stream, Redis auto-generates an ID
event_id = r.xadd(
    "user.created",
    {
        "user_id": "42",
        "email": "user@example.com",
        "timestamp": "2026-05-15T09:00:00Z",
    }
)

Consumer group setup and reading:

# Create consumer group (run once)
try:
    r.xgroup_create("user.created", "email-service", id="0", mkstream=True)
except redis.exceptions.ResponseError:
    pass  # Group already exists

# Read from consumer group
def consume_events(worker_id: str):
    while True:
        messages = r.xreadgroup(
            groupname="email-service",
            consumername=f"worker-{worker_id}",
            streams={"user.created": ">"},  # ">" means undelivered messages
            count=10,
            block=5000,  # block for 5s if no messages
        )
        
        for stream, events in (messages or []):
            for event_id, fields in events:
                try:
                    process_event(fields)
                    r.xack("user.created", "email-service", event_id)
                except Exception as e:
                    logger.error(f"Failed: {e}")
                    # Don't ACK — it stays in the pending list

Redis Streams supports:

  • Consumer groups (just like Kafka)
  • Message acknowledgment and pending message tracking
  • Dead letter pattern via XPENDING and XCLAIM (reclaim unacknowledged messages after a timeout)
  • Stream trimming by message count or time (MAXLEN)

What it doesn’t support:

  • Cross-datacenter replication without external tooling
  • Retention measured in days at petabyte scale
  • Built-in schema registry

For most services that need async event processing within a single region, Redis Streams is the pragmatic choice. It’s dramatically simpler to operate than Kafka, and the feature set covers 90% of use cases.

BullMQ: Background Jobs Without the Ceremony

BullMQ (Node.js) and Celery (Python) are job queue libraries built on Redis. They’re not a streaming platform; they’re for discrete background jobs with retries, delays, and priorities.

import { Queue, Worker } from "bullmq";

const emailQueue = new Queue("email", {
  connection: { host: "localhost", port: 6379 },
});

// Enqueue a job
await emailQueue.add("welcome-email", {
  userId: 42,
  email: "user@example.com",
}, {
  attempts: 3,
  backoff: { type: "exponential", delay: 2000 },
  delay: 0,
});

// Worker processes jobs
const worker = new Worker("email", async (job) => {
  await sendWelcomeEmail(job.data.userId, job.data.email);
}, {
  connection: { host: "localhost", port: 6379 },
  concurrency: 10,
});

BullMQ handles retries with backoff, delayed jobs, job prioritization, rate limiting, and job result storage. For the common case of “run this task in the background and retry on failure,” it’s the least-friction option if you’re on Node.js. Celery is the Python equivalent with broader broker support (Redis, RabbitMQ, SQS).

Choosing Between Them

RequirementRedis Streams / BullMQKafka
< 10k events/secGood fitOverkill
> 500k events/secStarts to strainRight tool
Event replay (days)Limited by memoryNative
Multiple independent consumersYesYes
Already using RedisYes, no new infraNo
Cross-datacenter streamsNeeds setupMirrorMaker
Schema evolutionManualSchema Registry
Ops team has Kafka experienceMaybeYes

The honest version: most product teams should start with Redis Streams or BullMQ. If you hit Redis’s limits (millions of events per second, multi-datacenter durability requirements, or long-term event replay across many consumer groups), then Kafka is justified. Starting there is premature optimization with a high operational price.

Patterns Worth Knowing

Fan-out: One event, many consumers. Both Kafka (consumer groups) and Redis Streams (consumer groups) handle this natively. Publish once, each subscriber gets a copy at its own pace.

Saga (distributed transaction): A sequence of events that represents a multi-step workflow. Order placed → payment initiated → inventory reserved → fulfillment triggered. Each step is its own event. If inventory fails, the saga publishes a compensation event (inventory.reserve.failed) that triggers upstream rollback.

Transactional Outbox: The hardest part of event publishing is guaranteeing that writing to the database and publishing an event happen atomically. If the database write succeeds but the Kafka publish fails (or vice versa), you have inconsistent state.

The outbox pattern writes the event to a database table in the same transaction as the business data write. A separate process reads the outbox table and publishes events, then marks them as published. This guarantees at-least-once delivery with no risk of data/event mismatch:

-- In the same transaction as the user insert
INSERT INTO outbox (event_type, payload, published)
VALUES ('user.created', '{"user_id": 42}', false);

Observability for Async Systems

Debugging synchronous systems is hard enough. Async systems are harder because failures happen in a different process than the one that triggered the work.

Track these at minimum:

  • Consumer lag: How far behind is each consumer group? High lag means consumers can’t keep up with producers.
  • Processing latency: Time from event publish to processing completion.
  • Dead letter queue depth: Unprocessable events that need human attention.
  • Retry count distribution: If many events are hitting max retries, there’s a systemic problem.

Tools like Kafka’s JMX metrics, Redis’s XPENDING, and BullMQ’s built-in dashboard surface these. Connect them to your alerting before the queue fills up, not after.

The goal isn’t to add event-driven infrastructure for its own sake. It’s to decouple operations that don’t need to be synchronous, reduce blast radius when services fail, and make systems that scale without rewriting. Pick the simplest tool that achieves that, and add complexity only when the requirements demand it.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored