Event-Driven Architecture in Practice: Kafka, Redis Streams, and When Each Fits

Most teams reach for Kafka the moment they hear “event-driven architecture.” Kafka is the name they know. But Kafka is a distributed streaming platform designed for high-throughput, durable, replayable event streams. If you’re running a few thousand events per day and need background job processing, you’ve added a Zookeeper dependency (or KRaft), multiple brokers, consumer groups, and offset management to solve a problem that could have been handled with Redis Streams and BullMQ.

Choosing the wrong tool doesn’t just mean extra configuration. It means operational complexity that doesn’t pay off at your actual scale.

What Event-Driven Architecture Actually Solves

The synchronous request-response model is fine until it isn’t. When an API endpoint needs to:

Send a welcome email
Trigger a Slack notification
Kick off a billing job
Sync data to a CRM

…doing all of that in the request handler means slow responses, tight coupling, and a failure in the Slack API taking down your signup flow.

Event-driven architecture decouples the action from the consequence. The signup handler publishes an event (user.created). Subscribers (email service, billing service, CRM sync) handle it independently, asynchronously, and in isolation. If the CRM is down, the event waits in the queue rather than failing the signup.

That decoupling is the core value. Everything else (replay, fan-out, consumer groups) is built on top of it.

Events vs Commands vs Messages

These three terms are often used interchangeably but mean different things in practice.

A command is a directive: “SendWelcomeEmail” to user 42. It has one intended recipient and implies the sender cares about the outcome. Commands are point-to-point.

An event is a fact: “UserCreated” with user ID 42 at timestamp T. The publisher doesn’t know or care who consumes it. Multiple services can react independently. Events are broadcast.

A message is the envelope. Both commands and events are messages; the terminology describes the semantic, not the transport format.

In a well-designed system, services publish events and subscribe to events they care about. The publisher doesn’t maintain a list of subscribers. Adding a new subscriber to user.created doesn’t change the signup service at all.

When Not to Use Event-Driven

Before choosing tools, check whether you need event-driven at all.

Event-driven introduces asynchrony. Asynchrony introduces eventual consistency. Eventual consistency makes debugging harder, tracing harder, and user-facing error handling more complex (“Your email will arrive shortly” is a worse UX than “Email sent”).

If your operations are:

Fast (sub-100ms)
Strongly sequential (B must happen before C, A before B)
Low volume (hundreds per day, not thousands per second)
Needing immediate confirmation to the user

…synchronous request-response is simpler and often better. Don’t add async infrastructure to solve a problem that doesn’t exist yet.

Kafka: When You Actually Need It

Kafka is the right tool when you need:

High throughput: Millions of events per second across a distributed cluster
Durable, replayable streams: Events are stored on disk and can be replayed by any consumer at any time, even days later
Multiple independent consumers: Different services consuming the same event stream at different rates and offsets
Long retention: Event history as a system of record, not just a queue that empties

A typical Kafka producer in Python:

from confluent_kafka import Producer

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "acks": "all",                    # wait for all replicas
    "enable.idempotence": True,       # prevent duplicate messages
})

def publish_event(topic: str, key: str, payload: dict):
    producer.produce(
        topic=topic,
        key=key.encode("utf-8"),
        value=json.dumps(payload).encode("utf-8"),
        callback=delivery_callback,
    )
    producer.flush()

def delivery_callback(err, msg):
    if err:
        logger.error(f"Event delivery failed: {err}")
    else:
        logger.debug(f"Event delivered: {msg.topic()} partition {msg.partition()}")

And a consumer:

from confluent_kafka import Consumer

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "email-service",
    "auto.offset.reset": "earliest",    # start from beginning if no offset stored
    "enable.auto.commit": False,         # commit manually after processing
})

consumer.subscribe(["user.created"])

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue
    if msg.error():
        logger.error(f"Consumer error: {msg.error()}")
        continue
    
    event = json.loads(msg.value())
    try:
        send_welcome_email(event["user_id"])
        consumer.commit(msg)            # only commit on success
    except Exception as e:
        logger.error(f"Failed to process event: {e}")
        # Don't commit — message will be redelivered

The consumer group (group.id) determines how Kafka distributes partitions. All instances of email-service share a group ID, so each event is processed by one instance. A separate group ID for billing-service means billing consumes every event independently.

Kafka’s operational cost is real. You need to manage:

Broker replication and leader election
Partition count (set too low and you can’t scale; changing it later is disruptive)
Consumer group lag monitoring
Topic retention policies
Schema compatibility (Avro + Schema Registry, or JSON with version fields)

Confluent Cloud and AWS MSK take the infrastructure overhead off your plate but add cost. For teams that genuinely need Kafka’s capabilities, this is worth it. For teams that don’t, managed Kafka is an expensive solution to a problem they don’t have.

Redis Streams: The Middle Ground

Redis Streams, added in Redis 5.0, give you a durable, consumer-group-aware event stream without Kafka’s operational complexity. If you’re already using Redis (and most teams are), you’re one command away from a production-grade event system.

Publishing to a stream:

import redis

r = redis.Redis(host="localhost", port=6379)

# XADD appends to the stream, Redis auto-generates an ID
event_id = r.xadd(
    "user.created",
    {
        "user_id": "42",
        "email": "user@example.com",
        "timestamp": "2026-05-15T09:00:00Z",
    }
)

Consumer group setup and reading:

# Create consumer group (run once)
try:
    r.xgroup_create("user.created", "email-service", id="0", mkstream=True)
except redis.exceptions.ResponseError:
    pass  # Group already exists

# Read from consumer group
def consume_events(worker_id: str):
    while True:
        messages = r.xreadgroup(
            groupname="email-service",
            consumername=f"worker-{worker_id}",
            streams={"user.created": ">"},  # ">" means undelivered messages
            count=10,
            block=5000,  # block for 5s if no messages
        )
        
        for stream, events in (messages or []):
            for event_id, fields in events:
                try:
                    process_event(fields)
                    r.xack("user.created", "email-service", event_id)
                except Exception as e:
                    logger.error(f"Failed: {e}")
                    # Don't ACK — it stays in the pending list

Redis Streams supports:

Consumer groups (just like Kafka)
Message acknowledgment and pending message tracking
Dead letter pattern via XPENDING and XCLAIM (reclaim unacknowledged messages after a timeout)
Stream trimming by message count or time (MAXLEN)

What it doesn’t support:

Cross-datacenter replication without external tooling
Retention measured in days at petabyte scale
Built-in schema registry

For most services that need async event processing within a single region, Redis Streams is the pragmatic choice. It’s dramatically simpler to operate than Kafka, and the feature set covers 90% of use cases.

BullMQ: Background Jobs Without the Ceremony

BullMQ (Node.js) and Celery (Python) are job queue libraries built on Redis. They’re not a streaming platform; they’re for discrete background jobs with retries, delays, and priorities.

import { Queue, Worker } from "bullmq";

const emailQueue = new Queue("email", {
  connection: { host: "localhost", port: 6379 },
});

// Enqueue a job
await emailQueue.add("welcome-email", {
  userId: 42,
  email: "user@example.com",
}, {
  attempts: 3,
  backoff: { type: "exponential", delay: 2000 },
  delay: 0,
});

// Worker processes jobs
const worker = new Worker("email", async (job) => {
  await sendWelcomeEmail(job.data.userId, job.data.email);
}, {
  connection: { host: "localhost", port: 6379 },
  concurrency: 10,
});

BullMQ handles retries with backoff, delayed jobs, job prioritization, rate limiting, and job result storage. For the common case of “run this task in the background and retry on failure,” it’s the least-friction option if you’re on Node.js. Celery is the Python equivalent with broader broker support (Redis, RabbitMQ, SQS).

Choosing Between Them

Requirement	Redis Streams / BullMQ	Kafka
< 10k events/sec	Good fit	Overkill
> 500k events/sec	Starts to strain	Right tool
Event replay (days)	Limited by memory	Native
Multiple independent consumers	Yes	Yes
Already using Redis	Yes, no new infra	No
Cross-datacenter streams	Needs setup	MirrorMaker
Schema evolution	Manual	Schema Registry
Ops team has Kafka experience	Maybe	Yes

The honest version: most product teams should start with Redis Streams or BullMQ. If you hit Redis’s limits (millions of events per second, multi-datacenter durability requirements, or long-term event replay across many consumer groups), then Kafka is justified. Starting there is premature optimization with a high operational price.

Patterns Worth Knowing

Fan-out: One event, many consumers. Both Kafka (consumer groups) and Redis Streams (consumer groups) handle this natively. Publish once, each subscriber gets a copy at its own pace.

Saga (distributed transaction): A sequence of events that represents a multi-step workflow. Order placed → payment initiated → inventory reserved → fulfillment triggered. Each step is its own event. If inventory fails, the saga publishes a compensation event (inventory.reserve.failed) that triggers upstream rollback.

Transactional Outbox: The hardest part of event publishing is guaranteeing that writing to the database and publishing an event happen atomically. If the database write succeeds but the Kafka publish fails (or vice versa), you have inconsistent state.

The outbox pattern writes the event to a database table in the same transaction as the business data write. A separate process reads the outbox table and publishes events, then marks them as published. This guarantees at-least-once delivery with no risk of data/event mismatch:

-- In the same transaction as the user insert
INSERT INTO outbox (event_type, payload, published)
VALUES ('user.created', '{"user_id": 42}', false);

Observability for Async Systems

Debugging synchronous systems is hard enough. Async systems are harder because failures happen in a different process than the one that triggered the work.

Track these at minimum:

Consumer lag: How far behind is each consumer group? High lag means consumers can’t keep up with producers.
Processing latency: Time from event publish to processing completion.
Dead letter queue depth: Unprocessable events that need human attention.
Retry count distribution: If many events are hitting max retries, there’s a systemic problem.

Tools like Kafka’s JMX metrics, Redis’s XPENDING, and BullMQ’s built-in dashboard surface these. Connect them to your alerting before the queue fills up, not after.

The goal isn’t to add event-driven infrastructure for its own sake. It’s to decouple operations that don’t need to be synchronous, reduce blast radius when services fail, and make systems that scale without rewriting. Pick the simplest tool that achieves that, and add complexity only when the requirements demand it.

Event-Driven Architecture in Practice: Kafka, Redis Streams, and When Each Fits

What Event-Driven Architecture Actually Solves

Events vs Commands vs Messages

When Not to Use Event-Driven

Kafka: When You Actually Need It

Redis Streams: The Middle Ground

BullMQ: Background Jobs Without the Ceremony

Choosing Between Them

Patterns Worth Knowing

Observability for Async Systems

API Versioning Strategies That Don't Break Clients: URL, Header, and Content Negotiation

Multi-Tenancy for SaaS: Database-per-Tenant, Schema, and Row-Level Security Compared

More from Cloud & Infrastructure

Database Connection Pooling in 2026: PgBouncer, Supabase, and Prisma Accelerate

Secrets Management in Production: The Patterns That Actually Work

Incident Response for Small Engineering Teams: SRE Without a Dedicated Ops Team

Working notes from
the studio.

Join the conversation.

What Event-Driven Architecture Actually Solves

Events vs Commands vs Messages

When Not to Use Event-Driven

Kafka: When You Actually Need It

Redis Streams: The Middle Ground

BullMQ: Background Jobs Without the Ceremony

Choosing Between Them

Patterns Worth Knowing

Observability for Async Systems

API Versioning Strategies That Don't Break Clients: URL, Header, and Content Negotiation

Multi-Tenancy for SaaS: Database-per-Tenant, Schema, and Row-Level Security Compared

More from Cloud & Infrastructure

Database Connection Pooling in 2026: PgBouncer, Supabase, and Prisma Accelerate

Secrets Management in Production: The Patterns That Actually Work

Incident Response for Small Engineering Teams: SRE Without a Dedicated Ops Team

Working notes fromthe studio.

Join the conversation.

Working notes from
the studio.