Real-Time Data Processing with Kafka Streams and Flink: A Production Guide

I have built stream processors with both Kafka Streams and Apache Flink, and the decision is rarely about which one is better. It is about where the state should live, how much operational overhead the team can handle, and whether the pipeline needs to join multiple sources or just keep up with a single Kafka cluster. I have seen teams pick Flink when they only needed a Kafka-native consumer, and teams stretch Kafka Streams too far.

This guide is my production view of the tradeoffs. I compare the two systems, show windowing and stateful aggregation patterns, and end with a decision matrix I use in architecture reviews. The examples are real patterns from production, in Java for Kafka Streams and PyFlink for Flink.

When to Use Kafka Streams

Kafka Streams is a library, not a separate system. You deploy it like any other JVM service, ship it with your application, and it scales by adding instances and partitions. I use it when the source of truth is Kafka and I want an embedded processor co-located with Kafka, like sessionization on clickstream events or enrichment with compacted topics.

When to Use Apache Flink

Flink is a distributed stream processing engine. It is the right choice when the job is complex, state is large, or the pipeline needs to ingest from multiple sources beyond Kafka. I reach for Flink when I need rich event time semantics, complex windowing, large joins, or integrations with databases and data lakes that require two phase commit.

Windowing Patterns with Kafka Streams

Windowing is where stateful stream processing becomes real. The patterns I use most are tumbling windows for fixed buckets, sliding windows for continuous aggregation, and session windows for user activity. Kafka Streams uses event time windows with optional grace periods.

StreamsBuilder builder = new StreamsBuilder();
KStream<String, Purchase> purchases = builder.stream("purchases");

// Tumbling window, 5 minutes
KTable<Windowed<String>, Long> countPerUser = purchases
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count();

// Sliding window, size 10 minutes, advance every 1 minute
KTable<Windowed<String>, Double> spendPerUser = purchases
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(10))
        .advanceBy(Duration.ofMinutes(1)))
    .aggregate(
        () -> 0.0,
        (key, purchase, total) -> total + purchase.amount(),
        Materialized.with(Serdes.String(), Serdes.Double())
    );

// Session window, 30 minute inactivity gap
KTable<Windowed<String>, Long> sessions = purchases
    .groupByKey()
    .windowedBy(SessionWindows.with(Duration.ofMinutes(30)))
    .count();

Windowed aggregations create state in RocksDB. I size state stores for the worst case and set retention on window stores to avoid infinite growth.

Windowing Patterns with PyFlink

Flink gives you explicit control over event time and watermarks. The APIs are more verbose, but you can build precise behavior around late events. I use Flink for sessionization or business rules where lateness is expected and I need deterministic handling.

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import (
    TumblingEventTimeWindows,
    SlidingEventTimeWindows,
    EventTimeSessionWindows,
)
from pyflink.common.time import Time

env = StreamExecutionEnvironment.get_execution_environment()
stream = env.from_source(source, watermark_strategy, "purchases")

# Tumbling window, 5 minutes
stream.key_by(lambda x: x.user_id)     .window(TumblingEventTimeWindows.of(Time.minutes(5)))     .reduce(lambda a, b: a.add(b))

# Sliding window, 10 minutes size, 1 minute slide
stream.key_by(lambda x: x.user_id)     .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))     .reduce(lambda a, b: a.add(b))

# Session window, 30 minute gap
stream.key_by(lambda x: x.user_id)     .window(EventTimeSessionWindows.with_gap(Time.minutes(30)))     .reduce(lambda a, b: a.add(b))

Flink windowing is more expressive, and the tradeoff is that you have to think about watermarks, late data, and state backend configuration.

Stateful Aggregations and Joins

The basic stateful building blocks are count, sum, and join. In Kafka Streams, you define a topology, Kafka handles partitioning, and RocksDB stores state locally with changelogs in Kafka. In Flink, you get richer join semantics and can manage state across more inputs.

KStream<String, Order> orders = builder.stream("orders");
KTable<String, Customer> customers = builder.table("customers");

// Sum order total per customer
KTable<String, Double> revenue = orders
    .groupByKey()
    .aggregate(
        () -> 0.0,
        (key, order, total) -> total + order.total(),
        Materialized.with(Serdes.String(), Serdes.Double())
    );

// Stream-table join for enrichment
KStream<String, EnrichedOrder> enriched = orders.join(
    customers,
    (order, customer) -> new EnrichedOrder(order, customer)
);

# PyFlink interval join for streams
orders = env.from_source(order_source, watermark_strategy, "orders")
customers = env.from_source(customer_source, watermark_strategy, "customers")

joined = orders.key_by(lambda o: o.customer_id)     .interval_join(customers.key_by(lambda c: c.id))     .between(Time.seconds(-5), Time.seconds(5))     .process(lambda left, right, ctx, out: out.collect(EnrichedOrder(left, right)))

My rule of thumb is that Kafka Streams joins are great when both sides are in Kafka and the state fits comfortably on each instance. If the join is large, the window is long, or the sources are heterogeneous, Flink is safer because it scales state across TaskManagers.

Exactly-Once Semantics in Practice

Kafka Streams exactly-once is built on Kafka transactions. Each task writes to changelog topics and output topics within a single transaction. When you enable exactly-once v2, the producer is idempotent, commits are transactional, and offsets are committed with the output. This gives you end to end exactly-once as long as all sources and sinks are Kafka.

Flink uses checkpointing to coordinate state and sinks. It snapshots state across operators, and sinks that support two phase commit can participate in the checkpoint so that state and output stay consistent. Flink can provide exactly-once even with sinks outside Kafka, which is why I use it for pipelines that write to databases or data lakes. The tradeoff is checkpoint overhead and tuning.

Deployment Patterns

Kafka Streams is deployed as a library. You package the topology inside your application, deploy it on Kubernetes, ECS, or bare metal, and scale by adding instances. There is no separate control plane, so rollouts look like normal service deployments.

Flink is deployed as a cluster with JobManagers and TaskManagers. You submit jobs, manage versions, and coordinate upgrades. In exchange you get centralized resource management, rescaling, and checkpointing for long running jobs.

Production Lessons I Learned the Hard Way

Backpressure handling is the first real operational challenge. Kafka Streams will slow consumers when downstream processing slows, but you still need to watch consumer lag, RocksDB compaction, and large windowed state. Flink has explicit backpressure metrics, but you still need to tune buffers and parallelism or the job will stall.

State store sizing is the second lesson. I have seen Kafka Streams instances die because the local disk filled up, and I have seen Flink jobs crawl because RocksDB was starved for memory. Size for peak, keep retention tight, and separate state from log directories. Checkpointing is the third lesson, Kafka Streams hides it inside the commit interval, Flink exposes every knob. Use a checkpoint interval that matches your recovery objectives, and watch duration so it does not approach the interval.

Decision Matrix

Choose Kafka Streams when: you process only Kafka sources and sinks, state fits on each instance, and the team wants a library they can deploy like any other service.

Choose Flink when: you need multi source joins, large state, or long windows, or you must write to non Kafka sinks with exactly-once guarantees. It is also the right call when you want a shared streaming platform with centralized operations.

Borderline case: if your job is Kafka only today but likely to grow into complex joins, I still start in Kafka Streams for speed, then migrate once requirements stabilize.

Closing

I have shipped production systems on both stacks, and neither one is a silver bullet. Kafka Streams is the pragmatic choice for Kafka native processing with moderate state and low operational overhead. Flink is the right choice for complex, large scale, multi source pipelines that need strong event time semantics and rich state management.

Questions or pushback on any of this? Find me on LinkedIn.

Share on X Share on LinkedIn

Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Spark, and writes about data engineering patterns from production experience. See his full portfolio.