Kafka vs Kinesis: A Data Engineer's Guide to Real-Time Streaming in 2026
November 9, 2025 · 9 min read
I've built production pipelines on both. I've been paged at 2 AM because of both. I've watched teams choose the wrong one and spend six months undoing that decision. This post is the guide I wish existed when I was making those calls.
The short answer: Kafka is more powerful and more complicated. Kinesis is easier to start with and harder to scale cheaply. The right choice depends almost entirely on your team, your AWS footprint, and how much operational complexity you're willing to carry long-term.
Let me give you the full picture.
The Architecture That Actually Matters
Kafka and Kinesis both implement a distributed, partitioned, append-only log. Messages go in, they stay for a configurable retention window, consumers read them at their own pace. That's where the surface-level similarity ends.
Kafka is an open-source distributed system. It's built around brokers, topics, and partitions. You control the number of partitions per topic, replication factor, retention period (time or size-based), and compression. The Kafka protocol is its own beast, and clients communicate directly with brokers. Consumer groups are first-class citizens: each group maintains its own committed offsets, and Kafka handles rebalancing when consumers join or leave. You can have as many consumer groups reading the same topic as you want, and each gets an independent read position.
Kinesis Data Streams is a managed AWS service. It uses shards instead of partitions. Each shard handles 1 MB/s write throughput and 2 MB/s read throughput. You provision shards explicitly (or use on-demand mode), and the max retention is 7 days on standard, 365 days with Extended Data Retention at extra cost. The enhanced fan-out feature gives each consumer its own dedicated 2 MB/s read pipe, which solves the shared throughput problem but adds cost.
The philosophical difference: Kafka gives you control over everything. Kinesis gives you guardrails and bills you for the privilege.
Producer and Consumer Models
Here's what the code actually looks like in practice.
Producing to Kafka with the Confluent Python client:
from confluent_kafka import Producer
import json
conf = {
'bootstrap.servers': 'broker1:9092,broker2:9092',
'acks': 'all',
'enable.idempotence': True,
'compression.type': 'snappy',
'batch.size': 65536,
'linger.ms': 5,
}
producer = Producer(conf)
def delivery_report(err, msg):
if err:
print(f'Delivery failed for record {msg.key()}: {err}')
else:
print(f'Record delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}')
for event in events:
producer.produce(
topic='user-events',
key=str(event['user_id']).encode('utf-8'),
value=json.dumps(event).encode('utf-8'),
callback=delivery_report
)
producer.poll(0)
producer.flush()Producing to Kinesis with boto3:
import boto3
import json
client = boto3.client('kinesis', region_name='us-east-1')
for event in events:
response = client.put_record(
StreamName='user-events',
Data=json.dumps(event).encode('utf-8'),
PartitionKey=str(event['user_id'])
)
print(f"Shard: {response['ShardId']}, Seq: {response['SequenceNumber']}")The Kinesis code is simpler to get running. You don't need to think about brokers, replication, or acks. But notice you're doing put_record one at a time. At scale, you want put_records (batch up to 500 records), and you have to handle partial failures yourself because the batch API returns per-record success/failure codes. Kafka handles batching internally.
Consumer group semantics are where the difference really bites. In Kafka, you create a consumer group, set a group.id, and Kafka does partition assignment and offset tracking. Multiple groups read the same topic independently. In Kinesis, you manage state yourself (DynamoDB via KCL, or EventBridge Pipes, or Lambda with manual checkpoint logic). Enhanced fan-out helps with throughput isolation but doesn't give you consumer group semantics out of the box.
Throughput, Latency, and Retention: Real Numbers
Throughput: Kafka is theoretically unlimited. Add partitions, add brokers. Confluent has documented single-cluster throughput above 1 TB/s. In practice, MSK clusters handle hundreds of MB/s without drama. Kinesis: 1 MB/s or 1,000 records/s per shard for writes. 2 MB/s per shard for reads (shared across consumers unless you use enhanced fan-out). If you need 100 MB/s write throughput, you're provisioning 100 shards.
Latency: Kafka median end-to-end latency of 2-5ms at p50. You can tune it lower by reducing linger.ms and batch.size, but you trade throughput for it. Kinesis typically 70-200ms end-to-end. AWS states under 1 second at p99 but you'll see 200-300ms under normal load. Fine for most real-time use cases. Not fine for sub-100ms requirements.
Retention: Kafka is configurable per topic. 7 days is common. A week, a month, forever if storage allows. This is a genuine architectural advantage. You can replay any partition from the beginning. Kinesis standard: 24 hours default, up to 7 days. Extended retention bumps it to 365 days but adds real cost. For long replay windows, Kafka wins cleanly.
Operational Reality: Managed vs. Self-Hosted vs. Kinesis
Here's where most engineers underestimate the decision.
Self-hosted Kafka is powerful and exhausting. You own broker upgrades, disk management, replication tuning, consumer lag monitoring, and ZooKeeper (or KRaft, which you should be using now). You will debug obscure partition leadership issues. You will fight with Java heap settings at some point. If you have a dedicated platform team, this can work well. If you're a two-person data team, it will own you.
Amazon MSK is self-hosted Kafka without the hardware. AWS manages the brokers, handles multi-AZ replication, and does the Kafka upgrades with your approval. You still manage topic configuration, consumer groups, and everything at the application layer. A 3-broker MSK cluster running 24/7 is about $150/month before storage. That's cheap for production infrastructure.
Confluent Cloud is the best fully-managed Kafka experience. Schema Registry, ksqlDB, Kafka Streams, Connectors, role-based access control, Terraform provider, the whole ecosystem. A moderately busy cluster runs $400-800/month. Worth it if you want Kafka capabilities without any operational burden.
Kinesis Data Streams is serverless in spirit. Provision shards (or use on-demand mode), point your producers at it, and you're done. No cluster to manage, no broker to upgrade. At low volume, Kinesis is very cheap. At high volume, shard costs compound fast.
Ecosystem Comparison
This is where Kafka pulls ahead for engineering teams that want to do serious work.
Kafka ecosystem: Kafka Streams for stateful stream processing embedded in your app. ksqlDB for SQL over Kafka. Flink on Kafka as the gold standard for complex CEP and ML feature pipelines. Kafka Connect with 200+ connectors for CDC, databases, object stores, and SaaS tools. Debezium on Kafka is still the best CDC story in the industry.
Kinesis ecosystem: Kinesis Data Firehose for zero-code delivery to S3, Redshift, OpenSearch, and Splunk. Managed Service for Apache Flink (formerly Kinesis Data Analytics). Lambda triggers for native event-driven patterns. EventBridge Pipes for Kinesis-to-target routing with filtering.
The honest assessment: if you want Kafka Streams or ksqlDB, you're using Kafka. If you want to dump events to S3 with no operational overhead, Kinesis Firehose is genuinely unbeatable.
Cost at Scale: Some Real Math
Let's compare a 10 MB/s sustained write throughput scenario.
Kinesis provisioned: You need 10 shards. At $0.015/shard-hour: roughly $109.50/month for shards. Add PUT costs at $0.014 per million payload units. At 10 MB/s you're ingesting roughly 26 billion units/month. That's $364 in PUT costs. Total: roughly $475/month before enhanced fan-out or extended retention.
MSK kafka.m5.2xlarge, 3 brokers: Approximately $885/month plus storage. At 10 MB/s sustained with 7-day retention, storage runs about $580/month. Total: roughly $1,465/month.
Confluent Cloud at 10 MB/s: At roughly $0.14/GB ingress, about $3,465/month in ingress alone.
At 10 MB/s, Kinesis is actually cheapest in raw infrastructure. The math flips at higher throughput where shard costs dominate, and it flips if you need long retention. For most teams under 5 MB/s with simple delivery requirements: Kinesis wins on cost. For teams above 20 MB/s or with complex processing requirements: Kafka (MSK especially) becomes more cost-effective.
Recommended Architectures by Team Size
Small team (1-3 data engineers), AWS-native shop: Use Kinesis Data Streams with Lambda consumers and Firehose to S3. Stand up a Glue catalog, query with Athena. Total managed surface area is minimal. You will not be paged about broker leadership. If you outgrow it, migrate to MSK later.
Mid-size team (4-10 engineers), polyglot infrastructure: MSK with a Schema Registry (Glue Schema Registry or Confluent-compatible). Add Flink for stateful processing if you need it. MSK gives you Kafka semantics without self-hosted pain. Budget $500-1,500/month depending on cluster size.
Large platform team (10+ engineers), serious scale: Confluent Cloud or self-hosted Kafka on EC2 with dedicated ops ownership. You want full Kafka Connect, ksqlDB, and Streams. The Confluent ecosystem is worth the premium if your team will actually use it.
Greenfield AWS project, simple event streaming: EventBridge. Not Kinesis, not Kafka. EventBridge handles event routing, replay (24 hours), and fan-out without writing a single producer. For service-to-service events in a microservices architecture, EventBridge is often the right answer and people reach for Kinesis out of habit.
My Take: Pick One, Don't Agonize
I use Kafka in my current stack. I've used Kinesis in previous roles. Here's where I actually land after building both.
Pick Kinesis if: Your entire infrastructure is AWS-native, your team is small, your throughput is under 5 MB/s, and you genuinely don't have operational capacity to run a cluster. Kinesis Data Streams plus Firehose handles a surprising number of real production use cases. Don't let anyone tell you it's a toy.
Pick Kafka if: You need long retention for replay, you have multiple independent consumer groups with different processing speeds, you want Flink or Kafka Streams for stateful processing, or you're not exclusively AWS. Kafka gives you more control and more ecosystem, and with MSK that control doesn't require running servers manually.
Pick neither if: You're doing low-volume service-to-service events in AWS. That's EventBridge. Stop over-engineering it.
The mistake I see most often is teams picking Kafka because it sounds impressive, then spending their first quarter just keeping the cluster alive. Kafka is a serious piece of infrastructure. It rewards teams that invest in understanding it. It punishes teams that treat it as a magic queue.
If you're unsure, start with Kinesis, prove your pipeline, and migrate to MSK when you hit the limits. That path is less painful than it sounds. The partition model maps cleanly between the two.
Getting Started
For Kafka: Install the Confluent CLI, spin up a free Confluent Cloud cluster, and use the Python client above. Confluent's free tier gives you enough to build a real prototype.
For Kinesis: Create a stream in the AWS console, enable enhanced fan-out if you have multiple consumers, and use boto3 with put_records batching from day one.
Whichever you pick, build your observability first. Consumer lag, producer error rates, and partition hot-spotting will tell you more about the health of your pipeline than anything else.
Questions or pushback on any of this? Find me on LinkedIn.
Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Spark, and writes about data engineering patterns from production experience. See his full portfolio.