We do a lot of Kafka cost reviews with customers: vendor, instances, replication factor (RF), tiered storage, fetch-from-follower, networking, topic usage, and some maths. A recent Get Kafka-Nated episode walks through most of it, and we also dug into the surprising cost of partition waste, one of the biggest hidden drivers on the broker side.
Why TCO is everywhere right now. It comes up in every customer conversation: CAB (Customer Advisory Board) sessions, roadmap reviews, renewal calls. Orgs are shrinking (headcount cuts, "do more with less", AI eating the easy stuff) while data volumes keep going up. Security comes with it: nobody wants to optimize cost by cutting corners.
Infra is a big chunk of the cost, but not only: the "usage" layer matters too, so much.
Where does the bill actually come from?
Vendor calculators are hard to compare because the assumptions baked in are rarely visible. Replication multipliers, disk class (the AWS EBS volume type: GP3 is meaningfully cheaper per GB than GP2), the compression ratio assumed for the workload, whether tiered storage is billed at the replicated rate or at the actual S3 rate.
What to consider in the infra cost:
- With RF=3, the per-GB list price gets multiplied by 3 everywhere. Tiered storage is still billed at the replicated rate even though only one copy lives in S3. So you're paying the RF=3 rate for data Kafka no longer replicates. Ask your provider, check your bill.
- Cross-VPC, in-region traffic between your account and the vendor's account lands on your own Cloud bill at roughly 1c/GB each way depending on the path (peering, PrivateLink, transit gateway).
- Without fetch-from-follower, most consumer fetches cross AZ boundaries. With three balanced AZs, ~2/3 of consumer reads are cross-AZ (the leader lives in one AZ, so two reads in three come from another).
- Compression is often left off. With zstd at sensible batch sizes (32KB and up), logs and metrics commonly compress 8-10x depending on payload, and far higher ratios are reachable on highly repetitive payloads (JSON). Going from 5x to 10x halves stored bytes and halves the replication bytes flowing inside the cluster.
fetch-from-follower in 30 seconds: Since Kafka 2.4, consumers can read from a follower replica in its own AZ instead of always hitting the leader which can be in another AZ. AWS charges nothing for same-AZ traffic within your VPC.
Do it all: fetch-from-follower, tiered storage, compression enforcement, partition right-sizing, BYOC to apply your existing AWS discount, single-AZ topics, proxies for routing and failover. But every one of them is infrastructure tuning, let's see the layer above.
Cost is not just infra, it's a stack
When tuning anything in Kafka, we think in layers, bottom-up: hardware, JVM, broker config, producer and consumer tuning, topic design, application code. Understand the layers to optimize the one you need.
- Cloud infrastructure. Instance types, AZ placement, networking, BYOC (Bring Your Own Cloud) negotiation with AWS. At hyperscale contract sizes, negotiated networking discounts can reach 90%, but only if your traffic flows through your own AWS account. Sign with a SaaS and that lever disappears (the vendor keeps that discount for itself).
- Broker and protocol tuning. Compression, retention, replication factor, fetch-from-follower, tiered storage, partition count. Easy because these are simple config changes.
Most clusters carry 40 to 70% partition waste. On managed Kafka it shows up as per-partition-hour billing. On self-managed, the rule-of-thumb ceiling is 4,000 to 6,000 partition-replicas per broker (RF=3 turns 100,000 partitions into 300,000 replicas to host and track). KRaft raises that ceiling but it still exists.
- Architecture. Diskless topics (KIP-1150, brokers backed directly by object storage), Iceberg topics (Kafka topics readable as Apache Iceberg tables), single-AZ topics, proxies sitting between clients and brokers. One cluster maps to one workload. This is where the next wave of Kafka cost reduction is happening.
Proxy usage is booming here: multi-tenancy, failover, multi-cluster routing. Non-prod is often the obvious win: staging and dev doubling the bill? Consolidate them onto shared virtual clusters. Proxies also provide advanced features like topic concentration: collapsing many low-volume topics (like Debezium reading all tables from databases) onto fewer physical topics cut partition counts by 90%+. See Gateway.
- Usage. Fan-out, governance, discovery, self-service. The one with the highest payoff. Let's focus on this layer.
Fan-out 1:N or Why Kafka even exists
Kafka was built for fan-out. The whole reason the log abstraction exists is that one byte written can be read by N independent consumers, decoupled in time, without coordinating with the producer.
If your average fan-out is 1, you probably shouldn't use Kafka. LinkedIn ran at an average of 5.4: the same bytes written once, read by 5.4 independent teams.
The cost-per-business-outcome collapses as fan-out grows: same software/hardware to serve five use cases instead of one.
If your Kafka is increasing, how do you identify if it's good (more business use-cases) or bad?
- Duplicated topics/data means more storage, replication, data pipelines, probably due to a lack of discovery and federated ownership.
- Too many partitions because no one knows how to size them, so they over-provision. You can't reduce partition count at all, and increasing it breaks key ordering. Surfacing the waste requires chargeback at the team-and-topic level. You can't optimize what you can't see attributed.
"A third of our traffic, we know what it has to do with, but we don't know exactly what they're doing."
— Platform engineer, US retail company
That's the usage layer leaking: it costs real money, and nobody can fix anything because nobody owns it. They're governance, discovery, and self-service problems, nothing to do with infrastructure. From our customers:
- How to grow Kafka footprint 5x, from 60 to 300+ applications, 800+ users, without expanding the platform team or the infra? You remove friction and add discovery, ownership, self-service provisioning.
- How do you consolidate 25 clusters across three data centers, 170 dependent applications, in nine months with zero downtime? You add visibility about what's running on those clusters.
Cost optimization is everybody's concern and nobody's KPI (maybe your CFO but he's far from these tech concerns). It's hard to trim down the initial safe over-provisioning: what if we need that capacity later?, or what if something breaks when we touch it?. It's all about risk management.
Saying "it's expensive" is not a business case. What works: show the waste, the annual cost, and the effort to reduce it. Until someone turns the cost into a business case with an owner, over-provisioning stays the safe default.
Where to spend your Kafka cost optimization effort
Most Kafka deployments we see have more headroom in their usage layer than in their infra layer: topics nobody reads, partitions nobody needs, teams who would benefit from streaming but it's too complex to onboard.
There's a recurring pattern in the streaming industry too: we chase the next architectural idea (diskless, Iceberg topics, single-AZ) while we haven't even fixed what we could in the usage layer: Who is using this? For what? And why aren't more teams using it?
If you're working on your Kafka cost in 2026:
- Do the infrastructure pass once. Instance types, AZ placement, BYOC.
- Do a config pass once. Compression, retention, partition right-sizing, fetch-from-follower, tiered storage.
- Spend the rest of the year on the usage layer. Fan-out, ownership, discovery, chargeback, self-service provisioning.
Want to see where your usage layer is leaking?
Get a free Kafka cost analysis with our field engineering team. We will walk your estate together, map cost back to teams and topics, and show you where the payoff sits above the broker.
Related: Why Kafka Costs Keep Rising → · 8 Ways to Cut Kafka Costs → · Chargeback for Kafka →
