Home / Resources / Ebooks / Where Kafka Costs Hide A Field Guide

Whitepaper

Where Kafka Costs Hide: A Field Guide

For platform leads, Kafka architects, and streaming engineers who know Kafka is expensive but can't fully explain where or why. A structured way to find, quantify, and act on the hidden costs in your estate.

Where Kafka Costs Hide: A Field Guide

Executive summary

Most teams running Kafka in production suspect something in the bill is wasted, but don't have an easy way to scope it. This guide is for platform leads, Kafka architects, and streaming engineers who want a structured way to find, quantify, and act on the waste in their own estate.

How we measure recoverable

The 25 to 40 percent figure covers what comes off through config changes, topic retirement, cluster consolidation, and replatforming where warranted. The flex capacity a healthy estate carries for growth and burst sits outside it.

The headline finding from the cost analyses we've run: about 25 to 40 percent of a typical Kafka infrastructure bill is recoverable, mostly from non-production environments where the business risk is low. The waste isn't coming from usage. It's coming from how the estate is structured and provisioned, and from decisions made at topic creation that nobody ever revisits.

The rest of the guide is the methodology behind that finding. Part 1 shows what the waste actually looks like when we run the analysis. The six patterns that follow each get their own section, with the diagnostic, what each costs at scale, how each surfaces on hosted versus self-managed infrastructure, and where the patterns cascade into each other. A closing section maps each pattern to the response approaches that fit, plus the data points worth gathering before any platform review or contract conversation.

Before getting into the framework, here is what we keep running into when we look at real Kafka estates. Each of these is a specific finding from a real cost analysis, anonymized but otherwise unedited.

The 300-partition topic with no traffic. Sitting on a managed cluster that gets billed by capacity unit. The partition count was set at creation against a peak load that never came. Nobody has questioned it since, and nobody can lower it without recreating the topic and coordinating with every producer and consumer.

200 clusters for 200 projects. One cluster per project. The strategy made sense at adoption: per-cluster isolation gave each team clean ownership and avoided the operational complexity of multi-tenancy. The economics didn't scale. Each cluster carried the same baseline cost regardless of how much it actually did, and there was no mechanism for any team to share infrastructure with another.

5,000 empty topics on a single cluster. None producing, none consuming, none safe to delete. The platform team didn't own them and couldn't get the project teams to confirm which were genuinely abandoned and which were waiting on a consumer that ran monthly or quarterly. Topics accumulated indefinitely because deletion was a coordination problem nobody had time to solve.

A topic running 30 times more egress than ingress. The application consuming it had grown internally to spawn hundreds of independent consumer instances, each polling at the default rate. Nothing in the platform's monitoring flagged this. It was a valid configuration that just happened to be expensive.

A monthly Kafka bill that doubled in a year. $60K to $120K, while business throughput grew at a fraction of that pace. The doubling came from accumulation: new topics inheriting peak-load partition defaults, retention bumps that became permanent, clusters spun up for projects that were never decommissioned.

None of these came from teams making mistakes. They came from teams running under the constraints Kafka imposes on every estate at scale: no built-in cost visibility, partition counts that can't be reduced once set, retention that became default policy nobody questions, cluster decisions that locked in years ago. A well-run estate accumulates waste over time. The patterns just become invisible from inside day-to-day operations.

The pattern under the patterns
  • The waste is structural, not behavioral. Every observation above came from a team that was making reasonable individual decisions. The patterns are what those decisions add up to over years.
  • Most of the recoverable spend is invisible from inside day-to-day operations. Kafka doesn't expose cost-by-team, partition-vs-throughput, or aggregate utilization in a way that surfaces drift.
  • The methodology is in the rest of this guide. Each pattern below has its own section, with the diagnostic to spot it, what it costs, and how it surfaces on hosted versus self-managed platforms.

Keep reading the full guide

By clicking the button below, you agree that your personal data will be processed in accordance with our Privacy Policy.