Kafka Costs: Stop Overpaying by 30-50%

Teams overpay for Kafka 30-50% through over-provisioning and topic sprawl. Measure per-team usage and right-size clusters to cut costs.

Stéphane Derosiaux · December 9, 2025 ·

Your Kafka bill isn't fixed costs. It's accumulated decisions.

Every topic created "just in case," every retention policy set to 30 days when data gets consumed in 30 minutes, every cluster provisioned for peak load that hits once a quarter—these decisions compound into waste. Most teams overpay for Kafka by 30-50% not because cloud providers are expensive, but because infrastructure is provisioned once and never right-sized afterward.

The real cost isn't your monthly cloud bill. It's the engineering time spent managing sprawl: duplicated topics because teams didn't know what existed, over-provisioned clusters because capacity planning happened once during setup, retention policies set to "forever" because nobody reviewed them. Organizations report $200K+ in annual savings from addressing these patterns—not through vendor negotiation, but through operational discipline.

Cost optimization isn't about switching providers. It's about matching infrastructure to actual usage instead of worst-case assumptions.

The Over-Provisioning Tax

Over-provisioning happens for good reasons: nobody wants to be the engineer who underestimated capacity and caused an outage. So teams provision for peak load—Black Friday traffic, end-of-quarter reporting, annual data migrations—and pay for that capacity year-round.

The problem is that peak load might be 10x normal load but happens 1% of the time. Paying for peak capacity continuously means 99% of the month, you're paying for idle capacity.

This manifests in three ways: storage, compute, and network.

Storage over-provisioning happens when retention policies don't match consumption patterns. A topic configured for 30-day retention where consumers read within 1 hour means 29 days of unnecessary storage. Every extra day multiplies storage and replication costs, and retention should align with real business recovery needs, not habit.

Default retention settings are often too generous. If old data isn't being used, shorter retention saves on disk space and reduces replication load. The cost of storing 1TB for 30 days is 30x higher than storing it for 1 day, but teams rarely revisit retention after initial topic creation.

Compute over-provisioning happens when clusters are sized for peak throughput but run at 20% CPU most of the time. Cloud costs are based on provisioned capacity, not actual usage. A cluster sized for 100k messages/second that processes 20k messages/second on average wastes 80% of compute budget.

Network over-provisioning happens through unnecessary cross-region replication or excessive consumer groups. Every additional consumer increases network egress. Every cross-region replica triples network costs. Without monitoring which consumers are actually active, teams pay for data transfer that serves no purpose.

Topic Sprawl and Duplication

Topic sprawl happens when teams create resources without knowing what already exists. A team needs order data, doesn't find the existing orders-processed topic, and creates order-events-v2. Six months later, three topics serve the same purpose, and infrastructure costs reflect the duplication.

The cost isn't just storage—it's the compounding inefficiency. Three teams maintain three pipelines to produce the same data. Consumers subscribe to all three "just in case," tripling network egress. Schema evolution happens independently, breaking consumers when changes aren't coordinated.

Organizations report saving 3,500+ hours per year after implementing topic discovery mechanisms. The savings come from preventing duplication: teams find existing topics instead of rebuilding them, reuse established schemas instead of creating incompatible versions, and consolidate redundant pipelines.

Discovery prevents waste, but it requires infrastructure: topic catalogs that make resources searchable, ownership metadata that shows who to contact, and usage metrics that show whether a topic is active or abandoned.

Retention Policy Antipatterns

Retention policies determine how long data persists and directly impact storage costs. Three antipatterns drive unnecessary costs:

Set-and-forget retention: Topics created with 30-day retention never revisited, even though actual consumption patterns show consumers read within hours. Review retention.ms settings periodically. If consumers lag by at most 2 hours during normal operations and you need 48 hours of safety margin for incidents, 7-day retention is sufficient—not 30 days.

Uniform retention across environments: Production retention might need 30 days for compliance, but development and staging don't. Setting the same retention in all environments triples storage costs for non-production data that doesn't require compliance guarantees.

Retention without compaction: Topics storing entity state (user profiles, product catalogs) don't need full history—they need latest state. Compacted topics retain only the most recent value per key, eliminating historical data that consumers don't read. Using standard retention instead of compaction for state topics wastes storage on messages nobody needs.

Multi-Cluster and Multi-Region Costs

Running Kafka across multiple clusters or regions serves legitimate purposes: environment separation (dev, staging, prod), compliance (data residency requirements), and disaster recovery (cross-region replication). But each adds cost.

Environment proliferation: Running separate Kafka clusters for every feature team or microservice increases fixed costs. A cluster requires minimum infrastructure: brokers, ZooKeeper/KRaft, monitoring. Multiplying this by 20 feature teams means 20x the fixed costs, even if aggregate traffic would fit comfortably in 3 clusters.

The alternative is multi-tenancy: shared clusters with namespace isolation. This reduces infrastructure overhead but requires access control to prevent teams from interfering with each other.

Cross-region replication: Disaster recovery requires data replication across regions, but unnecessary replication triples costs. Every message written to one region gets replicated to two others, tripling network egress and storage.

Right-sized cross-region replication replicates only business-critical topics, not every development topic. If a topic can be rebuilt from source systems, it doesn't need cross-region replication. If losing 24 hours of data is acceptable, asynchronous replication is cheaper than synchronous.

Network egress: Cloud providers charge for data leaving regions. Kafka consumers in different regions than producers pay network egress for every message consumed. For high-throughput topics, this cost exceeds compute and storage.

AWS MSK pricing optimization strategies recommend keeping consumers in the same region as brokers when possible, using VPC endpoints to avoid internet egress charges, and leveraging AWS Direct Connect for on-premises consumers.

Right-Sizing Based on Actual Usage

Right-sizing means matching infrastructure to actual workload, not theoretical worst-case.

Storage right-sizing: Measure actual retention needed based on consumer lag patterns. If consumers lag by at most 4 hours during incidents, 48-hour retention provides a safety margin. Anything beyond that is waste. Tiered storage moves older data to cost-effective object storage while maintaining access, dramatically reducing storage costs for long retention periods without sacrificing data availability.

Compute right-sizing: Monitor actual CPU and memory usage over 30 days. If clusters run at 30% CPU and 40% memory, they're over-provisioned. Resize instances to match actual load, leaving headroom for traffic spikes but not paying for idle capacity year-round.

Use committed-use discounts for predictable workloads: long-term commitments save up to 40% on compute costs compared to on-demand pricing. Reserve capacity for baseline load and scale on-demand instances for peaks.

Partition right-sizing: Partitions enable parallelism but consume resources. A topic with 50 partitions requires more memory and file handles than one with 10 partitions. If consumer parallelism never exceeds 10 instances, 50 partitions waste resources.

Over-partitioning also impacts broker performance: thousands of partitions per broker increase replication overhead and leader election time. Right-sizing partition counts balances consumer parallelism with broker efficiency.

Preventing Duplication Through Discovery

Duplication is invisible waste: teams don't know they're duplicating infrastructure because they don't know what exists.

Topic catalogs make existing resources discoverable. Engineers search for "order" and find orders-processed, preventing them from creating order-events-v2. Labels and descriptions clarify purpose: "raw order events from checkout service" vs. "enriched orders with customer and inventory data."

Usage metrics show whether resources are active. A topic with zero consumers for 90 days is probably abandoned. A schema version with zero producers means it was registered but never used. Surfacing these metrics helps teams identify waste and clean it up.

Ownership tracking maps resources to teams. When duplication is discovered, ownership information shows who to contact about consolidation. Without ownership, duplicated topics persist because nobody knows who's responsible for cleaning them up.

Organizations implementing topic catalogs report 75% fewer provisioning tickets as teams self-serve by discovering and reusing existing resources instead of filing requests to create duplicates.

Cost Visibility Across Clusters

Cost optimization requires knowing where money goes. Conduktor cost control provides this visibility. Most Kafka deployments lack cost attribution: storage, compute, and network costs are aggregated, but which teams or applications drive them isn't visible.

Cost attribution by topic: Tagging topics with owning team allows costs to be attributed. If the analytics team owns 60% of storage costs but represents 20% of headcount, their infrastructure might be over-provisioned or under-optimized.

Cost attribution by environment: Development and staging shouldn't cost as much as production, but without environment tagging, you can't verify this. If non-production costs exceed 30% of total spend, environments are probably over-provisioned.

Cost trend analysis: Costs should correlate with business metrics. If message volume is flat but storage costs grow 20% quarter-over-quarter, retention policies might be too long or compaction isn't working.

Measuring and Optimizing

Track three metrics: cost per message, storage efficiency, and cost per team.

Cost per message divides total Kafka spending by messages processed. If this trends upward while traffic stays flat, efficiency is degrading. Investigate retention policies, partition counts, and replication settings.

Storage efficiency measures storage cost divided by data written. If you write 1TB/day but store 100TB, data persists 100 days on average. If business requirements need only 7 days, storage efficiency is poor.

Cost per team attributes spending to teams or applications. Teams with disproportionate costs should audit their resources: are retention policies appropriate? Are unused topics being cleaned up? Is replication necessary?

The Path Forward

Kafka cost optimization isn't about switching cloud providers or negotiating discounts. It's about operational discipline: matching retention to actual needs, preventing duplication through discovery, right-sizing based on actual load, and attributing costs to teams.

Conduktor provides cost visibility through topic catalogs, usage metrics, and retention policy recommendations. Automate cost governance with Terraform or CLI. Organizations report $200K+ in annual savings from consolidating duplicated topics, right-sizing retention policies, and eliminating unused resources. The savings don't come from infrastructure changes—they come from knowing what exists, who owns it, and whether it's being used.

If your Kafka bill keeps growing but traffic stays flat, the problem isn't Kafka—it's visibility into where the costs come from.