Kafka Cost Optimization

Find waste. Attribute spend. Run leaner.

Kafka usage grows by accretion: more topics, clusters, integrations, and the operational overhead behind them. Conduktor gives platform teams the visibility, attribution, and architectural levers to find what's recoverable, run more efficiently, and keep the gains from eroding.

Get your Kafka Cost Analysis

Where Kafka Costs Actually Hide

The Kafka bill is rarely one line item. It surfaces across four layers, and waste in one pulls the others up with it. Operational overhead is the layer most teams underestimate. Most platform teams find 25 to 40 percent recoverable spend across these layers without a replatform or a renegotiation.

Infrastructure

Brokers, partitions, storage, replication. The biggest layer and the easiest to act on. Six patterns drive most of the waste, from partition overprovisioning to topic proliferation.

Ecosystem tooling

Streams, ksqlDB, Flink, Connect. Sits on top of infrastructure and inherits its inefficiencies. Filter-only jobs alone can dominate stream-processing usage.

Vendor and licensing

Support contracts priced as a percentage of platform spend. Tier upgrades and add-ons scale with cluster footprint. Shrink the infrastructure, these shrink with it.

Operational overhead

The cost that never appears on the bill, and the largest one most teams miss. Platform-team time on manual provisioning, firefighting, and cleanup coordination that better tooling would absorb.

Read: A better conversation about Kafka costs →

From patterns to solutions

Most Kafka estates carry the same six waste patterns. Conduktor helps in two ways: in identifying waste and addressing inefficiencies.

Pattern	What it looks like	How Conduktor helps
Partition overprovisioning	Topics with 30+ partitions for use cases that need 3, brokers approaching the per-broker partition-replica ceiling	Insights finds the over-partitioned topics. Cost Guardrails bound new ones at creation.
Retention misalignment	Default retention on every topic regardless of consumer lag, retention longer than any consumer reads back	Insights flags retention overrun. Cost Guardrails cap retention windows on new topics.
Cluster sprawl	A cluster per project, environment, or business unit, each carrying its own broker fixed cost	Capacity Pooling consolidates dedicated isolation clusters onto shared infrastructure.
Topic proliferation and duplication	Orphan topics with no traffic, near-duplicates, derived topics from filter-only stream processing	Insights surfaces orphans and duplicates. Topic Views replace filter-only derivations. Chargeback gets app teams retiring what they don't need.
Inefficient client patterns	Producers without compression, idempotence misconfigured, consumers without partition awareness	Cost Guardrails enforce compression, idempotence, and ack policies on every producer.
Static capacity per resource	Every topic gets dedicated partitions and replicas regardless of actual throughput	Capacity Pooling replaces dedicated partitions with shared backing topics via concentration.

Identify Waste

You can't fix what you can't see and so central cleanup never sticks. Insights finds the waste while Chargeback puts the bill in terms the business can act on.

Insights

Find the waste hiding in your estate

Insights surfaces the topics you are paying for that you should not be: orphans, oversized, over-retained. The Cost Control view ranks them so cleanup runs by impact, not by cluster.

Cost Control ranks expensive topics by storage, partitions, and throughput, with the underlying pattern (over-partitioned, retention overrun, orphan) flagged on each
Filter by application, topic prefix, or cluster so cleanup scopes to the topics that matter
RBAC-aware so app owners see the topics they own and act on them without escalating to a platform admin

Learn more about Insights →

Insights

Chargeback

Make app teams own their costs

Cleanup pushed centrally never sticks. Chargeback turns consumption into an accounting view of spend, so teams generating the cost see their own bill before anyone has to ask.

Spend rolls up by application, service account, or any label (team, department, business unit) so reports match the org chart, not the cluster topology
Configurable unit costs for storage, partitions, and ingress/egress tied to your actual contract terms
Confluent Cloud direct: pulls ingress and egress from the Confluent Metrics API, no Gateway required

Learn how chargeback works →

Chargeback

Address Inefficiency

Visibility surfaces the waste, but cutting it takes a different mix of levers: capacity pooling, topic views in place of filter-only stream processing, and creation-time guardrails on every new resource.

Capacity Pooling

Pool capacity instead of allocating it

Most cost waste is structural: each team gets its own cluster, each topic gets its own dedicated partitions. Virtual clusters and topic concentration share the underlying infrastructure without breaking isolation.

Virtual clusters consolidate dedicated isolation clusters onto shared physical infrastructure, eliminating per-team broker overhead
Topic concentration multiplexes low-volume topics onto a single physical backing topic, dropping partition counts on sparse topics by 50 to 90 percent
Hard isolation between environments, brands, or business units with no new brokers required

Learn more about virtual clusters → · Topic concentration →

Virtual ClustersTopic Concentration

# Two isolation boundaries on one physical cluster.
# No new brokers, no new licenses, no naming conventions.
apiVersion: gateway/v2
kind: VirtualCluster
metadata:
  name: payments-team
spec:
  type: Standard
  aclEnabled: true
  superUsers:
    - payments-admin

apiVersion: gateway/v2
kind: VirtualCluster
metadata:
  name: orders-team
spec:
  type: Standard
  aclEnabled: true
  superUsers:
    - orders-admin

# Sparse regional topics: 24 logical partitions each,
# backed by 6 physical partitions on a shared topic.
apiVersion: gateway/v2
kind: ConcentratedTopic
metadata:
  name: customer-events-eu
  vCluster: payments-team
spec:
  advertisedPartitions: 24
  backingTopic: _concentrated_customer_events

apiVersion: gateway/v2
kind: ConcentratedTopic
metadata:
  name: customer-events-us
  vCluster: payments-team
spec:
  advertisedPartitions: 24
  backingTopic: _concentrated_customer_events

# Two topics × 24 partitions = 48 advertised.
# Physically backed by 6 partitions: 87.5% reduction.

Topic Views

Replace filter-only stream processing

Most stream-processing jobs do one thing: read a topic, drop rows a consumer does not need, and write the rest somewhere new. The team pays three times: engine, dev time, derived infrastructure. Topic Views handle this at the proxy.

SQL-based topic views serve a filtered or projected subset at the proxy layer, with no new physical topic or partitions to pay for
Caching for high-frequency repetitive reads, reducing broker fetch load on the source topic

Learn more about topic views →

SQL Topic ViewCaching

# Replaces a Flink job that filtered "customers" to EU adults.
# No derived topic, no new partitions, no Flink instance.
apiVersion: gateway/v2
kind: Interceptor
metadata:
  name: customers-eu-adults
spec:
  pluginClass: io.conduktor.gateway.interceptor.VirtualSqlTopicPlugin
  priority: 100
  config:
    virtualTopic: customers-eu-adults
    statement: |
      SELECT firstName, lastName, email, country
      FROM customers
      WHERE age >= 18 AND country IN ('FR', 'DE', 'ES')
    schemaRegistryConfig:
      host: http://schema-registry:8081

# High-frequency read patterns served from cache,
# reducing broker fetch load and outbound bandwidth.
apiVersion: gateway/v2
kind: Interceptor
metadata:
  name: cache-reference-data
spec:
  pluginClass: io.conduktor.gateway.interceptor.CacheInterceptorPlugin
  priority: 100
  config:
    topic: "reference.*"
    cacheConfig:
      type: IN_MEMORY
      inMemConfig:
        cacheSize: 1000
        expireTimeMs: 60000

Cost Guardrails

Stop waste at creation time

The cheapest cleanup is the one you never have to run. Bound partitions, retention, and replication at topic creation, require an owner on every new resource, and future loads inherit the discipline.

Partition and retention bounds enforced at topic creation, with override-to-fixed or block actions
Replication factor enforcement to prevent quietly-doubled replication on non-critical topics
Producer policies for compression and idempotence to standardize client efficiency
Federated ownership required on every new resource so orphan topics are blocked by construction

Learn more about safeguards → · Federated ownership →

Topic Creation PolicyProducer Policy

# New topics are bounded on partitions, retention,
# and replication factor at creation time.
apiVersion: gateway/v2
kind: Interceptor
metadata:
  name: topic-cost-policy
spec:
  pluginClass: io.conduktor.gateway.interceptor.safeguard.CreateTopicPolicyPlugin
  priority: 100
  config:
    numPartition:
      min: 3
      max: 12
      action: OVERRIDE
      overrideValue: 6
    replicationFactor:
      min: 3
      max: 3
      action: BLOCK
    retentionMs:
      min: 86400000
      max: 604800000
      action: OVERRIDE
      overrideValue: 259200000

# Producers without compression or idempotence are blocked.
# Consistent client efficiency across every team.
apiVersion: gateway/v2
kind: Interceptor
metadata:
  name: producer-efficiency-policy
spec:
  pluginClass: io.conduktor.gateway.interceptor.safeguard.ProducerPolicyPlugin
  priority: 100
  config:
    compressionType:
      allowed: ["zstd", "lz4", "snappy"]
      action: BLOCK
    acks:
      required: "all"
      action: BLOCK
    enableIdempotence:
      required: true
      action: BLOCK

Three approaches to sequencing the work

The capabilities above are levers. How they get applied depends on which approach the team is taking. Most platform teams run all three in parallel: defaults catch new loads, optimization works through the existing estate, and architectural changes land over a longer horizon.

Update defaults for new loads

Set policies so new topics, clusters, and clients do not inherit waste. Partition defaults, retention policies, replication enforcement, and ownership requirements at creation time. Low coordination, fast to implement. Slows future cost growth without producing immediate savings.

Optimize existing loads

Hygiene and right-sizing on what is already running: tuning retention, retiring orphans, right-sizing partition counts, consolidating duplicate topics. The gating factor is coordinating with producers and consumers, not the technical work itself. Typically moves the infrastructure bill by 10 to 20 percent in weeks to months, with reductions of 50 percent or more in estates with significant accumulated waste.

Rethink workloads

Reshape data flows: pooled capacity at the cluster and topic layer, replacing filter-only stream processing with topic views, consolidating per-team clusters. The most variable approach in timeline and outcome, with the largest impact in big estates carrying years of accumulated structural decisions.

What to expect

Typical ranges from the estates we have analyzed. Your number depends on where you are starting from and which patterns dominate.

25 to 40 percent recoverable

The typical share of the Kafka infrastructure bill that's recoverable without a replatform. Configuration tuning and topic retirement do most of the work; consolidation and architectural changes close the rest.

10 to 20 percent typical infra reduction

What optimization on existing workloads usually moves the bill by, in weeks to months. Up to 50 percent in estates with significant accumulated orphans and over-partitioning.

50 to 90 percent partition reduction

Topic concentration on sparse low-throughput topics. Regional topics that would need hundreds of partitions back onto a fraction of the physical footprint.

~90 percent less cleanup coordination

Cleanup outreach drops from roughly 1.5 hours per project to under 15 minutes. Teams arrive already aware of their own consumption from the chargeback dashboard.

Stale topic rate from ~10% to ~3%

What happens when teams see their own bill. Steady-state staleness drops 60 to 80 percent across the estates we measure.

Support contract proportionally smaller

Support contracts on most hosted Kafka platforms are billed as a percentage of total platform spend. Cleaning up infrastructure shrinks that line item automatically, no renegotiation needed.

Frequently asked questions

How much can I actually save?

A meaningful share of the typical Kafka infrastructure bill is recoverable through configuration changes, retirement, consolidation, and architectural levers. The 25 to 40 percent range is what we see across estates we analyze closely. Optimization on existing workloads typically moves the bill 10 to 20 percent in weeks to months, with reductions of 50 percent or more in estates carrying significant accumulated waste. Architectural changes like pooled capacity yield more variable returns with the largest impact in big estates.

Where do the savings tend to come from?

A small number of patterns account for most recoverable cost: partition overprovisioning, retention not matched to consumer needs, cluster sprawl, topic proliferation and duplication, inefficient client patterns, and static capacity per resource. Estates vary in which dominate, but the data usually points to the biggest one or two quickly.

Where should I start?

Visibility first, almost always. Without a per-team breakdown of consumption, every other lever is operating blind. Insights and Chargeback are the typical starting point because they make the patterns and the accountability story concrete before any cleanup or architectural work begins.

What is topic concentration and how does it reduce costs?

Topic concentration multiplexes multiple low-volume logical topics onto a single physical backing topic. Applications still see independent topics with their own names and advertised partition counts, but the underlying storage is shared. Partition counts on sparse topics can drop 50 to 90 percent, directly lowering broker CPU, storage, and replication costs. Best fit is non-production environments and low-throughput production topics.

How does cost attribution work across multiple clusters and providers?

Conduktor connects to every cluster regardless of provider and tracks consumption at the application level. Gateway meters bytes in and out per service account, topic, and virtual cluster. Console aggregates this into chargeback dashboards so cost breakdowns surface by team, application, or business label across the entire Kafka estate, not one cluster at a time.

Can I reduce costs without changing platforms?

Yes. Most cost work happens within the existing platform: tuning configurations, retiring waste, consolidating clusters, and pooling capacity through virtual clusters and topic concentration. Replatforming is a smaller set of cases and the longest commitment, and the first two categories of work usually close most of the gap. When a Kafka migration is planned, cost cleanup typically runs alongside it.

How do I keep the savings from eroding?

Defaults and policies are necessary but not sufficient. Holding the gains involves three things: visibility into what the estate contains and what it costs, clear ownership of every topic and cluster, and a regular cadence of reviewing what is actually being used. Federated ownership and self-service with guardrails are the structural pieces that hold up over years.

Does Conduktor work with managed Kafka (Confluent Cloud, AWS MSK)?

Yes. Console Insights and Chargeback connect to every cluster regardless of provider. Gateway features like virtual clusters, topic concentration, and topic views work the same way, since clients connect to the gateway exactly like a broker. Managed-Kafka customers often see proportionally larger savings from concentration and view-based filtering because per-unit infrastructure cost is higher.

Ready to find what is recoverable in your estate?

Book a cost analysis with our field engineering team. We will walk through your estate together, identify the waste patterns that apply, and give a concrete estimate of where the savings sit.

Get your Kafka Cost Analysis

Kafka Cost Optimization

Where Kafka Costs Actually Hide

Infrastructure

Ecosystem tooling

Vendor and licensing

Operational overhead

From patterns to solutions

Identify Waste

Find the waste hiding in your estate

Make app teams own their costs

Address Inefficiency

Pool capacity instead of allocating it

Replace filter-only stream processing

Stop waste at creation time

Three approaches to sequencing the work

What to expect

25 to 40 percent recoverable

10 to 20 percent typical infra reduction

50 to 90 percent partition reduction

~90 percent less cleanup coordination

Stale topic rate from ~10% to ~3%

Support contract proportionally smaller

Frequently asked questions

Ready to find what is recoverable in your estate?

Read more customer stories

European Airline: 25 Clusters to Cloud

Smart Farming: 10x Kafka Utilization

Swiss Post: 5x Kafka Growth