Customer Story

How an Investment Fund Used AI to Get Its Kafka Ready for More AI

An AI assistant on the Conduktor MCP rightsized an investment fund's Kafka topics, freeing about 30% of a 50+ TB cluster for its growing AI workloads.

Industry Financial Services / Investment Management

Use Cases

Kafka cluster rightsizing
AI-assisted analysis via the Conduktor MCP
Cluster health and source of truth

Outcomes

~30% of a 50+ TB cluster reclaimed
AI did the rightsizing analysis via the MCP; the team applied the changes
Source of truth during a major broker incident

"I did the whole rightsizing of my topics almost 100% with Claude and the Conduktor MCP." - The fund's platform team

Executive Summary

A multi-strategy investment fund runs Kafka on self-managed, on-premise Confluent: market data feeding trading desks, application logs in flight, hundreds of developers producing and consuming.

14brokers

50+ TBcluster, ~80% full before

~30%reclaimed by rightsizing

The cluster was oversized from the start, retention crept up, and topics accumulated faster than anyone pruned them. By early 2026, disk-space alerts were firing repeatedly, abnormal enough that the team stopped to look closely. An AI assistant working through the Conduktor MCP did the analysis; the team applied the changes itself, rightsizing the handful of topics driving the growth and freeing about 30% of the 50+ TB cluster. Two things came out of it right away: a much clearer picture of how the cluster was actually used, and enough headroom to settle the running debate about buying a bigger cluster. That headroom matters now: AI is starting to push more of the firm's data through Kafka toward its data lake, and the cleared cluster is what that next wave will run on.

Challenges

Kafka here had the opposite problem of most teams: not runaway growth, but a system left untouched for too long.

"Our cluster size has been very static for the last seven years. We haven't added or removed any brokers."

- Oversized at birth, never re-measured. The cluster was built big and grew bigger by accretion. Capacity headroom hid the cost of long retention and abandoned topics until the cluster neared 80% full.

No clear health signal. External professional services brought in for the version upgrade never answered the basic question: what should they actually track? They wanted concrete thresholds, like how many partitions and replicas per broker is too many and what good looks like, and got reassurance instead of numbers.
A handful of topics drove almost all the weight. Eight to ten large, heavily replicated topics, mostly telemetry and trading data, accounted for most of the storage, while only needing about a day of retention.
Alerts said "full", not "why". When a disk-space alert fired, the team could see which topic was large, but not whether one or two topics had suddenly grown. A point-in-time size is not a growth signal, and chasing the difference by hand each time was slow.
A small team runs all of it. A lean platform team carries Kafka for the whole firm. Any cleanup or sizing analysis had to fit into a calendar already full with a separate platform migration.

Solution

The missing piece was knowing what good looked like: how many partitions and replicas per broker is safe, and which topics were actually driving the storage. Conduktor put real numbers on both. They didn't add hardware or buy a bigger cluster; they measured what they had and cut the waste, with an AI assistant doing most of the legwork.

Find the source of truth first

The trust in Conduktor predates the rightsizing. During a major incident in April 2025, roughly 1,400 topics were deleted in a day, and when the servers came back the team couldn't see all its brokers through its usual monitoring. Conduktor showed the correct state of the cluster when it mattered most.

"On the day the incident happened, Datadog was not immediate enough. But Conduktor was prompt, giving us absolutely correct information."

After that, Conduktor was the team's first stop.

"Conduktor is our main visualization tool. We don't even rely on Control Center 99% of the time."

Rightsize with an AI assistant on the MCP

The analysis ran through the Conduktor MCP, which exposes the platform's metadata and metrics to an AI assistant. Instead of hand-querying topic sizes, replication factors, and retention across thousands of topics, the team worked the problem in conversation: which topics drive storage, which retention settings are excessive, what to cut.

The MCP is read-only by design, so nothing changed on the cluster on its own. It surfaced the picture and the plan; the team stayed in control and applied the changes: rightsizing the eight to ten topics driving the storage, and dropping retention on the telemetry and trading-data topics that never needed more than a day.

Operational value, not a cost cut

The fund treats Kafka as shared infrastructure rather than a charged-back service, so the case for cleanup was operational, not financial: more headroom, a smaller blast radius, less risk when a spike hits. The same goes for partition count. Running on-premise, the team first read partition waste as a managed-Kafka concern, something that only shows up on a cloud bill. But on a self-managed cluster the cost is just as real, in CPU, memory, open files, and recovery time, which is the case we make in The Surprising Cost of Kafka Partition Waste.

Results

~30% of a 50+ TB cluster reclaimed. Fill dropped from near 80% to a comfortable level, on the same hardware.
The "bigger cluster" debate, settled. The reclaimed space took buying more hardware off the table for now, ending a running internal discussion about expanding the cluster.
A clearer read on the cluster. The same pass left the team understanding what actually drives storage, not just that it was filling.
AI did the analysis, the team stayed in control. The MCP surfaced what to rightsize; the team applied the changes by hand.
A trusted source of truth. Through a major broker incident and day-to-day operations, Conduktor is the tool the team reaches for first.

A foundation for what's next

Rightsizing did more than free space; it reset a seven-year-old cluster into something the team can grow on. The AI-driven demand is already taking shape, with Kafka increasingly the buffer in front of Snowflake. A cluster sitting near 80% full could not have absorbed that comfortably; a de-risked one can.

The rightsizing was a one-off pass, and the team wants the visibility to become permanent. The plan, not yet in production, is to use the MCP to poll every topic on a schedule and correlate total cluster usage against per-topic sizes, so the next disk alert arrives with an answer attached: which topic, or topics, just grew. That turns a point-in-time size into a growth signal, the gap that made every past alert a manual investigation.

Geography is the other front. The cluster is New York-centric while its consumers span the globe, from the New York desks to Hong Kong, so latency is the next thing to solve. Rather than stand up and run regional clusters, that load can move to the edge with Conduktor Gateway: a proxy that caches hot data close to remote readers, so a record crosses the ocean once instead of for every consumer, with access and encryption policy enforced right there at the edge.

Two large Kafka upgrades come first: removing ZooKeeper, then moving to the next major Confluent version. After that, the fund sees governance as the clear next step, with a replicated lower environment of representative data already standing by to trial it.

"The days of us designing our own solutions are past. We don't have time to reinvent the wheel. The next step in Kafka for me is improving our governance, and that's where something like Conduktor fits."

Frequently Asked Questions

How did the fund reclaim 30% of its Kafka cluster?

The platform team used Conduktor Console and the Conduktor MCP to see storage, replication, and retention across the whole cluster, identified the eight to ten heavily replicated topics driving most of the weight (mainly telemetry and trading data), and rightsized them, including cutting retention on topics that only needed about a day. The cluster went from near 80% full to a comfortable level on the same hardware.

What is the Conduktor MCP and how was it used here?

The Conduktor MCP exposes the platform's metadata and metrics to an AI assistant. The team ran the topic-rightsizing analysis in conversation, almost entirely autonomously, then applied the changes itself: the MCP is read-only, so it surfaces the picture and the plan but never changes anything on the cluster, and a human stays in the loop on every action.

Does cluster rightsizing only matter on the cloud?

No. This was a self-managed, on-premise cluster with no per-partition cloud billing. The value was operational: more headroom, a smaller blast radius when a spike hits, and shorter retention on the heaviest topics, rather than a line-item cost reduction.