Testing a Kafka Proxy: Taming Millions of Permutations

Matt Searle June 25, 2026 20 min read
Isometric wireframe line-art on a dark teal background: a chaotic cloud of permutation cubes drifts left-to-right and converges through a single glowing lime prism node, emerging as one thin single-file stream of ordered test-case cubes racing off the right edge.
60-second summary
  • Proxying Kafka sounds easy, but the protocol is ~90 APIs and 600+ wire formats.
  • Every broker, client, and proxy setting multiplies that into ~9 million permutations.
  • You can't brute-force it, so we split it into three axes and use pairwise expansion.
  • Scenarios are written once, run against real Kafka clients, and assert every skip.
  • One run: 500,000+ real API calls, ~30 deployments, under network chaos, green in 30 minutes.

When people are new to Kafka, I tell them to hold onto one fact: at its core, Kafka is brutally simple. You append to a file, and you read that data back, over the network. That part is a few hundred lines of code. The other ~300,000 lines exist to make that one operation reliable when it's distributed at scale, which, folks, is much harder than it looks.

That simplicity is a strength. A lot of Kafka's famous resilience and scaling comes from its laser focus on doing one thing well. Don't half-ass two things; whole-ass one thing.

So when you set out to proxy Kafka traffic, where you aren't even on the hook for persisting the data, surely things are straightforward?

🚫 "It's just produce and fetch, with the odd low-volume admin call. Right?"

Well. Yes and no.

Produce and fetch are a fraction of the protocol

Under the hood, Kafka clients and brokers talk a lot. Beyond produce and fetch there's a swathe of constantly-called APIs, version negotiation, authentication, leader discovery, consumer-group coordination, transactions, topology discovery, and almost every one supports multiple versions, so old and new clients can share a cluster.

~90 APIs×request + response×many versions→600+ wire formats
Around 600 distinct request/response shapes to get right, before a single feature is switched on.

The version differences aren't cosmetic. Behaviour and feature access change between them, so a proxy very much has to care which version each client is using.

And in a test run like ours, by raw call count, produce/fetch are far from the whole story (by bytes they still dominate, but that's not what a proxy spends most of its decisions on). Everything else forms a significant bulk: the ApiVersions negotiation alone fires tens of thousands of times in a single run as connections churn. A large amount of what a proxy handles is conversation about the data, not the data itself. (A single consumer's poll() already fans out into a surprising number of these calls before it reads a byte.)

Brokers and clients add their own variables

The 600+ wire formats are just the start. Brokers and clients each have levers that change the conversation:

SideVaries
Brokerlisteners; security protocol (PLAINTEXT, SSL/mTLS, SASL_PLAINTEXT, SASL_SSL); controller (ZooKeeper or KRaft); topology
Clientconnection and partition-leader failover handling (often 10+ broker connections at once); compression; batching; which APIs and versions it triggers. And there are two main families with their own quirks, Java and librdkafka (which underpins most non-JVM clients)
Some choices are static, baked in at startup. Others shift from one request to the next, or during the lifespan of a single connection.

So a real test of "produce and consume" isn't one test at all. We run it across Java clients reaching back several major versions plus librdkafka, against brokers from older Kafka through modern KRaft, and even non-Apache brokers like Redpanda. The same scenarios and the same assertions, with capability-aware skips where a broker lacks a feature, over wildly different wire conversations underneath.

The proxy adds another layer of variation

Finally, the proxy isn't a transparent pane of glass. You install it precisely because you want it to have an effect, from simple address translation up to field-level encryption and topic virtualisation. And most of those features behave differently depending on every other variable in play: API version, broker topology, client config.

Conduktor Gateway has a lot of them:

Each has its own options, some static config, others toggled while traffic is flowing.

A single test run exercises deployments like:

DeploymentAuth / front
1 gw β†’ 1 brokermutual-TLS client certs
2 gw β†’ 3 brokersOIDC via Keycloak
5 gw ⇄ 3 brokers (SNI)multi-cluster + live cluster-switching
3 gw ⇄ 5 brokers (SNI)HAProxy proxy-protocol front
…each running the relevant slices of the same scenario library.

The problem space: ~9 million permutations

Put it all together and the surface area is daunting:

protocol Γ— broker variant Γ— broker config Γ— topology Γ— client variant Γ— client config Γ— proxy config

Even conservatively, that's a base problem space of roughly:

50protocolΓ—3brokerΓ—5broker cfgΓ—5topologyΓ—5clientΓ—10client cfgΓ—50proxy cfg=β‰ˆ 9,000,000permutations
A conservative cross-product, and this is still before client behaviour over time.

…and that's before the actual patterns of client behaviour over time, which push the functional space into hundreds of millions of possible event chains. You are not going to brute-force that.

Kafka's speed is on our side

Kafka is fast. Even a modest CI box chews through huge volumes in seconds. Good black-box testing should exploit that, and avoid the trap of grinding through cases one at a time, each in its own stand-up/tear-down bookends. The universe doesn't have time for that.

So strip black-box testing to its parts:

  1. Choose an environment, the static setup
  2. Execute a set of actions, the dynamic setup, plus the "application" traffic
  3. Observe what happened, because if you can't see it happen, it isn't really a test
  4. Assert, that given what you did and saw, the right thing happened (or was correctly prevented)

And Kafka hands us step 3 almost for free: observing what happened is one of its superpowers. The cluster is already a record of events. No extra machinery required.

Separate what varies, then recombine

Our testing approach is built around Kafka's speed. That's our first pillar: achieve maximum coverage in as short a time as possible.

This matters more than ever in the age of AI-assisted development, because what we really want to optimise for is developer feedback. A good testing tool is as much about discovery, debugging, and experimentation ("what happens if I try this setup?") as it is about regression and fault prevention. Nothing kills engineering flow like waiting three hours for a verdict from some enormous, impenetrable CI job.

Which leads to our second pillar: the testing machinery has to be valuable during development, not just a checklist at the end. It should:

  • Run locally
  • Let a developer choose exactly what to run
  • Be highly deterministic, with no mercurial, flaky behaviour to send you on time-sink side-quests
  • Be highly extensible, and as far as possible keep environment, client behaviour, and scenario separate, so that any difference (intended or surprising) can be isolated by changing as little as possible

Three independent axes

Concretely, we split the setup into three independent areas:

AxisWhat it varies
EnvironmentsBroker auth, topology, controller mode, …
Proxy behaviourGateway topology, auth, routing, feature config, …
Client behaviourBatching, compression, client family/version, …
Each axis uses pairwise expansion to cover as much of the problem space as possible with an effective minimum of variants. That single technique is what turns "9 million permutations" from a horror story into a tractable test suite.
Environmentsauthtopologykraft / zk…×Proxy behaviourtopologyroutingencryption…×Client behaviourbatchingcompressjava / rd…pairwise expansion (not full cross-product)a small, representative setof compiled setups to test
Three independent axes, recombined by pairwise expansion into a small, representative set of compiled setups, not the full nine-million cross-product.

The result is a compiled setup, ready to be tested against. Each setup advertises the capabilities it supports, and those capabilities can constrain things. (Run with no authentication, for instance, and there's simply no client-specific auth behaviour to exercise.)

Scenarios that adapt

A test scenario, the actions performed, is then kept entirely independent of the environment. Scenarios are built from sets of templates, each with their own steps and their own pairwise optionality for expansion. Critically, every scenario is aware of the capabilities it requires to run.

That awareness is what makes scenarios portable. A single scenario can target many different environments and simply skip the steps that make no sense or aren't supported. Those skips are recorded, and asserted against the expected set of skips. No test can be silently skipped without us noticing.

A worked example: the core suite

This is best seen in what we call the core suite, the job that most embodies the whole approach. It is one broad set of scenarios: smoke checks, produce/consume, consumer groups, ACL enforcement (allow and deny), topic lifecycle, transactions, virtual-cluster isolation, topic aliases, and more.

One job will point this suite at a two-gateway, three-broker authenticated cluster, and run it across three different client drivers, a current Java client, an older Java client, and librdkafka, and under different Gateway feature-flag configurations. Nobody hand-curated a per-client or per-flag list. Every scenario is offered to every driver, and the ones that don't apply to a given combination skip themselves, with each skip checked against the expected set.

The result of one such pass:

DriverCasesStepsPassFail
java-3.940345400
java-4.240345400
rdkafka32245320
Total1129351120
A hundred-plus cases and the better part of a thousand discrete, observed-and-asserted steps, across three client families and two proxy configurations, green in under ten minutes. Tens of thousands of Kafka API calls are made. rdkafka runs fewer cases not because we wrote fewer, but because it self-skips the handful that don't apply to it. Same scenarios, written once.

A bug a mock would never find. A refactor of our topic-id fetch path stopped stripping the upstream fetch-session id. KIP-227 fetch sessions are broker-local state, so a proxy has to terminate and recreate them independently on each leg; ours let the upstream id leak through to the client. The first, full fetch still worked, so every short-lived test passed clean. But a consumer that kept polling then sent incremental fetches against a session the downstream broker had never actually established, and from the second round on it got back zero records. A sustained consumer would silently stall while a quick one looked fine.

A mock, or a hand-rolled frame, would never have surfaced that. It only showed up because the test drove a real client consuming over time, where round two behaves differently from round one. The matrix caught it in CI, before it ever reached a release, and the fix landed with two regression scenarios that fail without it, so the same bug can never come back unnoticed.

Inside the machine: compile, then run

The environment config and the scenario are compiled together into an explicit set of concrete actions and observations: the actual test plan. It's fully self-contained, data-driven, and deterministic. These plans are large (for us, a sprawling JSON document) and that's perfectly fine, because a human never has to read it. The test runner does.

environmentscenariocompileresolve alldecisionstest planself-contained,deterministic JSONrunreal Kafkaclientsobservecapture everyeventassertoffline,capability-aware skips
Every decision is resolved up front, at compile time. The runner that executes the plan is deliberately dumb, which is exactly what makes a run reproducible.

Note where the complexity lives: it's all up front, at compile time. Every decision, which versions, which steps, which skips, is resolved before a single packet moves. The runner that executes the plan is deliberately dumb, which is exactly what makes a run reproducible.

The payoff is reuse. One easy-to-review, easy-to-maintain suite of scenarios runs against many environments without modification:

  • Did we cover test case T for environment E? If T and E both exist, then yes, by construction.
  • Can we add coverage for a new environment? Set it up, and the existing scenarios run against it automatically.
  • Added a scenario for your new feature? Write it once. It runs everywhere.

All of it in CI, and all of it on your machine, with the same tools. We even bridge CI back down to a local machine, so that when something does fail you can reach in, reproduce it, and go digging, instead of squinting at a log file and hoping.

The network is a dimension too

There's one more axis worth calling out, because it's the one most easily forgotten in a test lab: the network itself.

Production networks do not behave like the loopback interface on a CI box. Packets are delayed, reordered, and dropped; connections stall and reset; a broker that was reachable a moment ago times out. (This is the same instinct behind chaos engineering for Kafka, pointed at the proxy instead of the cluster.) A proxy sits squarely in that path, and a lot of its hardest-won correctness is about what it does when the network misbehaves mid-conversation, exactly when a client is mid-failover between brokers, say.

So we make the network a thing we can configure too. By slotting a network-disruption sidecar between the moving parts, the same scenarios can be replayed with latency, jitter, and severed connections injected on demand, and we run them both with the chaos off and with it on, then compare. It's yet another dimension folded into the same framework, rather than a separate, bolted-on stress test.

chaos offclientscenariogatewaybrokerβœ“ passchaos onclientnetem sidecargatewaybrokerβœ“ passlatency Β· jitter Β· reorder Β· dropcomparethe two runs
The same scenarios, replayed with the wire misbehaving. We run them with chaos off and on, then compare the results.

What the runs actually cover

We won't pretend to a single triumphant "percentage covered," because the honest answer is that the space is effectively unbounded and any such figure would be slightly misleading. But the runs are not shy about their scale. One recent end-to-end run drove over half a million individual Kafka API calls, across ~50 distinct APIs, through multiple real clients, against around thirty different Gateway deployments, and finished green, with every expected skip accounted for, in well under 30 minutes of wall-clock time. This is not a Kafka performance run; it's a functional one, asserting every last detail of what happened (yes, over half a million assertions).

Beyond the raw number, here's the shape of the coverage:

  • Every Kafka API the Gateway proxies, exercised through real, unmodified Kafka clients, not mocks or hand-rolled frames. This includes example Kafka Streams applications.
  • Multiple client families across several years of releases, so a behaviour is proven on old and new clients alike, not just the one a developer happened to have handy.
  • A broad matrix of broker and proxy configurations, authentication models, single- and multi-broker topologies, single- and multi-gateway deployments, virtual clusters, routing and listener arrangements, and the major Gateway features, recombined rather than enumerated by hand.
  • The same suites run under network disruption, not just on a quiet wire.
  • Every run is deterministic, fully captured, and independently asserted, with skips accounted for rather than ignored.

The point of all of this isn't only to chase a number and prevent regression. It's also that adding the next environment, the next client version, or the next feature mostly means describing it once, and the coverage you already have comes along for the ride.

Just like Kafka itself: brutally simple at the core, a universe of behaviour at the edges. The trick to testing a Kafka proxy isn't more test cases. It's separating the things that vary, and letting the machine recombine them ad infinitum.


For you, this is the difference between hoping an upgrade is safe and knowing it. Every Gateway release ships having been exercised against old and new client families, multiple broker versions, virtual clusters, and the failure modes your network will eventually throw at it. The proxy in front of your clusters should behave the same on a bad day as on a good one, and proving that is the whole point.

Conduktor Gateway is the proxy this rigour is built around. If you want a Kafka proxy whose correctness is proven across millions of recombined permutations, real clients, and a network that misbehaves on purpose, explore Conduktor Gateway or book a demo.