What's Really Inside Kafka's poll()?

Matt Searle June 15, 2026 10 min read

Isometric wireframe: one connection forks at a glowing lime junction into several channels reaching different Kafka broker structures, data packets in flight

Ask an engineer to draw a Kafka consumer and you'll usually get two boxes and an arrow: a client on the left, Kafka on the right, poll() in the middle pulling records. One connection, one socket, done.

How we usually draw it

The reality is that a consumer calling poll() is running a small distributed protocol against your cluster:

it discovers brokers
finds coordinators
joins a group
fetches from several leaders at once
heartbeats on a timer
commits offsets
re-authenticates every new connection it opens along the way
...

By the time it reaches steady state, it holds many open TCP connections to different brokers, and only Kafka decides which is which.

We focus on mainstream clients (Java and librdkafka families). The newer protocols change a bit the picture (KIP-848 for consumer groups, KIP-890 for transactions) but the overall shape is more-or-less the same.

One consumer, many brokers

Each of those connections goes to a different broker doing a different job: some are partition leaders for the topics you read, one is your group's coordinator, one might be a transaction coordinator.

One consumer, many role-specific connections

This complexity raises a lot of questions:

How does the client get here?
What keeps these connections alive?
Why is this fan-out worth understanding before you put a proxy, a mesh, or a load balancer anywhere near Kafka?

Let's dig in.

Bootstrap is discovery, not traffic

When you set up a Kafka client, you only configure one address: bootstrap.servers. It's usually a single hostname, or a few for redundancy. What matters is what that address is for, and it's not what you'd expect.

Bootstrap is discovery, not traffic. bootstrap.servers is just the address (or addresses) the client uses to make its first connection. Once that connection is up (it negotiates ApiVersions first, and authenticates if SASL is on), the client calls the Metadata API. The response carries the full broker list (each broker with its own broker.id, host, and port), plus every partition leader.

Now the client has the full map and can connect directly to whichever broker owns whatever it needs.

Said differently: one hostname in the config can become a dozen-plus broker connections (locally you can use netstat to see that).

Bootstrap is the seed: one Metadata exchange, then the client connects straight to the brokers it needs.

A few warnings about bootstrap servers:

Multiple bootstrap.servers entries are failover for the initial contact: the client shuffles the list and picks one to reach (not strictly in config order) until one answers. They are not load-balancing for your traffic.
A load balancer in front of bootstrap.servers is fine. Any broker can answer Metadata, so it doesn't matter which one the LB picks.
After that first discovery, periodic Metadata refreshes use the brokers the client already knows about. The bootstrap addresses matter again only if the client has to re-bootstrap.

🚫 "We'll just put all the brokers behind one load balancer and point clients at the VIP."

Don't do that, this does not work. After bootstrap, a client reaches each broker by the advertised host and port that broker returned in Metadata, because it needs that specific broker: the one that leads the partition it wants to read. A shared VIP that round-robins to "some broker" sends a Fetch for topicX-0 to a broker that doesn't lead topicX-0, and you get a storm of NOT_LEADER_OR_FOLLOWER errors in your applications.

Behind a shared VIP, the Fetch for topicX-0 lands on whatever broker answers, not the one that leads it. Broker B rejects it with NOT_LEADER_OR_FOLLOWER while Broker A, the real leader, is never contacted.

Load-balance the bootstrap; never load-balance partition-leader traffic.

Coordinators are partition leaders too

A consumer doesn't only pull data from brokers. It also relies on them for some of Kafka's core mechanisms, like coordinating with the other consumers in its group, and transactions:

Broker Role	"Leader" is leader of…	Found via	Used for
Partition leader	the topic-partition itself	`Metadata`	`Produce`, `Fetch`, `ListOffsets`
Group coordinator	the `__consumer_offsets` partition for your `group.id`	`FindCoordinator(GROUP)`	`JoinGroup`, `OffsetCommit`, `Heartbeat`
Txn coordinator	the `__transaction_state` partition for your `transactional.id`	`FindCoordinator(TRANSACTION)`	`InitProducerId`, `EndTxn`

Each of these "leaders" has a role to play, and each can move when the underlying topic-partition gets a new leader through normal election.

How does a client find out when this changes? It refreshes its metadata periodically, or it simply hits an error and re-discovers on the fly.

The consumer startup sequence

If you combine discovery and group membership together, you get this sequence:

From bootstrap to consuming data, including the offset lookup before the first fetch

Every time the client contacts a new broker, it starts over: ApiVersions, the connection handshake, and authentication. These connections are sticky for as long as the role lasts.

Note that KIP-848 collapses the classic two round-trips (JoinGroup/SyncGroup) into a single ConsumerGroupHeartbeat.

Connections are sticky

Once it's running, a consumer keeps several connections open, each pinned to a broker role, and refreshes its metadata regularly (every 5 minutes by default). A connection stays open as long as its role is in use; an idle one is closed after connections.max.idle.ms (9 minutes by default). You typically have:

The group coordinator connection to handle heartbeats, offset commits, and group changes.
One connection per assigned partition leader to handle fetches (the actual consuming).
A transaction coordinator connection if the producer is transactional.

If you have a consumer reading a 12-partition topic whose leaders sit on 12 different brokers, it can have up to 13 TCP connections opened.

It stays one connection per broker no matter how you tune the client. max.in.flight.requests.per.connection (5 by default) is how many unacknowledged requests the client will pipeline on that single connection before it waits, not a number of connections. A producer writing to those 12 leaders still opens 12 connections, each carrying up to 5 in-flight requests, not 60.

Three jobs, three cadences

Inside the client, three jobs run on their own cadences. In the classic consumer only the heartbeat has its own background thread; fetches and offset commits ride on your poll() calls:

Loop	RPC	Cadence	Target
Heartbeat	`Heartbeat` / `ConsumerGroupHeartbeat`	every 3s (`heartbeat.interval.ms`); server-set ~5s (KIP-848)	Group coordinator
Fetch	`Fetch`	continuous, per partition	Partition leaders
OffsetCommit	`OffsetCommit`	auto-commit interval, or explicit	Group coordinator

Heartbeat~3s (heartbeat.interval.ms)

Fetchcontinuous, per partition

OffsetCommitauto-commit interval

Three jobs, three cadences, two different brokers.

Each one can fail:

Miss heartbeats and the coordinator evicts you from the group, triggering an instant rebalance.
If a commit fails, you will reprocess from the last good offset on restart.
If you fetch too slowly, you fall behind: lag climbs, even while everything looks healthy.

Transactions add another coordinator

When you turn on transactions, Kafka adds a fourth role: the transaction coordinator. Its job is to track transactions as they're started, committed, or rolled back across the cluster:

A transactional commit (transaction protocol v1) spans three coordinators.

That's a lot of coordination just to guarantee the I in ACID.

The diagram shows transaction protocol v1. The producer registers each partition with the transaction coordinator (AddPartitionsToTxn) before it writes there, so the coordinator knows where to place markers at commit time. Kafka 4.x's v2 (KIP-890) drops both client-side calls: the broker verifies and registers partitions itself on Produce, there's no separate AddOffsetsToTxn, and TxnOffsetCommit goes straight to the group coordinator. Fewer round-trips, same guarantees.

WriteTxnMarkers is intra-cluster. On EndTxn, the transaction coordinator writes the commit to its own log and can acknowledge the producer right away, then asynchronously sends the markers to every involved partition leader, including the __consumer_offsets leader, which tells the group coordinator to surface the committed offsets.

Error codes and recovery

Because every "leader" can move, clients spend a lot of their lives reacting to errors that mean go find the new owner. Each error code is really a state-machine transition. Here's what each one means:

Error	What it means	Recovery
`NOT_LEADER_OR_FOLLOWER`	The partition leader moved to another broker.	Refresh `Metadata`, reconnect to the new leader.
`LEADER_NOT_AVAILABLE`	No leader yet; an election is in progress.	Back off and retry.
`NOT_COORDINATOR`	The group or transaction coordinator moved.	Re-run `FindCoordinator`.
`COORDINATOR_LOAD_IN_PROGRESS`	The coordinator is still loading its state.	Back off and retry. Do not re-discover.
`ILLEGAL_GENERATION`	A rebalance happened (classic protocol).	Re-join the group.
`STALE_MEMBER_EPOCH`	The request's member epoch doesn't match the coordinator's (KIP-848).	Retry once the next `ConsumerGroupHeartbeat` returns the updated epoch.
`UNKNOWN_MEMBER_ID`	The coordinator forgot this member.	Re-join the group.
`FENCED_INSTANCE_ID`	Two instances share one `group.instance.id`.	Application error; cannot auto-recover.
`OFFSET_OUT_OF_RANGE`	The fetch position is before or past the log.	Apply `auto.offset.reset`, or an app handler.
`TOPIC_AUTHORIZATION_FAILED`	An ACL changed.	Surface to the application; no auto-recovery.

Inside one poll() call

When your code calls poll(100ms) and gets a small batch of records back, here's what the client may actually have done in that window:

Update coordinator state: run any pending rebalance and refresh the assignment
Send a queued OffsetCommit if auto-commit is due
If records are already buffered from a previous fetch, return them right away
Otherwise send Fetch to each assigned partition leader and poll the network, up to the timeout
Refresh Metadata if it's stale or a recent error asked for it
Decode the batches into records and hand them to the caller

The heartbeat isn't in this list: in the classic consumer it runs on its own background thread, not inside poll().

The fan-out is a control plane

Kafka is not just about Produce and Consume.

It's about discovery, coordination, group membership, transactions, and error-driven re-routing. That's where the real complexity lives. It stays invisible to users, but Kafka, and anything standing in for it, has to handle all of it.

Whatever sits between a client and Kafka (a proxy, a service mesh, a load balancer) has to understand which connection is which, because each one terminates at a specific broker for a specific reason. Send a Fetch to the wrong broker and you get NOT_LEADER_OR_FOLLOWER failures, or rebalances that never settle.

This is the world a Kafka proxy lives in. Conduktor Gateway sits in exactly this position: it speaks the full client protocol, tracks which connection is which, and routes each accordingly.