# What's Really Inside Kafka's poll()?

Ask an engineer to draw a Kafka consumer and you'll usually get two boxes and an arrow: a client on the left, Kafka on the right, `poll()` in the middle pulling records. One connection, one socket, done.

How we usually draw it

The reality is that a consumer calling `poll()` is running a small distributed protocol against your cluster:

- it discovers brokers
- finds coordinators
- joins a group
- fetches from several leaders at once
- heartbeats on a timer
- commits offsets
- re-authenticates every new connection it opens along the way
- ...

By the time it reaches steady state, it holds many open TCP connections to *different* brokers, and only Kafka decides which is which.

> We focus on mainstream clients (Java and `librdkafka` families). The newer protocols change a bit the picture (KIP-848 for consumer groups, KIP-890 for transactions) but the overall shape is more-or-less the same.

## One consumer, many brokers

Each of those connections goes to a different broker doing a different job: some are partition leaders for the topics you read, one is your group's coordinator, one might be a transaction coordinator.

One consumer, many role-specific connections

This complexity raises a lot of questions:

- How does the client get here?
- What keeps these connections alive?
- Why is this fan-out worth understanding before you put a proxy, a mesh, or a load balancer anywhere near Kafka?

Let's dig in.

## Bootstrap is discovery, not traffic

When you set up a Kafka client, *you only configure one address*: `bootstrap.servers`. It's usually a single hostname, or a few for redundancy. What matters is what that address is *for*, and it's not what you'd expect.

Bootstrap is discovery, not traffic. `bootstrap.servers` is just the address (or addresses) the client uses to make its *first* connection. Once that connection is up (it negotiates `ApiVersions` first, and authenticates if SASL is on), the client calls the `Metadata` API. The response carries the full broker list (each broker with its own `broker.id`, host, and port), plus every partition leader.

Now the client has the full map and can connect directly to whichever broker owns whatever it needs.

Said differently: one hostname in the config can become a dozen-plus broker connections (locally you can use `netstat` to see that).

Bootstrap is the seed: one Metadata exchange, then the client connects straight to the brokers it needs.

A few warnings about bootstrap servers:

- Multiple `bootstrap.servers` entries are **failover for the initial contact**: the client shuffles the list and picks one to reach (not strictly in config order) until one answers. They are not load-balancing for your traffic.
- A load balancer in front of `bootstrap.servers` is fine. Any broker can answer `Metadata`, so it doesn't matter which one the LB picks.
- After that first discovery, periodic `Metadata` refreshes use the brokers the client already knows about. The bootstrap addresses matter again only if the client has to re-bootstrap.

> 🚫 *"We'll just put all the brokers behind one load balancer and point clients at the VIP."*

Don't do that, this does not work. After bootstrap, a client reaches each broker by the **advertised host and port** that broker returned in `Metadata`, because it needs *that specific broker*: the one that leads the partition it wants to read. A shared VIP that round-robins to "some broker" sends a `Fetch` for `topicX-0` to a broker that doesn't lead `topicX-0`, and you get a storm of `NOT_LEADER_OR_FOLLOWER` errors in your applications.

Behind a shared VIP, the Fetch for topicX-0 lands on whatever broker answers, not the one that leads it. Broker B rejects it with NOT_LEADER_OR_FOLLOWER while Broker A, the real leader, is never contacted.

**Load-balance the bootstrap; never load-balance partition-leader traffic.**

## Coordinators are partition leaders too

A consumer doesn't only pull data from brokers. It also relies on them for some of Kafka's core mechanisms, like coordinating with the other consumers in its group, and transactions:

| Broker Role | "Leader" is leader of… | Found via | Used for |
|---|---|---|---|
| **Partition leader** | the topic-partition itself | `Metadata` | `Produce`, `Fetch`, `ListOffsets` |
| **Group coordinator** | the `__consumer_offsets` partition for your `group.id` | `FindCoordinator(GROUP)` | `JoinGroup`, `OffsetCommit`, `Heartbeat` |
| **Txn coordinator** | the `__transaction_state` partition for your `transactional.id` | `FindCoordinator(TRANSACTION)` | `InitProducerId`, `EndTxn` |

Each of these "leaders" has a role to play, and each can move when the underlying topic-partition gets a new leader through normal election.

How does a client find out when this changes? It refreshes its metadata periodically, or it simply hits an error and re-discovers on the fly.

## The consumer startup sequence

If you combine discovery and group membership together, you get this sequence:

From bootstrap to consuming data, including the offset lookup before the first fetch

Every time the client contacts a new broker, it starts over: `ApiVersions`, the connection handshake, and authentication. These connections are sticky for as long as the role lasts.

Note that KIP-848 collapses the classic two round-trips (`JoinGroup`/`SyncGroup`) into a single `ConsumerGroupHeartbeat`.

## Connections are sticky

Once it's running, a consumer keeps several connections open, each pinned to a broker role, and refreshes its metadata regularly (every 5 minutes by default). A connection stays open as long as its role is in use; an idle one is closed after `connections.max.idle.ms` (9 minutes by default). You typically have:

- **The group coordinator connection** to handle heartbeats, offset commits, and group changes.
- **One connection per assigned partition leader** to handle fetches (the actual consuming).
- **A transaction coordinator connection** if the producer is transactional.

If you have a consumer reading a 12-partition topic whose leaders sit on 12 different brokers, it can have **up to 13 TCP connections** opened.

> It stays one connection per broker no matter how you tune the client. `max.in.flight.requests.per.connection` (5 by default) is how many unacknowledged requests the client will pipeline on that *single* connection before it waits, not a number of connections. A producer writing to those 12 leaders still opens 12 connections, each carrying up to 5 in-flight requests, not 60.

## Three jobs, three cadences

Inside the client, three jobs run on their own cadences. In the classic consumer only the heartbeat has its own background thread; fetches and offset commits ride on your `poll()` calls:

| Loop | RPC | Cadence | Target |
|---|---|---|---|
| **Heartbeat** | `Heartbeat` / `ConsumerGroupHeartbeat` | every 3s (`heartbeat.interval.ms`); server-set ~5s (KIP-848) | Group coordinator |
| **Fetch** | `Fetch` | continuous, per partition | Partition leaders |
| **OffsetCommit** | `OffsetCommit` | auto-commit interval, or explicit | Group coordinator |

**Heartbeat**~3s (heartbeat.interval.ms)
**Fetch**continuous, per partition
**OffsetCommit**auto-commit interval

Three jobs, three cadences, two different brokers.

Each one can fail:

- Miss heartbeats and the coordinator evicts you from the group, triggering an instant rebalance.
- If a commit fails, you will reprocess from the last good offset on restart.
- If you fetch too slowly, you fall behind: lag climbs, even while everything looks healthy.

## Transactions add another coordinator

When you turn on transactions, Kafka adds a fourth role: the transaction coordinator. Its job is to track transactions as they're started, committed, or rolled back across the cluster:

A transactional commit (transaction protocol v1) spans three coordinators.

That's a lot of coordination just to guarantee the I in ACID.

The diagram shows transaction protocol v1. The producer registers each partition with the transaction coordinator (`AddPartitionsToTxn`) before it writes there, so the coordinator knows where to place markers at commit time. Kafka 4.x's v2 (KIP-890) drops both client-side calls: the broker verifies and registers partitions itself on `Produce`, there's no separate `AddOffsetsToTxn`, and `TxnOffsetCommit` goes straight to the group coordinator. Fewer round-trips, same guarantees.

> `WriteTxnMarkers` is intra-cluster. On `EndTxn`, the transaction coordinator writes the commit to its own log and can acknowledge the producer right away, then asynchronously sends the markers to every involved partition leader, including the `__consumer_offsets` leader, which tells the group coordinator to surface the committed offsets.

## Error codes and recovery

Because every "leader" can move, clients spend a lot of their lives reacting to errors that mean *go find the new owner*. Each error code is really a state-machine transition. Here's what each one means:

| Error | What it means | Recovery |
|---|---|---|
| `NOT_LEADER_OR_FOLLOWER` | The partition leader moved to another broker. | Refresh `Metadata`, reconnect to the new leader. |
| `LEADER_NOT_AVAILABLE` | No leader yet; an election is in progress. | Back off and retry. |
| `NOT_COORDINATOR` | The group or transaction coordinator moved. | Re-run `FindCoordinator`. |
| `COORDINATOR_LOAD_IN_PROGRESS` | The coordinator is still loading its state. | Back off and retry. Do not re-discover. |
| `ILLEGAL_GENERATION` | A rebalance happened (classic protocol). | Re-join the group. |
| `STALE_MEMBER_EPOCH` | The request's member epoch doesn't match the coordinator's (KIP-848). | Retry once the next `ConsumerGroupHeartbeat` returns the updated epoch. |
| `UNKNOWN_MEMBER_ID` | The coordinator forgot this member. | Re-join the group. |
| `FENCED_INSTANCE_ID` | Two instances share one `group.instance.id`. | Application error; cannot auto-recover. |
| `OFFSET_OUT_OF_RANGE` | The fetch position is before or past the log. | Apply `auto.offset.reset`, or an app handler. |
| `TOPIC_AUTHORIZATION_FAILED` | An ACL changed. | Surface to the application; no auto-recovery. |

## Inside one poll() call

When your code calls `poll(100ms)` and gets a small batch of records back, here's what the client may actually have done in that window:

Update coordinator state: run any pending rebalance and refresh the assignment
Send a queued OffsetCommit if auto-commit is due
If records are already buffered from a previous fetch, return them right away
Otherwise send Fetch to each assigned partition leader and poll the network, up to the timeout
Refresh Metadata if it's stale or a recent error asked for it
Decode the batches into records and hand them to the caller

The heartbeat isn't in this list: in the classic consumer it runs on its own background thread, not inside `poll()`.

## The fan-out is a control plane

Kafka is not just about Produce and Consume.

It's about discovery, coordination, group membership, transactions, and error-driven re-routing. That's where the real complexity lives. It stays invisible to users, but Kafka, and anything standing in for it, has to handle all of it.

Whatever sits between a client and Kafka (a proxy, a service mesh, a load balancer) has to understand *which connection is which*, because each one terminates at a specific broker for a specific reason. Send a `Fetch` to the wrong broker and you get `NOT_LEADER_OR_FOLLOWER` failures, or rebalances that never settle.

This is the world a Kafka proxy lives in. [Conduktor Gateway](https://www.conduktor.io/gateway) sits in exactly this position: it speaks the full client protocol, tracks which connection is which, and routes each accordingly.
