Kafka Streams async processing

Why external calls cap Kafka Streams throughput per thread, why adding threads won't help past partition count, and the async pattern that lifts it.

By Stéphane Derosiaux · July 23, 2026

Understand the per-thread throughput wall external calls create, and the real ways around it.

A Kafka Streams app that just filters and routes records is limited by your network and your CPU. The moment you put a synchronous external call inside the topology, a database read, a REST call, a model inference, the math changes completely, and not in your favor. Throughput stops being about how big your box is and becomes a function of one number: how long that call takes. This is the throughput wall that sends people searching for "Kafka Streams async processing" at 2am, and the honest answer is more nuanced than "add threads".

What you'll learn:

Why a stream thread processes records serially, and why that makes throughput = 1 / record-latency
Why adding num.stream.threads does nothing once you've hit the task ceiling
The three real fixes, partitions, co-located lookups, and true async processing, and when each one applies
What the async pattern has to solve to stay correct (per-key order, exactly-once, offset commits)

How a stream thread actually spends its time

The thing most people get wrong about Kafka Streams parallelism is invisible until you put a slow call in the middle of it. A stream thread processes records for its assigned tasks serially, one record at a time, depth-first through the topology. It picks a record, runs it all the way through every processor, map, filter, process, the store writes, the forwards to children, and only then picks up the next record. There is no concurrency within a task. One record, start to finish, then the next.

That design is deliberate and it's what gives you per-key ordering for free. It's also why a blocking call is so destructive. If a processor calls out to DynamoDB and that call takes 10ms, the thread does nothing for those 10ms but wait. It can't get ahead on the next record, because that would break ordering. So the ceiling for that task is brutally simple:

throughput per thread  ≈  1 / per-record latency

A 10ms call → ~100 records/second per thread. That's the whole story for one thread. (Illustrative arithmetic, not a benchmark, your real number depends on the call, but the shape 1/latency is exact.)

It compounds. Add a second external call, say a 90ms model inference after the 10ms lookup, and the per-record latency is now ~100ms, so the same thread does ~10 records/second. The slow call dominates, and they add up rather than overlap, because everything in a task runs in series.

record → [10ms DynamoDB read] → [90ms model call] → emit
         └──────────── thread idle ~100ms, doing nothing ─────────────┘

Why you can't just add threads

The instinct is to throw num.stream.threads at it. Here's why that hits a wall faster than you'd expect.

Kafka Streams parallelism is capped by partition count, not by threads. Each partition maps to exactly one task, and a task runs on exactly one thread at a time. So with an 8-partition input topic you get at most 8 tasks, and at most 8 threads doing useful work, whether they live on one big box or eight small ones. Past 8 threads, the rest are idle: there's no task to give them. (Full mental model: scaling and parallelism.)

Put the two facts together and you get a hard ceiling. 8 partitions × ~9 records/second/thread ≈ ~72 records/second, app-wide, no matter how much hardware you provision. You cannot buy your way out with a bigger instance, and you cannot buy your way out with more threads, because every thread past the eighth has nothing to do.

🚫 "It's slow because of the external call, bump num.stream.threads and it'll keep up."
Threads only help up to the task count (one task per partition of each sub-topology, summed across sub-topologies; the bottleneck stage's useful parallelism is its own sub-topology's partition count). Once num.stream.threads × instances ≥ that, every extra thread is idle and the slow call still bottlenecks each active thread at 1/latency. More threads add CPU cost and zero throughput. The lever is fewer/faster calls or more tasks, not more workers.

This is the blocking-call anti-pattern seen from the throughput side. The anti-patterns page warns it stalls a partition and can blow transaction.timeout.ms; this page is about the ceiling it imposes and how to lift it.

Fix 1: more tasks, raise partitions or `repartition(N)`

The standard, vendor-neutral lever is to give the slow stage more tasks so more threads run in parallel. Two ways:

More input partitions. 24 partitions instead of 8 raises the ceiling to 24 × your per-thread rate. This is the cleanest answer, if you can do it. On a stateful app you cannot just bump partition count: changing the partition count changes hash(key) % N, so a key now hashes to a different partition than the one holding its state. The failure is loud, not subtle: on restart Streams sees the changelog topic still has the old partition count and dies with a StreamsException ("Existing internal topic ... has invalid partitions") that points you at the StreamsResetter tool, which means a reset and full reprocess. The silent-corruption risk is real only if you force past that check, or for stateful consumers outside Streams. Plan partitions for peak up front, or treat an increase as a full reprocess.
repartition(N) before the slow stage. If the work is stateless (an enrichment call that doesn't touch a store), you can re-partition into a wider internal topic and run the downstream sub-topology with more tasks than the input has partitions:

stream
    .repartition(Repartitioned.<String, Order>as("enrich-fanout")
        .withNumberOfPartitions(48))           // 48 tasks downstream, regardless of input width
    .mapValues(order -> enrichWithBlockingCall(order));

This genuinely beats the input-partition ceiling for the downstream stage. It isn't free: you've added an internal topic, extra storage, extra produce/fetch, an extra network hop, and you've widened a sub-topology boundary that you now have to size and watch. But for "I need 6× the parallelism on one enrichment step", it's the legitimate, in-the-box answer, and it's where most teams should start. One footgun: Repartitioned.as(...) falls back to the default serdes, so without default.key.serde/default.value.serde configured this snippet fails at startup ("Failed to initialize key serdes for source node"); pass them explicitly with Repartitioned.with(keySerde, valueSerde).

Fix 2: don't make the call at all

The fastest external call is the one you delete. A huge fraction of "RPC in the topology" cases are enrichment: for each event, look up a row and attach it. If that reference data fits in memory and is itself a Kafka topic (or can be one), you don't need a network call, you need a join.

Pattern	Per-record cost	Fits when
Synchronous RPC per record	A network round-trip (the `1/latency` wall)	Data is huge, or lives only in a remote system you can't mirror
`KTable` join (co-partitioned)	A local store lookup, no network	Reference data is keyed the same as the stream and co-partitioned
`GlobalKTable` join	A local lookup by any field, fully replicated to every instance	Reference data is small enough to replicate to every instance

A GlobalKTable is replicated in full to every instance, so any key is a local lookup with no co-partitioning requirement, ideal for smallish, slowly-changing dimension data (currency rates, feature flags, a product catalog). A co-partitioned KTable join scales to larger reference data. Either way the lookup is in-process and microsecond-cheap, and your throughput goes back to being bounded by CPU instead of by someone else's API. Replace the call before you parallelize it. See joins for the co-partitioning contract that makes this work, and the silent ways it breaks if you get it wrong.

This doesn't cover everything. A genuinely large dataset, a system of record you can't mirror into Kafka, or a model you have to call live, those you can't turn into a local lookup. That's where fix 3 comes in.

Fix 3: the async processing pattern

Serial processing sends one record at a time; async pattern puts calls for keys A, B, C in flight concurrently while preserving same-key order and draining before each commit

When the call is irreducible, you must hit that REST API or that model per record, and raising partitions isn't enough, there's a third option: process records of different keys within a single partition concurrently, while preserving same-key order. This is the "async processing" pattern.

The intuition: within one partition, records for key A and key B are independent. There's no ordering requirement between them, only within each key. So you can have the call for A in flight at the same time as the call for B, and a third for C, using a small worker pool, instead of forcing all of them through one thread in single file. If a partition carries thousands of distinct keys (it usually does), you can have many calls outstanding at once and saturate the external system's concurrency instead of its latency. The ceiling moves from 1/latency toward concurrency/latency.

That sounds simple. It is not, and the hard parts are exactly the guarantees Kafka Streams gives you that you must not lose:

Per-key ordering. Records for the same key must still be processed in offset order, even though different keys are now interleaved out of order across the worker pool. The pattern needs per-key queues so A's second record never overtakes A's first.
Exactly-once. Under EOS, the writes a record produces are part of a transaction. With calls in flight, you have records that have started but not finished when a commit comes around, the commit has to drain every in-flight record first, so a transaction never commits a half-processed record's offset.
Offset-commit correctness. Streams commits the offset of processed records. If record at offset 100 finishes before offset 95 (different keys, different call latencies), you cannot commit 100, a crash would skip 95 to 99. The committed offset has to track the lowest un-finished offset, not the highest finished one.
Store writes and forwards. A processor that writes to a state store or forwards downstream can't do that from an arbitrary worker thread, the store and the downstream are owned by the stream thread. So forwards and store writes have to be buffered on the worker and replayed on the stream thread when the record completes, in a sandboxed processor context.

This is genuinely advanced, and today it's mostly delivered by a library rather than core Apache Kafka. Responsive's Async Processor is the notable implementation: a worker thread pool per stream thread, per-key queues to hold ordering, and a sandboxed ProcessorContext that buffers forwards and store writes for replay on the stream thread, solving exactly the four problems above. It's a Processor API feature (not the DSL), and it's an SDK addition, not part of the Apache Kafka distribution. Responsive has stated an intent to contribute the pattern upstream; as of writing it is not in core Kafka, so version-gate any plan that depends on it.

Async changes your commit and transaction settings. Because a low commit interval forces frequent drains of every in-flight record, an aggressive commit.interval.ms cancels much of the async benefit, you spend your time waiting for in-flight calls to finish instead of starting new ones. And under EOS the default drops from 30000ms to 100ms, so an exactly-once app starts at the aggressive end without you touching anything. Implementations of this pattern recommend raising the commit interval and setting transaction.timeout.ms comfortably above it (commit interval plus headroom), so transactions don't time out while records are in flight. Treat the commit interval as a throughput knob here, not a default.

Which fix, in what order

Reach for these top to bottom. Most teams never need the third.

Fix	Reach for it when	Cost
Co-located lookup (join)	The call is enrichment and the data can live in Kafka	Memory for the table; co-partitioning discipline
More tasks (partitions / `repartition(N)`)	The work is stateless, or you can plan partitions up front	An internal topic + a sub-topology boundary to watch
Async processing (library pattern)	The call is irreducible and more tasks still aren't enough	A dependency, Processor API only, careful commit/txn tuning

Try to delete the call first. If you can't, add tasks. Only when the call is truly unavoidable and tasks aren't enough do you take on the complexity, and the library dependency, of true async processing.

Frequently Asked Questions

Why is my Kafka Streams app slow with external calls?

A stream thread processes its task's records serially, one at a time. A blocking call makes the thread wait per record, so throughput per thread is roughly 1 / call-latency, a 10ms call caps a thread near 100 records/second, regardless of CPU. Multiple calls per record add up rather than overlap.

How do I parallelize Kafka Streams beyond partition count?

Useful parallelism is normally capped at the partition count (one task per partition). To go wider on a stateless stage, insert repartition(N) to write to a wider internal topic and run more downstream tasks. Beyond that, the async processing pattern runs records of different keys concurrently within one partition.

Can Kafka Streams process records concurrently?

Not within a task by default, processing is single-threaded per task to preserve per-key ordering. Records of different keys in the same partition can be processed concurrently with the async processing pattern (e.g. Responsive's Async Processor), which keeps per-key order, exactly-once, and offset-commit correctness while calls are in flight.

Should I call a REST API from a Kafka Streams processor?

Avoid a synchronous one if you can. It caps throughput at 1/latency per thread and can blow transaction.timeout.ms under exactly-once. Prefer enriching via a co-partitioned KTable or GlobalKTable lookup; if you must call out, use the async pattern and raise the commit interval.

What is the Responsive Async Processor?

A Processor-API feature in the Responsive SDK that processes records of different keys concurrently using a per-stream-thread worker pool, per-key queues for ordering, and a sandboxed context that replays store writes and forwards on the stream thread. It is one implementation of the async pattern, not part of core Apache Kafka.

See it in practice with Conduktor
The throughput wall shows up as one unambiguous signal: consumer group lag climbing while CPU sits idle. Conduktor Console shows per-partition lag and the consumer group's partition assignment, so you can tell a genuine throughput ceiling (uniform lag, more threads won't help) from a fix you can actually apply, and watch the internal topic a repartition(N) creates once you add one.

Next steps

Scaling & parallelism, the task ceiling this page works around, and the partition-increase trap
Anti-patterns, why a blocking call in the topology is the most damaging Streams mistake
Joins, replace an enrichment RPC with a co-partitioned KTable or GlobalKTable lookup