KStream vs KTable vs GlobalKTable
KStream vs KTable vs GlobalKTable: events vs changelog vs fully-replicated lookup, tombstone semantics, and which one to actually use in Kafka Streams.
Get the core Kafka Streams mental model right.
Almost every confusing thing in Kafka Streams traces back to one question: is this data a stream of events, or a table of current state? The library gives you three abstractions for that, KStream, KTable, and GlobalKTable, and picking the wrong one is how you end up with joins that drop data, aggregations that double-count, and the occasional NullPointerException you can't explain.
This is the single most useful concept in the library. Get it right and the rest of the DSL stops feeling arbitrary.
What you'll learn:
- KStream vs KTable: events versus a changelog, and why that distinction is everything
- What a tombstone is, and the
NullPointerExceptionit sets you up for - When GlobalKTable earns its memory cost, and when it doesn't
- A concrete rule for which abstraction to reach for
KStream: an append-only log of events
A KStream is an unbounded, append-only sequence of records. Every record is an independent fact that happened: a click, a payment, a sensor reading. Two records with the same key are two distinct events, not an update, if a customer places three orders, that's three records, and all three matter.
KStream<String, String> orders = builder.stream("orders");
// key=_1_ appears 3 times β 3 separate order events This is the right model for anything event-shaped: transactions, page views, log lines, IoT readings. You filter them, transform them, route them, and aggregate them, but you never "overwrite" one. The log is the truth, and the order of records carries meaning.
KTable: a changelog, where the latest value wins
A KTable is the opposite framing. It models a table where each key has exactly one current value, and the underlying topic is read as a changelog: each new record for a key is an update (an upsert) that replaces the previous value. Three records for customer-42 don't accumulate, the third one wins, and the first two are history.
KTable<String, String> profiles = builder.table("user-profiles");
// key=_1_ appears 3 times β table holds only the latest value This is the right model for entity and reference data: a user's current profile, an account balance, a product's current price, a device's last-known status. You're not interested in every change as an event, you want the current value per key, and a KTable maintains exactly that, backed by a local state store.
The tombstone: a null value means delete
Here is the part the 101 courses gloss over, and it bites people in production. In a changelog, you need a way to say "this key no longer exists." That's a tombstone: a record with a non-null key and a null value. When a KTable sees one, it removes that key from the table.
Tombstones have semantics that will surprise you the first time:
π« "
mapValuesandfilterrun on every record in my KTable, so I'm safe to dereference the value."
They don't. On a KTable, filter and mapValues skip your function for tombstones and forward the null straight through, the delete has to propagate downstream, so the table can't just drop it. Your lambda never sees it. The trap springs when you call toStream():
profiles
.mapValues(profile -> profile.toUpperCase()) // skipped for tombstones, null passes through
.toStream()
.foreach((key, value) -> process(value.length())); // NullPointerException on a deleted key toStream() faithfully surfaces those tombstones as records with null values, and your downstream KStream code, which assumed mapValues had already run, dereferences null. The fix is to filter nulls explicitly after toStream() when you don't care about deletes:
profiles
.toStream()
.filter((key, value) -> value != null) // drop tombstones before touching the value
.foreach((key, value) -> process(value.length())); If you do care about deletes (propagating them to a downstream system, for instance), keep the tombstone and handle null deliberately. Either way, the rule is: on a stream derived from a table, assume nulls are real and decide what to do with them.
Stream-table duality
KStream and KTable aren't separate worlds, they're two views of the same data, and you convert between them freely. This is stream-table duality, and it's the idea the whole DSL rests on.
- A table is a stream you've folded up. Replay a changelog from the beginning, keeping only the latest value per key, and you've reconstructed the table. That's literally how a
KTableis restored after a crash, by replaying its changelog topic. - A stream is the changes a table emitted. Call
toStream()on aKTableand you get the sequence of updates (including tombstones) that produced it.
Concretely: you groupBy + aggregate a KStream and the result is a KTable (a running view, see aggregations); you toStream() that KTable to publish the changes back out. You move across the boundary constantly, often without thinking about it.
GlobalKTable: replicated to every instance
A KTable is partitioned, each instance of your app holds only the keys for the partitions it owns. That's efficient, but it constrains joins: to join a KStream against a KTable, both sides must be co-partitioned (same partition count, same partitioning, so matching keys land on the same task). And don't expect Kafka Streams to catch a violation for you: the famous Topics not co-partitioned error only fires when the topology contains an internal repartition topic. Without one, a mismatched join starts fine, reaches RUNNING, and silently misses matches (verified on 3.9 and 4.3).
A GlobalKTable removes that constraint by making a different trade. It reads all partitions of its source topic into every instance of your app. Every instance holds a full, complete copy of the table.
GlobalKTable<String, String> countries = builder.globalTable("country-reference"); That full replication buys you two things:
- No co-partitioning required. Because every instance has every key, a
KStream-GlobalKTablejoin can look up any key locally. The stream side doesn't need to match the table's partition count or be re-keyed first. This is the usual reason people reach for it. - Bootstrapped before processing, not time-synchronized. A
GlobalKTableis loaded fully on startup. The join is a pure point-in-time lookup against current state, it does not try to align record timestamps the way aKStream-KTablejoin does (best-effort ordering by default; an exact temporal lookup only with versioned state stores, KIP-914). That's a feature for stable reference data, a footgun if you expected event-time correctness.
The cost is the part people underestimate. Every instance stores the entire table. A small lookup table (a few thousand currency codes, country names, feature flags) is free. A large one is not:
A 30 GB GlobalKTable is 30 GB on every instance. Run ten instances and you're holding ten full copies, memory and disk on each, plus the startup time to load all partitions before the app can process anything. Engineers regularly ask whether they can put a multi-gigabyte table in a
GlobalKTable; the honest answer is that you can, but you're paying for it N times over and slowing every restart. For large lookup data, a partitionedKTablejoin (with proper co-partitioning) or an external store is usually the better call.
One more limitation: you can't re-key a GlobalKTable. If you need to join on something other than its key, you supply a KeyValueMapper at join time that derives the join key from the stream's record, the table itself stays keyed as-is.
Which one do I actually use?
This is the question that sends people to the forums. A blunt rule covers most cases:
| Your data is⦠| Use | Why |
|---|---|---|
| Events, each record is an independent fact | KStream | Order matters, nothing is overwritten |
| Entity / reference data, partitioned, large | KTable | Latest value per key, scales with partitions |
| Reference / lookup data, small, joined by any key | GlobalKTable | Full copy everywhere, no co-partitioning, fast local lookup |
- "Did it happen" β KStream. "What is it now" β KTable. Payments happen; an account balance is. If you'd ever want to replay each occurrence, it's a stream.
- Reach for
GlobalKTableonly when the table is small and you want to skip co-partitioning. Currency rates, country codes, feature flags, small product catalogs, classic fits. The moment the table is large or grows unbounded, step back to aKTable. - When unsure between KTable and GlobalKTable, default to KTable. It scales with your partitions instead of replicating in full. You upgrade to
GlobalKTabledeliberately, to dodge a co-partitioning problem, not as a default.
The "which do I use for X" confusion almost always dissolves once you classify the data as event-vs-entity first, and only then pick the abstraction.
What is the difference between a KStream and a KTable?
A KStream is an append-only log of events where each record is an independent fact, so two records with the same key are two separate events. A KTable is a changelog where each new record for a key is an upsert, so only the latest value per key is kept.
When should I use a KTable vs a GlobalKTable?
Use a KTable for partitioned entity data that scales with your partitions; use a GlobalKTable only when the lookup table is small and you want to skip co-partitioning, because it replicates the entire table to every instance. When unsure, default to KTable and upgrade to GlobalKTable deliberately.
What is stream-table duality?
Stream-table duality is the idea that a stream and a table are two views of the same data. A table is a stream folded up to the latest value per key, and a stream is the sequence of changes a table emits, which is why a KTable is restored by replaying its changelog and toStream() turns it back into updates.
Why does a GlobalKTable use so much memory?
Because a GlobalKTable reads all partitions of its source topic into every instance of your app, so a 30 GB table is 30 GB on each instance, plus the startup time to load it before processing begins. For large lookup data, a partitioned KTable join or an external store is usually the better call.
What is a tombstone in a KTable?
A tombstone is a record with a non-null key and a null value; it tells a KTable to delete that key. On a KTable, filter and mapValues skip your function for tombstones and forward the null through, so code that dereferences the value after toStream() can hit a NullPointerException.
See it in practice with Conduktor
A
KTableis just a compacted Kafka topic read as a changelog, and aGlobalKTablereads all partitions of one. Conduktor Console lets you inspect those source topics, browse records to confirm a key's latest value, spot the tombstones (null-value records) that drive deletes, and check partition counts before you build a join that depends on co-partitioning.
Next steps
- Kafka Streams joins, co-partitioning, and the GlobalKTable join that skips it
- Aggregations, turn a KStream into a KTable with count, reduce, and aggregate
- State stores, the local store and changelog that back every KTable