# KStream vs KTable vs GlobalKTable

*Get the core Kafka Streams mental model right.*

Almost every confusing thing in Kafka Streams traces back to one question: is this data a stream of events, or a table of current state? The library gives you three abstractions for that, `KStream`, `KTable`, and `GlobalKTable`, and picking the wrong one is how you end up with joins that drop data, aggregations that double-count, and the occasional `NullPointerException` you can't explain.

This is the single most useful concept in the library. Get it right and the rest of the DSL stops feeling arbitrary.

**What you'll learn:**
- KStream vs KTable: events versus a changelog, and why that distinction is everything
- What a tombstone is, and the `NullPointerException` it sets you up for
- When GlobalKTable earns its memory cost, and when it doesn't
- A concrete rule for which abstraction to reach for

## KStream: an append-only log of events

A `KStream` is an unbounded, append-only sequence of records. Every record is an independent fact that *happened*: a click, a payment, a sensor reading. Two records with the same key are two distinct events, not an update, if a customer places three orders, that's three records, and all three matter.

```java
KStream<String, String> orders = builder.stream("orders");
// key="customer-42" appears 3 times → 3 separate order events
```

This is the right model for anything event-shaped: transactions, page views, log lines, IoT readings. You filter them, transform them, route them, and aggregate them, but you never "overwrite" one. The log is the truth, and the order of records carries meaning.

## KTable: a changelog, where the latest value wins

A `KTable` is the opposite framing. It models a table where each key has exactly one current value, and the underlying topic is read as a **changelog**: each new record for a key is an *update* (an upsert) that replaces the previous value. Three records for `customer-42` don't accumulate, the third one wins, and the first two are history.

```java
KTable<String, String> profiles = builder.table("user-profiles");
// key="customer-42" appears 3 times → table holds only the latest value
```

This is the right model for entity and reference data: a user's current profile, an account balance, a product's current price, a device's last-known status. You're not interested in every change as an event, you want *the current value per key*, and a `KTable` maintains exactly that, backed by a local [state store](https://www.conduktor.io/kafka-streams/state-store).

### The tombstone: a null value means delete

Here is the part the 101 courses gloss over, and it bites people in production. In a changelog, you need a way to say "this key no longer exists." That's a **tombstone**: a record with a non-null key and a **null value**. When a `KTable` sees one, it removes that key from the table.

Tombstones have semantics that will surprise you the first time:

> 🚫 *"`mapValues` and `filter` run on every record in my KTable, so I'm safe to dereference the value."*

They don't. On a `KTable`, `filter` and `mapValues` **skip your function for tombstones** and forward the `null` straight through, the delete has to propagate downstream, so the table can't just drop it. Your lambda never sees it. The trap springs when you call `toStream()`:

```java
profiles
    .mapValues(profile -> profile.toUpperCase()) // skipped for tombstones, null passes through
    .toStream()
    .foreach((key, value) -> process(value.length())); // NullPointerException on a deleted key
```

`toStream()` faithfully surfaces those tombstones as records with `null` values, and your downstream `KStream` code, which assumed `mapValues` had already run, dereferences `null`. The fix is to filter nulls explicitly after `toStream()` when you don't care about deletes:

```java
profiles
    .toStream()
    .filter((key, value) -> value != null) // drop tombstones before touching the value
    .foreach((key, value) -> process(value.length()));
```

If you *do* care about deletes (propagating them to a downstream system, for instance), keep the tombstone and handle `null` deliberately. Either way, the rule is: **on a stream derived from a table, assume nulls are real and decide what to do with them.**

## Stream-table duality

KStream and KTable aren't separate worlds, they're two views of the same data, and you convert between them freely. This is *stream-table duality*, and it's the idea the whole DSL rests on.

- A **table is a stream you've folded up.** Replay a changelog from the beginning, keeping only the latest value per key, and you've reconstructed the table. That's literally how a `KTable` is restored after a crash, by replaying its [changelog topic](https://www.conduktor.io/kafka/kafka-topic-configuration-log-compaction).
- A **stream is the changes a table emitted.** Call `toStream()` on a `KTable` and you get the sequence of updates (including tombstones) that produced it.

Concretely: you `groupBy` + aggregate a `KStream` and the result is a `KTable` (a running view, see [aggregations](https://www.conduktor.io/kafka-streams/aggregations)); you `toStream()` that `KTable` to publish the changes back out. You move across the boundary constantly, often without thinking about it.

## GlobalKTable: replicated to every instance

A `KTable` is **partitioned**, each instance of your app holds only the keys for the partitions it owns. That's efficient, but it constrains joins: to join a `KStream` against a `KTable`, both sides must be [co-partitioned](https://www.conduktor.io/kafka-streams/joins) (same partition count, same partitioning, so matching keys land on the same task). And don't expect Kafka Streams to catch a violation for you: the famous `Topics not co-partitioned` error only fires when the topology contains an internal repartition topic. Without one, a mismatched join starts fine, reaches RUNNING, and silently misses matches (verified on 3.9 and 4.3).

A `GlobalKTable` removes that constraint by making a different trade. It reads **all** partitions of its source topic into **every** instance of your app. Every instance holds a full, complete copy of the table.

```java
GlobalKTable<String, String> countries = builder.globalTable("country-reference");
```

That full replication buys you two things:

- **No co-partitioning required.** Because every instance has every key, a `KStream`-`GlobalKTable` join can look up *any* key locally. The stream side doesn't need to match the table's partition count or be re-keyed first. This is the usual reason people reach for it.
- **Bootstrapped before processing, not time-synchronized.** A `GlobalKTable` is loaded fully on startup. The join is a pure point-in-time lookup against current state, it does **not** try to align record timestamps the way a `KStream`-`KTable` join does (best-effort ordering by default; an exact temporal lookup only with versioned state stores, KIP-914). That's a feature for stable reference data, a footgun if you expected event-time correctness.

The cost is the part people underestimate. *Every instance stores the entire table.* A small lookup table (a few thousand currency codes, country names, feature flags) is free. A large one is not:

> **A 30 GB GlobalKTable is 30 GB on every instance.** Run ten instances and you're holding ten full copies, memory and disk on each, plus the startup time to load all partitions before the app can process anything. Engineers regularly ask whether they can put a multi-gigabyte table in a `GlobalKTable`; the honest answer is that you *can*, but you're paying for it N times over and slowing every restart. For large lookup data, a partitioned `KTable` join (with proper co-partitioning) or an external store is usually the better call.

One more limitation: you **can't re-key** a `GlobalKTable`. If you need to join on something other than its key, you supply a `KeyValueMapper` at join time that derives the join key from the stream's record, the table itself stays keyed as-is.

## Which one do I actually use?

This is the question that sends people to the forums. A blunt rule covers most cases:

| Your data is… | Use | Why |
|---|---|---|
| Events, each record is an independent fact | **KStream** | Order matters, nothing is overwritten |
| Entity / reference data, partitioned, large | **KTable** | Latest value per key, scales with partitions |
| Reference / lookup data, small, joined by any key | **GlobalKTable** | Full copy everywhere, no co-partitioning, fast local lookup |

Sharper heuristics:

- **"Did it happen" → KStream. "What is it now" → KTable.** Payments happen; an account balance *is*. If you'd ever want to replay each occurrence, it's a stream.
- **Reach for `GlobalKTable` only when the table is small and you want to skip co-partitioning.** Currency rates, country codes, feature flags, small product catalogs, classic fits. The moment the table is large or grows unbounded, step back to a `KTable`.
- **When unsure between KTable and GlobalKTable, default to KTable.** It scales with your partitions instead of replicating in full. You upgrade to `GlobalKTable` deliberately, to dodge a co-partitioning problem, not as a default.

The "which do I use for X" confusion almost always dissolves once you classify the *data* as event-vs-entity first, and only then pick the abstraction.

**What is the difference between a KStream and a KTable?**

A `KStream` is an append-only log of events where each record is an independent fact, so two records with the same key are two separate events. A `KTable` is a changelog where each new record for a key is an upsert, so only the latest value per key is kept.

**When should I use a KTable vs a GlobalKTable?**

Use a `KTable` for partitioned entity data that scales with your partitions; use a `GlobalKTable` only when the lookup table is small and you want to skip co-partitioning, because it replicates the entire table to every instance. When unsure, default to `KTable` and upgrade to `GlobalKTable` deliberately.

**What is stream-table duality?**

Stream-table duality is the idea that a stream and a table are two views of the same data. A table is a stream folded up to the latest value per key, and a stream is the sequence of changes a table emits, which is why a `KTable` is restored by replaying its changelog and `toStream()` turns it back into updates.

**Why does a GlobalKTable use so much memory?**

Because a `GlobalKTable` reads all partitions of its source topic into every instance of your app, so a 30 GB table is 30 GB on each instance, plus the startup time to load it before processing begins. For large lookup data, a partitioned `KTable` join or an external store is usually the better call.

**What is a tombstone in a KTable?**

A tombstone is a record with a non-null key and a null value; it tells a `KTable` to delete that key. On a `KTable`, `filter` and `mapValues` skip your function for tombstones and forward the null through, so code that dereferences the value after `toStream()` can hit a `NullPointerException`.

> **See it in practice with Conduktor**
> A `KTable` is just a compacted Kafka topic read as a changelog, and a `GlobalKTable` reads all partitions of one. [Conduktor Console](https://docs.conduktor.io/guide/manage-kafka/kafka-resources/topics) lets you inspect those source topics, browse records to confirm a key's latest value, spot the tombstones (null-value records) that drive deletes, and check partition counts before you build a join that depends on co-partitioning.

## Next steps

- [Kafka Streams joins](https://www.conduktor.io/kafka-streams/joins), co-partitioning, and the GlobalKTable join that skips it
- [Aggregations](https://www.conduktor.io/kafka-streams/aggregations), turn a KStream into a KTable with count, reduce, and aggregate
- [State stores](https://www.conduktor.io/kafka-streams/state-store), the local store and changelog that back every KTable
