# Kafka Data Quality: A Schema Registry Isn't Enough

You turned on the [Schema Registry](https://www.conduktor.io/blog/what-is-the-schema-registry-and-why-do-you-need-to-use-it). Every message references a schema. So your data is clean now, right?

Not really. A Schema Registry tells you a shape was declared. It says nothing about whether the data inside that shape is true.

By the time a consumer reads a bad record and notices something is off, that record is already durable on disk, replicated across brokers, and fanned out to every other consumer of the topic. You are not catching the problem. You are doing forensics on it.

So the question is not "how do we validate when we consume?". It is "why are we letting garbage into the log in the first place?". Validate at write time, not read time.

> 🚫 *"We have a Schema Registry, so our data is validated."*

## The broker validates nothing

To a vanilla Kafka broker, your carefully designed event is just a byte array. It stores the bytes, replicates the bytes, serves the bytes. It never deserializes them, never looks inside, never asks whether `amount` makes sense. Kafka is built to move data fast and to expect well-behaving clients. That is a feature (it is why Kafka scales), and it is also the hole.

"But we run broker-side schema validation." OK, so what does that check? Confluent's [server-side schema validation](https://docs.confluent.io/platform/current/schema-registry/schema-validation.html) confirms that a record carries a *registered schema ID*. It does not deserialize the payload to confirm the bytes match that schema, and it certainly does not judge whether the values are sane. A message can reference a valid schema and still carry nonsense.

So "we have a Schema Registry" guarantees one thing: somebody declared a structure once. It does not guarantee the producer respected it, and it has no opinion at all on whether a negative invoice or a license plate in the IBAN field should be allowed through.

Structure is not quality. Don't lose them.

## Read time is too late

The cleanest way to see why read-time validation fails is the poison pill: a single record your consumer cannot deserialize.

Watch what happens. A record that fails deserialization is never handed to your code. `poll()` throws before you get the `ConsumerRecord`, so there is no offset to commit past and nothing to `seek()` over from your handler. The consumer re-fetches the exact same offset, fails the exact same way, and loops. This is the decade-old [KAFKA-4740](https://issues.apache.org/jira/browse/KAFKA-4740), and it is still how the consumer behaves:

```
org.apache.kafka.common.errors.SerializationException:
  Error deserializing key/value for partition orders-0 at offset 2
```

One bad record, and the partition stops for every consumer assigned to it. Your options, all of them after the fact:

- skip the record and hope it didn't matter
- alert a human and stall until they wake up
- route it to a dead letter queue for later (more on "later" in a minute)
- halt and catch fire

Every one of those is recovery, not prevention. And they share one problem: the bad record is *already in the log*, durable and replicated, before your validation logic ever runs. "Validate when you consume" assumes you get to run. The poison pill is the case where you don't.

## One bad write fans out to every consumer

The record that crashes your consumer is the lucky case, because at least it crashes. The dangerous record is the one that passes the schema and is still wrong.

A negative invoice. An `amount` of `-42.50` where your payment logic never expected a negative. A timestamp in seconds where the rest of the pipeline assumes milliseconds. An IBAN with valid structure and broken [check digits](https://www.iban.com/iban-checker). Every one of those satisfies the schema. Nothing throws. The data is simply false, and it flows.

Now multiply that by your consumer count:

![Diagram comparing two Kafka pipelines. Left, no write-time gate: one bad red record is appended to the topic and fans out to four downstream consumers (a database, an analytics job, an ML feature store, and a search index), each turning red as it ingests the bad data. Right, with a write-time validation gate in front of the topic: the bad red record is caught and rejected at the gate, and only clean records fan out, so every downstream consumer stays clean.](https://www.conduktor.io/assets/images/blog/kafka-data-quality-write-vs-read-time.svg)

A bad write is not one failure. It is one failure per consumer, because each consumer re-derives meaning on its own and gets it wrong on its own.

> *"There are eight possible chances that a consumer will misinterpret the data from an event stream. And the more consumers and topics you have, the greater the chance they misinterpret data compared to their peers."*
> — Adam Bellemare, [Shift Left: Bad Data in Event Streams](https://www.confluent.io/blog/shift-left-bad-data-in-event-streams-part-1/)

Grab hit exactly this and wrote it up. Schema-passing data that did not match reality, and the consequence was not a crash. In their words, the issues "persist for periods of time, impacting various online and offline downstream systems before being discovered." Then, once someone finally noticed the numbers were wrong, teams "face difficulties in pinpointing the exact poison data." A bad write doesn't announce itself. It seeps.

## What bad data costs

A crash you can fix in an afternoon. A wrong number you can't, because by the time anyone notices, your dashboards, ML features, and finance reports have all consumed it and acted on it.

A few numbers:

- **71%** of data professionals worry about incorrect or hallucinated data reaching stakeholders, and the share rating trust in data as important jumped to **83%** ([dbt Labs State of Analytics Engineering, 2026](https://www.getdbt.com/resources/state-of-analytics-engineering-2026)).
- An average data team fields **67 data quality incidents a month**, each taking around **15 hours** to resolve ([Monte Carlo, 2023](https://montecarlo.ai/blog-data-quality-survey)). That is most of a team's month spent on cleanup.
- Gartner's working figure for the cost of poor data quality is **$12.9M a year** on average (worth a caveat: that sample skews toward enterprises already buying data-quality tooling, so read it as a ceiling, not a median).

Every pipeline reports success while the figures it produces drift from reality, until people stop trusting them at all. That is the most expensive outcome and the least recoverable, because trust doesn't come back with a backfill.

One of our customers, running ingestion across many data sources into a lake, framed their core problem like this:

> *"The primary challenge we have is the enrichment of the data and the cleansing of the data, plus some of the latency of the data itself."*
> — Data and analytics lead at a US health insurer

That is the cleansing tax. Let garbage into the log once and every consumer downstream pays to scrub it, forever. And the DLQ you promised to drain "later"? Later rarely comes. Teams end up writing throwaway scripts per incident, keeping a zombie consumer alive just to drain an old topic, watching dead messages rot as the schema moves on under them. The smartest advice I have seen on this is brutally simple: validate every fixed record against an agreed-good schema *before* you republish it, because nothing is worse than dumping another few thousand bad messages onto the queue you are already trying to empty.

## Producer-side validation isn't enforcement

The obvious answer is: fine, validate in the producer before `send()`. Necessary. Not sufficient.

Producer-side validation holds right up until a producer doesn't do it. And one will. A new service ships without the validation library. A team pins a stale version. A batch job speaks the Kafka wire protocol directly. A bug slips the check. The moment any client bypasses the rule, the rule is gone, and your "validated" topic has garbage in it again.

> *"For a mature multi-team Kafka deployment, that is not enforcement. It is a courtesy."*
> — Robert Allen, on client-side schema validation

The motivation for [KIP-729](https://cwiki.apache.org/confluence/display/KAFKA/KIP-729%3A+Custom+validation+of+records+on+the+broker+prior+to+log+append) notes that asking producers to validate "is difficult as producers could have bugs," which is precisely why the proposal wanted to make topics schema-aware at the broker. There is even a follow-up (KIP-940) aimed at rejecting records from a misconfigured client.

> **Quick note on status.** Broker-side content validation in open-source Kafka is still proposal territory (KIP-729, KIP-940), not shipped. As of mid-2026 the broker will not enforce your data's meaning for you. Anyone telling you Apache Kafka rejects bad payloads out of the box is describing a roadmap, not a release.

It is worse for polyglot shops. Rich client-side semantic validation is largely a Java story; your Python, Go, and Node producers can't enforce the same rules at the edge even if they wanted to. So the one place every producer is guaranteed to pass through, regardless of language, version, or how well-behaved it is, is the path to the broker. That is where enforcement has to live.

## Where to put the gate

Two places to put it:

- **At the build.** LinkedIn pushed governance so far upstream that an event schema missing required business metadata fails the build. Their client wraps the producer so schemas auto-register and a developer can't skip it. Quality is enforced before a single byte is produced, which works when you own every producer and every build pipeline.
- **On the path.** Put a checkpoint between producers and brokers that validates payload, schema, and semantic rules before the record is appended. Robinhood does a producer-driven version with a validation gate plus a quarantine store for what fails. A [Kafka proxy](https://www.conduktor.io/blog/enforcing-kafka-data-quality-at-scale) does it for every client at once, including the polyglot ones and the ones speaking raw protocol, with no application code to change.

This is where [Conduktor Gateway](https://docs.conduktor.io/guide/conduktor-concepts/interceptors) sits. It validates records against their schema and runs [data quality rules](https://docs.conduktor.io/guide/conduktor-concepts/data-quality-policies) (structural and semantic) at the proxy, then rejects or quarantines anything that fails before it reaches the brokers. The point is defense in depth: even when a producer skips its own validation, the gate doesn't, because it is the one thing on the path that no client can route around. Pair it with real schema compatibility rules so [breaking changes can't ship either](https://www.conduktor.io/blog/kafka-data-contracts), and the topic stops being a place where bad data accumulates.

So, back to where we started. A Schema Registry tells you the shape was declared. It does not tell you the data is true. That part is a write-time job, and nobody downstream can do it for you.

If your data quality plan is "we'll validate when we consume," you don't have a data quality plan. You have a growing log of garbage with a timestamp on every record.

[Book a demo](https://www.conduktor.io/contact/demo) to see Conduktor Gateway validate and quarantine bad records on the data path, before they fan out to everyone who trusts the topic.

---

**Related**: [Enforcing Data Quality at Scale →](https://www.conduktor.io/blog/enforcing-kafka-data-quality-at-scale) · [Kafka Data Contracts →](https://www.conduktor.io/blog/kafka-data-contracts) · [Schema Registry, Explained →](https://www.conduktor.io/blog/what-is-the-schema-registry-and-why-do-you-need-to-use-it)
