Kafka Policy Enforcement: Code Over Docs

Kafka policies in Confluence are suggestions. CEL-based enforcement at the API layer prevents bad configs before they reach production.

Stéphane DerosiauxStéphane Derosiaux · January 21, 2026 ·
Kafka Policy Enforcement: Code Over Docs

Documented policies don't prevent violations. Enforced policies do.

Most organizations have Kafka governance policies: topics must follow naming conventions, retention must stay within bounds, replication factor must be 3 in production. These policies exist in Confluence, get mentioned in code reviews, and are sometimes followed.

The problem isn't that policies are wrong. It's that enforcement depends on human memory and discipline. When creating a topic at 4 PM on Friday to fix a production issue, engineers forget to check the wiki for naming conventions. Code reviewers miss configuration mistakes. Policies document what should happen; enforcement makes it impossible for violations to reach production.

Real policy enforcement means: the platform rejects topic creation requests with invalid names, denies retention policies outside acceptable ranges, and prevents replication factor 1 in production—before these configurations touch Kafka. If governance requires remembering and manually validating policies, failures are inevitable.

The Policy Enforcement Problem

Policy enforcement fails at scale for three reasons: policy sprawl (too many rules to remember), policy drift (documented policies diverge from reality), and lack of automation (validation requires manual review).

Policy sprawl happens when organizations accumulate rules without pruning outdated ones. A 50-item policy checklist covering naming conventions, retention limits, partition counts, compression settings, ACL patterns, and schema compatibility is comprehensive but impossible to remember.

Engineers violate policies not from negligence but from cognitive load. They can't remember 50 rules while solving urgent production issues.

Policy drift happens when documented policies don't match operational reality. The wiki says "replication factor 3 required" but 30% of production topics have RF 2 because exceptions were granted informally and never documented.

Over time, documented policies become aspirational rather than descriptive. New engineers see existing violations and conclude policies are optional.

Manual validation doesn't scale. If every topic creation requires platform team review against a 50-item checklist, provisioning takes days and platform teams become bottlenecks. Manual review also introduces errors—reviewers miss violations, especially under time pressure.

Policy as Code

Policy as code expresses rules programmatically, enabling automated validation at creation time.

CEL (Common Expression Language) is one approach. Policies are written as expressions that evaluate to true (compliant) or false (violation):

// Topic name must match pattern team.domain.entity
resource.spec.name.matches("^[a-z]+\\.[a-z]+\\.[a-z]+$")

// Retention between 1 hour and 7 days
resource.spec.retentionMs >= 3600000 &&
resource.spec.retentionMs <= 604800000

// Replication factor 3 in production
cluster.environment == "production" ?
  resource.spec.replicationFactor >= 3 :
  resource.spec.replicationFactor >= 1

When a developer submits a topic creation request, these policies evaluate automatically. If all expressions return true, the topic is created. If any expression returns false, the request is rejected with an error message explaining the violation.

OPA (Open Policy Agent) is an alternative policy framework using Rego language. OPA provides more complex policy logic (querying external data, multi-step validation) at the cost of additional infrastructure. Organizations using OPA can implement similar validation patterns.

Custom validators run arbitrary code (Python, JavaScript) to validate requests. This provides maximum flexibility but requires maintaining custom code instead of declarative policies.

The choice depends on policy complexity. Simple rules (naming patterns, numeric bounds) work well in CEL. Complex rules (cross-referencing external systems, multi-resource validation) might need custom validators or alternative frameworks.

What Should Be Enforced

Not every organizational preference needs enforcement. Focus on policies that prevent incidents, compliance violations, or operational problems.

Naming conventions prevent chaos. If topics follow team.domain.entity pattern, names are self-documenting and ownership is obvious. Enforce this: reject topics with names like test, foo, or mydata.

Custom error message:

Topic name 'test' doesn't match required pattern 'team.domain.entity'.
Example: 'platform.orders.created'

Retention and partition limits prevent over-provisioning and under-provisioning. Too-long retention wastes storage. Too-short retention causes data loss when consumers lag. Too many partitions create broker overhead. Too few partitions limit parallelism.

Enforce bounds:

retention: 1 hour minimum, 30 days maximum (unless approved exception)
partitions: 3 minimum (for parallelism), 50 maximum (to limit overhead)

Replication factor requirements prevent data loss. RF 1 means broker failure loses data. RF 3 tolerates two broker failures. Production data must be replicated.

Enforce: production clusters require RF ≥ 3, dev/staging allow RF ≥ 1.

Schema compatibility modes prevent breaking changes. Schemas without compatibility checks allow producers to break consumers. Enforce: schemas must use BACKWARD, FORWARD, or FULL compatibility—not NONE.

ACL patterns prevent overly broad permissions. Wildcard ACLs (user:developer can read topic:*) violate least privilege. Enforce: ACLs must target specific topics or patterns scoped to team ownership.

Enforcement Architecture

Policies enforce at the control plane layer, before requests reach Kafka. This prevents invalid configurations from existing, not just detecting them after creation.

API gateway enforcement: Requests to create topics, register schemas, or grant ACLs flow through a control plane API. Policies evaluate before execution. Compliant requests proceed; non-compliant requests are rejected with actionable errors.

This works regardless of access method: web console, CLI, GitOps, Terraform. All paths hit the same API, all enforce the same policies. Integrate via Terraform or CLI.

GitOps enforcement: Policies validate during CI/CD. When a developer commits topic definition to Git, CI runs policy checks. Failing policies fail the build, preventing merge.

This provides fast feedback (developers see policy violations before code review) and prevents policy violations from reaching production repositories.

Runtime enforcement: Even if policies are bypassed during creation (emergency override, migration from unmanaged resources), runtime enforcement detects drift. Periodic scans check existing resources against policies and flag violations for remediation.

This catches: resources created before policies existed, manual changes to managed resources, policy changes that make existing resources non-compliant.

Custom Error Messages

Generic errors frustrate developers. "Policy violation" doesn't explain what's wrong or how to fix it. Custom error messages teach policy during violation, reducing support burden.

Bad error message:

Topic creation failed: policy violation

Good error message:

Topic name 'orders' doesn't match required pattern 'team.domain.entity'.

Required format: {team}.{domain}.{entity}
Example: platform.orders.created

Current value: 'orders'

The good message explains:

  • What's wrong (name doesn't match pattern)
  • What the pattern is (team.domain.entity)
  • An example (platform.orders.created)
  • The violating value (orders)

Developers fix violations without asking for help, reducing platform team interrupts.

Policy Testing and Iteration

Policies should evolve based on usage patterns. Brittle policies that reject valid use cases need adjustment.

Policy testing validates policies before deployment. Test cases cover:

  • Valid inputs (should pass)
  • Invalid inputs (should fail)
  • Edge cases (boundary values, special characters)

Example test:

policy: naming-convention
cases:
  - input: "platform.orders.created"
    expect: pass
  - input: "orders"
    expect: fail
  - input: "platform.orders"
    expect: fail
  - input: "platform-orders.created"
    expect: fail  # hyphens not allowed

Tests prevent policy regressions (changes that unintentionally break valid cases).

Feedback loops tune policies based on violations. Track: which policies are violated most often? Are violations legitimate use cases or policy gaps?

If "retention must be under 7 days" is violated 50 times/month because analytics team needs 30-day retention, the policy is wrong—not the requests. Adjust policy to allow 30-day retention for analytics topics or implement exception workflow.

Policy metrics measure enforcement effectiveness: percentage of requests compliant on first submission, percentage rejected due to violations, time to fix violations.

If 80% of requests are compliant on first try, policies are well-understood. If 40% are rejected initially, policies need clearer documentation or defaults.

Exception Handling

Strict policies need escape hatches for legitimate exceptions. The exception process should be audited and time-bound.

Manual approval for exceptions: If a team needs 90-day retention (exceeding policy max of 30 days), they submit an exception request explaining why. Platform team reviews business justification and approves time-limited exception.

Exception metadata:

  • Who requested
  • Business justification
  • Approver
  • Expiration date (exceptions aren't permanent)

Exception auditing: All exceptions are logged and reviewed quarterly. Are exceptions still justified? Should policy be adjusted to accommodate legitimate patterns?

Temporary overrides: Emergency situations might require bypassing policies. This should be:

  • Logged with incident ticket reference
  • Time-limited (expires after 7 days)
  • Reviewed in postmortem

Overrides are break-glass mechanisms for genuine emergencies, not routine workarounds.

Measuring Policy Compliance

Track compliance rate: percentage of resources compliant with policies. Target: 95%+ compliance within 30 days of policy deployment.

Compliance by policy: Which policies are frequently violated? If naming convention is violated 30% of the time, enforcement isn't working or the policy is unclear.

Compliance by team: Which teams have lowest compliance? They might need training, better documentation, or policy adjustments for their use cases.

Compliance trends: Is compliance improving (teams learning policies) or degrading (policies being ignored)? Improving trends indicate effective enforcement and education.

The Path Forward

Kafka policy enforcement shifts from documentation ("here's what you should do") to code ("the platform prevents you from doing it wrong"). Policies expressed as code validate requests automatically, rejecting non-compliant configurations before they reach Kafka.

Conduktor enforces policies through CEL-based validation, custom error messages, and enforcement across all provisioning paths (Console, CLI, GitOps). Teams define policies once; the platform enforces everywhere. Organizations report fewer governance incidents and faster provisioning because compliant requests succeed instantly without manual review.

If your governance depends on engineers remembering policies documented in wikis, the problem isn't the engineers—it's the lack of automated enforcement.


Related: Topic as a Service → · Kafka Governance → · Terraform x Conduktor →