# Kafka Automation Platform: Self-Serve by Policy

How long does it take to get a Kafka topic at your company? Or a service account, or a certificate?

At one global retailer we work with, the honest answer for a certificate was: it depends who's around.

> *"We ask for a certificate file. Sometimes we get it in thirty minutes, sometimes it takes two weeks. And sometimes the certificate is just copy-pasted into a chat."*
> — a technical lead at a global retailer

Nobody on that team is slow or careless. A human has to be available, find the file, and hand it over. When they're in a meeting, you wait. When they're on holiday, you wait longer. The work is manual, so it moves at the speed of people, and people have plenty of other things to do.

It's not that your platform team is slow. The manual process just doesn't scale, and it puts the weight on the few people who hold the whole thing in their heads.

Automation isn't about replacing those people with scripts. It's about giving them their time back, and letting developers move without filing a ticket and waiting on someone else's calendar.

## When the manual model starts to hurt

Manual operations are fine at small scale. Ten teams, twenty topics, a handful of service accounts. Slack requests and a spreadsheet get you a long way.

Then adoption grows, and three things start to show up.

The queue gets longer. Topic creation that took an hour now takes days, not because the work got harder, but because there's more of it and the same two people are doing it. Every "did you create my topic yet?" is a context switch for someone who was halfway through real work.

When every topic is created by hand, conventions live in Confluence, not in production. One team gets 7-day retention, another gets 30 for the same use case. Replication factor depends on who was on call. Nobody did anything wrong; the rules just weren't enforced anywhere.

And the knowledge concentrates. Two people know how to provision things correctly, and the runbook they wrote is six months stale. When they're both out, everything waits.

The usual answer is "hire more platform engineers." I've watched managers fight hard for that headcount and not get it.

> *"I keep trying to get her more resources. So far I've not been successful."*
> — a data lead at an enterprise HR-software company, on his small Kafka platform team

And even when the headcount does come, it doesn't really fix things. More people doing manual work is still manual work. You've raised the ceiling, not changed the shape of the curve.

What changes the shape is automation.

## Automation isn't a pile of scripts

When people hear "automate Kafka provisioning," they picture bash scripts in a Git repo. That's not it (and it usually makes things worse, because now the scripts are the thing only two people understand).

Real automation is policy-based. The platform team sets the guardrails once, and the right thing becomes the easy thing to do.

A developer asks for a topic. Before anything touches Kafka, the request is checked against policy:

- Naming convention
- Retention bounds
- Replication factor
- Partition count

If it's fine, the topic is created and ownership is recorded. If it's not, the developer gets a clear message saying what's wrong and how to fix it, in seconds, without opening a ticket or waiting on anyone's calendar.

That check is doing a job a person used to do from memory, every single time, hoping they hadn't forgotten a setting. The machine doesn't forget, and the people who used to carry all of it in their heads get to put it down.

> This post is about the operational toil: the tickets, the waiting, the manual steps. The governance and policy side has its own walkthrough in [Kafka policy enforcement](https://www.conduktor.io/blog/kafka-policy-enforcement) and [governed self-service](https://www.conduktor.io/blog/governed-kafka-self-service).

## What to automate, in order

You don't need to automate everything at once. Each step stands on its own and buys back time you can spend on the next one.

1. **Topic creation.** It's the most common request and the easiest to get right. Encode your naming convention as a pattern (`team.domain.entity.version`), set partition and retention bounds, require replication factor 3 in production. The request hits validation first, whether it comes through Console, the [CLI](https://docs.conduktor.io/guide/conduktor-in-production/automate/cli-automation), or a [pull request](https://docs.conduktor.io/guide/conduktor-in-production/automate/terraform-automation). Pass, and it's created and owned. Fail, and you get a clear error. This one change usually takes the biggest bite out of the ticket queue.
2. **Access at the application level, not per individual.** Granting ACLs to individual service accounts by hand is where mistakes and wildcards creep in. Define an application that owns a pattern instead: the `orders-service` owns `orders.*`, and its account gets the right permissions automatically. A new orders topic? The ACLs follow the ownership rule. Nobody hand-writes them at midnight.
3. **Schema compatibility.** A schema that breaks consumers shouldn't be registrable in the first place. Require a compatibility mode and check it at registration time. The producer team finds out before they ship, not after a downstream consumer falls over.
4. **Exceptions, on purpose.** Most requests should be self-service. A few shouldn't: cross-team data access, a retention window past the normal limit, production access for a brand-new app. Route those through an approval workflow where the data owner makes the call, not the platform team. They shouldn't be the gatekeeper for decisions they don't own.

Notice the shape of it. Every step moves a decision from "a person does it each time" to "the system does it, because a person set the rule once." That's the whole game.

## But what if someone asks for 10,000 partitions?

This is the real fear with self-service, and it's a fair one. What if someone:

- sets retention to one millisecond
- grants a wildcard ACL to everyone
- asks for ten thousand partitions on a whim

They can't, because the guardrails run before Kafka ever sees the request. You're not handing developers root on the cluster. You're handing them a validated path that can't produce a broken state: partition floors and ceilings, retention bounds, a replication minimum. A topic that violates the policy is rejected with a reason, the same way a type checker rejects code that won't compile.

**Self-service without guardrails is chaos. Guardrails without self-service is the ticket queue you already have.** You want both.

## How do you know it's working?

Three things tell you:

- **Lead time** from "I need a topic" to "it's ready." Manual, that's days. Automated, minutes.
- **Ticket volume** for routine requests. If those drop while incident tickets stay flat, the team has room to breathe again.
- **Time to resolve** when something does break. That improves when ownership is recorded and the context is one click away, instead of buried in old Slack threads nobody can find.

For what it's worth, the teams we've helped through this tend to land around 75% fewer provisioning tickets and roughly 4x faster provisioning ([our own numbers](https://www.conduktor.io/federated-ownership), so weigh them as such). But the metric I'd actually watch is softer: are developers creating topics without asking permission first, and is the platform team building things again instead of answering the same request for the tenth time?

## The platform team was never the bottleneck

When provisioning is slow, the instinct is to ask for more platform engineers. Usually that isn't the fix, and usually it isn't approved anyway.

The bottleneck isn't the people. It's that the work runs at the speed of whoever happens to be free to do it by hand. Move the routine decisions into policy, give developers a path they can't break, and the platform team gets to go back to the work they actually wanted to do.

Your platform team isn't a queue. There's no reason to keep treating them like one.

---

**Learn more**: [How Conduktor automates Kafka governance at scale →](https://www.conduktor.io/federated-ownership)
