Kafka Access Management: Fix ACL Sprawl

Kafka ACL sprawl makes permissions unknowable. Application-based access with periodic reviews replaces manual ACL-per-service-account chaos.

Stéphane Derosiaux · October 28, 2025 ·

ACLs don't document themselves.

After six months of operation, most Kafka clusters have hundreds of ACLs. Some were granted for production services still running. Some were granted for testing and never revoked. Some use wildcard patterns that matched 5 topics when created but now match 50. Nobody remembers why they all exist, and explaining what they collectively allow requires archaeology through ticket systems and Slack logs.

This is ACL sprawl: permissions accumulate faster than they're cleaned up, patterns grant broader access than intended, and institutional knowledge lives in individuals' heads rather than systems. The security issue isn't individual ACLs—it's that nobody can answer "who can access topic X?" without hours of investigation.

Real access management means: ownership is clear (who can approve access), permissions map to business logic (application-based, not individual grants), access reviews happen automatically (stale permissions are flagged), and answers to "who has access?" take seconds, not hours.

Why ACL Sprawl Happens

Lack of cleanup is the primary cause. When a service is decommissioned, its service account should lose access. But if nobody documents which service accounts map to which applications, cleanup doesn't happen. Permissions accumulate.

Over time, clusters have ACLs for: services that were deleted months ago, test consumers that ran once during development, individual engineers who left the company, and temporary access granted for migrations that finished long ago.

Overly broad grants happen because narrowly scoped permissions require more work. It's easier to grant service-x read access to topic:* (all topics) than to list the three specific topics it needs.

The problem manifests over time. When the grant was created, the cluster had 10 topics and topic:* was acceptable. Six months later, the cluster has 100 topics including sensitive PII data. The wildcard grant now provides access to topics that service-x shouldn't touch.

No ownership mapping means you can't answer "who owns this topic?" If ownership is unclear, access reviews are impossible. You can't determine whether a service account should have access if you don't know who's responsible for approving it.

Teams work around this by asking in Slack: "Who owns the customer-events topic? Someone requested access and we don't know who to ask." This doesn't scale.

Application-Based Permissions

Individual ACL management doesn't scale. Application-based permissions do.

Application definitions group related resources under single ownership. Instead of "grant service-account-abc read access to orders-topic," define "orders-service owns topics matching orders.* and its service account gets appropriate permissions automatically."

Example application definition:

apiVersion: self-serve/v1
kind: Application
metadata:
  name: "orders-service"
spec:
  title: "Orders Service"
  owner: "platform-team"
---
apiVersion: self-serve/v1
kind: ApplicationInstance
metadata:
  application: "orders-service"
  name: "orders-service-prod"
spec:
  cluster: "prod-cluster"
  serviceAccount: "orders-service-prod"
  resources:
    - type: TOPIC
      patternType: PREFIXED
      name: "orders."
    - type: TOPIC
      patternType: PREFIXED
      ownershipMode: LIMITED
      name: "inventory."

From this declaration, the platform auto-generates:

ACLs allowing orders-service-prod to produce to orders.*
ACLs allowing orders-service-prod to consume from inventory.*
Ownership metadata linking orders.* topics to platform-team

When a new orders.shipment-confirmed topic is created, permissions apply automatically based on the pattern. No manual ACL creation, no ticket to platform team, no opportunity for human error.

Benefits: Permissions map to business logic (which application needs what data), ACLs are generated consistently (no variance from manual creation), and ownership is documented (every topic maps to an owning team through patterns).

Principle of Least Privilege at Scale

Least privilege means granting minimum access necessary. For Kafka, this translates to specific guidance.

Grant access to specific topics, not wildcards. If a service needs three topics, list them explicitly: orders.created, orders.updated, orders.cancelled. Avoid orders.* unless the service genuinely needs all topics matching that pattern.

Exception: Application-based permissions use patterns (orders.) because they represent ownership. The orders-service owns all orders topics, so orders. is appropriate. But external consumers shouldn't get wildcard access.

Grant read-only access unless writes are required. Consumers that only read data shouldn't have write permissions. This limits blast radius if credentials are compromised—attacker can read data but can't corrupt topics or publish malicious messages.

Use service accounts per application. Don't share credentials across services. If five services share one service account, you can't determine which service accessed which topic. Separate service accounts enable tracing access to specific applications.

Expire temporary access automatically. Access granted for testing, migration, or one-time analysis should have expiration dates. After expiration, permissions are automatically revoked. This prevents temporary access from becoming permanent.

Access Request and Approval Workflows

Cross-team data access requires coordination. The analytics team wants to read orders data owned by the platform team. This shouldn't be a Slack thread—it should be a structured workflow.

Request submission: Analytics engineer submits access request through Console, CLI, or GitOps:

apiVersion: self-serve/v1
kind: ApplicationInstancePermission
metadata:
  application: "analytics-app"
  appInstance: "analytics-app-prod"
  name: "analytics-reads-orders"
spec:
  resource:
    type: TOPIC
    name: "orders.created"
    patternType: LITERAL
  serviceAccountPermission: READ
  grantedTo: "orders-service-prod"

Routing to owners: Request routes automatically to platform-team (identified as owners of orders.* through application definition). No manual lookup of "who owns this topic?"

Owner review: Platform team reviews:

Is this appropriate use? (yes, revenue reporting is legitimate)
Should access be scoped? (full access vs. masked PII fields)
What's the approval duration? (permanent vs. temporary)

Approval execution: Platform team approves. ACLs generate automatically. Analytics team gets access within minutes, not days.

Audit trail: Every step is logged: who requested, when, business justification, who approved, when approval happened. Compliance teams can report on data sharing without manual investigation.

Workflows integrate with GitOps: Approval happens through pull request review. Request becomes YAML in Git, approval is PR merge, and CI/CD applies the change. Full version control and rollback capability.

Access Reviews and Revocation

Permissions granted shouldn't persist indefinitely. Quarterly access reviews ensure permissions still match requirements.

Automated access review reminders: Every 90 days, owning teams receive reports: "These service accounts have access to your topics. Verify each is still needed."

Report includes:

Service account name
Which topics it can access
When access was granted
Last access timestamp (when did this account last read/write?)

Stale permission detection: If a service account hasn't accessed a topic in 90 days, flag it for review. This indicates:

Service was decommissioned but permissions weren't revoked
Service no longer uses this topic (logic changed, data source switched)
Service account is unused (test account, personal access)

Bulk revocation: After review, teams revoke stale permissions in bulk. Instead of individual ACL deletions, select "revoke all flagged permissions" and execute.

Service account lifecycle management: When services are decommissioned, service accounts should be deleted automatically. Integration with service catalogs (tracking which services exist) enables automatic revocation when services are marked as retired.

Visualizing and Understanding ACLs

Kafka ACLs are difficult to reason about. kafka-acls CLI lists grants but doesn't explain what they collectively allow.

ACL visualization shows: which principals have access to which topics, with what permissions, granted when, and by whom. Instead of parsing 500 lines of kafka-acls output, engineers see a visual representation.

Example view:

Topic: orders.created
  Read access:
    - analytics-processor (granted 2025-01-15 by platform-team)
    - fraud-detection (granted 2024-11-20 by security-team)
  Write access:
    - orders-service (granted 2024-10-01 by platform-team)

This answers "who can access this topic?" in seconds.

Permission testing: Before granting access, test "what would this ACL allow?" If granting analytics- read access to orders., visualization shows: this grants access to 47 topics, including orders.sensitive-pii. Is that intended?

Testing prevents accidental over-permissioning before it happens.

Impact analysis: Before revoking access, check "what breaks if we remove this ACL?" If revoking service-x read access to customer-events, does service-x actively consume that topic? Impact analysis shows last access timestamp and consumer lag, indicating whether service-x depends on the permission.

Measuring Access Management Health

Track three metrics: permission sprawl rate, access review completion, and time to grant access.

Permission sprawl rate measures ACL growth over time. If ACLs increase from 200 to 800 over six months but topic count only doubled, sprawl is happening. Old permissions aren't being cleaned up.

Target: ACL growth should correlate with resource growth. If topics increase 2x, ACLs should increase roughly 2x (accounting for new consumers).

Access review completion measures whether teams complete quarterly reviews. If 60% of teams complete reviews on time, 40% are accumulating stale permissions.

Target: 90%+ completion rate within 30 days of review cycle start.

Time to grant access measures lead time from request to approval. Manual processes measure this in days. Automated workflows measure it in hours.

Target: Under 4 hours for standard requests (no exceptions needed), under 1 day for exception requests (require special approval).

Security Implications

Poor access management creates security risks.

Privilege escalation: If permissions aren't reviewed, compromised low-privilege accounts might have accumulated high-privilege access over time. Attacker gains access through compromised test account, discovers it has production read access never revoked after testing finished.

Data exfiltration: Overly broad permissions mean more data is at risk from single compromise. If analytics accounts have wildcard access to all topics (not just analytics-relevant ones), compromised analytics credentials expose all data, not just what analytics needs.

Compliance violations: Auditors ask "who accessed customer PII in the last year?" Without access logs and permission tracking, this question takes weeks to answer through log archaeology. With proper access management, it's a report generated in seconds.

Insider threats: Employees with unnecessary access can exfiltrate data intentionally. Least privilege and access reviews limit what any individual can access, reducing insider threat risk.

The Path Forward

Kafka access management scales through application-based permissions (not individual ACLs), approval workflows (not Slack threads), and automated access reviews (not manual quarterly meetings).

Conduktor provides application catalogs for ownership, approval workflows for cross-team access, ACL visualization for understanding permissions, and automated access review reminders. Teams know who has access to what, why they have it, and whether they still need it—without manual investigation.

If explaining your Kafka permissions requires archaeology, the problem isn't Kafka ACLs—it's the lack of structure around them.