What is Kafka Connect?
Kafka Connect is a tool to stream data between Apache Kafka and other data systems in a reliable & scalable way. Kafka Connect makes it simple to quickly start “connectors“ to move continuous & large data sets into Kafka or out of Kafka. It was released circa Nov-2015 alongside Kafka 0.9.
What is a Connector?
Kafka Connect manages many “sources“ and “sinks“ technologies where data can be stored
- Common sources are PostgreSQL, MySQL, JDBC, Cassandra, DynamoDB, MongoDB, Solr, SQS, Redis etc.
- Common sinks are Amazon S3, Elasticsearch, HDFS, most databases, Cassandra, DynamoDB, MongoDB, Hbase, Redis, solr etc.
How to create a Connector in Conduktor
Conduktor helps you creating Kafka Connect connectors by providing a rich interface experience to guide you.
First, it will list all the sources and sinks supported by our Kafka Connect cluster when you want to create one:
If the one we are looking for is not there, you must install them on your Kafka Connect cluster (drop the correspond .jar). We can click on It’s not listed here which takes us to the official Confluent Hub where we can find hundreds of connectors, and install them using
confluent-hub command line and/or add this to some Dockerfile for instance.
Kafka Connect is about reusing the wheel
It is recommended to leverage existing connectors and not write them ourselves.
Other companies (including Confluent itself) and talented developers already wrote most of the connectors for us. They already have done a very good job in writing these sources and sinks connectors which are battle-tested.
Kafka Connect is really about: don’t write code and use a battled-tested connector instead. We just provide a piece of configuration and we’re good to go. Hopefully, it is still possible to write a new connector plugin from scratch.
When we have created our connectors with our custom configurations, Kafka Connect will create and manage “tasks“, by creating them on its Kafka Connect “workers“ (Kafka Connect is a cluster of workers). Tasks are the minimum unit of processing in the data model of Kafka Connect.
Each connector instance coordinates a set of tasks to actually move, copy, or process the data, in or out. By allowing the connector to break a single job into many tasks, Kafka Connect provides built-in support for parallelism and scalability. Why not enjoy this freely?
Kafka Connect Architecture
Here is a logical representation of the different parts:
- Our sources data: databases, JDBC, MongoDB, Redis, Solr etc., that we want to copy to our Kafka cluster
- In between our source data and our Kafka cluster: a Kafka Connect cluster, made of multiple Kafka Connect workers where connectors & tasks are running. The tasks are pulling data from the sources and push them safely to our Kafka cluster.
- We can also send our data from our Kafka cluster, to any sink: Amazon S3, Cassandra, Redis, MongoDB, HDFS, etc. In the same way as previously, the tasks will pull data from the Kafka cluster and write them to our sinks.
We’re always stronger together
The workers are the instances of a same Kafka Connect cluster. A cluster can be composed of 1 worker up to 1000 workers and beyond.
There are two ways of running Kafka Connect workers:
A single Kafka Connect instance runs all our connectors and tasks (one worker).
It is very simple to setup and is the preferred way for development and testing purposes.
It’s quite limited. It does not provide any fault tolerance if the server crashes: the processing is stopped, data can be lost, there is no way to recover them. It’s also delicate to scale up a Kafka Connect standalone instance, as it only supports vertical scaling (more CPU, more memory).
Kafka Connect in standalone mode relies on a local file (configured by
offset.storage.file.filename). It is used by source connectors, to keep track of the source offsets from the source system. The next time the connector is restarted, it will read this file, and know from where to start in the source (instead of starting from scratch).
Distributed mode: scale-up and beyond
This time, multiple Kafka Connect instance are running our connectors and tasks, and are coordinating each other.
It is easy to scale horizontally: just start more instances! They use the Kafka cluster itself to be “connected”. All the workers are connected to Kafka and share the same
group.id configuration (to declare they “belong“ together).
The workers will automatically synchronize themselves up to schedule and distribute the work across all the available workers. When a worker crashes (or if we shutdown one of them manually for maintenance, or if we upgrade one, etc.), the other workers will quickly detect it, and a “rebalance“ will be automatically triggered to re-distribute the work belong the alive workers (all the work or just the crashed tasks, according to the configuration).
In distributed mode, the workers also use the Kafka cluster to save their data and configuration (like a database, instead of a simple file as in standalone mode). We just need to configure some properties to tell Kafka Connect where to save its data in Kafka:
status.storage.topic and this makes Kafka Connect “distributed“.
How Conduktor helps us working with Kafka Connect clusters?
Let’s say we have 3 Kafka Connect instances, running 3x connectors doing a copy from some database tables to our Kafka cluster:
We can see our three connectors are in “Running“ state: they are listening to our database and are copying data in “almost realtime“ to our Kafka cluster.
We can select one or multiple connectors and Pause/Resume, Restart (after a change of configuration for instance), or Stop them. “Stopping” a connector means removing it from our Kafka Connect cluster, it’s gone.
If an instance crashes or is stopped, Conduktor will alert you:
The number of available instances is now 2 instead of the expected 3.
Kafka Connect failover
Let’s say the three connectors have 2 tasks each which are spread amongst these three workers:
Worker 2 dies, a rebalancing occurs automatically: tasks of Worker 2 are re-assigned to the available workers Worker 1 and 3:
As mentioned above, source offsets (the checkpoint the connectors used to remember where they “were” in the source data) are stored in a Kafka topic in distributed mode. Therefore, in case of failure, Kafka Connect recovers the necessary source offsets from its topic, and start new tasks from these previously committed offsets.
Kafka Connect is one of the nicest accelerators around Apache Kafka, to integrate our legacy databases into Apache Kafka, or to populate our large BI databases from realtime data processed by Kafka. All this without coding anything! How magic is that?
Also, don’t use the standalone mode with Kafka Connect, as it is unreliable.