All articles

Why use Kafka Connect?

~ NaN minutes

Using Conduktor to stream data between Apache Kafka and other data systems in a reliable and scalable way.

Written by
Author's avatarStéphane Derosiaux
Published onJan 13, 2021
Blog's image cover

    What is Kafka Connect?

    Kafka Connect is a tool to stream data between Apache Kafka and other data systems in a reliable & scalable way. Kafka Connect makes it simple to quickly start “connectors“ to move continuous & large data sets into Kafka or out of Kafka. It was released circa Nov-2015 alongside Kafka 0.9.

    What is a Connector?

    Kafka Connect manages many “sources“ and “sinks“ technologies where data can be stored

    • Common sources are PostgreSQL, MySQL, JDBC, Cassandra, DynamoDB, MongoDB, Solr, SQS, Redis, etc.
    • Common sinks are Amazon S3, Elasticsearch, HDFS, most databases, Cassandra, DynamoDB, MongoDB, Hbase, Redis, solr, etc.

    How to create a Connector in Conduktor

    Conduktor helps you create Kafka Connect connectors by providing a rich interface experience to guide you.

    You can see all of your source and sink connectors listed, with their status displayed, in the Kafka Connect section:

    Creating a Connector in Conduktor

    Adding a Connector requires 2 things: first, you must have setup Kafka Connect within your Cluster when creating it. We provide an example Connect cluster for you.

    With that done, you will then need to generate a JSON configuration for your chosen connector. Full instructions on how to do both of these steps are available in the Conduktor documentation.

    To find more connectors, you can head to the Confluent Hub where we can find hundreds of connectors, and install them using confluent-hub command line and/or add this to some Dockerfile for instance.

    Kafka Connect is about reusing the wheel

    It is recommended to leverage existing connectors and not write them ourselves.

    Other companies (including Confluent itself) and talented developers already wrote most of the connectors for us. They already have done a very good job in writing these source and sink connectors which are battle-tested.

    Kafka Connect is really about: don't write code and use a battled-tested connector instead. We just provide a piece of configuration and we're good to go. Hopefully, it is still possible to write a new connector plugin from scratch.

    When we have created our connectors with our custom configurations, Kafka Connect will create and manage “tasks“, by creating them on its Kafka Connect “workers“ (Kafka Connect is a cluster of workers). Tasks are the minimum unit of processing in the data model of Kafka Connect.

    Each connector instance coordinates a set of tasks to actually move, copy, or process the data, in or out. By allowing the connector to break a single job into many tasks, Kafka Connect provides built-in support for parallelism and scalability. Why not enjoy this freely?

    Kafka Connect Architecture

    Here is a logical representation of the different parts:

    Kafka Connect Architecture

    • Our sources data: databases, JDBC, MongoDB, Redis, Solr, etc., that we want to copy to our Kafka cluster
    • In between our source data and our Kafka cluster: a Kafka Connect cluster, made of multiple Kafka Connect workers where connectors & tasks are running. The tasks are pulling data from the sources and pushing them safely to our Kafka cluster.
    • We can also send our data from our Kafka cluster, to any sink: Amazon S3, Cassandra, Redis, MongoDB, HDFS, etc. In the same way as previously, the tasks will pull data from the Kafka cluster and write them to our sinks.

    We're always stronger together

    The workers are the instances of the same Kafka Connect cluster. A cluster can be composed of 1 worker up to 1000 workers and beyond.

    There are two ways of running Kafka Connect workers:

    • Standalone
    • Distributed

    Standalone mode

    A single Kafka Connect instance runs all our connectors and tasks (one worker).

    It is very simple to setup and is the preferred way for development and testing purposes.

    It's quite limited. It does not provide any fault tolerance if the server crashes: the processing is stopped, data can be lost, and there is no way to recover them. It's also delicate to scale up a Kafka Connect standalone instance, as it only supports vertical scaling (more CPU, more memory).

    Kafka Connect in standalone mode relies on a local file (configured by offset.storage.file.filename). It is used by source connectors, to keep track of the source offsets from the source system. The next time the connector is restarted, it will read this file, and know from where to start in the source (instead of starting from scratch).

    Distributed mode: scale-up and beyond

    This time, multiple Kafka Connect instances are running our connectors and tasks, and are coordinating each other.

    It is easy to scale horizontally: just start more instances! They use the Kafka cluster itself to be "connected". All the workers are connected to Kafka and share the same group.id configuration (to declare they “belong“ together).

    The workers will automatically synchronize themselves up to schedule and distribute the work across all the available workers. When a worker crashes (or if we shutdown one of them manually for maintenance, or if we upgrade one, etc.), the other workers will quickly detect it, and a “rebalance“ will be automatically triggered to re-distribute the work belong the alive workers (all the work or just the crashed tasks, according to the configuration).

    In distributed mode, the workers also use the Kafka cluster to save their data and configuration (like a database, instead of a simple file as in standalone mode). We just need to configure some properties to tell Kafka Connect where to save its data in Kafka: config.storage.topic, offset.storage.topic, and status.storage.topic and this makes Kafka Connect “distributed“.

    How Conduktor helps us work with Kafka Connect clusters?

    Let's go back to our earlier Conduktor example, in which we had 1 Kafka connect instance, running 2 Kafka connectors, one a sink and one a source:

    Creating a Connector in Conduktor

    We can see our two connectors are in “Running“ state: they performing actions in “almost realtime“ to our Kafka cluster, with the source drawing from wikipedia and our sink sending to elasticsearch.

    We can select one or multiple connectors and Pause/Resume, Restart (after a change of configuration for instance), or remove them.

    ControlConnectors

    If a connector crashes or is paused, Conduktor will alert you:

    PausedConnector The number of available connectors and tasks is now 1 instead of the expected 2.

    Kafka Connect failover

    Let's say the three connectors have 2 tasks each which are spread amongst these three workers:

    Kafka Connect failover

    Worker 2 dies, a rebalancing occurs automatically: tasks of Worker 2 are re-assigned to the available workers Worker 1 and 3:

    Kafka Connect failover

    As mentioned above, source offsets (the checkpoint the connectors used to remember where they "were" in the source data) are stored in a Kafka topic in distributed mode. Therefore, in case of failure, Kafka Connect recovers the necessary source offsets from its topic, and starts new tasks from these previously committed offsets.

    Conclusion

    Kafka Connect is one of the nicest accelerators around Apache Kafka, to integrate our legacy databases into Apache Kafka, or to populate our large BI databases from realtime data processed by Kafka. All this without coding anything! How magic is that?

    Also, don't use the standalone mode with Kafka Connect, as it is unreliable.