Kafka Connect CLI tutorial

Kafka Connect CLI walkthrough in standalone mode: connect to Kafka with a connector using real-time Wikipedia changes, worker properties and the output topic.

Learn how to use Kafka Connect in standalone mode

Kafka Connect provides a scalable way to move data between Kafka and external systems. This tutorial demonstrates running a connector in standalone mode using real-time Wikipedia changes.

What you'll learn:

What Kafka Connect is and when to use it
How to configure a standalone connector
How to set up the required properties files
How to verify data flowing into Kafka

Live data stream
This tutorial uses the Wikipedia recent changes stream - a real-time feed of edits happening across Wikipedia.

What is Kafka Connect?

Kafka Connect is a framework for streaming data between Kafka and external systems using reusable connectors. Instead of writing custom producer/consumer code for common integrations, you can use pre-built connectors.

Flowchart: source systems (database, S3, APIs) feed source connectors, which write to Kafka topics, which sink connectors read and deliver to sink systems (Elasticsearch, data warehouse, S3)

Connector type	Direction	Examples
Source	External → Kafka	Debezium, JDBC, S3, MongoDB, Twitter
Sink	Kafka → External	Elasticsearch, S3, JDBC, HDFS, Splunk

Find connectors on Confluent Hub.

How to use Kafka Connect in standalone mode?

To use Kafka Connect in standalone mode, we need to provide the mandatory parameters:

Download a Kafka Connect connector, either from GitHub or Confluent Hub Confluent Hub
Create a configuration file for your connector
Use the connect-standalone.sh CLI to start the connector

Example: Kafka Connect standalone with Wikipedia data

Create the Kafka topic wikipedia.recentchange in Kafka with 3 partitions

kafka-topics --bootstrap-server localhost:9092 --topic wikipedia.recentchange --create --partitions 3 --replication-factor 1

As well as the topic dead letter queue wikipedia.dlq, for catching any errors

kafka-topics --bootstrap-server localhost:9092 --topic wikipedia.dlq --create --partitions 3 --replication-factor 1

Download the release JAR and configuration from here and unzip the archive on your computer at kafka_2.13-2.8.1/connectors/kafka-connect-sse:

~/kafka_2.13-2.8.1/connectors  ls -R
kafka-connect-sse

./kafka-connect-sse:
connector.properties                            kafka-connect-sse-1.0-jar-with-dependencies.jar

Edit the configuration file connectors/kafka-connect-sse/connector.properties with the following properties:

name=sse-source-connector
tasks.max=1
connector.class=com.github.cjmatta.kafka.connect.sse.ServerSentEventsSourceConnector
topic=wikipedia.recentchange
sse.uri=https://stream.wikimedia.org/v2/stream/recentchange
errors.tollerance=all
errors.deadletterqueue.topic.name=wikipedia.dlq

Look into your Kafka installation directory (where your bin and config folders are)

Edit the content of the config/connect-standalone.properties file

key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.flush.interval.ms=10000

# EDIT BELOW IF NEEDED
bootstrap.servers=localhost:9092
offset.storage.file.filename=/tmp/connect.offsets
plugin.path=/Users/stephanemaarek/kafka_2.13-2.8.1/connectors

The last three lines are the most important to make everything work.

In particular, the plugin.path config: this is where you indicate the folder where you store your Kafka connectors you have downloaded before.

This must be an absolute path (not relative, and no shortcut with ~) to your connectors directory

If you fail this step, Kafka Connect will stop after starting it.

Next, we can start our Kafka Connect standalone connector

connect-standalone ~/kafka_2.13-2.8.1/config/connect-standalone.properties ~/kafka_2.13-2.8.1/connectors/kafka-connect-sse/connector.properties

And as we can see, the data is flowing into our wikipedia.recentchange topic:

kafka-console-consumer --bootstrap-server localhost:9092 --topic wikipedia.recentchange

Standalone vs distributed mode

Mode	Use case	Scalability
Standalone	Development, testing, single tasks	Single worker
Distributed	Production, high availability	Multiple workers

Standalone mode runs on a single machine with no fault tolerance. For production deployments, use distributed mode with multiple workers.

See it in practice with Conduktor
Conduktor Console provides a visual interface for managing Kafka Connect clusters, deploying connectors and monitoring connector health and throughput.

Next steps

Start programming with Kafka to build producers and consumers in code
Monitor Kafka clusters including Connect workers
Explore Confluent Hub for more available connectors