Teach yourself Apache Kafka and Python with a Jupyter Notebook

A girl flies on a book in a tutorial for Apache Kafka

One of the best ways to learn a new technology is to try it within an assisted environment that anybody can replicate and get working within few minutes. Notebooks represent an excellence in this field by allowing people to share and use pre-built content which includes written descriptions, media and executable code in a single page.

This blog post aims to teach you the basics of Apache Kafka Producers and Consumers through building an interactive notebook in Python. If you want to browse a full ready-made solution instead, check out our dedicated github repository.

Language support and multi-window GUI: the case for JupyterLab

One of the main actors in the notebook space is the Jupyter project. With JupyterLab it provides a solid web interface where it’s possibile to create and distribute notebooks written in a variety of languages (named kernels in Jupyter terms).

JupyterLab can be started as Docker container. Using Docker allows us to focus on learning without having to deal with software installation and configuration: a big win for productivity. We start by creating a folder on named kafka-jupyter and navigating to it. If you prefer terminal over a GUI, you can achieve the same result by issuing the following two commands:

mkdir -p kafka-jupyter 
cd kafka-jupyter

Now we can start the Docker container with the following command:

docker run                  \
--rm -p 8888:8888 \
-e JUPYTER_ENABLE_LAB=yes \
-v "$PWD":/home/jovyan/work \
jupyter/datascience-notebook

The above command will create a Jupyter Docker container, with JupyterLab enabled, mapping the existing kafka-jupyter folder in it. This step is required if we want to share files from the host computer to the guest. When the above command is executed, we see a message like this:

To access the server, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/jpserver-9-open.html
Or copy and paste one of these URLs:
http://639a69244ab1:8888/lab?token=5031e1652236a8050ea2a9213df7c6ade24a790d3710b239
http://127.0.0.1:8888/lab?token=5031e1652236a8050ea2a9213df7c6ade24a790d3710b239

Now, JupyterLab is accessible at http://127.0.0.1:8888/?token=<token>, where <token> is the one shown in the above message.

One of the great things about JupyterLab is that it provides an easy GUI-driven method to configure and arrange the user interface. This makes it the perfect choice to learn not only sequential step-by-step tutorials, but also more complex and branched examples where multiple code sections have to run in parallel… Can you see where I’m going?

Why Kafka on a Notebook?

Apache Kafka is a streaming technology. This means that to understand its beauty you need to have data flowing from Point A (aka the Producer) to Point B (aka the Consumer). Kafka step-by-step tutorials can become complex to follow, since they usually require continuously switching focus between various applications or windows. JupyterLab, with its great language support and multi-window layout, is the perfect way to dig into the basics of Kafka with text descriptions, media and executable code available in a single Web UI. This way you can focus on technology concepts rather than on your local setup.

Now, as a basic playground, let’s create a Kafka instance with Aiven Console. If you haven’t done it already, sign up for an Aiven account and redeem the free credit to start your trial.

To create a Kafka Service, select a cloud provider, the region where you want to deploy the service and the plan which determines the amount of resources available for your cluster. Finally set the service name; in this example we’ll refer to an instance named kafka-notebook but you can choose any name you wish.

(Tip! A video is available allowing you to review the whole service creation process.)

While we wait for the service to be ready, let’s click on it to check its details. On the Overview tab we can find the Host and Port information we’ll later use to connect to the cluster. While we’re here, we can download the three SSL certificates required to authenticate to Kafka ( Access Key, Access Certificate and CA Certificate) into our local kafka-jupyter folder.

Last change required: we need to scroll down the Overview tab to the Advanced configuration section and enable the kafka.auto_create_topics_enable parameter which will allow us to produce messages to Kafka without needing to create a topic beforehand.

Now, it’s time for an espresso while we wait a couple of minutes until all the Nodes lights turn green, meaning that our kafka-notebook Kafka instance is running and ready to be used.

🟢🟢🟢

Producing the first message

If we now check the JupyterLab Web UI at http://127.0.0.1:8888/, we should see something like this:

On the top left we can spot the work folder. By double clicking on it we can see that it contains the three certificates ( ca.pem, service.cert, service.key) we downloaded earlier in our host kafka-jupyter folder. It's now time to create a Kafka producer by selecting the Python 3 icon under the Notebook section of the main page. A notebook will be opened with a first empty cell that we can use to install the Python library needed to connect to Kafka. Copy the following in the cell and run it:

%%bash 
pip install kafka-python

Even if we are creating a Python notebook, the prefix %%bash allows us to execute bash commands. This section installs kafka-python, the main Python client for Apache Kafka.

Now we’re all set to produce our first record to Kafka.

A new empty code block should already be there, if not let’s click on the + icon on top of our notebook. In this new code section we'll speak Pythonese and create an instance of KafkaProducer. Copy and paste the following code into the block, replacing the <host> and <port> parameters with the ones taken from Aiven's console:

from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers='<host>:<port>',
security_protocol="SSL",
ssl_cafile="./ca.pem",
ssl_certfile="./service.cert",
ssl_keyfile="./service.key",
value_serializer=lambda v: json.dumps(v).encode('ascii')
)

The code creates a producer, pointing to Kafka via the bootstrap_servers parameter and using the SSL authentication and the three SSL certificates. The value_serializer transforms our json message value into a bytes array, the format requested and understood by Kafka.

Now let’s produce our first message. Since it’s time to think about summer holidays, we’ll create a hotel booking message by pasting the following in a new code block and execute it.

producer.send(
'hotel-booking-request',
value=
{
"name": "Giuseppe Rossi",
"hotel": "Luxury Hotel",
"dateFrom": "25-06-2021",
"dateTo": "07-07-2021",
"details": "I want the best room 😀😀😀😀😀!!!!"
}
)
producer.flush()

The above code adds Giuseppe’s booking for Luxury Hotel to a buffer of pending records, which will be sent to a topic named hotel-booking-request. With the flush() method we make sure the record is actually sent to Kafka. Let's save our producer notebook as Producer.ipynb.

Happy times! Our 1st message has gone to Kafka. How can be sure? Well… let’s create a Consumer.

Consuming the Message(s)

We now create a new Python notebook to host our Consumer code. In general, it’s a good idea to create separate notebooks for producer and consumer, since they solve two different problems and are usually placed in different sections of the containing application. It also enables us to keep the related code separate and to focus only on one block at the time.

Nevertheless, JupyterLab allows us to visualise the Consumer alongside the Producer, to do this we drag and drop the newly created notebook alongside the Producer one as shown in the image below.

It’s time now to create a KafkaConsumer, by pasting the following code into the first code block of our new notebook and, after amending the <host>:<port> section as done in the producer, executing it.

from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
bootstrap_servers='<host>:<port>',
security_protocol="SSL",
ssl_cafile="./ca.pem",
ssl_certfile="./service.cert",
ssl_keyfile="./service.key",
value_deserializer = lambda v: json.loads(v.decode('ascii')),
auto_offset_reset='earliest'
)

The consumer is ready, pointing to our Kafka cluster and using a deserialization function that will take the bytes from the message value and transform them into a json structure performing the opposite transformation to the one made during the production phase.

By default a consumer starts reading from a Kafka topic from the point in time it attaches to the cluster. Previous messages are not read. We are changing this behaviour with the auto_offset_reset='earliest' parameter, allowing us to read from the beginning of the topic. We are now ready to subscribe to the hotel-booking-request topic and start reading from it with the following code

consumer.subscribe(topics='hotel-booking-request')
for message in consumer:
print ("%d:%d: v=%s" % (message.partition,
message.offset,
message.value))

The consumer thread never ends: this is justified by the fact that we always want to consume messages as soon as they’re available in the Kafka topic, and there is no “end time” in the streaming world. We should also see the first message appearing on our consumer console.

0:0: v={'name': 'Giuseppe Rossi',
'hotel': 'Luxury Hotel',
'dateFrom': '25-06-2021',
'dateTo': '07-07-2021',
'details': 'I want the best room 😀😀😀😀😀!!!!'}

Now if we go back to the Producer notebook, we can produce a holiday booking for Carlo Bianchi by pasting the following code in a new code block:

producer.send(
'hotel-booking-request',
key=b'Average Hotel',
value=
{
"name": "Carlo Bianchi",
"hotel": "Average Hotel",
"dateFrom": "12-07-2021",
"dateTo": "23-07-2021",
"details": "Room next to the highway 🚗🚗🚗🚗"
}
)
producer.flush()

After executing it, we should immediately receive the same message on the consumer side.

0:1: v={'name': 'Carlo Bianchi',
'hotel': 'Average Hotel',
'dateFrom': '12-07-2021',
'dateTo': '23-07-2021',
'details': 'Room next to the highway 🚗🚗🚗🚗'}

If you’re wondering what the 0:1 prefix is, check out the consumer code. They are the topics partition and offset meaning that we are reading the second message (offset starts with 0) from partition 0 of the topic. Our Producer/Consumer pipeline is working. Step 1 complete. Congrats!

Wrapping up

Notebooks represent an awesome method to learn new concepts and technologies by providing a way to incapsulate in a unique artefact text explanations, media, and executable code. JupyterLab, with its huge language support and multi-window layout, constitutes the perfect playground for technologies, like Apache Kafka, composed of multiple pieces working simultaneously.

Want more?

If this first notebook whet your Kafka appetite, then check out our pizza-based Kafka Python notebook for further examples of Kafka concepts like Partitioning, Consumer Groups and Kafka Connect. Please try the whole set of notebooks — and let us know if there’s anything else you’d like included in it.

This blog post provides the first and very basic Apache Kafka Producer/Consumer setup, if you want to understand and test more, here are some additional resources:

Not using Aiven services yet? Sign up now for your free trial at https://console.aiven.io/signup!

In the meantime, make sure you follow our changelog and blog RSS feeds or our LinkedIn and Twitter accounts to stay up-to-date with product and feature-related news.

Originally published at https://aiven.io.

Your database in the cloud, www.aiven.io