Apache Kafka and the great database debate
It’s stormy times on the internet, and the waters are seething. The rolling waves strike hard against the shores of Kafka, leaving behind arguments for and against this Great Question: Is It A Database?
Here’s a handy tl;dr for this article: Yes and no. It depends.
If you keep your eyes on Kafka’s logical niche in data architecture, and kind of squint, then yes — you could call it a database. It takes in data and gives out data, after all. It contains information about events that have occurred (insofar as it’s configured to of course). You may argue that having all that data in there is good enough reason to be using it as a database.
But does that mean Kafka is a database?
In the main, we’d have to come down on the side of No.
What do relational databases do well?
The Kafka vs. databases debate largely boils down to events vs. states.
A relational database stores data as states. For example, a PostgreSQL database will tell you that your central warehouse has 41 chasing hammers that you can deliver to retail outlets. In other words, it tells you the state of affairs right now.
In contrast, Kafka will tell you that you ordered 100 birch-handled chasing hammers in July; 8 were delivered to Outlet A; 19 were delivered to Outlet B; 48 were delivered to outlet C; 20 more hammers were ordered (Outlet C seems to be very good at selling hammers, or they just lost a shipment), and 14 were delivered to Outlet D.
If at this point you need to ask “How many birch-handled chasing hammers are in the central warehouse?”, the relational database has your answer ready. Kafka really has to think about it. And the longer the chain of events involving chasing hammers in the central warehouse, the longer it thinks. So Kafka isn’t well-suited for occasional queries.
To take a much more common use case, consider holding on to inventory data for delivery to an online shop, so that consumers can purchase their birch-handled chasing hammers directly. In normal circumstances, again, both Kafka and relational databases perform equally well.
But what if there’s a problem and the data warehouse system has to be rebooted? Relational databases are just restored from a backup in seconds. To restore Kafka, you have to run it through all the same hoops of goods purchased, sent, ordered and received, just to reach the same state it was in before it went away.
What does Kafka do well?
Kafka works on a completely different principle than a relational database. The fact that it stores data is incidental; what it actually does is make a note of events as they unfold.
Well then, you say, if all Kafka can do is publish to streams and consume them, what’s it good for?
Because Kafka is all about events, it makes an excellent message bus component for a data pipeline. Kafka is definitely at its best as short-term storage from which other systems (including long-term storage databases) can retrieve data in a robust, ACID-compliant way. It eliminates data silos by allowing any interested component to find and consume data.
And note that Kafka can step up to every letter of that acronym:
- Atomicity: data in Kafka is either written in its entirety or it’s not written, and if a consumer fails, it can just go back and re-read the partition.
- Consistency: Kafka as a whole eats anything, but constraints can be mimicked by using partitions. In any case, this is more a property of the downstream application than it is of the data store itself.
- Isolation: data is absolutely, definitely serialised in Kafka, since events are always ordered by time and consumers will always read them in the same order.
- Durability: Kafka writes data to disk and replicates it (to multiple brokers), which is precisely what any database worth its salt does. If you want backups, well, backups aren’t part of a database anyway, but it’s certainly possible to back up all the messages you write to Kafka.
The fact that one can even think of Kafka as being able to replace a database shows that Kafka really is versatile and multifunctional. Kafka can easily process enormous amounts of data, which comes in handy for recording transactions (such as purchases of chasing hammers!), compiling metrics, and handling streaming data of any kind.
Using Kafka as long-term data storage doesn’t take full advantage of its strengths and ignores its weak recovery arrangements. The recovery aspects are weak specifically because Kafka was never intended to store data, only to deliver it. It’s a data store only to the same degree that a pizza courier is a cooling rack.
As you can see from the image, the most productive way to use Kafka is between systems. This completely divorces systems sending in data from systems picking it up. You don’t need custom integrations between every building block in your architecture, just plug Kafka into the middle and it takes care of the data for the short time it needs to persist before being read by a receiving component. The applications for this approach are wide and varied.
Kafka is great. At Aiven, we all think so — after all, it’s part of our product offering. Kafka can do plenty of cool stuff. But in practice, to keep your data safe and your DBadmins happy, pair it with some other great open source data store.
For further reading, you might enjoy An introduction to Apache Kafka to refresh your memory of what this Kafka thing even is.
Not using Aiven services yet? Sign up now for your free trial at https://console.aiven.io/signup!
Originally published at https://aiven.io.