Apache Kafka in a Nutshell
Once upon a time, a computer program was a monolith. At its simplest, at least one argument and a bit of data went in, and from the same system, deterministically, a result came out.
Things became more complicated when programs needed to talk to each other and pass data back and forth. In such cases, data was addressed to the receiving app, who then theoretically received it, did something to it, and either displayed it locally, or maybe passed the result back to the sender.
Things continued to evolve with systems becoming distributed amongst billions of machines and users. Events like status updates, logins and presence, along with data about uploads and other user activity needed to be handled in real time across systems that spanned the planet via a constellation of users. Suddenly, there was a need for those billions of point applications (and other systems) to communicate with each other.
Imagine the complexity of such a system: masses of unique, distributed implementations, scattered across continents, each with unique needs, and all trying to exchange data — but only the relevant stuff, willy nilly. Would each one have a distinct point-to-point exchange mechanism? Would each transport mechanism be different for the type of data being exchanged?
If so, this would be problematic. How would you build, manage and maintain such a system? After all, the order of complexity of a system with 1 billion apps communicating point to point is a product: n*n . In other words, 1 quintillion, or 1018. So technically, that’s the possible upper limit of how many separate communication protocols and implementations could be needed for 1 billion point apps to communicate 1:1.
Surely there’s a simpler way. What if communication were built around a common interface? And, better still, what if applications could subscribe to and receive only the events they wanted?
Enter the pub-sub mechanism.
Earlier attempts at messaging
But first, let’s look at some early efforts to address this scalability problem.
An earlier endeavor for loosely-coupled messaging involved the Java Messaging Service that emerged in 1998. Unlike more tightly-coupled protocols (like TCP network sockets, or RMI with JMS), an intermediary component enabled software elements to speak to each other indirectly, either via a publish-and-subscribe, or a point-to-point model. As it turns out, JMS offered many features which would be core to Apache Kafka later on, such as producer, consumer, message, queue and topic.
Both Pivotal’s RabbitMQ and Apache ActiveMQ eventually became provider implementations of JMS, with the former implementing Advanced Message Queuing Protocol (AMQP) and compatibility with a range of languages, and the latter offering “Enterprise features” — including multi-tenancy.
But even these solutions came up short in some cases. For example, RabbitMQ stores messages in DRAM until the DRAM is completely consumed, at which point messages are written to disk, severely impacting performance.
Also, the routing logic of AMQP can be fairly complicated as opposed to Apache Kafka. For instance, each consumer simply decides which messages to read in Kafka.
In addition to message routing simplicity, there are places where developers and DevOps staff prefer Apache Kafka for its high throughput, scalability, performance, and durability; although, developers still swear by all three systems for various reasons.
So, in order to understand what makes Apache Kafka special, let’s look at what it is.