We’ve previously looked at the current data and event ecosystem and the 7 challenges that data pipelines must solve. Next, let’s look into the future. In order to understand where things are going, it’s useful to first look at where they’ve been.
200 years ago, if you wanted to send a message somewhere, you wrote a letter and potentially waited years for a response. 150 years ago, you could accomplish roughly the same thing by telegraph in as little as a day. A few decades later, telephone and radio were able to transmit and receive immediately.
The same happened with visual information 80 years ago thanks to the first commercial television. Next came the rise of PCs in the early 1980’s where data could be readily stored and searched (locally). Then, just 25 years ago, the world wide web emerged, and with it, the need for dynamic, distributed data stores.
What have we learned?
For argument’s sake, let’s consider our “message” and “data” to be identical. So what have we learned? With each innovation, the data in our pipelines grew and transformed, in terms of:
- Amount — The sheer amount of data a system (a data pipeline) is required to handle within a set timeframe.
- Velocity — The speed at which the data travels through the pipeline, which subsequently affects the speed at which a response can be expected.
- Purpose — The function the data being transmitted serves. Is it the message payload itself? Metadata for use by the transmission mechanism? Formatting? Headers? Instructions for another system?
- Trajectory — The direction(s) the data moves in. Data no longer merely moves from point to point, such as with a telegraph, but often between several different sets of points simultaneously, e.g. a TV broadcast. Or a peer-to-peer network or blockchain. This implies that for every single producer, there are possibly many consumers.
- Format — Data may come as structured, unstructured, plaintext, encrypted, binary, or even embedded within other data. Data can also be commands to subsequent systems in the pipeline. Or any combination thereof.
So, what about now?
With the mobile and smart phone, mobile devices and recently, the world of IoT with connected infrastructure and devices, we’re seeing the amount of generated data explode and continue its metamorphosis in form and function.
According to IDC, there were 16.1 zettabytes of data generated in 2016. That’s projected to grow to 163 zettabytes in 2025 with users worldwide expected to interact with a data-driven endpoint device every 18 seconds on average.
Let’s look at what we believe data pipelines and components will need to be able to achieve in terms of functionality, design, compliance, usability, performance, and scalability to handle this scale.
1. Functionality
Just a decade ago, a pipeline was generally unidirectional, point-to-point, dealt with siloed business background data, and ingested it in batch (often schema-rigid through an inflexible ETL) during off hours when CPU resources and bandwidth were free.
Today, data is ubiquitous and can even be life critical. As such, data is in the foreground of users’ everyday lives. Consequently, pipelines may need to run polydirectionally (from point of ingestion to one or more central data lakes or data warehouses for processing, and back to edge data centers and even endpoint devices to be processed, visualized and rendered).
Much of this happens in real- or near-real-time. Core-to-endpoint analytics, like those found in some modern cars, are a good example of this.
An interesting side-trend emerging from these phenomena is that the data pipelines are not just an IT component supporting the business, but, increasingly are the business.
Alooma is one notable example of a data-pipeline-as-a-service; many other services work in a specific domain with a real-time data pipeline at their core.
As embedded IoT Devices proliferate and mobile real-time data grows, the immense volume of data generated within such pipelines must be accomodated and made appropriately available. This suggests a few requirements:
- pipelines and their components must be capable of auto-scaling, sharding and partition-tolerance with minimal — if any — human interaction;
- pipelines and their data flow be troubleshootable and configurable on the fly;
- pipelines be agnostic to — and able to accommodate — a range of formats, from fully ACID-compliant to completely structured data, but;
- pipelines implement measures to capture, fix and requeue events that error out.
Analytics pipelines will also increasingly serve as a funnel/conduit for data to be ingested and used toward training AI and ML models. This is already built into systems like Apache HiveMall that sit atop the data pipeline and make deductions, detect anomalies, or translate data into commands for e.g. endpoint devices or connected systems.
These systems will largely self-perpetuate and automate continuous improvements via updates to both the data pipelines and the software components of products.
That said, in the future, the challenge for humans supporting these AI systems won’t be as much about training intelligence algorithms or devices (AI will learn to self-train), but keeping an eye on things and, when necessary, intervene to refine/tune/retrain the machine learning component as it self-develops to keep these systems from going off the rails…
Visit The Future of Data Pipelines post on the Aiven blog to get the full breakdown of what data pipelines will have to achieve in terms of compliance, usability, performance, and scalability.