The challenges a modern data pipeline must solve

Imagine: you’re running an international hotel chain with thermostat data written to flat files in a S3 bucket, and energy consumption data logged to an accounting spreadsheet only accessible by accounts.

Or, picture this: you’re running a stock tracking site, but you can only batch import stock data from the previous day once every 24 hours?

What if you’re a mobile game developer monetizing game levels by tracking progress? Collecting a handful of SDK event code onto your commodity-hardware MySQL server under your desk works great for a few players, but

And what if you suddenly needed to collect and accommodate variable-length events such as when a player advances to the next level, and points are tallied with the player’s name, rank, and position, along with a timestamp?

Previous attempts (and lessons learned)

Not long ago, these scenarios were common. Data from different sources went into separate silos that various stakeholders couldn’t access. Data couldn’t be viewed, interpreted or analyzed in transit. Data was almost exclusively processed in daily batches as in nowhere near real-time.

Data volume and velocity grew faster than a homespun pipeline was designed to handle. And data ingestion would often fail silently when the incoming events’ schema didn’t match the backend’s, resulting in long and painful troubleshoot / restart cycles.

Adding insult to injury, the data formats and fields kept changing with the race to meet ever-changing business needs. The result was predictable:

Analytics data was typically stored on-premise, in ACID-compliant databases on commodity hardware. This worked until you needed to scale your hardware and the available analytics and visualization tools no longer gave analysts what they needed.

What’s more, analysts were tied up with infrastructure maintenance, and growth-related chores like sharding and replication…this was without handling periodic software crashes and hardware failures.

Along with that homespun storage and management tools rarely get the same attention as the product, and it’s difficult to justify the investment. However, whenever technical debt goes unaddressed, system stakeholders eventually pay the price.

Now, how you host and store your data is crucially important for reaching your data analytic goals: managed cloud hosting and storage can free up analysts from disaster cleanup, as well as the repetitive work of maintenance, sharding and replication.

The Seven Challenges

A data pipeline is any set of automated workflows that extract data from multiple sources. Most agree that a data pipeline should include , , , , and .

Modern data pipelines need to accomplish at least two things:

  1. Define what, where, and how it’s collected and,
  2. Automatically extract, transform, combine, validate and load it for further analysis and visualization…

…visit the original blog post at Aiven to take a deep dive into what the 7 challenges are.

Your database in the cloud,

Your database in the cloud,