What if you built a data pipeline, but couldn’t connect to where your data was stored?
Imagine: you’re running an international hotel chain with thermostat data written to flat files in a S3 bucket, and energy consumption data logged to an accounting spreadsheet only accessible by accounts. How could you correlate thermostat and energy consumption data to optimize?
Or, picture this: you’re running a stock tracking site, but you can only batch import stock data from the previous day once every 24 hours? Can any trader make accurate, well-timed decisions using your platform?
What if you’re a mobile game developer monetizing game levels by tracking progress? Collecting a handful of SDK event code onto your commodity-hardware MySQL server under your desk works great for a few players, but what if your game goes viral?
And what if you suddenly needed to collect and accommodate variable-length events such as when a player advances to the next level, and points are tallied with the player’s name, rank, and position, along with a timestamp?
Previous attempts (and lessons learned)
Not long ago, these scenarios were common. Data from different sources went into separate silos that various stakeholders couldn’t access. Data couldn’t be viewed, interpreted or analyzed in transit. Data was almost exclusively processed in daily batches as in nowhere near real-time.
Data volume and velocity grew faster than a homespun pipeline was designed to handle. And data ingestion would often fail silently when the incoming events’ schema didn’t match the backend’s, resulting in long and painful troubleshoot / restart cycles.
Adding insult to injury, the data formats and fields kept changing with the race to meet ever-changing business needs. The result was predictable:
Stale, inconclusive insights, unmanageable latency and performance bottlenecks, undetected data import failures, and wasted time! You need to get your data where you want it so you have singular, canonical stores for analytics.
Analytics data was typically stored on-premise, in ACID-compliant databases on commodity hardware. This worked until you needed to scale your hardware and the available analytics and visualization tools no longer gave analysts what they needed.
What’s more, analysts were tied up with infrastructure maintenance, and growth-related chores like sharding and replication…this was without handling periodic software crashes and hardware failures.
Along with that homespun storage and management tools rarely get the same attention as the product, and it’s difficult to justify the investment. However, whenever technical debt goes unaddressed, system stakeholders eventually pay the price.
Now, how you host and store your data is crucially important for reaching your data analytic goals: managed cloud hosting and storage can free up analysts from disaster cleanup, as well as the repetitive work of maintenance, sharding and replication.
The Seven Challenges
A data pipeline is any set of automated workflows that extract data from multiple sources. Most agree that a data pipeline should include connection support, elasticity, schema flexibility, support for data mobility, transformation and visualization.
Modern data pipelines need to accomplish at least two things:
- Define what, where, and how it’s collected and,
- Automatically extract, transform, combine, validate and load it for further analysis and visualization…