The data ingestion ecosystem

At the beginning of any data pipeline, data ingestion involves procuring events from sources (applications, IoT devices, web and server logs, and even data file uploads) and transporting them into a data store for further processing.

Embulk

Embulk is a parallel bulk data uploader built around a core and a series of community-contributed input and output plugins that supports bulk data transfer between various data stores, databases, NoSQL stores, and cloud services.

StreamSets

With over 2 million downloads, StreamSets Data Collector is a popular, “low-latency ingest infrastructure tool that lets you create continuous data ingest pipelines with a drag and drop UI.” Licensed under Apache 2.0 open-source, StreamSets is a good way to set up data ingestion graphically, with minimal code and configuration.

Fluentd

Fluentd, a “Unified Logging Layer”, is an open-source, streaming data collector, that decouples data sources from backend systems. A favorite of Yukihiro Matsumoto, creator of Ruby, Fluentd also consists of a community-maintained core combined with input and output plug-ins like Embulk. FluentBit is the version maintained for embedded systems.

Apache Sqoop, Flume, and Spark

Apache Sqoop is a tool for transporting bulk data between Apache Hadoop and structured datastores like relational databases. By offloading certain tasks (such as extract, transform, load operations) onto Hadoop, it can make data warehouses more efficient.

The data transport ecosystem

Data transport overlaps somewhat with data ingestion, but “ingestion” revolves around getting data extracted from one system and into another, while “transport” concerns getting data from any location to any other.

  • Real-time streaming data pipelines between systems or applications;
  • Real-time streaming applications that transform streams of data;
  • Real-time streaming applications that react to streams of data.

Data storage and data management ecosystem

No one talks about Big Data or its ecosystem without including Apache Hadoop and Apache Spark. Hadoop is a framework that can process large data sets across clusters; Spark is “a unified analytics engine for large scale data processing.”

PostgreSQL

PostgreSQL is an open-source object-relational database management system emphasizing extensibility and standards compliance that has been around so long, it’s become a standby for companies ranging from manufacturing to IoT. Aiven’s fully managed PostgreSQL service can be found here.

Redis

Redis is a superfast variant of the NoSQL database known as a key-value store. As such, it’s an extremely simple database that stores only key-value pairs and serves search results by retrieving the value associated with a known key.

Cassandra

If you’re working with large, active data sets, and need to tweak the tradeoff between consistency, availability and partition tolerance, then Apache Cassandra may be your solution. Because data is distributed across nodes, when one node — or even an entire data center — goes down, the data remains preserved in other nodes (depending on the consistency level setting).

InfluxDB

The rapid instrumentation of the physical world due to IoT and data-collecting applications has led to an explosion of time-stamped data. Time series databases serve this evolving niche, and among them, InfluxDB is emerging as a major player. InfluxDB, like others, can handle complex logic or business rules atop massive — and fast-growing — data sets, and InfluxDB adds the advantage of a range of ingestion methods, as well as the ability to append tags to different data points. Aiven also provides a managed version, Aiven InfluxDB.

Data visualization

When you want to develop insights and reach conclusions to support your hypotheses, you’re in the domain of data scientists. Data visualization tools and dashboards also support managers, marketers, and even end consumers, but there are simply too many such tools, with too many areas of specialty, to possibly cover in this article.

Other tools

Often the need to handle, search and process raw text arises. Based on Lucene, Elasticsearch is a distributed document and full-text indexing solution that supports complex data analytics in real time. Aiven’s enhanced Elasticsearch offering is frequently used aside other Aiven services such as Aiven Kafka, PostgreSQL, and Redis.

Where to learn more

Here are a few resources where you can learn more about data pipelines and related technologies:

General

The Data Science Handbook

On Data Ingestion

What is data ingestion?

On Message Brokers and Data Transport

What is a message broker?

On Data Storage and Management

What is a key-value store? What is a time-series database?

On Data Visualization

What is data visualization?

Wrapping up

As you can see, the data and event ecosystem provides a vast number of components to create a data pipeline from. To get a feel for real use cases, you can find examples of companies who have built their data pipelines using the Aiven platform here. With Aiven you too can build your own data pipelines with just a few clicks! We have the infrastructure and expertise to help you get started.

--

--

--

Your database in the cloud, www.aiven.io

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Paribus Native Staking Program

gRPC integration for Siddhi

Why did we create HIDE tokens

A comprehensive guide to “Being Agile”

Easiest steps to Install WordPress on your server using Cpanel

5 Successful Frameworks that Would Rule App Development in 2021

Swagger + Spring Boot 2

Deploying a predictive python ML model with Flask and Heroku (Part 2)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aiven

Aiven

Your database in the cloud, www.aiven.io

More from Medium

Outbox pattern, bridge OLTP and OLAP

Use Amazon Athena Federated Query to query data from Aurora PostgreSQL running in Private Subnet

Databricks Workspace SSO: Integration with Keycloak and SAML 2.0

Databricks admin console single sign-on form

Streaming Analytics With KSQL vs. a Real-Time Analytics Database

Streaming Analytics vs Real-Time Analytics Database