0% found this document useful (0 votes)
15 views10 pages

What Is A Data Pipeline - IBM

Uploaded by

Arifuddin Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

What Is A Data Pipeline - IBM

Uploaded by

Arifuddin Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Home / Topics / Data pipeline

What is a data pipeline?

Discover IBM DataStage

Subscribe to the Think Newsletter


Updated: 14 June 2024
Contributor: Cole Stryker

What is a data pipeline?


A data pipeline is a method in which raw data is ingested from various data sources,
transformed and then ported to a data store, such as a data lake or data warehouse, for
analysis.

Before data flows into a data repository, it usually undergoes some data processing. This
is inclusive of data transformations, such as filtering, masking, and aggregations, which
ensure appropriate data integration and standardization. This is particularly important
when the destination for the dataset is a relational database. This type of data
repository has a defined schema which requires alignment—that is, matching data
columns and types—to update existing data with new data.
As the name suggests, data pipelines act as the “piping” for data science projects or
business intelligence dashboards. Data can be sourced through a wide variety of places
—APIs, SQL and NoSQL databases, files, et cetera—but unfortunately, that data usually
isn’t ready for immediate use. During sourcing, data lineage is tracked to document the
relationship between enterprise data in various business and IT applications, for
example, where data is currently and how it’s stored in an environment, such as on-
premises, in a data lake or in a data warehouse.

Data preparation tasks usually fall on the shoulders of data scientists or data engineers,
who structure the data to meet the needs of the business use cases and handle huge
amounts of data. The type of data processing that a data pipeline requires is usually
determined through a mix of exploratory data analysis and defined business
requirements. Once the data has been appropriately filtered, merged, and summarized,
it can then be stored and surfaced for use. Well-organized data pipelines provide the
foundation for a range of data projects; this can include exploratory data analyses, data
visualizations, and machine learning tasks.

Now available: watsonx.data

Scale AI workloads, for all your data, anywhere.

Try watsonx.data

Types of data pipelines


There are several main types of data pipelines, each appropriate for
specific tasks on specific platforms.

Batch processing

The development of batch processing was a critical step in building data infrastructures
that were reliable and scalable. In 2004, MapReduce, a batch processing algorithm, was
patented and then subsequently integrated into open-source systems, such as Hadoop,
CouchDB and MongoDB.

As the name implies, batch processing loads “batches” of data into a repository during
set time intervals, which are typically scheduled during off-peak business hours. This
way, other workloads aren’t impacted as batch processing jobs tend to work with large
volumes of data, which can tax the overall system. Batch processing is usually the
optimal data pipeline when there isn’t an immediate need to analyze a specific dataset
(for example, monthly accounting), and it is more associated with the ETL data
integration process, which stands for “extract, transform, and load.”

Batch processing jobs form a workflow of sequenced commands, where the output of
one command becomes the input of the next command. For example, one command
might kick off data ingestion, the next command may trigger filtering of specific
columns, and the subsequent command may handle aggregation. This series of
commands will continue until the data quality is completely transformed and rewritten
into a data repository.

Streaming data

Unlike batching processing, streaming data pipelines—also known as event-driven


architectures—continuously process events generated by various sources, such as
sensors or user interactions within an application. Events are processed and analyzed,
and then either stored in databases or sent downstream for further analysis.

Streaming data is leveraged when it is required for data to be continuously updated. For
example, apps or point-of-sale systems need real-time data to update inventory and
sales history of their products; that way, sellers can inform consumers if a product is in
stock or not. A single action, such as a product sale, is considered an “event,” and
related events, such as adding an item to checkout, are typically grouped together as a
“topic” or “stream.” These events are then transported via messaging systems or
message brokers, such as the open-source offering, Apache Kafka.

Since data events are processed shortly after occurring, streaming processing systems
have lower latency than batch systems, but aren’t considered as reliable as batch
processing systems as messages can be unintentionally dropped or spend a long time in
queue. Message brokers help to address this concern through acknowledgements,
where a consumer confirms processing of the message to the broker to remove it from
the queue.

Get started with IBM Cloud Pak® for Data

Data integration pipelines

Data integration pipelines concentrate on merging data from multiple sources into a
single unified view. These pipelines often involve extract, transform and load (ETL)
processes that clean, enrich, or otherwise modify raw data before storing it in a
centralized repository such as a data warehouse or data lake. Data integration pipelines
are essential for handling disparate systems that generate incompatible formats or
structures. For example, a connection can be added to Amazon S3 (Amazon Simple
Storage Service)—a service that is offered by Amazon Web Services (AWS) that provides
object storage through a web service interface.

Cloud-native data pipelines

A modern data platform includes a suite of cloud-first, cloud-native software products


that enable the collection, cleansing, transformation and analysis of an organization’s
data to help improve decision making. Today’s data pipelines have become increasingly
complex and important for data analytics and making data-driven decisions. A modern
data platform builds trust in this data by ingesting, storing, processing and transforming
it in a way that ensures accurate and timely information, reduces data silos, enables
self-service and improves data quality.
Data pipeline architecture
Three core steps make up the architecture of a data pipeline.

1. Data ingestion: Data is collected from various sources—including software-as-a-


service (SaaS) platforms, internet-of-things (IoT) devices and mobile devices—and
various data structures, both structured and unstructured data. Within streaming data,
these raw data sources are typically known as producers, publishers, or senders. While
businesses can choose to extract data only when ready to process it, it’s a better
practice to land the raw data within a cloud data warehouse provider first. This way, the
business can update any historical data if they need to make adjustments to data
processing jobs. During this data ingestion process, various validations and checks can
be performed to ensure the consistency and accuracy of data.

2. Data transformation: During this step, a series of jobs are executed to process data
into the format required by the destination data repository. These jobs embed
automation and governance for repetitive workstreams, such as business reporting,
ensuring that data is cleansed and transformed consistently. For example, a data stream
may come in a nested JSON format, and the data transformation stage will aim to unroll
that JSON to extract the key fields for analysis.

3. Data storage: The transformed data is then stored within a data repository, where it
can be exposed to various stakeholders. Within streaming data, this transformed data
are typically known as consumers, subscribers, or recipients.

Data pipeline vs. ETL pipeline


You might find that some terms, such as data pipeline and ETL pipeline, are used
interchangeably in conversation. However, you should think about an ETL pipeline as a
subcategory of data pipelines. The two types of pipelines are distinguished by three key
features:
– ETL pipelines follow a specific sequence. As the abbreviation implies, they extract
data, transform data, and then load and store data in a data repository. Not all data
pipelines need to follow this sequence. In fact, ELT (extract, load, transform)
pipelines have become more popular with the advent of cloud-native tools where
data can be generated and stored across multiple sources and platforms. While data
ingestion still occurs first with this type of pipeline, any transformations are applied
after the data has been loaded into the cloud-based data warehouse.

– ETL pipelines also tend to imply the use of batch processing, but as noted above, the
scope of data pipelines is broader. They can also be inclusive of stream processing.

– Finally, while unlikely, data pipelines as a whole do not necessarily need to undergo
data transformations, as with ETL pipelines. It’s rare to see a data pipeline that
doesn’t utilize transformations to facilitate data analysis.

Use cases of data pipelines


As big data continues to grow, data management becomes an ever-increasing priority.
While data pipelines serve various functions, the following are for business applications:

– Exploratory data analysis: Data scientists use exploratory data analysis (EDA) to
analyze and investigate data sets and summarize their main characteristics, often
employing data visualization methods. It helps determine how best to manipulate
data sources to get the needed answers, making it easier for data scientists to
discover patterns, spot anomalies, test a hypothesis or check assumptions.

– Data visualizations: To represent data via common graphics, data visualizations


such as charts, plots, infographics, and even animations can be created. These visual
displays of information communicate complex data relationships and data-driven
insights in a way that is easy to understand.

– Machine learning: A branch of artificial intelligence (AI) and computer science,


machine learning focuses on the use of data and algorithms to imitate the way that
humans learn, gradually improving its accuracy. Through the use of statistical
methods, algorithms are trained to make classifications or predictions, uncovering
key insights within data mining projects.
– Data observability: To verify the accuracy and safety of the data being used, data
observability applies a variety of tools for monitoring, tracking and alerting for both
expected events and anomalies.

IBM solutions

IBM DataStage

IBM® DataStage® is an industry-leading data integration tool that helps design, develop
and run jobs that move and transform data.

Explore IBM DataStage

IBM Data Replication

IBM Data Replication is data synchronization software that keeps multiple data stores in
synch in near real time. IBM Data Replication is a low-impact solution, only tracking data
changes captured by the log.

Explore IBM Data Replication


What is a data pipeline?

Types of data pipelines


IBM Databand
Data pipeline architecture

Data pipeline vs. ETL pipeline

Use cases of data pipelines

IBM solutions
IBM® Databand® is observability software for data pipelines and warehouses that
Resources collects metadata to build historical baselines, detect anomalies, triage
automatically
alerts, and monitor the health and reliability of Apache Airflow directed acyclic graphs
Take the next step
(DAGs).

Explore IBM Databand

IBM watsonx.data

IBM® watsonx.data™ is a fit-for-purpose data store built on an open-data lakehouse


architecture to scale analytics and AI workloads, for all your data, anywhere.

Explore IBM watsonx.data

Resources

Create a State Bank


strong of India
data
foundation
for AI
Read the Learn how the
smartpaper State Bank of
on how to India used
create a several IBM
robust data solutions,
foundation along with
for AI by IBM Garage™
focusing on methodology,
three key to develop a
data comprehensiv
management e online
areas: access, banking
governance, platform.
and privacy

Take the next step


and
compliance.

IBM DataStage is an industry-leading data integration tool that helps you design,
develop and run jobs that move and transform data. At its core, DataStage supports
extract, transform and load (ETL) and extract, load and transform (ELT) patterns.

Explore DataStage

Try for free

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy