0% found this document useful (0 votes)

15 views10 pages

What Is A Data Pipeline - IBM

Uploaded by

Arifuddin Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views10 pages

What Is A Data Pipeline - IBM

Uploaded by

Arifuddin Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Home / Topics / Data pipeline

What is a data pipeline?

Discover IBM DataStage

Subscribe to the Think Newsletter

Updated: 14 June 2024
Contributor: Cole Stryker

What is a data pipeline?

A data pipeline is a method in which raw data is ingested from various data sources,
transformed and then ported to a data store, such as a data lake or data warehouse, for
analysis.

Before data flows into a data repository, it usually undergoes some data processing. This
is inclusive of data transformations, such as filtering, masking, and aggregations, which
ensure appropriate data integration and standardization. This is particularly important
when the destination for the dataset is a relational database. This type of data
repository has a defined schema which requires alignment—that is, matching data
columns and types—to update existing data with new data.
As the name suggests, data pipelines act as the “piping” for data science projects or
business intelligence dashboards. Data can be sourced through a wide variety of places
—APIs, SQL and NoSQL databases, files, et cetera—but unfortunately, that data usually
isn’t ready for immediate use. During sourcing, data lineage is tracked to document the
relationship between enterprise data in various business and IT applications, for
example, where data is currently and how it’s stored in an environment, such as on-
premises, in a data lake or in a data warehouse.

Data preparation tasks usually fall on the shoulders of data scientists or data engineers,
who structure the data to meet the needs of the business use cases and handle huge
amounts of data. The type of data processing that a data pipeline requires is usually
determined through a mix of exploratory data analysis and defined business
requirements. Once the data has been appropriately filtered, merged, and summarized,
it can then be stored and surfaced for use. Well-organized data pipelines provide the
foundation for a range of data projects; this can include exploratory data analyses, data
visualizations, and machine learning tasks.

Now available: watsonx.data

Scale AI workloads, for all your data, anywhere.

Try watsonx.data

Types of data pipelines

There are several main types of data pipelines, each appropriate for
specific tasks on specific platforms.

Batch processing

The development of batch processing was a critical step in building data infrastructures
that were reliable and scalable. In 2004, MapReduce, a batch processing algorithm, was
patented and then subsequently integrated into open-source systems, such as Hadoop,
CouchDB and MongoDB.

As the name implies, batch processing loads “batches” of data into a repository during
set time intervals, which are typically scheduled during off-peak business hours. This
way, other workloads aren’t impacted as batch processing jobs tend to work with large
volumes of data, which can tax the overall system. Batch processing is usually the
optimal data pipeline when there isn’t an immediate need to analyze a specific dataset
(for example, monthly accounting), and it is more associated with the ETL data
integration process, which stands for “extract, transform, and load.”

Batch processing jobs form a workflow of sequenced commands, where the output of
one command becomes the input of the next command. For example, one command
might kick off data ingestion, the next command may trigger filtering of specific
columns, and the subsequent command may handle aggregation. This series of
commands will continue until the data quality is completely transformed and rewritten
into a data repository.

Streaming data

Unlike batching processing, streaming data pipelines—also known as event-driven

architectures—continuously process events generated by various sources, such as
sensors or user interactions within an application. Events are processed and analyzed,
and then either stored in databases or sent downstream for further analysis.

Streaming data is leveraged when it is required for data to be continuously updated. For
example, apps or point-of-sale systems need real-time data to update inventory and
sales history of their products; that way, sellers can inform consumers if a product is in
stock or not. A single action, such as a product sale, is considered an “event,” and
related events, such as adding an item to checkout, are typically grouped together as a
“topic” or “stream.” These events are then transported via messaging systems or
message brokers, such as the open-source offering, Apache Kafka.

Since data events are processed shortly after occurring, streaming processing systems
have lower latency than batch systems, but aren’t considered as reliable as batch
processing systems as messages can be unintentionally dropped or spend a long time in
queue. Message brokers help to address this concern through acknowledgements,
where a consumer confirms processing of the message to the broker to remove it from
the queue.

Get started with IBM Cloud Pak® for Data

Data integration pipelines

Data integration pipelines concentrate on merging data from multiple sources into a
single unified view. These pipelines often involve extract, transform and load (ETL)
processes that clean, enrich, or otherwise modify raw data before storing it in a
centralized repository such as a data warehouse or data lake. Data integration pipelines
are essential for handling disparate systems that generate incompatible formats or
structures. For example, a connection can be added to Amazon S3 (Amazon Simple
Storage Service)—a service that is offered by Amazon Web Services (AWS) that provides
object storage through a web service interface.

Cloud-native data pipelines

A modern data platform includes a suite of cloud-first, cloud-native software products

that enable the collection, cleansing, transformation and analysis of an organization’s
data to help improve decision making. Today’s data pipelines have become increasingly
complex and important for data analytics and making data-driven decisions. A modern
data platform builds trust in this data by ingesting, storing, processing and transforming
it in a way that ensures accurate and timely information, reduces data silos, enables
self-service and improves data quality.
Data pipeline architecture
Three core steps make up the architecture of a data pipeline.

1. Data ingestion: Data is collected from various sources—including software-as-a-

service (SaaS) platforms, internet-of-things (IoT) devices and mobile devices—and
various data structures, both structured and unstructured data. Within streaming data,
these raw data sources are typically known as producers, publishers, or senders. While
businesses can choose to extract data only when ready to process it, it’s a better
practice to land the raw data within a cloud data warehouse provider first. This way, the
business can update any historical data if they need to make adjustments to data
processing jobs. During this data ingestion process, various validations and checks can
be performed to ensure the consistency and accuracy of data.

2. Data transformation: During this step, a series of jobs are executed to process data
into the format required by the destination data repository. These jobs embed
automation and governance for repetitive workstreams, such as business reporting,
ensuring that data is cleansed and transformed consistently. For example, a data stream
may come in a nested JSON format, and the data transformation stage will aim to unroll
that JSON to extract the key fields for analysis.

3. Data storage: The transformed data is then stored within a data repository, where it
can be exposed to various stakeholders. Within streaming data, this transformed data
are typically known as consumers, subscribers, or recipients.

Data pipeline vs. ETL pipeline

You might find that some terms, such as data pipeline and ETL pipeline, are used
interchangeably in conversation. However, you should think about an ETL pipeline as a
subcategory of data pipelines. The two types of pipelines are distinguished by three key
features:
– ETL pipelines follow a specific sequence. As the abbreviation implies, they extract
data, transform data, and then load and store data in a data repository. Not all data
pipelines need to follow this sequence. In fact, ELT (extract, load, transform)
pipelines have become more popular with the advent of cloud-native tools where
data can be generated and stored across multiple sources and platforms. While data
ingestion still occurs first with this type of pipeline, any transformations are applied
after the data has been loaded into the cloud-based data warehouse.

– ETL pipelines also tend to imply the use of batch processing, but as noted above, the
scope of data pipelines is broader. They can also be inclusive of stream processing.

– Finally, while unlikely, data pipelines as a whole do not necessarily need to undergo
data transformations, as with ETL pipelines. It’s rare to see a data pipeline that
doesn’t utilize transformations to facilitate data analysis.

Use cases of data pipelines

As big data continues to grow, data management becomes an ever-increasing priority.
While data pipelines serve various functions, the following are for business applications:

– Exploratory data analysis: Data scientists use exploratory data analysis (EDA) to
analyze and investigate data sets and summarize their main characteristics, often
employing data visualization methods. It helps determine how best to manipulate
data sources to get the needed answers, making it easier for data scientists to
discover patterns, spot anomalies, test a hypothesis or check assumptions.

– Data visualizations: To represent data via common graphics, data visualizations

such as charts, plots, infographics, and even animations can be created. These visual
displays of information communicate complex data relationships and data-driven
insights in a way that is easy to understand.

– Machine learning: A branch of artificial intelligence (AI) and computer science,

machine learning focuses on the use of data and algorithms to imitate the way that
humans learn, gradually improving its accuracy. Through the use of statistical
methods, algorithms are trained to make classifications or predictions, uncovering
key insights within data mining projects.
– Data observability: To verify the accuracy and safety of the data being used, data
observability applies a variety of tools for monitoring, tracking and alerting for both
expected events and anomalies.

IBM solutions

IBM DataStage

IBM® DataStage® is an industry-leading data integration tool that helps design, develop
and run jobs that move and transform data.

Explore IBM DataStage

IBM Data Replication

IBM Data Replication is data synchronization software that keeps multiple data stores in
synch in near real time. IBM Data Replication is a low-impact solution, only tracking data
changes captured by the log.

Explore IBM Data Replication

What is a data pipeline?

Types of data pipelines

IBM Databand
Data pipeline architecture

Data pipeline vs. ETL pipeline

Use cases of data pipelines

IBM solutions
IBM® Databand® is observability software for data pipelines and warehouses that
Resources collects metadata to build historical baselines, detect anomalies, triage
automatically
alerts, and monitor the health and reliability of Apache Airflow directed acyclic graphs
Take the next step
(DAGs).

Explore IBM Databand

IBM watsonx.data

IBM® watsonx.data™ is a fit-for-purpose data store built on an open-data lakehouse

architecture to scale analytics and AI workloads, for all your data, anywhere.

Explore IBM watsonx.data

Resources

Create a State Bank

strong of India
data
foundation
for AI
Read the Learn how the
smartpaper State Bank of
on how to India used
create a several IBM
robust data solutions,
foundation along with
for AI by IBM Garage™
focusing on methodology,
three key to develop a
data comprehensiv
management e online
areas: access, banking
governance, platform.
and privacy

Take the next step

and
compliance.

IBM DataStage is an industry-leading data integration tool that helps you design,
develop and run jobs that move and transform data. At its core, DataStage supports
extract, transform and load (ETL) and extract, load and transform (ELT) patterns.

Explore DataStage

Try for free

Compute Engine
No ratings yet
Compute Engine
49 pages
Bda Unit 2 - Mam
No ratings yet
Bda Unit 2 - Mam
63 pages
AWS Solution Architect Question and Answers
No ratings yet
AWS Solution Architect Question and Answers
154 pages
Week 1 Lecture 2
No ratings yet
Week 1 Lecture 2
92 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
Pls Academy Pde Student Slides 4 2405
No ratings yet
Pls Academy Pde Student Slides 4 2405
129 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
Modern Data Stack
No ratings yet
Modern Data Stack
23 pages
4-Data Processing Pipelines in Science and Business
100% (1)
4-Data Processing Pipelines in Science and Business
22 pages
Unit 3 - BDA - Notes
No ratings yet
Unit 3 - BDA - Notes
9 pages
Domenico Berti - Vita Di Giordano Bruno Da Nola-Paravia (1868)
No ratings yet
Domenico Berti - Vita Di Giordano Bruno Da Nola-Paravia (1868)
433 pages
CV0 004 Demo
No ratings yet
CV0 004 Demo
13 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Data Engineering Concepts For Mid-to-Senior Professionals
No ratings yet
Data Engineering Concepts For Mid-to-Senior Professionals
27 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Pipeline
No ratings yet
Pipeline
19 pages
Tinku Kumar
No ratings yet
Tinku Kumar
16 pages
Welcome To The Dux
No ratings yet
Welcome To The Dux
10 pages
N3 2020 Copy Updated
No ratings yet
N3 2020 Copy Updated
22 pages
Design Key Management System For DLMS/COSEM Standard-Based Smart Metering
100% (1)
Design Key Management System For DLMS/COSEM Standard-Based Smart Metering
5 pages
Hazelcast Level Up To Instant Action-1706173416548
No ratings yet
Hazelcast Level Up To Instant Action-1706173416548
36 pages
Summer Project PPT (Vanshik)
No ratings yet
Summer Project PPT (Vanshik)
23 pages
Aadarsh 180410107093
No ratings yet
Aadarsh 180410107093
28 pages
Lambda Architecure On For Batch Aws
No ratings yet
Lambda Architecure On For Batch Aws
12 pages
Week8 Classroom Exercise
No ratings yet
Week8 Classroom Exercise
17 pages
Social Justice Is Destroying The Pleasure of Reading
No ratings yet
Social Justice Is Destroying The Pleasure of Reading
9 pages
Epicor ERP Architecture Guide
No ratings yet
Epicor ERP Architecture Guide
41 pages
Unit 5 Irt
No ratings yet
Unit 5 Irt
10 pages
Getting Started With Threat Modeling
No ratings yet
Getting Started With Threat Modeling
16 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
DE Skills and Tools Guide
No ratings yet
DE Skills and Tools Guide
20 pages
Lecture Notes On Language-Based Security: Erik Poll Radboud University Nijmegen Updated September 2019
No ratings yet
Lecture Notes On Language-Based Security: Erik Poll Radboud University Nijmegen Updated September 2019
51 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
CCD 4,5,6
No ratings yet
CCD 4,5,6
21 pages
20230314-EB-Transform Your Data Pipelines
No ratings yet
20230314-EB-Transform Your Data Pipelines
9 pages
Angélique Kidjo at The Royal Albert Hall (EFG LJF 2023) - London Jazz News
No ratings yet
Angélique Kidjo at The Royal Albert Hall (EFG LJF 2023) - London Jazz News
5 pages
DZ Data Pipeline Essentials 2024
No ratings yet
DZ Data Pipeline Essentials 2024
6 pages
Data Pipeline
No ratings yet
Data Pipeline
13 pages
Se 22-23 New
No ratings yet
Se 22-23 New
4 pages
Chapter 6
No ratings yet
Chapter 6
26 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Data Pipelines Explained
No ratings yet
Data Pipelines Explained
4 pages
Unit 4
No ratings yet
Unit 4
11 pages
Department of Computer Technology Bharati Vidyapeeth'S Jawaharlal Nehru Institute of Technology (Polytechnic) PUNE-411043 2021-2022
No ratings yet
Department of Computer Technology Bharati Vidyapeeth'S Jawaharlal Nehru Institute of Technology (Polytechnic) PUNE-411043 2021-2022
10 pages
H3394 Ilm Email Msgs WP
No ratings yet
H3394 Ilm Email Msgs WP
28 pages
Interrupts in Microprocessor: Types and Function
No ratings yet
Interrupts in Microprocessor: Types and Function
12 pages
VITA AGENCY TEMPLATE Security Awareness and Training Policy
No ratings yet
VITA AGENCY TEMPLATE Security Awareness and Training Policy
4 pages
Bug-Bounty Video Collection PDF
No ratings yet
Bug-Bounty Video Collection PDF
12 pages
ITS205 - Cloud Technology Architecture
No ratings yet
ITS205 - Cloud Technology Architecture
5 pages
Reason For Choosing King's Own Institute
No ratings yet
Reason For Choosing King's Own Institute
4 pages
01module15 230110 190214
No ratings yet
01module15 230110 190214
48 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
59 pages
Migrate A Project Using Libraries To Pcvue 11 or Later
No ratings yet
Migrate A Project Using Libraries To Pcvue 11 or Later
9 pages
19.1 - Data Pipelines
No ratings yet
19.1 - Data Pipelines
18 pages
SRM Datasheet
No ratings yet
SRM Datasheet
2 pages
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
No ratings yet
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
9 pages
ABAP Programming Tips Tricks
No ratings yet
ABAP Programming Tips Tricks
10 pages
DS - Chapter # 1
No ratings yet
DS - Chapter # 1
37 pages
9.0.1.2 Class Activity - Creating Codes
No ratings yet
9.0.1.2 Class Activity - Creating Codes
3 pages
ETL Pipeline - Javatpoint
No ratings yet
ETL Pipeline - Javatpoint
3 pages
What Is Streaming Data
No ratings yet
What Is Streaming Data
4 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Data Pipeline Essentials: See Ya Later
No ratings yet
Data Pipeline Essentials: See Ya Later
6 pages
Yealink-Sip-T20p Phone Setupguide
No ratings yet
Yealink-Sip-T20p Phone Setupguide
9 pages
Data Pipeline
No ratings yet
Data Pipeline
14 pages
Prof. Mathew's List of Must Watch Movies
No ratings yet
Prof. Mathew's List of Must Watch Movies
2 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
TiE Kerala Monthly Meeting 28th Apr - 11
No ratings yet
TiE Kerala Monthly Meeting 28th Apr - 11
2 pages
Engineering Change Management (LO
No ratings yet
Engineering Change Management (LO
6 pages
h17698 Dell Emc Unity XT All Flash Family Ds PDF
No ratings yet
h17698 Dell Emc Unity XT All Flash Family Ds PDF
2 pages
MCS 201
No ratings yet
MCS 201
4 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
From Everand
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Xplenty Data Integration Architecture: Definitive Reference for Developers and Engineers
From Everand
Xplenty Data Integration Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Integration with Blendo: Definitive Reference for Developers and Engineers
From Everand
Data Integration with Blendo: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
The Study of Building the Data Warehouse
From Everand
The Study of Building the Data Warehouse
venkateswara Rao
No ratings yet
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
From Everand
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
Robert Johnson
No ratings yet
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

What Is A Data Pipeline - IBM

Uploaded by

What Is A Data Pipeline - IBM

Uploaded by

Home / Topics / Data pipeline

What is a data pipeline?

Discover IBM DataStage

Subscribe to the Think Newsletter

What is a data pipeline?

Now available: watsonx.data

Scale AI workloads, for all your data, anywhere.

Types of data pipelines

Unlike batching processing, streaming data pipelines—also known as event-driven

Get started with IBM Cloud Pak® for Data

Data integration pipelines

Cloud-native data pipelines

A modern data platform includes a suite of cloud-first, cloud-native software products

1. Data ingestion: Data is collected from various sources—including software-as-a-

Data pipeline vs. ETL pipeline

Use cases of data pipelines

– Data visualizations: To represent data via common graphics, data visualizations

– Machine learning: A branch of artificial intelligence (AI) and computer science,

Explore IBM DataStage

IBM Data Replication

Explore IBM Data Replication

Types of data pipelines

Data pipeline vs. ETL pipeline

Use cases of data pipelines

Explore IBM Databand

IBM® watsonx.data™ is a fit-for-purpose data store built on an open-data lakehouse

Explore IBM watsonx.data

Create a State Bank

Take the next step

Try for free

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.