What Is A Data Pipeline - IBM
What Is A Data Pipeline - IBM
Before data flows into a data repository, it usually undergoes some data processing. This
is inclusive of data transformations, such as filtering, masking, and aggregations, which
ensure appropriate data integration and standardization. This is particularly important
when the destination for the dataset is a relational database. This type of data
repository has a defined schema which requires alignment—that is, matching data
columns and types—to update existing data with new data.
As the name suggests, data pipelines act as the “piping” for data science projects or
business intelligence dashboards. Data can be sourced through a wide variety of places
—APIs, SQL and NoSQL databases, files, et cetera—but unfortunately, that data usually
isn’t ready for immediate use. During sourcing, data lineage is tracked to document the
relationship between enterprise data in various business and IT applications, for
example, where data is currently and how it’s stored in an environment, such as on-
premises, in a data lake or in a data warehouse.
Data preparation tasks usually fall on the shoulders of data scientists or data engineers,
who structure the data to meet the needs of the business use cases and handle huge
amounts of data. The type of data processing that a data pipeline requires is usually
determined through a mix of exploratory data analysis and defined business
requirements. Once the data has been appropriately filtered, merged, and summarized,
it can then be stored and surfaced for use. Well-organized data pipelines provide the
foundation for a range of data projects; this can include exploratory data analyses, data
visualizations, and machine learning tasks.
Try watsonx.data
Batch processing
The development of batch processing was a critical step in building data infrastructures
that were reliable and scalable. In 2004, MapReduce, a batch processing algorithm, was
patented and then subsequently integrated into open-source systems, such as Hadoop,
CouchDB and MongoDB.
As the name implies, batch processing loads “batches” of data into a repository during
set time intervals, which are typically scheduled during off-peak business hours. This
way, other workloads aren’t impacted as batch processing jobs tend to work with large
volumes of data, which can tax the overall system. Batch processing is usually the
optimal data pipeline when there isn’t an immediate need to analyze a specific dataset
(for example, monthly accounting), and it is more associated with the ETL data
integration process, which stands for “extract, transform, and load.”
Batch processing jobs form a workflow of sequenced commands, where the output of
one command becomes the input of the next command. For example, one command
might kick off data ingestion, the next command may trigger filtering of specific
columns, and the subsequent command may handle aggregation. This series of
commands will continue until the data quality is completely transformed and rewritten
into a data repository.
Streaming data
Streaming data is leveraged when it is required for data to be continuously updated. For
example, apps or point-of-sale systems need real-time data to update inventory and
sales history of their products; that way, sellers can inform consumers if a product is in
stock or not. A single action, such as a product sale, is considered an “event,” and
related events, such as adding an item to checkout, are typically grouped together as a
“topic” or “stream.” These events are then transported via messaging systems or
message brokers, such as the open-source offering, Apache Kafka.
Since data events are processed shortly after occurring, streaming processing systems
have lower latency than batch systems, but aren’t considered as reliable as batch
processing systems as messages can be unintentionally dropped or spend a long time in
queue. Message brokers help to address this concern through acknowledgements,
where a consumer confirms processing of the message to the broker to remove it from
the queue.
Data integration pipelines concentrate on merging data from multiple sources into a
single unified view. These pipelines often involve extract, transform and load (ETL)
processes that clean, enrich, or otherwise modify raw data before storing it in a
centralized repository such as a data warehouse or data lake. Data integration pipelines
are essential for handling disparate systems that generate incompatible formats or
structures. For example, a connection can be added to Amazon S3 (Amazon Simple
Storage Service)—a service that is offered by Amazon Web Services (AWS) that provides
object storage through a web service interface.
2. Data transformation: During this step, a series of jobs are executed to process data
into the format required by the destination data repository. These jobs embed
automation and governance for repetitive workstreams, such as business reporting,
ensuring that data is cleansed and transformed consistently. For example, a data stream
may come in a nested JSON format, and the data transformation stage will aim to unroll
that JSON to extract the key fields for analysis.
3. Data storage: The transformed data is then stored within a data repository, where it
can be exposed to various stakeholders. Within streaming data, this transformed data
are typically known as consumers, subscribers, or recipients.
– ETL pipelines also tend to imply the use of batch processing, but as noted above, the
scope of data pipelines is broader. They can also be inclusive of stream processing.
– Finally, while unlikely, data pipelines as a whole do not necessarily need to undergo
data transformations, as with ETL pipelines. It’s rare to see a data pipeline that
doesn’t utilize transformations to facilitate data analysis.
– Exploratory data analysis: Data scientists use exploratory data analysis (EDA) to
analyze and investigate data sets and summarize their main characteristics, often
employing data visualization methods. It helps determine how best to manipulate
data sources to get the needed answers, making it easier for data scientists to
discover patterns, spot anomalies, test a hypothesis or check assumptions.
IBM solutions
IBM DataStage
IBM® DataStage® is an industry-leading data integration tool that helps design, develop
and run jobs that move and transform data.
IBM Data Replication is data synchronization software that keeps multiple data stores in
synch in near real time. IBM Data Replication is a low-impact solution, only tracking data
changes captured by the log.
IBM solutions
IBM® Databand® is observability software for data pipelines and warehouses that
Resources collects metadata to build historical baselines, detect anomalies, triage
automatically
alerts, and monitor the health and reliability of Apache Airflow directed acyclic graphs
Take the next step
(DAGs).
IBM watsonx.data
Resources
IBM DataStage is an industry-leading data integration tool that helps you design,
develop and run jobs that move and transform data. At its core, DataStage supports
extract, transform and load (ETL) and extract, load and transform (ELT) patterns.
Explore DataStage