What Is ETL?

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6
At a glance
Powered by AI
ETL is used to integrate data from multiple sources and transform it before loading it into a target data warehouse. It became popular in the 1970s as organizations began using multiple databases.

ETL stands for Extract, Transform, Load. It is a process that extracts data from different source systems, transforms the data (e.g. through calculations, concatenations), and loads the data into a data warehouse.

There are several reasons to use ETL, including analyzing business data to make critical decisions, providing a common data repository, allowing data verification and transformation, and migrating data into a data warehouse.

ETL History

ETL gained popularity in the 1970s when organizations began using multiple data
repositories, or databases, to store different types of business information. The need to
integrate data that was spread across these databases grew quickly. ETL became the
standard method for taking data from disparate sources and transforming it before
loading it to a target source, or destination.

In the late 1980s and early 1990s, data warehouses came onto the scene. A distinct
type of database, data warehouses provided integrated access to data from multiple
systems – mainframe computers, minicomputers, personal computers and
spreadsheets. But different departments often chose different ETL tools to use with
different data warehouses.

What is ETL?
ETL is an abbreviation of Extract, Transform and Load. In this process, an
ETL tool extracts the data from different RDBMS source systems then
transforms the data like applying calculations, concatenations, etc. and then
load the data into the Data Warehouse system.

It's tempting to think a creating a Data warehouse is simply extracting data


from multiple sources and loading into database of a Data warehouse. This is
far from the truth and requires a complex ETL process. The ETL process
requires active inputs from various stakeholders including developers,
analysts, testers, top executives and is technically challenging.

In order to maintain its value as a tool for decision-makers, Data warehouse


system needs to change with business changes. ETL is a recurring activity
(daily, weekly, monthly) of a Data warehouse system and needs to be agile,
automated, and well documented.

Why do you need ETL?


There are many reasons for adopting ETL in the organization:
 It helps companies to analyze their business data for taking critical
business decisions.
 Transactional databases cannot answer complex business questions
that can be answered by ETL.
 A Data Warehouse provides a common data repository
 ETL provides a method of moving the data from various sources into a
data warehouse.
 As data sources change, the Data Warehouse will automatically update.
 Well-designed and documented ETL system is almost essential to the
success of a Data Warehouse project.
 Allow verification of data transformation, aggregation and calculations
rules.
 ETL process allows sample data comparison between the source and
the target system.
 ETL process can perform complex transformations and requires the
extra area to store the data.
 ETL helps to Migrate data into a Data Warehouse. Convert to the
various formats and types to adhere to one consistent system.
 ETL is a predefined process for accessing and manipulating source
data into the target database.
 ETL offers deep historical context for the business.
 It helps to improve productivity because it codifies and reuses without a
need for technical skills.
Step 1) Extraction in Data Warehouses
In this step, data is extracted from the source system into the staging area.
Transformations if any are done in staging area so that performance of source
system in not degraded. Also, if corrupted data is copied directly from the
source into Data warehouse database, rollback will be a challenge. Staging
area gives an opportunity to validate extracted data before it moves into the
Data warehouse.

Data warehouse needs to integrate systems that have different

DBMS, Hardware, Operating Systems and Communication Protocols.


Sources could include legacy applications like Mainframes, customized
applications, Point of contact devices like ATM, Call switches, text files,
spreadsheets, ERP, data from vendors, partners amongst others.
Hence one needs a logical data map before data is extracted and loaded
physically. This data map describes the relationship between sources and
target data.

Three Data Extraction methods:

1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification

Irrespective of the method used, extraction should not affect performance and
response time of the source systems. These source systems are live
production databases. Any slow down or locking could effect company's
bottom line.

Some validations are done during Extraction:

 Reconcile records with the source data


 Make sure that no spam/unwanted data loaded
 Data type check
 Remove all types of duplicate/fragmented data
 Check whether all the keys are in place or not

Step 2) Transformation
Data extracted from source server is raw and not usable in its original form.
Therefore it needs to be cleansed, mapped and transformed. In fact, this is
the key step where ETL process adds value and changes data such that
insightful BI reports can be generated.

In this step, you apply a set of functions on extracted data. Data that does not
require any transformation is called as direct move or pass through data.

In transformation step, you can perform customized operations on data. For


instance, if the user wants sum-of-sales revenue which is not in the database.
Or if the first name and the last name in a table is in different columns. It is
possible to concatenate them before loading.

Step 3) Loading
Loading data into the target datawarehouse database is the last step of the
ETL process. In a typical Data warehouse, huge volume of data needs to be
loaded in a relatively short period (nights). Hence, load process should be
optimized for performance.

In case of load failure, recover mechanisms should be configured to restart


from the point of failure without data integrity loss. Data Warehouse admins
need to monitor, resume, cancel loads as per prevailing server performance.

Types of Loading:

 Initial Load — populating all the Data Warehouse tables


 Incremental Load — applying ongoing changes as when needed
periodically.
 Full Refresh —erasing the contents of one or more tables and
reloading with fresh data.

ETL Tools for Data Warehouses


Designing and maintaining the ETL process is often considered one of the most difficult and
resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL
tools to manage this process. Oracle Warehouse Builder (OWB), for example, provides ETL
capabilities and takes advantage of inherent database abilities. Other data warehouse builders
create their own ETL tools and processes, either inside or outside the database.

Besides the support of extraction, transformation, and loading, there are some other tasks that are
important for a successful ETL implementation as part of the daily operations of the data warehouse
and its support for further enhancements. Besides the support for designing a data warehouse and
the data flow, these tasks are typically addressed by ETL tools such as OWB.

Oracle is not an ETL tool and does not provide a complete solution for ETL. However, Oracle does
provide a rich set of capabilities that can be used by both ETL tools and customized ETL solutions.
Oracle offers techniques for transporting data between Oracle databases, for transforming large
volumes of data, and for quickly loading new data into a data warehouse.

ETL tools
There are many Data Warehousing tools are available in the market. Here,
are some most prominent one:
1. MarkLogic:

MarkLogic is a data warehousing solution which makes data integration easier


and faster using an array of enterprise features. It can query different types of
data like documents, relationships, and metadata.

http://developer.marklogic.com/products

2. Oracle:

Oracle is the industry-leading database. It offers a wide range of choice of


Data Warehouse solutions for both on-premises and in the cloud. It helps to
optimize customer experiences by increasing operational efficiency.

https://www.oracle.com/index.html

3. Amazon RedShift:

Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective tool


to analyze all types of data using standard SQL and existing BI tools. It also
allows running complex queries against petabytes of structured data.

https://aws.amazon.com/redshift/?nc2=h_m1

Here is a complete list of useful Data warehouse Tools.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy