ETL - PPT v0.2
ETL - PPT v0.2
ETL - PPT v0.2
1
ETL Overview
ETL
Extraction
Transformation
Loading
2
Extraction Transformation Loading – ETL
ETL is short for extract, transform, load, three database functions that are combined into one tool to pull
data out of source and place it into target.
Extract is the process of reading data from a source. In this stage, the data is collected, often from multiple
and different types of sources.
Transform is the process of converting the extracted data from its previous form into the form it needs to
be in so that it can be placed into target. Transformation occurs by using rules or lookup tables or by
combining the data with other data.
Load is the process of writing the data into the target.
3
Why ETL
ETL processes the heterogeneous data and makes it homogeneous which in turn makes it
seamless for data scientists and data analysts to analyze the data and derive business
intelligence from it.
ETL tools can collect, read and migrate data from multiple data structures and across different
database and platforms like mainframe, etc.
ETL technology can also identify ”delta” changes as they occur, which enables ETL tools to copy
only changed data without the need to perform full data refreshes.
ETL tools include ready to use operations like filtering, reformatting, sorting, joining, merging,
aggregation and advanced data profiling and cleansing.
Additionally, ETL tools also support transformation scheduling, version control, monitoring and
unified metadata management.
ETL tools have powerful debugging capabilities which enables user to see how the real time data
will flow. Additionally , we can see how an attribute was transformed, when the data was most
recently refreshed, and from what source system the data was extracted. Also , the error
messages thrown provides precise information about the error and where it occurred.
4
How It Works
Data Source layer can consists of different types of data – Operational data including business data like
Sales, Customer, Finance, Product and others, web server logs etc.
Data from data sources is extracted and put into the warehouse staging area and data is minimally cleaned
with no major transformations.
Business intelligent logic to transform transactional data into analytical data. It is indeed the most time
consuming phase in the whole DWH architecture and is the chief process between data source and
presentation layer of DWH
5
Talend Products
6
What All Talend Can Do
Type of sources Talend can handle:
• Flat files:Delimted,xml,Json,excel..
• Applications/Platforms :CRM,SAP,Salesforce…
• Relational database: Mysql,vertica,Postgres,Oracle,Mssql…..
• NoSql: MangoDB,Hbase,cassandra..
• BigData:Hive,Hbase,Pig,Scoop…
How do we transport:
• File system
• FTP
• SFTP
• Webservices (Soap , Rest)
• HTTP
• Mail
7
Big Data component:
Talend is the Graphical User Interface tool which is capable enough to “translate” an ETL job to a
MapReduce job.
Talend Open Studio for Big Data enables to leverage Hadoop loading and processing technologies like
HDFS, HBase, Hive, and Pig without having to write Hadoop application code.
Load data into HDFS (Hadoop Distributed File System) .
Use Hadoop Pig to transform data in HDFS .
Load data into a Hadoop Hive based data warehouse.
Perform ELT (extract, load, transform) aggregations in Hive.
Leverage Sqoop to integrate relational databases .
8
Apex Parks Group(APG)
Apex Parks Group is a privately held operating company based in Aliso Viejo, California. They own
14 Family Entertainment Centers in multiple states in USA.
Objective :
Provide Apex Parks Group (APG) an ability to explore the data and use for further processing in
building an EDW for Reports and Analytics
Projects :
1. POS Data Staging 3. ADP Labor data ingestion 5. Ultimate HR Data Ingestion
2. Cloudbeds Reservation Data 4. Item Attribute extraction6. Migration and Support
9
APG Projects
POS Staging :
Ingest data from 14 POS systems across three time zones spanning 3 years of data
Cloudbeds :
Ingestion of reservation data
10
APG POS Database Data flow
• Built a logical schema rationalizing among the 14 POS schemas across the APG Parks of 42 tables
• Bulk ingestion of historical data and subsequent daily , hourly loads
• Build and test logic for net change in data across each data source
• Built logging and alerting logic for job runs status
• Built and test scheduling logic across time zones and avoiding production windows
• Validate business queries spanning multiple parks, tables and time zones using the staging DB
11
Different Type Of Delta Load
The delta load is implemented based on the following, as not every table has date/time column to identify delta records.
• Complete Load
Tables where date/time is not present and even ID is not reliable,
in such cases all the records of the source table are extracted and populated in the target table.
12
APG ADP Data Ingestion
• Received 7 files on a FTP location from ADP system that were updated each day
• The data in the files spanned pas 7 days of time punch data
• Data ingestion using a file based ingestion, backing up the source file on job completion
• Built recovery mechanism for recovering from job run failures to plug data gaps
• Built logging and alerting logic for job runs status
• Built and test scheduling logic
13
APG Cloudbeds Data flow
14
APG Ultimate HR Data Ingestion
UltiPro / Core
SOAP API VERTICA STAGING Email/Audit
REST API
• Investigate SOAP and REST API to identify the correct API , sequence and format of output
• Data ingestion using a API based ingestion
• Built recovery mechanism for recovering from job run failures to plug data gaps
• Built logging and alerting logic for job runs status
• Built and test scheduling logic across and avoiding production windows
15
APG SOAP API Data Flow
SOAP logon Call
Token
Successful call
16
APG REST API Data Flow
Get Team API Team
Punches
17
EMAILING
18
19
20