ETL - PPT v0.2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20
At a glance
Powered by AI
The key takeaways are that ETL stands for Extract, Transform, Load and involves pulling data from source systems and preparing it for analysis in the target system. The main components of ETL are extraction of data from various sources, transformation of data to prepare it for loading, and loading the transformed data into the target system.

The main components of ETL are extraction of data from various sources, transformation of data to prepare it for loading, and loading the transformed data into the target system. Extraction pulls the data from sources, transformation prepares the data for loading by cleaning, filtering etc and loading writes the transformed data to the target system.

Talend can handle various data sources like flat files, relational databases, applications/platforms, NoSQL databases and big data sources like Hive, Hbase etc.

TALEND FOR APG

1
ETL Overview
 ETL
 Extraction
 Transformation
 Loading

2
Extraction Transformation Loading – ETL
 ETL is short for extract, transform, load, three database functions that are combined into one tool to pull
data out of source and place it into target.
 Extract is the process of reading data from a source. In this stage, the data is collected, often from multiple
and different types of sources.
 Transform is the process of converting the extracted data from its previous form into the form it needs to
be in so that it can be placed into target. Transformation occurs by using rules or lookup tables or by
combining the data with other data.
 Load is the process of writing the data into the target.

3
Why ETL
 ETL processes the heterogeneous data and makes it homogeneous which in turn makes it
seamless for data scientists and data analysts to analyze the data and derive business
intelligence from it.
 ETL tools can collect, read and migrate data from multiple data structures and across different
database and platforms like mainframe, etc.
 ETL technology can also identify ”delta” changes as they occur, which enables ETL tools to copy
only changed data without the need to perform full data refreshes.
 ETL tools include ready to use operations like filtering, reformatting, sorting, joining, merging,
aggregation and advanced data profiling and cleansing.
 Additionally, ETL tools also support transformation scheduling, version control, monitoring and
unified metadata management.
  ETL tools have powerful debugging capabilities which enables user to see how the real time data
will flow. Additionally , we can see how an attribute was transformed, when the data was most
recently refreshed, and from what source system the data was extracted. Also , the error
messages thrown provides precise information about the error and where it occurred.

4

How It Works
Data Source layer can consists of different types of data  – Operational data including business data like
Sales, Customer, Finance, Product and others, web server logs etc.
 Data from data sources is extracted and put into the warehouse staging area and data is minimally cleaned
with no major transformations.
 Business intelligent logic to transform transactional data into analytical data. It is indeed the most time
consuming phase in the whole DWH architecture and is the chief process between data source and
presentation layer of DWH

5
Talend Products

6
What All Talend Can Do
 Type of sources Talend can handle:
• Flat files:Delimted,xml,Json,excel..
• Applications/Platforms :CRM,SAP,Salesforce…
• Relational database: Mysql,vertica,Postgres,Oracle,Mssql…..
• NoSql: MangoDB,Hbase,cassandra..
• BigData:Hive,Hbase,Pig,Scoop…

 How do we transport:
• File system
• FTP
• SFTP
• Webservices (Soap , Rest)
• HTTP
• Mail

7
Big Data component:

 Talend is the Graphical User Interface tool which is capable enough to “translate” an ETL job to a
MapReduce job.
 Talend Open Studio for Big Data enables to leverage Hadoop loading and processing technologies like
HDFS, HBase, Hive, and Pig without having to write Hadoop application code.
 Load data into HDFS (Hadoop Distributed File System) .
 Use Hadoop Pig to transform data in HDFS .
 Load data into a Hadoop Hive based data warehouse.
 Perform ELT (extract, load, transform) aggregations in Hive.
 Leverage Sqoop to integrate relational databases .

8
Apex Parks Group(APG)
Apex Parks Group is a privately held operating company based in Aliso Viejo, California. They own
14 Family Entertainment Centers in multiple states in USA.

Objective :
Provide Apex Parks Group (APG) an ability to explore the data and use for further processing in
building an EDW for Reports and Analytics

Projects :
1. POS Data Staging 3. ADP Labor data ingestion 5. Ultimate HR Data Ingestion
2. Cloudbeds Reservation Data 4. Item Attribute extraction6. Migration and Support

9
APG Projects
 POS Staging :
 Ingest data from 14 POS systems across three time zones spanning 3 years of data

 Cloudbeds :
 Ingestion of reservation data

 Item Attribute Extraction :


 APG is lacking an Item Master and wanted to extract attributes from Item descriptions to better
understand item sales. This was built form the POS staging database

 ADP Labor Data :


 Ingest time punch data from ADP systems

 Ultimate Labor Data ingestion :


 Ingest time punch data from Ultimate HR systems

 Migration and Support :


 Migration of POS from MySQL to Vertica for performance and on going support

10
APG POS Database Data flow

14 POS VERTICA STAGING Email/Audit


Database

• Built a logical schema rationalizing among the 14 POS schemas across the APG Parks of 42 tables
• Bulk ingestion of historical data and subsequent daily , hourly loads
• Build and test logic for net change in data across each data source
• Built logging and alerting logic for job runs status
• Built and test scheduling logic across time zones and avoiding production windows
• Validate business queries spanning multiple parks, tables and time zones using the staging DB

11
Different Type Of Delta Load
The delta load is implemented based on the following, as not every table has date/time column to identify delta records.

• Delta Load based on Timestamp


All the tables where timestamp column is identified to get the delta load.
At each run (current date time) will get stored in the audit table.
Timestamp of table will be compared with the timestamp stored in the last run in audit table
and all the records having date time more than timestamp stored in audit table will be fetched.

• Delta Load based on ID


Tables where timestamp is not present in the source table to get the delta records,
and where ID can be used to identify delta records.
In the Source, data which have ID greater than max ID present in the target will be fetched.

• Complete Load
Tables where date/time is not present and even ID is not reliable,
in such cases all the records of the source table are extracted and populated in the target table.

12
APG ADP Data Ingestion

CSV on FTP VERTICA STAGING Email/Audit

• Received 7 files on a FTP location from ADP system that were updated each day
• The data in the files spanned pas 7 days of time punch data
• Data ingestion using a file based ingestion, backing up the source file on job completion
• Built recovery mechanism for recovering from job run failures to plug data gaps
• Built logging and alerting logic for job runs status
• Built and test scheduling logic

13
APG Cloudbeds Data flow

CloudBeds VERTICA STAGING Email/Audit


REST API

• Data ingestion using REST API exposed by Cloudbeds


• Built the authentication and refresh token login to enable the job to span job run time
• Bulk ingestion of historical data and subsequent daily loads
• Build and test logic for net change in data across each data source
• Built logging and alerting logic for job runs status
• Built and test scheduling logic
• Validate business queries spanning multiple parks using the staging DB

14
APG Ultimate HR Data Ingestion

UltiPro / Core
SOAP API VERTICA STAGING Email/Audit
REST API

• Investigate SOAP and REST API to identify the correct API , sequence and format of output
• Data ingestion using a API based ingestion
• Built recovery mechanism for recovering from job run failures to plug data gaps
• Built logging and alerting logic for job runs status
• Built and test scheduling logic across and avoiding production windows

15
APG SOAP API Data Flow
SOAP logon Call

Token
Successful call

SOAP Execute Call


Fai
Report ID le dC
all
Successful call
Fai
le dC
SOAP Retrieve Report all
Call
Encoded xml

Decode Extracted XML VERTICA STAGING Email/Audit

16
APG REST API Data Flow
Get Team API Team

Emp Get Emp API

Get Work Summary API


Hours
Email/Audit

Punches

17
EMAILING

18
19
20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy