Chapter 9 Fundamentals of An ETL Architecture

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8

Chapter 9: Fundamentals of an ETL Architecture

An ETL Infrastructure
Extraction, Transformation and loading (ETL) technology is not a new concept. It is
nothing more than a series of batch interfaces between systems, and batch interfaces have
been for quite sometimes now.
What makes ETL difficult today is the added emphasis on capturing and consolidating
business intelligence in conjunction with batch processing of data.
An ETL tool collects data about the ETL processes, makes it reusable, is easier to manage
and transfer knowledge from, but can lack some of the horsepower and complexity
needed to process transformations efficiently.
The Need for ETL Metadata
The following metadata items are necessary to build effective processes and to
communicate information about the execution of these processes. Some of this data does
not originate in the ETL process but originates in other tools or steps in your development
methodology.
Metadata Item
Source data definitions

Originating Process
Source analysis. Source database schemas
and file definitions
Modeling process. Data modeling tool used
to design data warehouse or mart
Source analysis, modeling process, or ETL
process. If the data is not available from the
source schema or modeling tool, it will
have to be manually typed into the ETL
tool.
Middleware definitions, user names and
passwords for development, and production
access to source and target data.
ETL process
ETL process
ETL process

Target data definitions


Business descriptions of data stores

Data

access methods,
privileges

rights,

and

Transformations
Transformation business definitions
Data lineage (description of all
transformations that occur to
source data to compute a source
field)
Load Statistics (i.e., last load, time,
number of records loaded, number
of errors, etc.)

ETL process or Operator Tools

ETL Tools versus Manual Development


The following table describes the different types of platforms, processing and needs and
whether you could use an ETL tool or manual development (that is, hand-coded SQL
scripts, PL/SQL, COBOL, and so on) to facilitate that need.
Process/Need
Metadata
Heterogeneous
Connectivity
Error Logging
Reusability
Transformations
Aggregations

ETL Tools
Good
Good
Good
Good
Good
Good (Easier to manage)

Manual Development
Difficult
Difficult
Good
Good (if managed well)
Good (but can be more
limiting)
Good (faster at processing
individual aggregations)

Selecting an ETL Architecture


Need and value are probably the two important criteria discussed with respect to ETL
tools and probably the two most overlooked.
Really, what you want from ETL tool, or architecture, is backbone for your data
warehouse. What is needed is something that can support the extraction, transformation,
and the delivery of information from source to user.
Extraction
Data extraction is all about tapping your source system data as efficiently as possible.
But, no ETL tool effectively manages Change Data Capture (CDC) without the assistance
of native code on the source system.
To effectively handle CDC in a relational environment, triggers and/or stored procedures,
are often written to create a CDC log that contains, at minimum, the table name and
primary key values of the records that change. There are other variations to this theme,
more efficient than others.
The point is that some code or objects must be created in the source system to efficiently
handle CDC. The ETL tool uses the result of CDC to extract the relevant data from the
source system.
Oftentimes, source data resides in proprietary systems or uses older technology that does
not have open connectivity or an API to exchange data with other applications. In these
cases, proprietary languages or extract tools are used to create extract files.

At ETL tool should be able to read a file produced by programs written on the source
system.
Once the processing is done on the source system, that data will be transported to the
target environment. Another set of programs will be needed on the target to
Resolve foreign key references
Do any slowly changing dimension processing, and separate the date into insert
and update data streams, as well as redirect involved records to an error file.
An ETL or replication tool will never be able to extract from every possible data source.
But as long as it can effectively handle the transformation and load processing and a large
portion of the extraction combined with single source system file extracts or CDC
triggers it may be worth the investment.
Scheduling:
Scheduling processes across multiple platforms can be tricky what makes it tricky is the
heterogeneity of most systems in an enterprise.
In organization with disparate systems, a third party scheduling tool is often employed to
manage the enterprise processing.
If an ETL tool is not used, three basic process types will need to be scheduled.
i.
ii.
iii.

The Extraction programs


An FTP or equivalent transportation method to move the data to a file system
that the transformation and load program can read from, and
The transformation and load programs.

Third-party scheduling tools can handle all of the three basic steps above.
One of the strengths of a good ETL tool is the ability to bridge the gap between disparate
systems within their using ODBC drivers, gateways, or FTP connection. Additionally,
they can involve external processes and, in source cases, remote processes. The
execution of these processes can often be managed internally without the assistance of a
third party scheduler.
Staging Area
A staging area is a data store where data from the source systems can be integrated,
transformed, and prepared for loading into the target warehouse repositories.
Depending on the size of warehousing effort, the staging area can be physical or virtual.
Staging area is where dimensions are constructed and conformed prior to being
distributed to the target environments.

It is also where transformations are applied and surrogate key values can be resolved for
fact records.
When using a replication tool or homegrown trigger-based CDC, the replicated data or
CDC log is written to the staging area.
An effective staging area will allow you to build your warehouse literally around the
clock without affecting users.
The staging area does not have to be a complete copy of the warehouse. While it must
have complete copies of the dimension tables in order to keep everything consistent it
does not need a complete history of the fact table perhaps between two to six batch cycles
of data.
Development and Operations
Two key areas should be inherent in an ETL architecture:
Development lifecycle controls and
Operation
Development life cycle control is the ability to control development and manage the
promotion of objects from development to test and on to production.
The ETL architecture needs administrative functions that allow for the versioning and
promotion of objects from development to the test environment for integration and
acceptance environment.
Target Databases
Oddly enough, the ability to target a variety of database types is overlooked or glossed
over more than one might think.
Most ETL and replication tools can write to most RDBMS systems.
Another target database platform to consider is a multidimensional database (MDDB)
format such as oracle express, Oracles MDDB OLAP product.
Change Data Capture
Change Data Capture (CDC) is an important element of the extract analysis.
Transactions that drive fact data almost always have timestamps. However, dimension
data in the source systems do not always have timestamps because they are typically not
tied to an event. Hence, CDC is the most difficult to implement with dimension data.
Target Table Reflush strategy:

Not only the CDC of the source, but also the load strategy of the target tables affect load
increment processing.

Dimension Table Load Strategy


Fact Table Load Strategy and
Aggregate Table Load Strategy

Dimension Table Load Strategy:


There are essentially three dimension load strategies. These three load strategies are
called slowly changing dimension (SCD) strategies. In all three types, all input is
compared to the existing data.
If a match on the natural key is not found, then the input record is inserted. The natural
key is made up of the columns in the dimension other than the surrogate key-that
uniquely identify a dimension record.
If a match is found, then one of the following strategies will be followed.
Slowly Changing Dimension Type 1(SCD-1):
No history is preserved in an SDC-1. If the input record already exists in the target
dimension table, based on the natural key values, then that record is updated or refreshed
with the data from the input record.
Slowly Changing Dimension Type2 (SCD-2):
Sometimes there are a number of critical values on a record that we may want to retain.
In a SCD-2, We preserve a dimension record the way it was when the associated facts
occurred. So if a field or column value that is one of these Critical Values on the input
record is different from its corresponding column in the target table, then the existing
record is expired and a new record based on the input record is inserted and assigned a
new surrogate key value. If none of the critical values are different from their
corresponding columns in the target table, then the existing record is updated but not
expired, much like SCD-1.
Slowly Changing Dimension Type 3 (SCD-3):
Much like SCD-2, SCD-3 is used to track changes to critical values. However, instead of
keeping a separated record for each change, separate columns on the existing record are
used to store the current value and any previous n values.
When a critical change is detected, all the previous values are shifted to the next
previous column and the nth oldest value is discarded.
Note:

All dimension tables follow one of these three strategies with only two exceptions. Static
dimension tables and dimension tables that are completely replaced with new data, static
dimension tables never change, and if they do, the only change is the addition of new
records. A time dimension is a good example of a static dimension.
Job Scheduling
Determining when a job should run depends on several things. Dependences are derived
from
External processing schedules
Client expectations of data availability, and
Own internal procession schedules
External Processing
The best place to start is with the source data. When is the source data available? Is there
a batch job that prepares the source data and when does it complete? What backups
and/or sytem maintenance needs to be completed?
Client Expectations
After determining the source data availability, the clients needs are the next
consideration. Where does the client expect to see this data? How often do they expect it
to be updated? What are the service level agreements with the client on data availability?
It is generally best to examine the external schedule before attempting to after clients
expectations.
Internal Processing
Just like external system the internal processing schedule probably already includes other
batch load processing and backups. There are also dependencies in the load schedule that
required certain table to be loaded before others so that referential integrity is insured.
So scheduling has two issues:

What data dependencies exist that will effect when the jobs that load the ata are
scheduled, and

Does this schedule still satisfy the clients expectations?

Balancing
Balancing is the summing up of additive values that are is both the source and target
tables.

Columns that have monetary values or other types of numeric data are the usual
candidates for balancing.
Some times additive values do not exist in a source extract or target table. Queries that
count the number of records by certain classifications provide similar measure when
additive values do not exist. Counts do not give the insurance that additive summations
do, but are still an accurate balancing mechanism.
How much to Balance?
Each table has distinctly different data sets and the processing them is unique to each
table. Some times it makes sense to balance data in the most recent load (incremental
balancing). There are several factors that contribute to this decision.

Table size
Table refresh strategy and
Table update strategy

Table Size:
A table whose size stays constant or that has slow growth may need to be completely
balanced after each load. Large tables that grow considerably during each load many be
candidates for incremental balancing.
Table refresh strategy:
If the load is completely refreshed, then completely balance it after each load. Tables that
incrementally change or append data during each execution many be candidates for
incremental balancing.
Table update strategy:
If the data loads exclusively append input records to the target warehouse table, then
incremental balancing many is all their necessary. If the input records cause updates to
existing recodes in the warehouse table, then consider a complete balancing of the table
after the load.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy