Chapter 9 Fundamentals of An ETL Architecture
Chapter 9 Fundamentals of An ETL Architecture
Chapter 9 Fundamentals of An ETL Architecture
An ETL Infrastructure
Extraction, Transformation and loading (ETL) technology is not a new concept. It is
nothing more than a series of batch interfaces between systems, and batch interfaces have
been for quite sometimes now.
What makes ETL difficult today is the added emphasis on capturing and consolidating
business intelligence in conjunction with batch processing of data.
An ETL tool collects data about the ETL processes, makes it reusable, is easier to manage
and transfer knowledge from, but can lack some of the horsepower and complexity
needed to process transformations efficiently.
The Need for ETL Metadata
The following metadata items are necessary to build effective processes and to
communicate information about the execution of these processes. Some of this data does
not originate in the ETL process but originates in other tools or steps in your development
methodology.
Metadata Item
Source data definitions
Originating Process
Source analysis. Source database schemas
and file definitions
Modeling process. Data modeling tool used
to design data warehouse or mart
Source analysis, modeling process, or ETL
process. If the data is not available from the
source schema or modeling tool, it will
have to be manually typed into the ETL
tool.
Middleware definitions, user names and
passwords for development, and production
access to source and target data.
ETL process
ETL process
ETL process
Data
access methods,
privileges
rights,
and
Transformations
Transformation business definitions
Data lineage (description of all
transformations that occur to
source data to compute a source
field)
Load Statistics (i.e., last load, time,
number of records loaded, number
of errors, etc.)
ETL Tools
Good
Good
Good
Good
Good
Good (Easier to manage)
Manual Development
Difficult
Difficult
Good
Good (if managed well)
Good (but can be more
limiting)
Good (faster at processing
individual aggregations)
At ETL tool should be able to read a file produced by programs written on the source
system.
Once the processing is done on the source system, that data will be transported to the
target environment. Another set of programs will be needed on the target to
Resolve foreign key references
Do any slowly changing dimension processing, and separate the date into insert
and update data streams, as well as redirect involved records to an error file.
An ETL or replication tool will never be able to extract from every possible data source.
But as long as it can effectively handle the transformation and load processing and a large
portion of the extraction combined with single source system file extracts or CDC
triggers it may be worth the investment.
Scheduling:
Scheduling processes across multiple platforms can be tricky what makes it tricky is the
heterogeneity of most systems in an enterprise.
In organization with disparate systems, a third party scheduling tool is often employed to
manage the enterprise processing.
If an ETL tool is not used, three basic process types will need to be scheduled.
i.
ii.
iii.
Third-party scheduling tools can handle all of the three basic steps above.
One of the strengths of a good ETL tool is the ability to bridge the gap between disparate
systems within their using ODBC drivers, gateways, or FTP connection. Additionally,
they can involve external processes and, in source cases, remote processes. The
execution of these processes can often be managed internally without the assistance of a
third party scheduler.
Staging Area
A staging area is a data store where data from the source systems can be integrated,
transformed, and prepared for loading into the target warehouse repositories.
Depending on the size of warehousing effort, the staging area can be physical or virtual.
Staging area is where dimensions are constructed and conformed prior to being
distributed to the target environments.
It is also where transformations are applied and surrogate key values can be resolved for
fact records.
When using a replication tool or homegrown trigger-based CDC, the replicated data or
CDC log is written to the staging area.
An effective staging area will allow you to build your warehouse literally around the
clock without affecting users.
The staging area does not have to be a complete copy of the warehouse. While it must
have complete copies of the dimension tables in order to keep everything consistent it
does not need a complete history of the fact table perhaps between two to six batch cycles
of data.
Development and Operations
Two key areas should be inherent in an ETL architecture:
Development lifecycle controls and
Operation
Development life cycle control is the ability to control development and manage the
promotion of objects from development to test and on to production.
The ETL architecture needs administrative functions that allow for the versioning and
promotion of objects from development to the test environment for integration and
acceptance environment.
Target Databases
Oddly enough, the ability to target a variety of database types is overlooked or glossed
over more than one might think.
Most ETL and replication tools can write to most RDBMS systems.
Another target database platform to consider is a multidimensional database (MDDB)
format such as oracle express, Oracles MDDB OLAP product.
Change Data Capture
Change Data Capture (CDC) is an important element of the extract analysis.
Transactions that drive fact data almost always have timestamps. However, dimension
data in the source systems do not always have timestamps because they are typically not
tied to an event. Hence, CDC is the most difficult to implement with dimension data.
Target Table Reflush strategy:
Not only the CDC of the source, but also the load strategy of the target tables affect load
increment processing.
All dimension tables follow one of these three strategies with only two exceptions. Static
dimension tables and dimension tables that are completely replaced with new data, static
dimension tables never change, and if they do, the only change is the addition of new
records. A time dimension is a good example of a static dimension.
Job Scheduling
Determining when a job should run depends on several things. Dependences are derived
from
External processing schedules
Client expectations of data availability, and
Own internal procession schedules
External Processing
The best place to start is with the source data. When is the source data available? Is there
a batch job that prepares the source data and when does it complete? What backups
and/or sytem maintenance needs to be completed?
Client Expectations
After determining the source data availability, the clients needs are the next
consideration. Where does the client expect to see this data? How often do they expect it
to be updated? What are the service level agreements with the client on data availability?
It is generally best to examine the external schedule before attempting to after clients
expectations.
Internal Processing
Just like external system the internal processing schedule probably already includes other
batch load processing and backups. There are also dependencies in the load schedule that
required certain table to be loaded before others so that referential integrity is insured.
So scheduling has two issues:
What data dependencies exist that will effect when the jobs that load the ata are
scheduled, and
Balancing
Balancing is the summing up of additive values that are is both the source and target
tables.
Columns that have monetary values or other types of numeric data are the usual
candidates for balancing.
Some times additive values do not exist in a source extract or target table. Queries that
count the number of records by certain classifications provide similar measure when
additive values do not exist. Counts do not give the insurance that additive summations
do, but are still an accurate balancing mechanism.
How much to Balance?
Each table has distinctly different data sets and the processing them is unique to each
table. Some times it makes sense to balance data in the most recent load (incremental
balancing). There are several factors that contribute to this decision.
Table size
Table refresh strategy and
Table update strategy
Table Size:
A table whose size stays constant or that has slow growth may need to be completely
balanced after each load. Large tables that grow considerably during each load many be
candidates for incremental balancing.
Table refresh strategy:
If the load is completely refreshed, then completely balance it after each load. Tables that
incrementally change or append data during each execution many be candidates for
incremental balancing.
Table update strategy:
If the data loads exclusively append input records to the target warehouse table, then
incremental balancing many is all their necessary. If the input records cause updates to
existing recodes in the warehouse table, then consider a complete balancing of the table
after the load.