0% found this document useful (0 votes)
293 views

Model Driven Logical ETL Design Part1

The volume and change rate of ETL programs demand a high quality design approach. Current ETL design practices lead to low quality and high costs. An efficient method for the design and development of ETL components will prove to be invaluable in a data warehousing project.

Uploaded by

markzwijsen
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
293 views

Model Driven Logical ETL Design Part1

The volume and change rate of ETL programs demand a high quality design approach. Current ETL design practices lead to low quality and high costs. An efficient method for the design and development of ETL components will prove to be invaluable in a data warehousing project.

Uploaded by

markzwijsen
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Article for DM Review

Author: Mark Zwijsen, Atos Origin Nederland B.V.


Date: 26 september, 2008

Model driven design of ETL-functions

Structured approach leads to higher quality of all aspects.

Summary
In this article, the author presents a structured approach for describing ETL-
functionality. This approach is platform-independent and, when applied to
average-sized and larger data warehouse environments, leads to cost reduction
in programming and testing, as well as changing future designs.

The volume and change rate of ETL programs demand a high quality
design approach
Most data warehousing projects contain large numbers of ETL (Extract,
Transform, Load) components. An average data warehouse easily contains
dozens if not hundreds of tables (facts, dimensions, aggregates) and the number
of ETL components often exceeds the number of tables in a data warehouse.
This is because a separate ETL-program is normally designed and built for at
least each table. In many cases more than one component is used per table: if a
table contains data from more than one source, usually an ETL component for
each source for each table is built.

A data warehouse is subject to change during its whole lifecycle. These changes
come from evolving user requirements, company processes or the market in
which the company operates. This means that new tables and ETL components
have to be built and existing components need adjustments.
An efficient method for the design and development of ETL components will
prove to be invaluable in a data warehousing project.

Current ETL design practices lead to low quality and high costs
Essentially, there is no difference between the design and development process
of ETL components and that of other data processing systems. In both cases,
functional requirements lead to the final product. However, in the practice of
designing or specifying the ETL-transformations, there is no clear distinction
between the logical and the technical design level. One of the reasons for this is
that formal specification languages for ETL are not available or widely
recognized. This is in contrast to the design practice for data structures, where
formal specification languages at the logical level have been available for many
years.

When modelling a data structure, a clear distinction is made between the logical
and technical data model. The logical level describes the structure, the meaning
and relationship of the data, without any knowledge or consideration of the
database management system (DBMS) that will be used. Entity Relationship
modelling is often used as the logical specification language. The technical level

Page 1 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
describes the way in which the data structure needs to be implemented in the
specific DBMS.
The technical model originates from a transformation of the logical model, which
undergoes implementation specific adjustments needed for various technical
improvements, such as optimisation and I/O performance.

In designing ETL components, the distinction between the logical and the
technical level is rarely made. The design documents usually contain a mixture of
functional and technical descriptions and specifications. The language used is
informal, which often means that it is free for interpretation, incomplete and does
not comply to standard terminology. This means that the programmer is free to
make assumptions and interpretations, which can endanger the quality of the end
result (the ETL component).
In many data warehousing projects, the design documentation is produced after
the software has been developed. The temptation to document by reverse
engineering is huge. In this type of documentation it is not uncommon to find
many tool specific descriptions included.

High quality ETL designs are purely functional and model driven
So what is a high quality ETL design and why do we need it? In order to answer
this question, the term “quality” needs to be defined. In this article, we use a
definition based on publications by Cavano, McCall and Boehm on the aspects of
quality in software. The term ‘enduring usability’ is the key indicator. The
‘enduring usability’ of a product (in this case an ETL design) is measured by its
usability, maintainability and portability.
The actual design documentation is often completely written in the technical
language of the chosen ETL tool. Unfortunately, this has undesirable
consequences for maintainability on a functional level (structure, modifiability,
legibility, testability) and for the portability. A consideration to (partially) switch to
a different ETL tool is often obstructed by a tool lock-in of the documentation.
This means that pure functional designs are very important, especially for the
maintainability and portability of the data warehouse software.

A universal functional model for ETL functions forms a solid base for good
functional designs. This principle has already firmly proven itself in data
modeling. The question that remains is how such a model should look. To
answer this question we have analyzed a large set of ETL programs from
different data warehouse implementations.

The common functioning of most of the ETL programs can be split into two parts:
1. In the first part, a model transformation is applied. It could also be called a
metamorphosis: The shape of the data changes from source model shape to
target model shape.
2. In the second part, the data - which now is in the ‘target shape’ - is processed
into the data warehouse where aspects like denormalisation and history (slowly
changing dimensions, dimension specific) are handled.

Page 2 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
This second part is rather commonplace and a lot of ETL tools offer automated
support (wizards). This part can be designed and developed with a small set of
rules.

The actual transformation takes place in the first part. This is where the data
undergoes its change of shape. The data from the source is the candidate data
for the data warehouse. A candidate undergoes a kind of metamorphosis to
become insertable into the target environment. During the transformation process
first the source shape of the candidate is constructed, and then the target shape
is derived from this source shape.

The transformation model can be divided into 7 steps


These steps are:

1. Collect
2. Enrich
3. Filter
4. Link
5. Validate
6. Convert
7. Deliver

Source Target
model model

Trigger Enrichment Filter Link Validation Conversion Delivery


entity rules rules rules rules rules rules

Collect Enrich Filter Link Validate Convert Deliver

Figure 1

The transformation steps can be classified as:


1. Source model oriented: The rules and criteria are mainly related to the data
model of the source collection: Collect, Enrich, Filter. This is where the source
shape of the candidates is constructed.
2. Target model oriented: The rules and criteria are mainly related to the data
model of the target collection: Link, Convert, Deliver. This is where the target
shape of the candidates is constructed.
3. Quality oriented: The rules and criteria are conern the quality (completeness,
correctness) of the target shape and are related to corrective or adaptive
actions: Validate.

Page 3 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008

The fundamental idea behind this classification and ranking is that similar
decisions, or transformations, on the data are brought together in one
transformation step, and that all the necessary input for a step is formed by the
output of the preceding step.

The output of every step is summarized in Figure 2.

Transformation Output
step
Collect All the candidates with the directly related available attributes
Enrich All candidates with the directly and indirectly related
attributes
Filter All relevant candidates with the directly and indirectly related
attributes
Link All relevant candidates with the directly and indirectly related
attributes, plus the referencing key attributes of the target
model
Validate All relevant and qualitatively approved candidates with the
directly and indirectly related attributes plus the referencing
key attributes of the target model
Convert All relevant and qualitatively approved candidates with all
necessary attributes for the target model.
Deliver Candidates processed into the target model
Figure 2: Result of the functional transformation steps

Though this 7-step functional decomposition describes a global processing order,


the order can be changed in the technical implementation. This technical
implementation is not only driven by the proposed functionality, but also by the
features of the chosen ETL tool, and by ‘tuning’-activities to achieve optimal
operational performance.

Elaboration is done using the Table metaphor


Tables are very suitable for this goal because they are not only a technical
implementation of the logical concept ‘entity type’, but also because the data can
be instantly visualized.
However, the data that is made visible has only symbolic value, and this requires
some imagination and interpretation by the reader. The advantage is that the size
of the tables is relatively small, which simplifies reading.

1. Collect: All candidates from the trigger table


The first step is collecting the data from the trigger table. The Trigger entity type
represents the actual candidates that need to be processed. This is called the
trigger entity type, because changes to entities of this type are the origin of
changes in the target data.

Page 4 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
The functional information stored for this step is the name of the trigger-entity
type.

In the table metaphor, this is shown in Figure 3. In fact this is nothing more than
a table with contents.
Trigger-table A

Attribute A1

Attribute A2

Attribute A3

1 A B C
2 D E F
3 G H I
4 J K L
5 M N O
Figure 3: Table contents after Collection

2. Enrich: add attributes from other source tables


The trigger entity type candidates are probably not yet in their full target shape. It
might be necessary to enrich these candidates with attributes from related
entities.
Two relevant motives for this enrichment are:
• Denormalization in the target model (for the sake of a dimension in a
dimensional model).
• Combination: Combining information from several sources.

The main purpose of Enrichment, is that the relationships between the trigger
entity type and the remaining relevant entity types, are specified in such a way
that the correct attributes are unambiguously added to the target shape.
Related table B

Related table C
Trigger-table A

Attribute C1

Attribute C2
Attribute A1

Attribute A2

Attribute A3

Attribute B1

Attribute B2

1 A B C P Q R S
2 D E F T U V W
3 G H I X Y Z 1

Page 5 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
4 J K L 2 3 4 5
5 M N O 6 7 8 9
Figure 4: Table Contents after Enrichment

3. Filter: Exclude Candidates


It is possible that not all candidates are relevant for processing into the target.
Reasons for not processing can be:
1. The whole population of the source table is offered for processing, whereas
only the changes in a specific period are needed.
2. If a department specific data mart is being made, some candidates are just
not relevant for that specific department.

To indicate which candidate is relevant and which is not, criteria on the attributes
are used.

The candidates 2 and 5 in Figure 5 do not meet the filter criteria. After this step, 3
candidates remain (numbers 1, 3 and 4).
Related table B

Related table C
Trigger-table A

Attribute C1

Attribute C2
Attribute A1

Attribute A2

Attribute A3

Attribute B1

Attribute B2

1 A B C P Q R S
2 D E F T U V W
3 G H I X Y Z 1
4 J K L 2 3 4 5
5 M N O 6 7 8 9
Figure 5: Tablecontents after filtering

4. Link: Add key attributes of the target model


If a candidate in the target model has references to other entities (N : 1
relationship), then this step ensures that the reference keys are determined. In a
dimensional target model, this step mainly applies to fact candidates but in some
cases it can apply to dimension candidates.

This step has a lot of similarities with the ‘Enrich’ step. Like in the enrichment
process, the relationships between the candidates and other entity types need to
comply with specific criteria. The difference between this step and the enrichment
step is that in this step we look at the target model, and we are only interested in
the referring key and not the other attributes of the entity type which we want to
link to.

Page 6 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008

Figure 6 shows that the table is steadily growing. It is also starting to adopt the
shape of the target model.

Related table B

Related table C

Target Table D
Trigger-table A

Attribute C1

Attribute C2
Attribute A1

Attribute A2

Attribute A3

Attribute B1

Attribute B2

FK1

FK2

FK3
1 A B C P Q R S a b c
2 D E F T U V W
3 G H I X Y Z 1 d e
4 J K L 2 3 4 5 g h i
5 M N O 6 7 8 9
Figure 6: Table contents after Linking

5. Validate: Verifying completeness and correctness of the


candidates
Every candidate needs to comply with the specified quality criteria, before being
inserted into the target. These criteria are also called validation rules.
A validation rule can be related to an enrichment (did an enrichment succeed or
fail?), a link (was the referring link found or not?) or an attribute value (domain
validation, obligatory value?).
This step has similarities with the Filter-step. The main difference is that in the
Filter-step a candidate is ignored and no further processing is then done on that
candidate. In the Validate-step every candidate is relevant, but there is a
possibility that a candidate needs more processing before it can pass validation.
Corrective actions are needed to get the candidate through the validation
process. Offering the candidate for validation again later or replacing an invalid
reference by a valid reference (for example dummy reference), are examples of
corrective actions.

Figure 7 shows that candidates 3 and 4 do not comply with the quality criteria.
Per criteria a proceeding step is defined. After this step one candidate remains
(number 1).

Page 7 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008

Related table B

Related table C

Target Table D
Trigger-table A

Attribute C1

Attribute C2
Attribute A1

Attribute A2

Attribute A3

Attribute B1

Attribute B2

Action
FK1

FK2

FK3
1 A B C P Q R S a b c
2 D E F T U V W
3 G H I X Y Z 1 d e Re-process
4 J K L 2 3 4 5 g h i Corrective
action
5 M N O 6 7 8 9
Figure 7: Table contents after validation, only record #1 remains

6. Convert: establish target attributes, to be derived from source


attributes
The target shape of the candidates is completed by deriving all attributes that are
needed in the target model. (Not including the referencing key attributes which
have already been determined in the Link-step).
For each target model attribute, we specify how it must be derived from one or
more source attributes.

Figure 8 shows the contents of the table with the complete target model.
Related table B

Related table C

Target Table D
Trigger-table A

Attribute C1

Attribute C2

Attribute D1

Attribute D2

Attribute D3
Attribute A1

Attribute A2

Attribute A3

Attribute B1

Attribute B2

FK1

FK2

FK3

1 A B C P Q R S a b c x x z
2 D E F T U V W
3 G H I X Y Z 1 d e
4 J K L 2 3 4 5 g h i
5 M N O 6 7 8 9
Figure 8: Table contents after converting

Page 8 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008

7. Deliver: loading the candidates into the target database


The last step is the actual delivery, or loading, of the new candidates into the
target database. In this step the following rules apply:
- rules concerning history handling (slowly changing dimension types);
- rules describing the atomicity of the load, i.e. when existing data has to be
replaced, is it a record-based replacement or is it a batch replacement?

So all you need are a set of rules and criteria


The transformation rules, per step and relating to the source and target model,
can be clarified in a table. Together with the data models of the source and
target, this table defines the complete functional description of the transformation
process. This functional description is truly a ‘Platform Independent Model’.

Conclusion
This model driven approach offers a number of advantages over the commonly
used approaches (prosaic, reverse engineering):
- This model can be automated, provided that it is complete. Case-tools which
can automatically or semi-automatically generate data structures and more
classical data processing components, have been available for a
considerable time. Case-tool automation should also be possible for ETL. .
Further analysis is needed to accomplish this.
- This model driven approach generally complies with the quality requirements
described in this article. This approach excels especially in structure,
efficiency, consistency, transferability and testability.
- And as always there is also a financial side to this. Ultimately, this approach
will deliver financial rewards. This model offers a few easy methods for
monitoring completeness and correctness and cuts down the amount of
rework.

Mark Zwijsen is Senior Consultant Data Warehousing and Business Intelligence at Atos Origin.
He can be reached at mark.zwijsen@atosorigin.com

Page 9 of 9

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy