Model Driven Logical ETL Design Part1
Model Driven Logical ETL Design Part1
Summary
In this article, the author presents a structured approach for describing ETL-
functionality. This approach is platform-independent and, when applied to
average-sized and larger data warehouse environments, leads to cost reduction
in programming and testing, as well as changing future designs.
The volume and change rate of ETL programs demand a high quality
design approach
Most data warehousing projects contain large numbers of ETL (Extract,
Transform, Load) components. An average data warehouse easily contains
dozens if not hundreds of tables (facts, dimensions, aggregates) and the number
of ETL components often exceeds the number of tables in a data warehouse.
This is because a separate ETL-program is normally designed and built for at
least each table. In many cases more than one component is used per table: if a
table contains data from more than one source, usually an ETL component for
each source for each table is built.
A data warehouse is subject to change during its whole lifecycle. These changes
come from evolving user requirements, company processes or the market in
which the company operates. This means that new tables and ETL components
have to be built and existing components need adjustments.
An efficient method for the design and development of ETL components will
prove to be invaluable in a data warehousing project.
Current ETL design practices lead to low quality and high costs
Essentially, there is no difference between the design and development process
of ETL components and that of other data processing systems. In both cases,
functional requirements lead to the final product. However, in the practice of
designing or specifying the ETL-transformations, there is no clear distinction
between the logical and the technical design level. One of the reasons for this is
that formal specification languages for ETL are not available or widely
recognized. This is in contrast to the design practice for data structures, where
formal specification languages at the logical level have been available for many
years.
When modelling a data structure, a clear distinction is made between the logical
and technical data model. The logical level describes the structure, the meaning
and relationship of the data, without any knowledge or consideration of the
database management system (DBMS) that will be used. Entity Relationship
modelling is often used as the logical specification language. The technical level
Page 1 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
describes the way in which the data structure needs to be implemented in the
specific DBMS.
The technical model originates from a transformation of the logical model, which
undergoes implementation specific adjustments needed for various technical
improvements, such as optimisation and I/O performance.
In designing ETL components, the distinction between the logical and the
technical level is rarely made. The design documents usually contain a mixture of
functional and technical descriptions and specifications. The language used is
informal, which often means that it is free for interpretation, incomplete and does
not comply to standard terminology. This means that the programmer is free to
make assumptions and interpretations, which can endanger the quality of the end
result (the ETL component).
In many data warehousing projects, the design documentation is produced after
the software has been developed. The temptation to document by reverse
engineering is huge. In this type of documentation it is not uncommon to find
many tool specific descriptions included.
High quality ETL designs are purely functional and model driven
So what is a high quality ETL design and why do we need it? In order to answer
this question, the term “quality” needs to be defined. In this article, we use a
definition based on publications by Cavano, McCall and Boehm on the aspects of
quality in software. The term ‘enduring usability’ is the key indicator. The
‘enduring usability’ of a product (in this case an ETL design) is measured by its
usability, maintainability and portability.
The actual design documentation is often completely written in the technical
language of the chosen ETL tool. Unfortunately, this has undesirable
consequences for maintainability on a functional level (structure, modifiability,
legibility, testability) and for the portability. A consideration to (partially) switch to
a different ETL tool is often obstructed by a tool lock-in of the documentation.
This means that pure functional designs are very important, especially for the
maintainability and portability of the data warehouse software.
A universal functional model for ETL functions forms a solid base for good
functional designs. This principle has already firmly proven itself in data
modeling. The question that remains is how such a model should look. To
answer this question we have analyzed a large set of ETL programs from
different data warehouse implementations.
The common functioning of most of the ETL programs can be split into two parts:
1. In the first part, a model transformation is applied. It could also be called a
metamorphosis: The shape of the data changes from source model shape to
target model shape.
2. In the second part, the data - which now is in the ‘target shape’ - is processed
into the data warehouse where aspects like denormalisation and history (slowly
changing dimensions, dimension specific) are handled.
Page 2 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
This second part is rather commonplace and a lot of ETL tools offer automated
support (wizards). This part can be designed and developed with a small set of
rules.
The actual transformation takes place in the first part. This is where the data
undergoes its change of shape. The data from the source is the candidate data
for the data warehouse. A candidate undergoes a kind of metamorphosis to
become insertable into the target environment. During the transformation process
first the source shape of the candidate is constructed, and then the target shape
is derived from this source shape.
1. Collect
2. Enrich
3. Filter
4. Link
5. Validate
6. Convert
7. Deliver
Source Target
model model
Figure 1
Page 3 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
The fundamental idea behind this classification and ranking is that similar
decisions, or transformations, on the data are brought together in one
transformation step, and that all the necessary input for a step is formed by the
output of the preceding step.
Transformation Output
step
Collect All the candidates with the directly related available attributes
Enrich All candidates with the directly and indirectly related
attributes
Filter All relevant candidates with the directly and indirectly related
attributes
Link All relevant candidates with the directly and indirectly related
attributes, plus the referencing key attributes of the target
model
Validate All relevant and qualitatively approved candidates with the
directly and indirectly related attributes plus the referencing
key attributes of the target model
Convert All relevant and qualitatively approved candidates with all
necessary attributes for the target model.
Deliver Candidates processed into the target model
Figure 2: Result of the functional transformation steps
Page 4 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
The functional information stored for this step is the name of the trigger-entity
type.
In the table metaphor, this is shown in Figure 3. In fact this is nothing more than
a table with contents.
Trigger-table A
Attribute A1
Attribute A2
Attribute A3
1 A B C
2 D E F
3 G H I
4 J K L
5 M N O
Figure 3: Table contents after Collection
The main purpose of Enrichment, is that the relationships between the trigger
entity type and the remaining relevant entity types, are specified in such a way
that the correct attributes are unambiguously added to the target shape.
Related table B
Related table C
Trigger-table A
Attribute C1
Attribute C2
Attribute A1
Attribute A2
Attribute A3
Attribute B1
Attribute B2
1 A B C P Q R S
2 D E F T U V W
3 G H I X Y Z 1
Page 5 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
4 J K L 2 3 4 5
5 M N O 6 7 8 9
Figure 4: Table Contents after Enrichment
To indicate which candidate is relevant and which is not, criteria on the attributes
are used.
The candidates 2 and 5 in Figure 5 do not meet the filter criteria. After this step, 3
candidates remain (numbers 1, 3 and 4).
Related table B
Related table C
Trigger-table A
Attribute C1
Attribute C2
Attribute A1
Attribute A2
Attribute A3
Attribute B1
Attribute B2
1 A B C P Q R S
2 D E F T U V W
3 G H I X Y Z 1
4 J K L 2 3 4 5
5 M N O 6 7 8 9
Figure 5: Tablecontents after filtering
This step has a lot of similarities with the ‘Enrich’ step. Like in the enrichment
process, the relationships between the candidates and other entity types need to
comply with specific criteria. The difference between this step and the enrichment
step is that in this step we look at the target model, and we are only interested in
the referring key and not the other attributes of the entity type which we want to
link to.
Page 6 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
Figure 6 shows that the table is steadily growing. It is also starting to adopt the
shape of the target model.
Related table B
Related table C
Target Table D
Trigger-table A
Attribute C1
Attribute C2
Attribute A1
Attribute A2
Attribute A3
Attribute B1
Attribute B2
FK1
FK2
FK3
1 A B C P Q R S a b c
2 D E F T U V W
3 G H I X Y Z 1 d e
4 J K L 2 3 4 5 g h i
5 M N O 6 7 8 9
Figure 6: Table contents after Linking
Figure 7 shows that candidates 3 and 4 do not comply with the quality criteria.
Per criteria a proceeding step is defined. After this step one candidate remains
(number 1).
Page 7 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
Related table B
Related table C
Target Table D
Trigger-table A
Attribute C1
Attribute C2
Attribute A1
Attribute A2
Attribute A3
Attribute B1
Attribute B2
Action
FK1
FK2
FK3
1 A B C P Q R S a b c
2 D E F T U V W
3 G H I X Y Z 1 d e Re-process
4 J K L 2 3 4 5 g h i Corrective
action
5 M N O 6 7 8 9
Figure 7: Table contents after validation, only record #1 remains
Figure 8 shows the contents of the table with the complete target model.
Related table B
Related table C
Target Table D
Trigger-table A
Attribute C1
Attribute C2
Attribute D1
Attribute D2
Attribute D3
Attribute A1
Attribute A2
Attribute A3
Attribute B1
Attribute B2
FK1
FK2
FK3
1 A B C P Q R S a b c x x z
2 D E F T U V W
3 G H I X Y Z 1 d e
4 J K L 2 3 4 5 g h i
5 M N O 6 7 8 9
Figure 8: Table contents after converting
Page 8 of 9
Article for DM Review
Author: Mark Zwijsen, Atos Origin Nederland B.V.
Date: 26 september, 2008
Conclusion
This model driven approach offers a number of advantages over the commonly
used approaches (prosaic, reverse engineering):
- This model can be automated, provided that it is complete. Case-tools which
can automatically or semi-automatically generate data structures and more
classical data processing components, have been available for a
considerable time. Case-tool automation should also be possible for ETL. .
Further analysis is needed to accomplish this.
- This model driven approach generally complies with the quality requirements
described in this article. This approach excels especially in structure,
efficiency, consistency, transferability and testability.
- And as always there is also a financial side to this. Ultimately, this approach
will deliver financial rewards. This model offers a few easy methods for
monitoring completeness and correctness and cuts down the amount of
rework.
Mark Zwijsen is Senior Consultant Data Warehousing and Business Intelligence at Atos Origin.
He can be reached at mark.zwijsen@atosorigin.com
Page 9 of 9