Calculates Totals or Other Aggregate Functions For Each Group. The Summed Totals For Each Group Are Output From The Stage Thro' Output Link

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 106
At a glance
Powered by AI
The document discusses various ETL stages like Aggregator, Pivot, Surrogate Key Generator, and their functions and properties. Key stages like Aggregator, Pivot and Surrogate Key Generator are used to transform and cleanse data.

The Aggregator stage can perform different types of aggregations like calculation, count rows, and re-calculation. Calculation aggregates values into a new column, count rows counts the number of records in each group, and re-calculation recalculates a previously calculated column.

The Pivot stage converts all the columns into one column. It can be used to convert a slowly changing dimension from type 3 to type 2 by pivoting the column values into a single column with a derivation. This places the input column values into a single output column.

Aggregator Stage :

Definition : Aggregator classifies data rows from a single input link into groups and
calculates totals or other aggregate functions for each group. The summed totals for each
group are output from the stage thro' output link.

Group is a set of record with the same value for one or more columns.

Example : Transaction records might be grouped by both day of the week and by
month. These groupings might show the busiest day of the week varies by season.
Input & View data :
The INPUT page shows you the metadata of
the incoming data.

The input data look like this…


Properties :

When "Aggregation Type = Calculation" …

Here, we group by "Gender".

The column to aggregate.

User defined column to collect


the aggregated values.
Output & View data :
The OUTPUT page shows only those columns
used to group and aggregate.

As we have grouped by "Gender", the


incomes of Males and Females are summed
up and shown here.
Execution Mode :

Note :
The Aggregator stage can have only one output link.
Properties :

When "Aggregation Type = Count Rows" …

Here, we group by "Gender".

The column to be counted.


Output & View data :
The OUTPUT page shows only the grouping
column and the column to be counted.

As we have grouped by "Gender", the


number of records in Males and Females are
totaled and shown here.
Execution Mode :

Note :
The Aggregator stage can have only one output link.
Properties :

When "Aggregation Type = Re-calculation" …

Here, we group by "Gender".

The column to preserver the


summary of Recalculation.

Note :
When the "Aggregation Type = Re-calculation" then place an extra aggregator to aggregate a
column, first. The second aggregator will re-calculate the previously calculated column.
Output & View data :
The OUTPUT page shows only the grouping
column and the column for recalculation.

The column "MaxVal” is recalculated as


"SumOfMaxVal"
Execution Mode :

Note :
The Aggregator stage can have only one output link.
Change Apply Stage :

Definition : Takes the change data set, that contains the changes in the before and after
data sets, from the Change Capture stage and applies the encoded change operations to a
before data set to compute an after data set.

The Change Apply stage read a record from the change data set and from the before data
set, compares their key column values, and acts accordingly.
Change Capture Property :

Change Apply Property :


Input (Before Changes) :

Output :

Input (After Changes) :


Job :
Compress Stage :

The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It
converts a data set from a sequence of records into a stream of raw binary data.
Steps
:
Set the Stage Properties :
"Command = Uncompress"

Load the Metadata in the Output tab…


Example :

Limitations :
A compressed data set cannot be processed by many stages until it is expanded,
i.e., until its rows are returned to their normal format. Stages that do not perform column
based processing or reorder the rows can operate on compressed data sets. For example, you
can use the copy stage to create a copy of the compressed data set.
Expand Stage :

The Expand stage uses the UNIX compress or GZIP utility to expand the data set. It converts
a data set from a stream of raw binary data into sequence of records.
Steps
: Set the Stage Properties :
"Command = Uncompress"

Load the Metadata in the Output tab…


Example Job :
Filter Stage :

Definition : The Filter stage transfers, unmodified, the records of the input data set which
satisfy the specified requirements and filters out all other records.

Filter stage can have a single input link and a any number of output links and, optionally, a
single reject link. You can specify different requirements to route rows down different output
links. The filtered out records can be routed to a reject link, if required.
Simple Job :
Criteria to Filter :

Note : Only if the "Output Rejects=True" the rejected rows are collected separately, otherwise
those rows that fails to satisfy the criteria will be ignored.
Criteria : Salary>30000
Input :

Output : Reject Rows :


Complex Job :
Criteria to Filter :

Note : "Output Row Only Once=False" means, every single input row is forced to satisfy each
and every criteria given. In other words, Row that satisfies a criteria is forced to undergo another
criteria. In such case, every row gets more than a single chance to output.
Input Data

Criteria 1 : Salary>30000 Criteria 2 : Number>1

Criteria 3 : Number=3 Reject Rows :


Criteria to Filter :

Note : "Output Row Only Once=True" means, every single input row is not forced to undergo
each and every criteria given. In other words, Rows that satisfy at least one criteria is not forced
to satisfy another criteria. In such case, every row gets a single chance to output.
Input Data

Criteria 1 : Salary>30000 Criteria 2 : Number>1

Criteria 3 : Number=3 Reject Rows :


Note : Though there is a row that
satisfies this criteria, it is not outputted
as it is already been outputted for
satisfying the Criteria – 1.
Funnel Stage :

Definition : Funnel Stage copies multiple input data sets to a single output data set. This
operation is useful for combining separate data sets into a single large data set. The stage can
have any number of input links and a single output link.
Type – 1 : Continuous Funnel

Note : This type of Funnel combines the records of the input data in no guaranteed order. It
takes one record from each input link in turn. If data is not available on an input link, the stage
skips to the next link rather than waiting.
Continuous Funnel type …

Input - 1 view data : Input - 2 view data :

Output view data :


Type – 2 : Sequence

Note : This type of Funnel copies all records from the first input data set to the output data set,
then all the records from the second input data set, and so on.
Sequence type …

Input - 1 view data : Input - 2 view data :

Output view data :


Type – 3 : Sort Funnel

Note : This type of Funnel combines the input records in the order defined by the value(s) of one
or more key columns and the order of the output records is determined by these sorting keys.
Sort Funnel type …

Input - 1 view data : Input - 2 view data :

Output view data :


Job :
Join Stage :

Definition : Join Stage performs join operations on two or more data sets input to the stage
and then outputs the resulting data set.

The input data sets are notionally identified as the "right" set and the "left" set, and
"intermediate" sets. It has any number of input links and a single output link.
Join Type = Full Outer

Job :
Left Input : Right Input :

Output :
Join Type = Inner

Job :
Left Input : Right Input :

Output :
Join Type = Left Outer

Job :
Left Input : Right Input :

Output :
Join Type = Right Outer

Job :
Left Input : Right Input :

Output :
Lookup Stage :

Definition : Lookup Stage used to perform lookup operations on a data set read into memory
from any other Parallel job stage that can output data.

It can also perform lookups directly in a DB2 or Oracle database  or in a lookup table contained
in a Lookup File Set stage.
Mappings :
Input : Output :

References : Reject :
Job :
Merge Stage :

Definition : Join Stage combines a sorted master data set with one or more update data
sets. The columns from the records in the master and update data sets are merged so that the
output record contains all the columns from the master record plus any additional columns
from each update record.

A master record and an update record are merged only if both of them have the same values
for the merge key column(s) that you specify. Merge key columns are one or more columns
that exist in both the master and update records.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that
rows with the same key column values are located in the same partition and will be processed
by the same node.
Unmatched Masters Mode = Drop

Job :
Master : Output :

Updates : Reject :
Unmatched Masters Mode = Keep

Job :
Master : Output :

Updates : Reject :
Note : If the options "Warn on Reject
Updates = True" and "Warn on
Unmatched Masters = True" then the log
file shows the warnings on Reject Updates
and Unmatched Data from Masters.
Note : If the options "Warn on Reject
Updates = False" and "Warn on
Unmatched Masters = False" then the log
file do not shows the warnings on Reject
Updates and Unmatched Data from
Masters.
Modify Stage :

Definition : The Modify stage alters the record schema of its input data set. The modified
data set is then output. It is a processing stage.

It can have a single input and single output.


Null Handling:

Job (before handling):


Step – 1:

"NULL" value
has to be
handled…

Null Handling…
Null
Handling

Step – 2:

Syntax : Column_Name=Handle_Null('Column_Name',Value)

Input Link Columns Output Link Columns

Null Handling…
Step – 3:

"NULL" value
has been
handled…

Null Handling…
Job (after execution):

Null Handling…
Drop Column(s):

Job (before execution):


Step – 1:

The column
"MGR" has to
be dropped…

Drop Columns …
Dropping
Column

Step – 2:

Syntax : DROP Column_Name

Input Link Columns Output Link Columns

Drop Columns …
Step – 3:

The column
"MGR" has
been dropped…

Drop Columns …
Job (after execution):

Drop Columns …
Keep Column(s):

Job (before execution):


Step – 1:

The column
"EmpNo" has to
be kept…

Keep Columns …
Keeping
Column

Step – 2:

Syntax : KEEP Column_Name

Input Link Columns Output Link Columns

Keep Columns …
Step – 3:

The column
"EmpNo" is
kept…

Keep Columns …
Job (after execution):

Keep Columns …
Type Conversion:

Job (before execution):


Step – 1:

The column
"HireDate" has to
converted to Date…

Type Conversion …
Type
Conversion

Step – 2:

Syntax : Column_Name=type_conversion('Column_Name')

Input Link Columns Output Link Columns

Type Conversion …
Step – 3:

The column "HireDate"


has been converted to
Date…

Type Conversion …
Job (after execution):

Type Conversion …
Multiple Specifications:

Job (before execution):


Step – 1:

The column
"HireDate" has
to converted to
The column
Date…
"MGR" has to be
Null handled…

Multiple Specification …
Null
Handling

Step – 2:

Type
Conversion

Input Link Columns Output Link Columns

Multiple Specification …
Step – 3:

The column
"HireDate" has
been converted to
The column
Date…
"MGR" has been
Null handled…

Multiple Specification …
Job (after execution):

Multiple Specification …
Pivot Stage :

Pivot Stage converts columns in to rows.

Eg., Mark-1 and Mark-2 are two columns.

Task : Convert all the columns in to one column.

Implication : Can be used to co SCD Type-3 to Type-2.

Using Methodology : In the deviation field of the output column change the
input columns in to one column.

Eg., Column Name – "Marks".

Derivation : Mark-1 and Mark-2.

Note : Column "Marks" is derived from the input columns Mark-1 and Mark-2.
Job (before execution):
Input metadata

Source File

Output Metadata:

Note the
change in the
derivation …
Job (after execution) :

Output file:
Remove Duplicates Stage:

Definition : The Remove Duplicates stage takes a single sorted data set as input, removes
all duplicate records, and writes the results to an output data set.

Removing duplicate records is a common way of cleansing a data set before you perform
further processing. Two records are considered duplicates if they are adjacent in the input
data set and have identical values for the key column(s).
Selecting Key & Retrain Row : Sorting Column :

Input view data : Output view data :

Last duplicate row


dropped…
Selecting Key & Retrain Row :

Input view data : Output view data :

First duplicate row


dropped…
Job (after execution) :
Surrogate Key Generator Stage :

Definition : The Surrogate Key stage generates key columns for an existing data set.

User can specify certain characteristics of the key sequence. The stage generates sequentially
incrementing unique integers from a given starting point. The existing columns of the data set
are passed straight through the stage.

If the stage is operating in parallel, each node will increment the key by the number of
partitions being written to.
Job (before execution):
Input :

Property :

Output :
Job (after execution):
Switch Stage :

Definition : The switch stage takes a single data set as input and assigns each input row to
an output data set based on the value of a selector field.

It can have a single input link, up to 128 output links and a single rejects link. This stage
performs an operation similar to a C switch statement. Rows that satisfy none of the cases are
output on the rejects link.
Property :
Output - 1 :

Input :

Output - 2 :

Reject :

Output - 3 :
Job (after execution):
Property :
Output - 1 :

Input : Output - 2 :

Output - 3 :
Job (after execution):
Property :
Output - 1 :

Input :

Output - 2 :

Reject :

Output - 3 :
Job (after execution):

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy