Pipeline Partitioning Overview Informatica
Pipeline Partitioning Overview Informatica
Pipeline Partitioning Overview Informatica
You create a session for each mapping you want the Informatica Server to run. Every mapping
contains one or more source pipelines. A source pipeline consists of a source qualifier and all the
transformations and targets that receive data from that source qualifier.
If you use PowerCenter, you can specify partitioning information for each source pipeline in a mapping.
If you use PowerMart, you must accept the default partitioning information. The partitioning
information for a pipeline controls the following factors:
• The number of reader, transformation, and writer threads that the master thread creates for
the pipeline. For more information, see Understanding Processing Threads.
• How the Informatica Server reads data from the source, including the number of connections
to the source.
• How the Informatica Server distributes rows of data to each transformation as it processes the
pipeline.
• How the Informatica Server writes data to the target, including the number of connections to
each target in the pipeline.
You can specify partitioning information for a pipeline by setting the following attributes:
• Location of partition points. Partition points mark the thread boundaries in a pipeline and
divide the pipeline into stages. The Informatica Server sets partition points at several
transformations in a pipeline by default. If you use PowerCenter, you can define other partition
points. When you add partition points, you increase the number of transformation threads,
which can improve session performance. The Informatica Server can redistribute rows of data
at partition points, which can also improve session performance. For more information on
partition points, see Partition Points.
• Number of partitions. A partition is a pipeline stage that executes in a single thread. If you
use PowerCenter, you can set the number of partitions at any partition point. If you use
PowerMart, the Informatica Server defines one partition for the pipeline. When you add
partitions, you increase the number processing threads, which can improve session
performance. For more information, see Number of Partitions.
• Partition types. The Informatica Server specifies a default partition type at each partition
point. If you use PowerCenter, you can change the partition type. The partition type controls
how the Informatica Server redistributes data among partitions at partition points. For more
information, see Partition Types.
Partition Points
By default, the Informatica Server sets partition points at various transformations in the pipeline.
Partition points mark thread boundaries as well as divide the pipeline into stages. A stage is a section
of a pipeline between any two partition points. When you set partition point at a transformation, the
new pipeline stage includes that transformation.
Table 10-1 lists the partition points that the Workflow Manager creates by default:
Table 10-1. Default Partition Points
Source Qualifier or Controls how the Informatica Server reads data from
Pass-through
Normalizer transformation the source and passes data into the source qualifier.
Rank and unsorted Ensures that the Informatica Server groups rows
Hash auto-keys
Aggregator transformations properly before it sends them to the transformation.
The mapping in Figure 10-1 contains four stages. The partition point at the source qualifier marks the
boundary between the first (reader) and second (transformation) stages. The partition point at the
Aggregator transformation marks the boundary between the second and third (transformation) stages.
The partition point at the target instance marks the boundary between the third (transformation) and
fourth (writer) stage.
When you add a partition point, you increase the number of pipeline stages by one. Similarly, when
you delete a partition point, you reduce the number of stages by one. For more information, see
Understanding Processing Threads.
Besides marking stage boundaries, partition points also mark the points in the pipeline where the
Informatica Server can redistribute data across partitions. For example, if you place a partition point
at a Filter transformation and define multiple partitions, the Informatica Server can redistribute rows
of data among the partitions before the Filter transformation processes the data. The partition type
you set at this partition point controls the way in which the Informatica Server passes rows of data to
each partition. For more information, see Partition Types.
For more information on adding and deleting partition points, see Adding and Deleting Partition Points.
Number of Partitions
A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. By
default, the Informatica Server defines a single partition in the source pipeline. If you use
PowerCenter, you can increase the number of partitions. This increases the number of processing
threads, which can improve session performance.
For example, you need to use the mapping in Figure 10-1 to extract data from three flat files of
various sizes. To do this, you define three partitions at the source qualifier to read the data
simultaneously. When you do this, the Workflow Manager defines three partitions in the pipeline.
Figure 10-2 shows the threads that the master thread creates for this mapping:
By default, the Informatica Server sets the number of partitions to one. If you use PowerMart, you
cannot change the number of partitions. If you use PowerCenter, you can generally define up to 16
partitions at any partition point. However, there are situations in which you can define only one
partition in the pipeline. For more information, see Restrictions on the Number of Partitions.
Note: Increasing the number of partitions or partition points increases the number of threads.
Therefore, increasing the number of partitions or partition points also increases the load on the server
machine. If the server machine contains ample CPU bandwidth, processing rows of data in a session
concurrently can increase session performance. However, if you create a large number of partitions or
partition points in a session that processes large amounts of data, you can overload the system.
For more information on adding and deleting partitions, see Adding and Deleting Partitions.
Partition Types
When you configure the partitioning information for a pipeline, you must specify a partition type at
each partition point in the pipeline. The partition type determines how the Informatica Server
redistributes data across partition points.
The Workflow Manager allows you to specify the following partition types:
• Round-robin partitioning. The Informatica Server distributes data evenly among all
partitions. Use round-robin partitioning where you want each partition to process
approximately the same number of rows. For more information, see Round-Robin Partitioning.
• Hash partitioning. The Informatica Server applies a hash function to a partition key to group
data among partitions. If you select hash auto-keys, the Informatica Server uses all grouped
or sorted ports as the partition key. If you select hash user keys, you specify a number of
ports to form the partition key. Use hash partitioning where you want to ensure that the
Informatica Server processes groups of rows with the same partition key in the same partition.
For more information, see Hash Partitioning.
• Key range partitioning. You specify one or more ports to form a compound partition key.
The Informatica Server passes data to each partition depending on the ranges you specify for
each port. Use key range partitioning where the sources or targets in the pipeline are
partitioned by key range. For more information, see Key Range Partitioning.
• Pass-through partitioning. The Informatica Server passes all rows at one partition point to
the next partition point without redistributing them. Choose pass-through partitioning where
you want to create an additional pipeline stage to improve performance, but do not want to
change the distribution of data across partitions. For more information, see Pass-through
Partitioning.
You can specify different partition types at different points in the pipeline.
The mapping in Figure 10-3 reads data about items and calculates average wholesale costs and prices.
The mapping must read item information from three flat files of various sizes, and then filter out
discontinued items. It sorts the active items by description, calculates the average prices and
wholesale costs, and writes the results to a relational database in which the target tables are
partitioned by key range.
When you use this mapping in a session, you can increase session performance by specifying different
partition types at the following partition points in the pipeline:
• Source qualifier. To read data from the three flat files concurrently, you must specify three
partitions at the source qualifier. Accept the default partition type, pass-through.
• Filter transformation. Since the source files vary in size, each partition processes a different
amount of data. Set a partition point at the Filter transformation, and choose round-robin
partitioning to balance the load going into the Filter transformation.
• Sorter transformation. To eliminate overlapping groups in the Sorter and Aggregator
transformations, use hash auto-keys partitioning at the Sorter transformation. This causes the
Informatica Server to group all items with the same description into the same partition before
the Sorter and Aggregator transformations process the rows. You can delete the default
partition point at the Aggregator transformation.
• Target. Since the target tables are partitioned by key range, specify key range partitioning at
the target to optimize writing data to the target.