Datawarehouse Interview Quesion and Answers

Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 230
At a glance
Powered by AI
The key takeaways are that a data warehouse is a subject oriented, integrated, time variant and non-volatile collection of data used for management decision making. Data mining involves sorting through data to identify patterns and relationships using techniques like association, sequence analysis, classification and clustering.

The main components of a data warehouse are that it is subject oriented, integrated, time variant and non-volatile. It contains data related to a specific scope for analysis and decision making.

The different types of data mining parameters are association, sequence or path analysis, classification and clustering. Association looks for connected events, sequence analysis looks for patterns where one event leads to another, classification identifies new patterns and clustering finds visual groups of related facts.

Data warehouse

Data warehouse is a Subject oriented, Integrated, Time variant, Non volatile collection of
data in support of management's decision making process.
Subject Oriented This means a data warehouse has a defined scope and it only stores data
under that scope. So for example, if the sales team of your company is creating a data warehouse
- the data warehouse by definition is required to contain data related to sales (and not the data
related to production management for example)
Non-volatile This means that data once stored in the data warehouse are not removed or deleted
from it and always stay there no matter what.
Integrated This means that the data stored in a data warehouse make sense. Fact and figures are
related to each other and they are integrable and projects a single point of truth.
Time variant This means that data is not constant, as new and new data gets loaded in the
warehouse, data warehouse also grows in size.

Data mining
Data mining is sorting through data to identify patterns and establish relationships.
Data mining parameters include:

Association - looking for patterns where one event is connected to another event
Sequence or path analysis - looking for patterns where one event leads to another later
event
Classification - looking for new patterns (May result in a change in the way the data is
organized but that's ok)
Clustering - finding and visually documenting groups of facts not previously known

Forecasting - discovering patterns in data that can lead to reasonable predictions about the
future (This area of data mining is known as predictive analytics.)
Data mining techniques are used in a many research areas, including mathematics, cybernetics,
genetics and marketing. Web mining, a type of data mining used in customer relationship
management (CRM), takes advantage of the huge amount of information gathered by a Web site
to look for patterns in user behavior.
Datamart
Data marts are generally designed for a single subject area
Dimension

junk dimensions - a collection of miscellaneous attributes that are unrelated to any


particular dimension.
degenerate dimensions - data that is dimensional in nature but stored in a fact table.
role playing dimensions - a dimension that can play different roles in a fact table depending
on the context.

conformed dimensions - a dimension that has exactly the same meaning and content when

being referred to from different fact tables.


conformed dimension
A conformed dimension is a dimension that has the same meaning to every fact with which it
relates. Conformed dimensions allow facts and measures to be categorized and described in the
same way across multiple facts and/or data marts, ensuring consistent reporting across the
enterprise.
A conformed dimension can exist as a single dimension table that relates to multiple fact tables
within the same data warehouse, or as identical dimension tables in separate data marts. Date is a
common conformed dimension because its attributes (day, week, month, quarter, year, etc.) have
the same meaning when joined to any fact table. A conformed product dimension with product
name, description, SKU, and other common attributes could exist in multiple data marts, each
containing data for one store in a chain.
star schema
A star schema is the simplest form of a dimensional model, in which data is organized into facts
and dimensions. A fact is an event that is counted or measured, such as a sale or login. A
dimension contains reference information about the fact, such as date, product, or customer.A star
schema is diagramed by surrounding each fact with its associated dimensions. The resulting
diagram resembles a star.
What is Business Intelligence?
Business Intelligence, on the other hand, is simply the art and science of presenting historical data
in a meaningful way (often by using different data visualization techniques).
Raw data stored in databases turns into valuable information through the implementation of
Business Intelligence processes.
-Describe advantages of the CIF architecture versus the bus architecture with conformed
dimensions. Which would fit best in our environment given [some parameters they give you] and
why
-Describe snowflaking
-Describe factless fact tables.
-Draw a star schema of our business
-Describe common optimization techniques applied at the data model level
-How do you handle data rejects in a warehouse architecture?
-Describe common techniques for loading from the staging area to the warehouse when you only
have a small window.
-How do you load type 1 dimensions
-How do you load type 2 dimensions, and how would you load it given our [insert business
particularity]
-How would you model unbalanced hierarchies
-How would you model cyclic relations
-What major elements would you include in an audit model?
-How would you implement traceability?
http://informaticaramamohanreddy.blogspot.com/2012/08/final-interview-questions-etl.html

FINAL INTERVIEW QUESTIONS ( ETL - INFORMATICA)

Data warehousing Basics

Definition of data warehousing?


Data warehouse is a Subject oriented, Integrated, Time variant, Non volatile collection of
data in support of management's decision making process.

Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about
your company's sales data, you can build a warehouse that concentrates on sales. Using this
warehouse, you can answer questions like "Who was our best customer for this item last year?"
This ability to define a data warehouse by subject matter, sales in this case makes the data
warehouse subject oriented.

Integrated
Integration is closely related to subject orientation. Data warehouses must put data from
disparate sources into a consistent format. They must resolve such problems as naming
conflicts and inconsistencies among units of measure. When they achieve this, they are said to be
integrated.

Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This
is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very
much in contrast to online transaction processing (OLTP) systems, where performance
requirements demand that historical data be moved to an archive. A data warehouse's focus on
change over time is what is meant by the term time variant.
How many stages in Datawarehousing?
Data warehouse generally includes two stages
ETL
Report Generation
ETL
Short for extract, transform, load, three database functions that are combined into one tool
Extract -- the process of reading data from a source database.
Transform -- the process of converting the extracted data from its previous form into required
form
Load -- the process of writing the data into the target database.
ETL is used to migrate data from one database to another, to form data marts and data
warehouses and also to convert databases from one format to another format.
It is used to retrieve the data from various operational databases and is transformed into useful
information and finally loaded into Datawarehousing system.

1
2
3
4
5

INFORMATICA
ABINITO
DATASTAGE
BODI
ORACLE WAREHOUSE BUILDERS

Report generation
In report generation, OLAP is used (i.e.) online analytical processing. It is a set of specification
which allows the client applications in retrieving the data for analytical processing.
It is a specialized tool that sits between a database and user in order to provide various analyses
of the data stored in the database.
OLAP Tool is a reporting tool which generates the reports that are useful for Decision support for
top level management.
Business Objects
Cognos
Micro strategy
Hyperion
Oracle Express
Microsoft Analysis Services

Different Between OLTP and OLAP

OLTP
Application Oriented (e.g., purchase order it
is functionality of an application)

OLAP
Subject Oriented (subject in the sense
customer, product, item, time)

Used to run business

Used to analyze business

Detailed data

Summarized data

Repetitive access

Ad-hoc access

Few Records accessed at a time (tens),

Large volumes accessed at a

simple query

time(millions), complex query

Small database

Large Database

Current data

Historical data

Clerical User

Knowledge User

Row by Row Loading

Bulk Loading

10

Time invariant

Time variant

11

Normalized data

De-normalized data

12

E R schema

Star schema

What are the types of datawarehousing?

EDW (Enterprise datawarehousing)


It provides a central database for decision support throughout the enterprise
It is a collection of DATAMARTS
DATAMART
It is a subset of Datawarehousing
It is a subject oriented database which supports the needs of individuals depts. in an
organizations
It is called high performance query structure
It supports particular line of business like sales, marketing etc..
ODS (Operational data store)

It is defined as an integrated view of operational database designed to support operational


monitoring
It is a collection of operational data sources designed to support Transaction processing
Data is refreshed near real-time and used for business activity
It is an intermediate between the OLTP and OLAP which helps to create an instance
reports

What are the types of Approach in DWH?


Bottom up approach: first we need to develop data mart then we integrate these data mart into
EDW
Top down approach: first we need to develop EDW then form that EDW we develop data mart
Bottom up
OLTP

ETL

Data mart

ETL

DWH

DWH

OLAP

Top down
OLTP

Data mart

OLAP

Top down
Cost of initial planning & design is high
Takes longer duration of more than an year
Bottom up
Planning & Designing the Data Marts without waiting for the Global warehouse design
Immediate results from the data marts
Tends to take less time to implement
Errors in critical modules are detected earlier.
Benefits are realized in the early phases.
It is a Best Approach
Data

Modeling Types:
Conceptual Data Modeling
Logical Data Modeling
Physical Data Modeling
Dimensional Data Modeling

1. Conceptual Data Modeling


Conceptual data model includes all major entities and relationships and does not contain
much detailed level of information about attributes and is often used in the INITIAL
PLANNING PHASE
Conceptual data model is created by gathering business requirements from various sources
like business documents, discussion with functional teams, business analysts, smart
management experts and end users who do the reporting on the database. Data modelers
create conceptual data model and forward that model to functional team for their review.
Conceptual data modeling gives an idea to the functional and technical team
about how business requirements would be projected in the logical data model.
2. Logical Data Modeling
This is the actual implementation and extension of a conceptual data model.
Logical data model includes all required entities, attributes, key groups, and relationships
that represent business information and define business rules
3. Physical Data Modeling
Physical data model includes all required tables, columns, relationships, database
properties for the physical implementation of databases. Database performance, indexing
strategy, physical storage and demoralization are important parameters of a physical
model.

Logical vs. Physical Data Modeling


Logical Data Model

Physical Data Model

Represents business information and

Represents the physical implementation of

defines business rules

the model in a database.

Entity

Table

Attribute

Column

Primary Key

Primary Key Constraint

Alternate Key

Unique Constraint or Unique Index

Inversion Key Entry

Non Unique Index

Rule

Check Constraint, Default Value

Relationship

Foreign Key

Definition

Comment

Dimensional Data Modeling


Dimension model consists of fact and dimension tables
It is an approach to develop the schema DB designs
Types

of Dimensional modeling
Star schema
Snow flake schema
Star flake schema (or) Hybrid schema
Multi star schema

What is Star Schema?


The Star Schema Logical database design which contains a centrally located fact table
surrounded by at least one or more dimension tables
Since the database design looks like a star, hence it is called star schema db
The Dimension table contains Primary keys and the textual descriptions
It contain de-normalized business information
A Fact table contains a composite key and measures
The measure are of types of key performance indicators which are used to evaluate the
enterprise performance in the form of success and failure
Eg: Total revenue , Product sale , Discount given, no of customers
To generate meaningful report the report should contain at least one dimension and one
fact table
The advantage of star schema
Less number of joins
Improve query performance
Slicing down
Easy understanding of data.
Disadvantage
Require more storage space
Snowflake Schema
In star schema, If the dimension tables are spitted into one or more dimension tables
The de-normalized dimension tables are spitted into a normalized dimension table
Example of Snowflake Schema:
In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup
tables and 1 fact table. The reason is that hierarchies (category, branch, state, and month)
are being broken out of the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and
TIME) respectively and separately.
It increases the number of joins and poor performance in retrieval of data.
In few organizations, they try to normalize the dimension tables to save space.
Since dimension tables hold less space snow flake schema approach may be avoided.
Bit map indexes cannot be effectively utilized
Important aspects of Star Schema & Snow Flake Schema
In a star schema every dimension will have a primary key.
In a star schema, a dimension table will not have any parent table.
Whereas in a snow flake schema, a dimension table will have one or more parent tables.
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
Whereas hierarchies are broken into separate tables in snow flake schema. These
hierarchies help to drill down the data from topmost hierarchies to the lowermost
hierarchies.
Star flake schema (or) Hybrid Schema
Hybrid schema is a combination of Star and Snowflake schema
Multi Star schema
Multiple fact tables sharing a set of dimension tables
Confirmed Dimensions are nothing but Reusable Dimensions.
The dimensions which you are using multiple times or in multiple data marts.
Those are common in different data marts

Measure Types (or) Types of Facts


Additive - Measures that can be summed up across all dimensions.
Ex: Sales Revenue
Semi Additive - Measures that can be summed up across few dimensions and not with others
Ex: Current Balance
Non Additive - Measures that cannot be summed up across any of the dimensions.
Ex: Student attendance
Surrogate Key
Joins between fact and dimension tables should be based on surrogate keys
Users should not obtain any information by looking at these keys
These keys should be simple integers

A sample data warehouse schema


Why need staging area for DWH?

Data

Staging area needs to clean operational data before loading into data warehouse.
Cleaning in the sense your merging data which comes from different source.
Its the area where most of the ETL is done
Cleansing
It is used to remove duplications
It is used to correct wrong email addresses
It is used to identify missing data
It used to convert the data types
It is used to capitalize name & addresses.

Types of Dimensions:
Confirmed Dimensions
Junk Dimensions Garbage Dimension
Degenerative Dimensions
Slowly changing Dimensions
Confirmed is something which can be shared by multiple Fact Tables or multiple Data Marts.
Junk Dimensions is grouping flagged values
Degenerative Dimension is something dimensional in nature but exist fact table.(Invoice
No).Which is neither fact nor strictly dimension attributes. These are useful for some kind
of analysis. These are kept as attributes in fact table called degenerated dimension
For ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, and
date in key section and price, quantity, amount in measure section. In this fact table, bill_no from
key section is a single value; it has no associated dimension table. Instead of creating a Separate
dimension table for that single value, we can Include it in fact table to improve performance. SO
here the column, bill_no is a degenerate dimension or line item dimension.

Informatica Architecture

The Power Center domain


It is a primary unit of the Administration.
Can have single and multiple domains.
It is a collection of nodes and services.
Nodes
A node is the logical representation of a machine in a domain
One node in the domain acts as a gateway node to receive service requests from clients and route
them to the appropriate service and node
Integration Service:
Integration Service does all the real job. It extracts data from sources, processes it as per the
business logic and loads data to targets.
Repository Service:
Repository Service is used to fetch the data from the repository and sends it back to the
requesting components (mostly client tools and integration service)
Power Center Repository:
Repository is nothing but a relational database which stores all the metadata created in Power
Center.
Power Center Client Tools:
The Power Center Client consists of multiple tools.
Power Center Administration Console:
This is simply a web-based administration tool you can use to administer the Power Center
installation.

Q. How can you define a transformation? What are different types of transformations
available in Informatica?
A. A transformation is a repository object that generates, modifies, or passes data. The Designer
provides a set of transformations that perform specific functions. For example, an Aggregator
transformation performs calculations on groups of data. Below are the various transformations
available in Informatica:
Aggregator
Custom
Expression

External Procedure
Filter
Input
Joiner
Lookup
Normalizer
Rank
Router
Sequence Generator
Sorter
Source Qualifier
Stored Procedure
Transaction Control
Union
Update Strategy
XML Generator
XML Parser
XML Source Qualifier
Q. What is a source qualifier? What is meant by Query Override?
A. Source Qualifier represents the rows that the PowerCenter Server reads from a relational or flat
file source when it runs a session. When a relational or a flat file source definition is added to a
mapping, it is connected to a Source Qualifier transformation.
PowerCenter Server generates a query for each Source Qualifier Transformation whenever it runs
the session. The default query is SELET statement containing all the source columns. Source
Qualifier has capability to override this default query by changing the default settings of the
transformation properties. The list of selected ports or the order they appear in the default query
should not be changed in overridden query.
Q. What is aggregator transformation?
A. The Aggregator transformation allows performing aggregate calculations, such as averages and
sums. Unlike Expression Transformation, the Aggregator transformation can only be used to
perform calculations on groups. The Expression transformation permits calculations on a rowbyrow basis only.
Aggregator Transformation contains group by ports that indicate how to group the data. While
grouping the data, the aggregator transformation outputs the last row of each group unless

otherwise specified in the transformation properties.


Various group by functions available in Informatica are : AVG, COUNT, FIRST, LAST, MAX, MEDIAN,
MIN, PERCENTILE, STDDEV, SUM, VARIANCE.
Q. What is Incremental Aggregation?
A. Whenever a session is created for a mapping Aggregate Transformation, the session option for
Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it
passes new source data through the mapping and uses historical cache data to perform
new aggregation calculations incrementally.
Q. How Union Transformation is used?
A. The union transformation is a multiple input group transformation that can be used to merge
data from various sources (or pipelines). This transformation works just like UNION ALL statement
in SQL, that is used to combine result set of two SELECT statements.

Q. Can two flat files be joined with Joiner Transformation?


A. Yes, joiner transformation can be used to join data from two flat file sources.
Q. What is a look up transformation?
A. This transformation is used to lookup data in a flat file or a relational table, view or synonym. It
compares lookup transformation ports (input ports) to the source column values based on the
lookup condition. Later returned values can be passed to other transformations.
Q. Can a lookup be done on Flat Files?
A. Yes.
Q. What is a mapplet?
A. A mapplet is a reusable object that is created using mapplet designer. The mapplet contains set
of transformations and it allows us to reuse that transformation logic in multiple mappings.
Q. What does reusable transformation mean?
A. Reusable transformations can be used multiple times in a mapping. The reusable
transformation is stored as a metadata separate from any other mapping that uses the
transformation. Whenever any changes to a reusable transformation are made, all the mappings
where the transformation is used will be invalidated.

Q. What is update strategy and what are the options for update strategy?
A. Informatica processes the source data row-by-row. By default every row is marked to be inserted
in the target table. If the row has to be updated/inserted based on some logic Update Strategy
transformation is used. The condition can be specified in Update Strategy to mark the processed
row for update or insert.
Following options are available for update strategy:
DD_INSERT: If this is used the Update Strategy flags the row for insertion. Equivalent numeric
value of DD_INSERT is 0.
DD_UPDATE: If this is used the Update Strategy flags the row for update. Equivalent numeric
value of DD_UPDATE is 1.
DD_DELETE: If this is used the Update Strategy flags the row for deletion. Equivalent numeric
value of DD_DELETE is 2.
DD_REJECT: If this is used the Update Strategy flags the row for rejection. Equivalent numeric
value of DD_REJECT is 3.

Q. What are the types of loading in Informatica?


There are two types of loading, 1. Normal loading and 2. Bulk loading.
In normal loading, it loads record by record and writes log for that. It takes comparatively a longer
time to load data to the target.
In bulk loading, it loads number of records at a time to target database. It takes less time to load
data to target.
Q. What is aggregate cache in aggregator transformation?
The aggregator stores data in the aggregate cache until it completes aggregate calculations. When
you run a session that uses an aggregator transformation, the informatica server creates index
and data caches in memory to process the transformation. If the informatica server requires more
space, it stores overflow values in cache files.
Q. What type of repositories can be created using Informatica Repository Manager?
A. Informatica PowerCenter includes following type of repositories:
Standalone Repository: A repository that functions individually and this is unrelated to any
other repositories.
Global Repository: This is a centralized repository in a domain. This repository can
contain shared objects across the repositories in a domain. The objects are shared through global
shortcuts.

Local Repository: Local repository is within a domain and its not a global repository. Local
repository can connect to a global repository using global shortcuts and can use objects in its
shared folders.
Versioned Repository: This can either be local or global repository but it allows version control
for the repository. A versioned repository can store multiple copies, or versions of an object. This
feature allows efficiently developing, testing and deploying metadata in the production
environment.
Q. What is a code page?
A. A code page contains encoding to specify characters in a set of one or more languages. The
code page is selected based on source of the data. For example if source contains Japanese text
then the code page should be selected to support Japanese text.
When a code page is chosen, the program or application for which the code page is set, refers to a
specific set of data that describes the characters the application recognizes. This influences the
way that application stores, receives, and sends character data.
Q. Which all databases PowerCenter Server on Windows can connect to?
A. PowerCenter Server on Windows can connect to following databases:
IBM DB2
Informix
Microsoft Access
Microsoft Excel
Microsoft SQL Server
Oracle
Sybase
Teradata
Q. Which all databases PowerCenter Server on UNIX can connect to?
A. PowerCenter Server on UNIX can connect to following databases:
IBM DB2
Informix
Oracle
Sybase
Teradata
Q. How to execute PL/SQL script from Informatica mapping?
A. Stored Procedure (SP) transformation can be used to execute PL/SQL Scripts. In SP
Transformation PL/SQL procedure name can be specified. Whenever the session is executed, the
session will call the pl/sql procedure.

Q. What is Data Driven?


The informatica server follows instructions coded into update strategy transformations within the
session mapping which determine how to flag records for insert, update, delete or reject. If we do
not choose data driven option setting, the informatica server ignores all update strategy
transformations in the mapping.
Q. What are the types of mapping wizards that are provided in Informatica?
The designer provide two mapping wizard.
1. Getting Started Wizard - Creates mapping to load static facts and dimension tables as well as
slowly growing dimension tables.
2. Slowly Changing Dimensions Wizard - Creates mappings to load slowly changing dimension
tables based on the amount of historical dimension data we want to keep and the method we
choose to handle historical dimension data.
Q. What is Load Manager?
A. While running a Workflow, the PowerCenter Server uses the Load Manager
process and the Data Transformation Manager Process (DTM) to run the workflow and carry
out workflow tasks. When the PowerCenter Server runs a workflow, the Load Manager performs
the following tasks:
1. Locks the workflow and reads workflow properties.
2. Reads the parameter file and expands workflow variables.
3. Creates the workflow log file.
4. Runs workflow tasks.
5. Distributes sessions to worker servers.
6. Starts the DTM to run sessions.
7. Runs sessions from master servers.
8. Sends post-session email if the DTM terminates abnormally.
When the PowerCenter Server runs a session, the DTM performs the following tasks:
1. Fetches session and mapping metadata from the repository.
2. Creates and expands session variables.
3. Creates the session log file.
4. Validates session code pages if data code page validation is enabled. Checks
Query conversions if data code page validation is disabled.
5. Verifies connection object permissions.
6. Runs pre-session shell commands.

7. Runs pre-session stored procedures and SQL.


8. Creates and runs mappings, reader, writer, and transformation threads to extract,
transform, and load data.
9. Runs post-session stored procedures and SQL.
10. Runs post-session shell commands.
11. Sends post-session email.
Q. What is Data Transformation Manager?
A. After the load manager performs validations for the session, it creates the DTM
process. The DTM process is the second process associated with the session run. The
primary purpose of the DTM process is to create and manage threads that carry out
the session tasks.
The DTM allocates process memory for the session and divide it into buffers. This
is also known as buffer memory. It creates the main thread, which is called the
master thread. The master thread creates and manages all other threads.
If we partition a session, the DTM creates a set of threads for each partition to
allow concurrent processing.. When Informatica server writes messages to the
session log it includes thread type and thread ID.
Following are the types of threads that DTM creates:
Master Thread - Main thread of the DTM process. Creates and manages all other
threads.
Mapping Thread - One Thread to Each Session. Fetches Session and Mapping
Information.
Pre and Post Session Thread - One Thread each to Perform Pre and Post Session
Operations.
Reader Thread - One Thread for Each Partition for Each Source Pipeline.
Writer Thread - One Thread for Each Partition if target exist in the source pipeline
write to the target.
Transformation Thread - One or More Transformation Thread For Each Partition.

Q. What is Session and Batches?


Session - A Session Is A set of instructions that tells the Informatica Server How
And When To Move Data From Sources To Targets. After creating the session, we
can use either the server manager or the command line program pmcmd to start
or stop the session.

Batches - It Provides A Way to Group Sessions For Either Serial Or Parallel Execution By The
Informatica Server. There Are Two Types Of Batches:
1. Sequential - Run Session One after the Other.
2. Concurrent - Run Session At The Same Time.
Q. How many ways you can update a relational source definition and what
are they?
A. Two ways
1. Edit the definition
2. Reimport the definition
Q. What is a transformation?
A. It is a repository object that generates, modifies or passes data.
Q. What are the designer tools for creating transformations?
A. Mapping designer
Transformation developer
Mapplet designer
Q. In how many ways can you create ports?
A. Two ways
1. Drag the port from another transformation
2. Click the add button on the ports tab.
Q. What are reusable transformations?
A. A transformation that can be reused is called a reusable transformation
They can be created using two methods:
1. Using transformation developer
2. Create normal one and promote it to reusable
Q. Is aggregate cache in aggregator transformation?
A. The aggregator stores data in the aggregate cache until it completes aggregate calculations.
When u run a session that uses an aggregator transformation, the Informatica server creates index
and data caches in memory to process the transformation. If the Informatica server requires more
space, it stores overflow values in cache files.
Q. What r the settings that u use to configure the joiner transformation?

Master and detail source

Type of join

Condition of the join

Q. What are the join types in joiner transformation?


Normal (Default) -- only matching rows from both master and detail
Master outer -- all detail rows and only matching rows from master
Detail outer -- all master rows and only matching rows from detail
Full outer -- all rows from both master and detail (matching or non matching)
Q. What are the joiner caches?
A. When a Joiner transformation occurs in a session, the Informatica Server reads all the records
from the master source and builds index and data caches based on the master rows. After building
the caches, the Joiner transformation reads records from the detail source and performs joins.
Q. What r the types of lookup caches?
Static cache: You can configure a static or read-only cache for only lookup table. By default
Informatica server creates a static cache. It caches the lookup table and lookup values in the
cache for each row that comes into the transformation. When the lookup condition is true, the
Informatica server does not update the cache while it processes the lookup transformation.
Dynamic cache: If you want to cache the target table and insert new rows into cache and the
target, you can create a look up transformation to use dynamic cache. The Informatica server
dynamically inserts data to the target table.
Persistent cache: You can save the lookup cache files and reuse them the next time the
Informatica server processes a lookup transformation configured to use the cache.
Recache from database: If the persistent cache is not synchronized with the lookup table, you
can configure the lookup transformation to rebuild the lookup cache.
Shared cache: You can share the lookup cache between multiple transactions. You can share
unnamed cache between transformations in the same mapping.
Q. What is Transformation?
A: Transformation is a repository object that generates, modifies, or passes data.
Transformation performs specific function. They are two types of transformations:
1. Active
Rows, which are affected during the transformation or can change the no of rows that pass
through it. Eg: Aggregator, Filter, Joiner, Normalizer, Rank, Router, Source qualifier, Update
Strategy, ERP Source Qualifier, Advance External Procedure.
2. Passive
Does not change the number of rows that pass through it. Eg: Expression, External Procedure,
Input, Lookup, Stored Procedure, Output, Sequence Generator, XML Source Qualifier.
Q. What are Options/Type to run a Stored Procedure?

A: Normal: During a session, the stored procedure runs where the


transformation exists in the mapping on a row-by-row basis. This is useful for calling the stored
procedure for each row of data that passes through the mapping, such as running a calculation
against an input port. Connected stored procedures run only in normal mode.
Pre-load of the Source. Before the session retrieves data from the source, the stored procedure
runs. This is useful for verifying the existence of tables or performing joins of data in a temporary
table.
Post-load of the Source. After the session retrieves data from the source, the stored procedure
runs. This is useful for removing temporary tables.
Pre-load of the Target. Before the session sends data to the target, the stored procedure runs.
This is useful for verifying target tables or disk space on the target system.
Post-load of the Target. After the session sends data to the target, the stored procedure runs.
This is useful for re-creating indexes on the database. It must contain at least one Input and one
Output port.
Q. What kinds of sources and of targets can be used in Informatica?

Sources may be Flat file, relational db or XML.

Target may be relational tables, XML or flat files.

Q: What is Session Process?


A: The Load Manager process. Starts the session, creates the DTM process, and
sends post-session email when the session completes.
Q. What is DTM process?
A: The DTM process creates threads to initialize the session, read, write, transform
data and handle pre and post-session operations.
Q. What is the different type of tracing levels?
Tracing level represents the amount of information that Informatica Server writes in a log
file. Tracing levels store information about mapping and transformations. There are 4 types of
tracing levels supported
1. Normal: It specifies the initialization and status information and summarization of the success
rows and target rows and the information about the skipped rows due to transformation errors.
2. Terse: Specifies Normal + Notification of data
3. Verbose Initialization: In addition to the Normal tracing, specifies the location of the data
cache files and index cache files that are treated and detailed transformation statistics for each
and every transformation within the mapping.

4. Verbose Data: Along with verbose initialization records each and every record processed by
the informatica server.
Q. Types of Dimensions?

A dimension table consists of the attributes about the facts. Dimensions store the textual
descriptions of the business.
Conformed Dimension:
Conformed dimensions mean the exact same thing with every possible fact table to which they are
joined.
Eg: The date dimension table connected to the sales facts is identical to the date dimension
connected to the inventory facts.
Junk Dimension:
A junk dimension is a collection of random transactional codes flags and/or text attributes that are
unrelated to any particular dimension. The junk dimension is simply a structure that provides a
convenient place to store the junk attributes.
Eg: Assume that we have a gender dimension and marital status dimension. In the fact table we
need to maintain two keys referring to these dimensions. Instead of that create a junk dimension
which has all the combinations of gender and marital status (cross join gender and marital status
table and create a junk table). Now we can maintain only one key in the fact table.
Degenerated Dimension:
A degenerate dimension is a dimension which is derived from the fact table and doesnt have its
own dimension table.
Eg: A transactional code in a fact table.
Slowly changing dimension:
Slowly changing dimensions are dimension tables that have slowly increasing
data as well as updates to existing data.
Q. What are the output files that the Informatica server creates during the
session running?
Informatica server log: Informatica server (on UNIX) creates a log for all status and
error messages (default name: pm.server.log). It also creates an error log for error
messages. These files will be created in Informatica home directory
Session log file: Informatica server creates session log file for each session. It writes
information about session into log files such as initialization process, creation of sql

commands for reader and writer threads, errors encountered and load summary. The
amount of detail in session log file depends on the tracing level that you set.
Session detail file: This file contains load statistics for each target in mapping.
Session detail includes information such as table name, number of rows written or
rejected. You can view this file by double clicking on the session in monitor window.
Performance detail file: This file contains information known as session performance
details which helps you where performance can be improved. To generate this file
select the performance detail option in the session property sheet.
Reject file: This file contains the rows of data that the writer does not write to
targets.
Control file: Informatica server creates control file and a target file when you run a
session that uses the external loader. The control file contains the information about
the target flat file such as data format and loading instructions for the external
loader.
Post session email: Post session email allows you to automatically communicate
information about a session run to designated recipients. You can create two
different messages. One if the session completed successfully the other if the session
fails.
Indicator file: If you use the flat file as a target, you can configure the Informatica
server to create indicator file. For each target row, the indicator file contains a
number to indicate whether the row was marked for insert, update, delete or reject.
Output file: If session writes to a target file, the Informatica server creates the
target file based on file properties entered in the session property sheet.
Cache files: When the Informatica server creates memory cache it also creates cache
files.
For the following circumstances Informatica server creates index and data cache
files:
Aggregator transformation
Joiner transformation
Rank transformation
Lookup transformation
Q. What is meant by lookup caches?
A. The Informatica server builds a cache in memory when it processes the first row
of a data in a cached look up transformation. It allocates memory for the cache
based on the amount you configure in the transformation or session properties. The
Informatica server stores condition values in the index cache and output values in
the data cache.

Q. How do you identify existing rows of data in the target table using lookup
transformation?
A. There are two ways to lookup the target table to verify a row exists or not :
1. Use connect dynamic cache lookup and then check the values of NewLookuprow
Output port to decide whether the incoming record already exists in the table / cache
or not.
2. Use Unconnected lookup and call it from an expression transformation and check
the Lookup condition port value (Null/ Not Null) to decide whether the incoming
record already exists in the table or not.
Q. What are Aggregate tables?
Aggregate table contains the summary of existing warehouse data which is grouped to certain
levels of dimensions. Retrieving the required data from the actual table, which have millions of
records will take more time and also affects the server performance. To avoid this we can
aggregate the table to certain required level and can use it. This tables reduces the load in the
database server and increases the performance of the query and can retrieve the result very
fastly.
Q. What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table in a data warehouse. For
example: Based on design you can decide to put the sales data in each transaction. Now, level of
granularity would mean what detail you are willing to put for each transactional fact. Product sales
with respect to each minute or you want to aggregate it upto minute and put that data.
Q. What is session?
A session is a set of instructions to move data from sources to targets.
Q. What is worklet?
Worklet are objects that represent a set of workflow tasks that allow to reuse a set of workflow
logic in several window.
Use of Worklet: You can bind many of the tasks in one place so that they can easily get identified
and also they can be of a specific purpose.
Q. What is workflow?
A workflow is a set of instructions that tells the Informatica server how to execute the tasks.
Q. Why cannot we use sorted input option for incremental aggregation?
In incremental aggregation, the aggregate calculations are stored in historical cache on the server.
In this historical cache the data need not be in sorted order. If you give sorted input, the records

come as presorted for that particular run but in the historical cache the data may not be in the
sorted order. That is why this option is not allowed.
Q. What is target load order plan?
You specify the target loadorder based on source qualifiers in a mapping. If you have the multiple
source qualifiers connected to the multiple targets, you can designate the order in which
informatica server loads data into the targets.
The Target load Plan defines the order in which data extract from source qualifier transformation.
In Mappings (tab) Target Load Order Plan
Q. What is constraint based loading?
Constraint based load order defines the order of loading the data into the multiple targets based
on primary and foreign keys constraints.
Set the option is: Double click the session
Configure Object > check the Constraint Based Loading
Q. What is the status code in stored procedure transformation?
Status code provides error handling for the informatica server during the session. The stored
procedure issues a status code that notifies whether or not stored procedure completed
successfully. This value cannot see by the user. It only used by the informatica server to determine
whether to continue running the session or stop.
Q. Define Informatica Repository?
The Informatica repository is a relational database that stores information, or metadata, used by
the Informatica Server and Client tools. Metadata can include information such as mappings
describing how to transform source data, sessions indicating when you want the Informatica
Server to perform the transformations, and connect strings for sources and targets.
The repository also stores administrative information such as usernames and passwords,
permissions and privileges, and product version.
Use repository manager to create the repository. The Repository Manager connects to the
repository database and runs the code needed to create the repository tables. These tables stores
metadata in specific format the informatica server, client tools use.
Q. What is a metadata?
Designing a data mart involves writing and storing a complex set of instructions. You need to know
where to get data (sources), how to change it, and where to write the information (targets).
PowerMart and PowerCenter call this set of instructions metadata. Each piece of metadata (for
example, the description of a source table in an operational database) can contain comments

about it.
In summary, Metadata can include information such as mappings describing how to transform
source data, sessions indicating when you want the Informatica Server to perform the
transformations, and connect strings for sources and targets.

Q. What is metadata reporter?


It is a web based application that enables you to run reports against repository metadata. With a
Meta data reporter you can access information about your repository without having knowledge of
sql, transformation language or underlying tables in the repository.

Q. What are the types of metadata that stores in repository?


Source definitions. Definitions of database objects (tables, views, synonyms) or files that provide
source data.
Target definitions. Definitions of database objects or files that contain the target data. Multidimensional metadata. Target definitions that are configured as cubes and dimensions.
Mappings. A set of source and target definitions along with transformations containing business
logic that you build into the transformation. These are the instructions that the Informatica Server
uses to transform and move data.
Reusable transformations. Transformations that you can use in multiple mappings.
Mapplets. A set of transformations that you can use in multiple mappings.
Sessions and workflows. Sessions and workflows store information about how and when the
Informatica Server moves data. A workflow is a set of instructions that describes how and when to
run tasks related to extracting, transforming, and loading data. A session is a type of task that you
can put in a workflow. Each session corresponds to a single mapping.
Following are the types of metadata that stores in the repository

Database Connections

Global Objects

Multidimensional Metadata

Reusable Transformations

Short cuts

Transformations
Q. How can we store previous session logs?

Go to Session Properties > Config Object > Log Options


Select the properties as follows.
Save session log by > SessionRuns
Save session log for these runs > Change the number that you want to save the number of log
files (Default is 0)
If you want to save all of the logfiles created by every run, and then select the option
Save session log for these runs > Session TimeStamp
You can find these properties in the session/workflow Properties.
Q. What is Changed Data Capture?
Changed Data Capture (CDC) helps identify the data in the source system that has changed since
the last extraction. With CDC data extraction takes place at the same time the insert update or
delete operations occur in the source tables and the change data is stored inside the database in
change tables.
The change data thus captured is then made available to the target systems in a controlled
manner.
Q. What is an indicator file? and how it can be used?
Indicator file is used for Event Based Scheduling when you dont know when the Source Data is
available. A shell command, script or a batch file creates and send this indicator file to the
directory local to the Informatica Server. Server waits for the indicator file to appear before running
the session.
Q. What is audit table? and What are the columns in it?
Audit Table is nothing but the table which contains about your workflow names and session names.
It contains information about workflow and session status and their details.

WKFL_RUN_ID

WKFL_NME

START_TMST

END_TMST

ROW_INSERT_CNT

ROW_UPDATE_CNT

ROW_DELETE_CNT

ROW_REJECT_CNT

Q. If session fails after loading 10000 records in the target, how can we load 10001th
record when we run the session in the next time?
Select the Recovery Strategy in session properties as Resume from the last check point.
Note Set this property before running the session
Q. Informatica Reject File How to identify rejection reason

D - Valid data or Good Data. Writer passes it to the target database. The target accepts it
unless a database error occurs, such as finding a duplicate key while inserting.
O - Overflowed Numeric Data. Numeric data exceeded the specified precision or scale for the
column. Bad data, if you configured the mapping target to reject overflow or truncated data.
N - Null Value. The column contains a null value. Good data. Writer passes it to the target, which
rejects it if the target database does not accept null values.
T - Truncated String Data. String data exceeded a specified precision for the column, so the
Integration Service truncated it. Bad data, if you configured the mapping target to reject overflow
or truncated data.
Also to be noted that the second column contains column indicator flag value D which signifies
that the Row Indicator is valid.
Now let us see how Data in a Bad File looks like:
0,D,7,D,John,D,5000.375,O,,N,BrickLand Road Singapore,T
Q. What is Insert Else Update and Update Else Insert?
These options are used when dynamic cache is enabled.

Insert Else Update option applies to rows entering the lookup transformation with the row
type of insert. When this option is enabled the integration service inserts new rows in the cache
and updates existing rows. When disabled, the Integration Service does not update existing rows.

Update Else Insert option applies to rows entering the lookup transformation with the row
type of update. When this option is enabled, the Integration Service updates existing rows, and
inserts a new row if it is new. When disabled, the Integration Service does not insert new rows.

Q. What are the Different methods of loading Dimension tables?


Conventional Load - Before loading the data, all the Table constraints will be checked against the
data.
Direct load (Faster Loading) - All the Constraints will be disabled. Data will be loaded directly.
Later the data will be checked against the table constraints and the bad data wont be indexed.
Q. What are the different types of Commit intervals?
The different commit intervals are:

Source-based commit. The Informatica Server commits data based on the number of
source rows. The commit point is the commit interval you configure in the session properties.

Target-based commit. The Informatica Server commits data based on the number of
target rows and the key constraints on the target table. The commit point also depends on the
buffer block size and the commit interval.
Q. How to add source flat file header into target file?
Edit Task-->Mapping-->Target-->Header Options--> Output field names
Q. How to load name of the file into relation target?
Source Definition-->Properties-->Add currently processed file name port

Q. How to return multiple columns through un-connect lookup?


Suppose your look table has f_name,m_name,l_name and you are using unconnected lookup. In
override SQL of lookup use f_name||~||m_name||~||l_name you can easily get this value using
unconnected lookup in expression. Use substring function in expression transformation to separate
these three columns and make then individual port for downstream transformation /Target.
-----------------------------------------------------------------------------------------

Q. What is Factless fact table? In which purpose we are using this in our DWH projects?
Plz give me the proper answer?
It is a fact table which does not contain any measurable data.
EX: Student attendance fact (it contains only Boolean values, whether student attended class or
not ? Yes or No.)

A Factless fact table contains only the keys but there is no measures or in other way we can say
that it contains no facts. Generally it is used to integrate the fact tables

Factless fact table contains only foreign keys. We can have two kinds of aggregate functions from
the factless fact one is count and other is distinct count.

2 purposes of factless fact

1. Coverage: to indicate what did NOT happen. Like to


Like: which product did not sell well in a particular region?
2. Event tracking: To know if the event took place or not.
Like: Fact for tracking students attendance will not contain any measures.
Q. What is staging area?
Staging area is nothing but to apply our logic to extract the data from source and cleansing the
data and put the data into meaningful and summaries of the data for data warehouse.
Q. What is constraint based loading
Constraint based load order defines the order of loading the data into the multiple targets based
on primary and foreign keys constraints.

Q. Why union transformation is active transformation?


the only condition for a transformation to bcum active is row number changes.
Now the thing is how a row number can change. Then there are
2 conditions:
1. either the no of rows coming in and going out is diff.
eg: in case of filter we have the data like
id name dept row_num
1 aa 4 1
2 bb 3 2
3 cc 4 3
and we have a filter condition like dept=4 then the o/p wld
b like
id name dept row_num
1 aa 4 1
3 cc 4 2
So row num changed and it is an active transformation
2. or the order of the row changes
eg: when Union transformation pulls in data, suppose we have
2 sources
sources1:
id name dept row_num
1 aa 4 1
2 bb 3 2
3 cc 4 3
source2:
id name dept row_num
4 aaa 4 4
5 bbb 3 5
6 ccc 4 6
it never restricts the data from any source so the data can
come in any manner
id name dept row_num old row_num
1 aa 4 1 1
4 aaa 4 2 4
5 bbb 3 3 5
2 bb 3 4 2
3 cc 4 5 3
6 ccc 4 6 6
so the row_num are changing . Thus we say that union is an active transformation

Q. What is use of batch file in informatica? How many types of batch file in informatica?
With the batch file, we can run sessions either in sequential or in concurrently.
Grouping of Sessions is known as Batch.
Two types of batches:
1)Sequential: Runs Sessions one after another.

2)Concurrent: Run the Sessions at the same time.

If u have sessions with source-target dependencies u have to go for sequential batch to start the
sessions one after another. If u have several independent sessions u can use concurrent batches
Which run all the sessions at the same time
Q. What is joiner cache?
When we use the joiner transformation an integration service maintains the cache, all the records
are stored in joiner cache. Joiner caches have 2 types of cache 1.Index cache 2. Joiner cache.

Index cache stores all the port values which are participated in the join condition and data cache
have stored all ports which are not participated in the join condition.
Q. What is the location of parameter file in Informatica?
$PMBWPARAM
Q. How can you display only hidden files in UNIX
$ ls -la
total 16
8 drwxrwxrwx 2 zzz yyy 4096 Apr 26 12:00 ./
8 drwxrwxrwx 9 zzz yyy 4096 Jul 31 16:59 ../
Correct answer is
ls -a|grep "^\."
$ls -a
Q. How to delete the data in the target table after loaded.
SQ---> Properties tab-->Post SQL
delete from target_tablename
SQL statements executed using the source database connection, after a pipeline is run write post
sql in target table as truncate table name. we have the property in session truncate option.
Q. What is polling in informatica?

It displays the updated information about the session in the monitor window. The monitor window
displays the status of each session when you poll the Informatica server.
Q. How i will stop my workflow after 10 errors
Session level property error handling mention condition stop on errors: 10
--->Config object > Error Handling > Stop on errors
Q. How can we calculate fact table size?

A fact table is multiple of combination of dimension tables


ie if we want 2 find the fact table size of 3years of historical date with 200 products and 200 stores
3*365*200*200=fact table size
Q. Without using emailtask how will send a mail from informatica?
by using 'mailx' command in unix of shell scripting
Q. How will compare two mappings in two different repositories?
in the designer client , goto mapping tab there is one
option that is 'compare', here we will compare two mappings in two different repository
in informatica designer go to mapping tab--->compare..
we can compare 2 folders within the same repository ..
we can compare 2 folders within different repository ..
Q. What is constraint based load order
Constraint based load order defines the order in which data loads into the multiple targets based
on primary key and foreign key relationship.
Q. What is target load plan
Suppose i have 3 pipelines in a single mapping designer
emp source--->sq--->tar1
dept source--->sq--->tar2
bonus source--->sq--->tar3
my requirement is to load first in tar2 then tar1 and then finally tar3
for this type of loading to control the extraction of data from source by source qualifier we use
target load plan.
Q. What is meant by data driven.. in which scenario we use that..?
Data driven is available at session level. it says that when we r using update strategy t/r ,how the
integration service fetches the data and how to update/insert row in the database log.
Data driven is nothing but instruct the source rows that should take action on target
i.e(update,delete,reject,insert). If we use the update strategy transformation in a mapping then
will select the data driven option in session.
Q. How to run workflow in unix?
Syntax: pmcmd startworkflow -sv <service name> -d <domain name> -u <user name> -p
<password> -f <folder name> <workflow name>
Example
Pmcmd start workflow service
${INFA_SERVICE} -domain
${INFA_DOMAIN} -uv xxx_PMCMD_ID -pv PSWD -folder
${ETLFolder} -wait ${ETLWorkflow} \
Q. What is the main difference between a Joiner Transformation and Union
Transformation?
Joiner Transformation merge horizontally
Union Transformation merge vertically

A joiner Transformation is used to join data from hertogenous database ie (Sql database and flat
file) where has Union transformation is used to join data from
the same relational sources.....(oracle table and another Oracle table)
Join Transformation combines data record horizontally based on join condition.
And combine data from two different sources having different metadata.
Join transformation supports heterogeneous, homogeneous data source.
Union Transformation combines data record vertically from multiple sources, having same
metadata.
Union transformation also support heterogeneous data source.
Union transformation functions as UNION ALL set operator.

Q. What is constraint based loading exactly? And how to do this? I think it is when we
have primary key-foreign key relationship. Is it correct?
Constraint Based Load order defines load the data into multiple targets depend on the primary key
foreign key relation.
set the option is: Double click the session
Configure Object check the Constraint Based Loading
Q. Difference between top down(w.h inmon)and bottom up(ralph kimball)approach?
Top Down approach:As per W.H.INWON, first we need to build the Data warehouse after that we need to build up the
DataMart but this is so what difficult to maintain the DWH.
Bottom up approach;As per Ralph Kimbal, first we need to build up the Data Marts then we need to build up the
Datawarehouse..
this approach is most useful in real time while creating the Data warehouse.
Q. What are the different caches used in informatica?

Static cache

Dynamic cache

Shared cache

Persistent cache
Q. What is the command to get the list of files in a directory in unix?
$ls -lrt
Q. How to import multiple flat files in to single target where there is no common
column in the flat files
in workflow session properties in Mapping tab in properties choose Source filetype - Indirect
Give the Source filename : <file_path>
This <file_path> file should contain all the multiple files which you want to Load
Q. How to connect two or more table with single source qualifier?
Create a Oracle source with how much ever column you want and write the join query in SQL
query override. But the column order and data type should be same as in the SQL query.

Q. How to call unconnected lookup in expression transformation?


:LKP.LKP_NAME(PORTS)
Q. What is diff between connected and unconnected lookup?
Connected lookup:
It is used to join the two tables
it returns multiple rows
it must be in mapping pipeline
u can implement lookup condition
using connect lookup u can generate sequence numbers by
enabling dynamic lookup cache.
Unconnected lookup:
it returns single output through return port
it acts as a lookup function(:lkp)
it is called by another t/r.
not connected either source r target.
-----CONNECTED LOOKUP:
>> It will participated in data pipeline
>> It contains multiple inputs and multiple outputs.
>> It supported static and dynamic cache.
UNCONNECTED LOOKUP:
>> It will not participated in data pipeline
>> It contains multiple inputs and single output.
>> It supported static cache only.
Q. Types of partitioning in Informatica?
Partition 5 types
1.
2.
3.
4.
5.
Q.
1.
2.
3.
4.
5.

Simple pass through


Key range
Hash
Round robin
Database
Which transformation uses cache?
Lookup transformation
Aggregator transformation
Rank transformation
Sorter transformation
Joiner transformation

Q. Explain about union transformation?


A union transformation is a multiple input group transformation, which is used to merge the data
from multiple sources similar to UNION All SQL statements to combine the results from 2 or more
sql statements.
Similar to UNION All statement, the union transformation doesn't remove duplicate rows. It is an
active transformation.

Q. Explain about Joiner transformation?


Joiner transformation is used to join source data from two related heterogeneous sources. However
this can also be used to join data from the same source. Joiner t/r join sources with at least one
matching column. It uses a condition that matches one or more pair of columns between the 2
sources.
To configure a Joiner t/r various settings that we do are as below:
1) Master and detail source
2) Types of join
3) Condition of the join

Q. Explain about Lookup transformation?


Lookup t/r is used in a mapping to look up data in a relational table, flat file, view or synonym.
The informatica server queries the look up source based on the look up ports in the
transformation. It compares look up t/r port values to look up source column values based on the
look up condition.
Look up t/r is used to perform the below mentioned tasks:
1) To get a related value.
2) To perform a calculation.
3) To update SCD tables.
Q. How to identify this row for insert and this row for update in dynamic lookup cache?
Based on NEW LOOKUP ROW.. Informatica server indicates which one is insert and which one is
update.
Newlookuprow- 0...no change
Newlookuprow- 1...Insert
Newlookuprow- 2...update
Q. How many ways can we implement SCD2?
1) Date range
2) Flag
3) Versioning
Q. How will you check the bottle necks in informatica? From where do you start
checking?
You start as per this order
1. Target
2. Source
3. Mapping

4. Session
5. System
Q. What is incremental aggregation?
When the aggregator transformation executes all the output data will get stored in the temporary
location called aggregator cache. When the next time the mapping runs the aggregator
transformation runs for the new records loaded after the first run. These output values will get
incremented with the values in the aggregator cache. This is called incremental aggregation. By
this way we can improve performance...
--------------------------Incremental aggregation means applying only the captured changes in the source to aggregate
calculations in a session.
When the source changes only incrementally and if we can capture those changes, then we can
configure the session to process only those changes. This allows informatica server to update
target table incrementally, rather than forcing it to process the entire source and recalculate the
same calculations each time you run the session. By doing this obviously the session performance
increases.
Q. How can i explain my project architecture in interview..? Tell me your project flow
from source to target..?

Project architecture is like


1. Source Systems: Like Mainframe,Oracle,People soft,DB2.
2. Landing tables: These are tables act like source. Used for easy to access, for backup purpose,
as reusable for other mappings.
3. Staging tables: From landing tables we extract the data into staging tables after all validations
done on the data.
4. Dimension/Facts: These are the tables those are used for analysis and make decisions by
analyzing the data.
5. Aggregation tables: These tables have summarized data useful for managers who wants to
view monthly wise sales, year wise sales etc.
6. Reporting layer: 4 and 5 phases are useful for reporting developers to generate reports. I
hope this answer helps you.
Q. What type of transformation is not supported by mapplets?

Normalizer transformation

COBOL sources, joiner

XML source qualifier transformation

XML sources

Target definitions

Pre & Post Session stored procedures

Other mapplets

Q. How informatica recognizes mapping?


All are organized by Integration service.
Power center talks to Integration Service and Integration service talk to session. Session has
mapping Structure. These are flow of Execution.

Q. Can every transformation reusable? How?


Except source qualifier transformation, all transformations support reusable property. Reusable
transformation developed in two ways.
1. In mapping which transformation do you want to reuse, select the transformation and double
click on it, there you got option like make it as reusable transformation
option. There you need to check the option for converting non reusable to reusable transformation.
but except for source qualifier trans.
2. By using transformation developer

Q. What is Pre Sql and Post Sql?


Pre SQL means that the integration service runs SQL commands against the source database
before it reads the data from source.

Post SQL means integration service runs SQL commands against target database after it writes to
the target.

Q. Insert else update option in which situation we will use?

if the source table contain multiple records .if the record specified in the associated port to insert
into lookup cache. it does not find a record in the lookup cache when it is used find the particular
record & change the data in the associated port.

---------------------We set this property when the lookup TRFM uses dynamic cache and the session property TREAT
SOURCE ROWS AS "Insert" has been set.
-------------------This option we use when we want to maintain the history.
If records are not available in target table then it inserts the records in to target and records are
available in target table then it updates the records.
Q. What is an incremental loading? in which situations we will use incremental loading?
Incremental Loading is an approach. Let suppose you a mapping for load the data from employee
table to a employee_target table on the hire date basis. Again let suppose you already move the
employee data from source to target up to the employees hire date 31-12-2009.Your organization
now want to load data on employee_target today. Your target already have the data of that
employees having hire date up to 31-12-2009.so you now pickup the source data which are hiring
from 1-1-2010 to till date. That's why you needn't take the data before than that date, if you do
that wrongly it is overhead for loading data again in target which is already exists. So in source
qualifier you filter the records as per hire date and you can also parameterized the hire date that
help from which date you want to load data upon target.
This is the concept of Incremental loading.

Q. What is target update override?


By Default the integration service updates the target based on key columns. But we might want to
update non-key columns also, at that point of time we can override the
UPDATE statement for each target in the mapping. The target override affects only when the
source rows are marked as update by an update strategy in the mapping.

Q. What is the Mapping parameter and Mapping variable?


Mapping parameter: Mapping parameter is constant values that can be defined before mapping
run. A mapping parameter reuses the mapping for various constant values.

Mapping variable: Mapping variable is represent a value that can be change during the mapping

run that can be stored in repository the integration service retrieve that value from repository and
incremental value for next run.
Q. What is rank and dense rank in informatica with any examples and give sql query for
this both ranks
for eg: the file contains the records with column
100
200(repeated rows)
200
300
400
500
the rank function gives output as
1
2
2
4
5
6
and dense rank gives
1
2
2
3
4
5
for eg: the file contains the records with column
empno sal
100 1000
200(repeated rows) 2000
200 3000
300 4000
400 5000
500 6000
Rank :
select rank() over (partition by empno order by sal) from emp
1
2
2
4
5
6
Dense Rank
select dense_rank() over (partition by empno order by sal) from emp
and dense rank gives
1
2
2
3

4
5
Q. What is the incremental aggregation?
The first time you run an upgraded session using incremental aggregation, the Integration Service
upgrades the index and data cache files. If you want to partition a session using a mapping with
incremental aggregation, the Integration Service realigns the index and data cache files.

Q. What is session parameter?


Parameter file is a text file where we can define the values to the parameters .session parameters
are used for assign the database connection values

Q. What is mapping parameter?


A mapping parameter represents a constant value that can be defined before mapping run. A
mapping parameter defines a parameter file which is saved with an extension.prm a mapping
parameter reuse the various constant values.

Q. What is parameter file?


A parameter file can be a text file. Parameter file is to define the values for parameters and
variables used in a session. A parameter file is a file created by text editor such as word pad or
notepad. You can define the following values in parameter file

Mapping parameters

Mapping variables

Session parameters

Q. What is session override?


Session override is an option in informatica at session level. Here we can manually give a sql query
which is issued to the database when the session runs. It is nothing but over riding the default sql
which is generated by a particular transformation at mapping level.
Q. What are the diff. b/w informatica versions 8.1.1 and 8.6.1?
Little change in the Administrator Console. In 8.1.1 we can do all the creation of IS and repository

Service, web service, Domain, node, grid ( if we have licensed version),In 8.6.1 the Informatica
Admin console we can manage both Domain page and security page. Domain Page means all the
above like creation of IS and repository Service, web service, Domain, node, grid ( if we have
licensed version) etc. Security page means creation of users, privileges, LDAP configuration,
Export Import user and Privileges etc.

Q. What are the uses of a Parameter file?


Parameter file is one which contains the values of mapping variables.
type this in notepad.save it .
foldername.sessionname
$$inputvalue1=
--------------------------------Parameter files are created with an extension of .PRM

These are created to pass values those can be changed for Mapping Parameter and Session
Parameter during mapping run.

Mapping Parameters:
A Parameter is defined in a parameter file for which a Parameter is create already in the Mapping
with Data Type , Precision and scale.

The Mapping parameter file syntax (xxxx.prm).


[FolderName.WF:WorkFlowName.ST:SessionName]
$$ParameterName1=Value
$$ParameterName2=Value

After that we have to select the properties Tab of Session and Set Parameter file name including
physical path of this xxxx.prm file.

Session Parameters:
The Session Parameter files syntax (yyyy.prm).

[FolderName.SessionName]
$InputFileValue1=Path of the source Flat file

After that we have to select the properties Tab of Session and Set Parameter file name including
physical path of this yyyy.prm file.

Do following changes in Mapping Tab of Source Qualifier's


Properties section
Attributes

values

Source file Type ---------> Direct


Source File Directory --------> Empty
Source File Name

--------> $InputFileValue1

Q. What is the default data driven operation in informatica?


This is default option for update strategy transformation.
The integration service follows instructions coded in update strategy within session mapping
determine how to flag records for insert,delete,update,reject. If you do not data driven option
setting, the integration service ignores update strategy transformations in the mapping.
Q. What is threshold error in informatica?
When the target is used by the update strategy DD_REJECT,DD_UPDATE and some limited count,
then if it the number of rejected records exceed the count then the
session ends with failed status. This error is called Threshold Error.

Q. SO many times i saw "$PM parser error ". What is meant by PM?
PM: POWER MART
1) Parsing error will come for the input parameter to the lookup.
2) Informatica is not able to resolve the input parameter CLASS for your lookup.
3) Check the Port CLASS exists as either input port or a variable port in your expression.

4) Check data type of CLASS and the data type of input parameter for your lookup.

Q. What is a candidate key?


A candidate key is a combination of attributes that can be uniquely used to identify a database
record without any extraneous data (unique). Each table may have one or more candidate keys.
One of these candidate keys is selected as the table primary key else are called Alternate Key.

Q. What is the difference between Bitmap and Btree index?


Bitmap index is used for repeating values.
ex: Gender: male/female
Account status:Active/Inactive
Btree index is used for unique values.
ex: empid.
Q. What is ThroughPut in Informatica?
Thoughtput is the rate at which power centre server read the rows in bytes from source or writes
the rows in bytes into the target per second.

You can find this option in workflow monitor. Right click on session choose properties and
Source/Target Statictics tab you can find thoughtput details for each instance of source and
target.

Q. What are set operators in Oracle


UNION
UNION ALL
MINUS
INTERSECT

Q. How i can Schedule the Informatica job in "Unix Cron scheduling tool"?

Crontab
The crontab (cron derives from chronos, Greek for time; tab stands for table) command, found in

Unix and Unix-like operating systems, is used to schedule commands to be executed periodically.
To see what crontabs are currently running on your system, you can open a terminal and run:
sudo crontab -l
To edit the list of cronjobs you can run:
sudo crontab -e
This will open a the default editor (could be vi or pico, if you want you can change the default
editor) to let us manipulate the crontab. If you save and exit the editor, all your cronjobs are saved
into crontab. Cronjobs are written in the following format:
* * * * * /bin/execute/this/script.sh
Scheduling explained
As you can see there are 5 stars. The stars represent different date parts in the following order:
1.

minute (from 0 to 59)

2.

hour (from 0 to 23)

3.

day of month (from 1 to 31)

4.

month (from 1 to 12)

5.

day of week (from 0 to 6) (0=Sunday)

Execute every minute


If you leave the star, or asterisk, it means every. Maybe
that's a bit unclear. Let's use the the previous example
again:
* * * * * /bin/execute/this/script.sh
They are all still asterisks! So this means
execute /bin/execute/this/script.sh:
1.

every minute

2.

of every hour

3.

of every day of the month

4.

of every month

5.

and every day in the week.

In short: This script is being executed every minute.


Without exception.

Execute every Friday 1AM


So if we want to schedule the script to run at 1AM every
Friday, we would need the following cronjob:
0 1 * * 5 /bin/execute/this/script.sh
Get it? The script is now being executed when the system
clock hits:
1.

minute: 0

2.

of hour: 1

3.

of day of month: * (every day of month)

4.

of month: * (every month)

5.

and weekday: 5 (=Friday)

Execute on weekdays 1AM


So if we want to schedule the script to run at 1AM every Friday, we would need the following
cronjob:
0 1 * * 1-5 /bin/execute/this/script.sh
Get it? The script is now being executed when the system
clock hits:
1.

minute: 0

2.

of hour: 1

3.

of day of month: * (every day of month)

4.

of month: * (every month)

5.

and weekday: 1-5 (=Monday til Friday)

Execute 10 past after every hour on the 1st of every month


Here's another one, just for practicing
10 * 1 * * /bin/execute/this/script.sh
Fair enough, it takes some getting used to, but it offers great flexibility.

Q. Can anyone tell me the difference between persistence and dynamic caches? On
which conditions we are using these caches?

Dynamic:-1)When you use a dynamic cache, the Informatica Server updates the lookup cache as it passes
rows to the target.
2)In Dynamic, we can update catch will New data also.
3) Dynamic cache, Not Reusable
(when we need Updated cache data, That only we need Dynamic Cache)

Persistent:-1)a Lookup transformation to use a non-persistent or persistent cache. The PowerCenter Server
saves or deletes lookup cache files after a successful session based on the Lookup Cache
Persistent property.
2) Persistent, we are not able to update the catch with New data.
3) Persistent catch is Reusable.

(When we need Previous Cache data, That only we need Persistent Cache)
---------------------------------few more additions to the above answer.....
1. Dynamic lookup allows modifying cache where as Persistent lookup does not allow us to modify
cache.
2. Dynamic lookup use 'newlookup row', a default port in the cache but persistent does use any
default ports in cache.
3.As session completes dynamic cache removed but the persistent cache saved in informatica
power centre server.

Q. How to obtain performance data for individual transformations?


There is a property at session level Collect Performance Data, you can select that property. It
gives you performance details for all the transformations.

Q. List of Active and Passive Transformations in Informatica?

Active Transformation - An active transformation changes the number of rows that pass through
the mapping.
Source Qualifier Transformation
Sorter Transformations
Aggregator Transformations
Filter Transformation
Union Transformation
Joiner Transformation
Normalizer Transformation
Rank Transformation
Router Transformation
Update Strategy Transformation
Advanced External Procedure Transformation
Passive Transformation - Passive transformations do not change the number of rows that pass
through the mapping.
Expression Transformation
Sequence Generator Transformation
Lookup Transformation
Stored Procedure Transformation
XML Source Qualifier Transformation
External Procedure Transformation
Q. Eliminating of duplicate records without using dynamic lookups?
Hi U can eliminate duplicate records by an simple one line SQL Query.
Select id, count (*) from seq1 group by id having count (*)>1;
Below are the ways to eliminate the duplicate records:
1. By enabling the option in Source Qualifier transformation as select distinct.
2. By enabling the option in sorter transformation as select distinct.
3. By enabling all the values as group by in Aggregator transformation.
Q. Can anyone give idea on how do we perform test load in informatica? What do we
test as part of test load in informatica?
With a test load, the Informatica Server reads and transforms data without writing to targets. The
Informatica Server does everything, as if running the full session. The Informatica Server writes
data to relational targets, but rolls back the data when the session completes. So, you can enable
collect performance details property and analyze the how efficient your mapping is. If the session
is running for a long time, you may like to find out the bottlenecks that are existing. It may be
bottleneck of type target, source, mapping etc.

The basic idea behind test load is to see the behavior of Informatica Server with your session.
Q. What is ODS (Operational Data Store)?

A collection of operation or bases data that is extracted from operation databases and
standardized, cleansed, consolidated, transformed, and loaded into enterprise data architecture.
An ODS is used to support data mining of operational data, or as the store for base data that is
summarized for a data warehouse.
The ODS may also be used to audit the data warehouse to assure summarized and derived data is
calculated properly. The ODS may further become the enterprise shared operational database,
allowing operational systems that are being reengineered to use the ODS as there operation
databases.
Q. How many tasks are there in informatica?

Session Task

Email Task

Command Task

Assignment Task

Control Task

Decision Task

Event-Raise

Event- Wait

Timer Task

Link Task
Q. What are business components in Informatica?

Domains

Nodes

Services

Q. What is versioning?
Its used to keep history of changes done on the mappings and workflows
1. Check in: You check in when you are done with your changes so that everyone can see those
changes.
2. Check out: You check out from the main stream when you want to make any change to the
mapping/workflow.
3. Version history: It will show you all the changes made and who made it.

Q. Diff between $$$sessstarttime and sessstarttime?

$$$SessStartTime - Returns session start time as a string value (String datatype)


SESSSTARTTIME - Returns the date along with date timestamp (Date datatype)
Q. Difference between $,$$,$$$ in Informatica?
1. $ Refers
These are the system variables/Session Parameters like $Bad file,$input
file, $output file, $DB connection,$source,$target etc..
2. $$ Refers
User defined variables/Mapping Parameters like $$State,$$Time, $$Entity, $$Business_Date, $
$SRC,etc.
3. $$$ Refers
System Parameters like $$$SessStartTime
$$$SessStartTime returns the session start time as a string value. The format of the
string depends on the database you are using.
$$$SessStartTime returns the session start time as a string value --> The format of the string
depends on the database you are using.
Q. Finding Duplicate Rows based on Multiple Columns?
SELECT firstname, COUNT(firstname), surname, COUNT(surname), email, COUNT(email) FROM
employee
GROUP BY firstname, surname, email
HAVING (COUNT(firstname) > 1) AND (COUNT(surname) > 1) AND (COUNT(email) > 1);

Q. Finding Nth Highest Salary in Oracle?


Pick out the Nth highest salary, say the 4th highest salary.
Select * from
(select ename,sal,dense_rank() over (order by sal desc) emp_rank from emp)
where emp_rank=4;
Q. Find out the third highest salary?
SELECT MIN(sal) FROM emp WHERE
sal IN (SELECT distinct TOP 3 sal FROM emp ORDER BY sal DESC);

Q. How do you handle error logic in Informatica? What are the transformations that you
used while handling errors? How did you reload those error records in target?
Row indicator: It generally happens when working with update strategy transformation. The

writer/target rejects the rows going to the target


Column indicator:
D -Valid
o - Overflow
n - Null
t - Truncate
When the data is with nulls, or overflow it will be rejected to write the data to the target
The reject data is stored on reject files. You can check the data and reload the data in to the target
using reject reload utility.
Q. Difference between STOP and ABORT?
Stop - If the Integration Service is executing a Session task when you issue the stop command,
the Integration Service stops reading data. It continues processing and writing data and
committing data to targets. If the Integration Service cannot finish processing and committing
data, you can issue the abort command.
Abort - The Integration Service handles the abort command for the Session task like the stop
command, except it has a timeout period of 60 seconds. If the Integration Service cannot finish
processing and committing data within the timeout period, it kills the DTM process and terminates
the session.
Q. What is inline view?

An inline view is term given to sub query in FROM clause of query which can be used as table.
Inline view effectively is a named sub query
Ex : Select Tab1.col1,Tab1.col.2,Inview.col1,Inview.Col2
From Tab1, (Select statement) Inview
Where Tab1.col1=Inview.col1
SELECT DNAME, ENAME, SAL FROM EMP ,
(SELECT DNAME, DEPTNO FROM DEPT) D
WHERE A.DEPTNO = B.DEPTNO
In the above query (SELECT DNAME, DEPTNO FROM DEPT) D is the inline view.
Inline views are determined at runtime, and in contrast to normal view they are not stored in the
data dictionary,

Disadvantage of using this is


1. Separate view need to be created which is an overhead
2. Extra time taken in parsing of view
This problem is solved by inline view by using select statement in sub query and using that as
table.

Advantage of using inline views:


1. Better query performance
2. Better visibility of code
Practical use of Inline views:
1. Joining Grouped data with non grouped data
2. Getting data to use in another query
Q. What is generated key and generated column id in normalizer transformation?

The integration service increments the generated key (GK) sequence number each time it

process a source row. When the source row contains a multiple-occurring column or a multipleoccurring group of columns, the normalizer transformation returns a row for each occurrence. Each
row contains the same generated key value.

The normalizer transformation has a generated column ID (GCID) port for each multiple-

occurring column. The GCID is an index for the instance of the multiple-occurring data. For
example, if a column occurs 3 times in a source record, the normalizer returns a value of 1, 2 or 3
in the generated column ID.
Q. What is difference between SUBSTR and INSTR?

INSTR function search string for sub-string and returns an integer indicating the position of the
character in string that is the first character of this occurrence.

SUBSTR function returns a portion of string, beginning at character position, substring_length


characters long. SUBSTR calculates lengths using characters as defined by the input character set.
Q. What are different Oracle database objects?

TABLES
VIEWS
INDEXES
SYNONYMS
SEQUENCES

TABLESPACES

Q. What is @@ERROR?
The @@ERROR automatic variable returns the error code of the last Transact-SQL statement. If
there was no error, @@ERROR returns zero. Because @@ERROR is reset after each Transact-SQL
statement, it must be saved to a variable if it is needed to process it further after checking it.
Q. What is difference between co-related sub query and nested sub query?

Correlated subquery runs once for each row selected by the outer query. It contains a reference
to a value from the row selected by the outer query.
Nested subquery runs only once for the entire nesting (outer) query. It does not contain any
reference to the outer query row.
For example,
Correlated Subquery:
Select e1.empname, e1.basicsal, e1.deptno from emp e1 where e1.basicsal = (select
max(basicsal) from emp e2 where e2.deptno = e1.deptno)
Nested Subquery:
Select empname, basicsal, deptno from emp where (deptno, basicsal) in (select deptno,
max(basicsal) from emp group by deptno)

Q. How does one escape special characters when building SQL queries?

The LIKE keyword allows for string searches. The _ wild card character is used to match exactly
one character, % is used to match zero or more occurrences of any characters. These characters
can be escaped in SQL. Example:
SELECT name FROM emp WHERE id LIKE %\_% ESCAPE \;
Use two quotes for every one displayed. Example:
SELECT Frankss Oracle site FROM DUAL;

SELECT A quoted word. FROM DUAL;


SELECT A double quoted word. FROM DUAL;

Q. Difference between Surrogate key and Primary key?


Surrogate key:
1. Query processing is fast.
2. It is only numeric
3. Developer develops the surrogate key using sequence generator transformation.
4. Eg: 12453

Primary key:
1. Query processing is slow
2. Can be alpha numeric
3. Source system gives the primary key.
4. Eg: C10999

Q. How does one eliminate duplicate rows in an Oracle Table?

Method 1:
DELETE from table_name A
where rowid > (select min(rowid) from table_name B where A.key_values = B.key_values);

Method 2:
Create table table_name2 as select distinct * from table_name1;
drop table table_name1;
rename table table_name2 as table_name1;
In this method, all the indexes,constraints,triggers etc have to be re-created.

Method 3:
DELETE from table_name t1
where exists (select x from table_name t2 where t1.key_value=t2.key_value and t1.rowid >
t2.rowid)

Method 4:
DELETE from table_name where rowid not in (select max(rowid) from my_table group by
key_value )

Q. Query to retrieve Nth row from an Oracle table?

The query is as follows:


select * from my_table where rownum <= n
MINUS
select * from my_table where rownum < n;

Q. How does the server recognize the source and target databases?
If it is relational - By using ODBC connection
FTP connection - By using flat file
Q. What are the different types of indexes supported by Oracle?
1. B-tree index
2. B-tree cluster index
3. Hash cluster index
4. Reverse key index
5. Bitmap index
6. Function Based index

Q. Types of Normalizer transformation?

There are two types of Normalizer transformation.


VSAM Normalizer transformation
A non-reusable transformation that is a Source Qualifier transformation for a COBOL source. The
Mapping Designer creates VSAM Normalizer columns from a COBOL source in a mapping. The
column attributes are read-only. The VSAM Normalizer receives a multiple-occurring source column
through one input port.

Pipeline Normalizer transformation


A transformation that processes multiple-occurring data from relational tables or flat files. You
might choose this option when you want to process multiple-occurring data from another
transformation in the mapping.
A VSAM Normalizer transformation has one input port for a multiple-occurring column. A pipeline
Normalizer transformation has multiple input ports for a multiple-occurring column.
When you create a Normalizer transformation in the Transformation Developer, you create a
pipeline Normalizer transformation by default. When you create a pipeline Normalizer
transformation, you define the columns based on the data the transformation receives from
another type of transformation such as a Source Qualifier transformation.
The Normalizer transformation has one output port for each single-occurring input port.
Q. What are all the transformation you used if source as XML file?

XML Source Qualifier

XML Parser

XML Generator

Q. List the files in ascending order in UNIX?


ls -lt (sort by last date modified)
ls ltr (reverse)
ls lS (sort by size of the file)
Q. How do identify the empty line in a flat file in UNIX? How to remove it?
grep v ^$ filename
Q. How do send the session report (.txt) to manager after session is completed?
Email variable %a (attach the file) %g attach session log file
Q. How to check all the running processes in UNIX?
$> ps ef
Q. How can i display only and only hidden file in the current directory?
ls -a|grep "^\."
Q. How to display the first 10 lines of a file?
# head -10 logfile
Q. How to display the last 10 lines of a file?
# tail -10 logfile

Q. How did you schedule sessions in your project?


1. Run once Set 2 parameter date and time when session should start.
2. Run Every Informatica server run session at regular interval as we configured, parameter
Days, hour, minutes, end on, end after, forever.
3. Customized repeat Repeat every 2 days, daily frequency hr, min, every week, every
month.
Q. What is lookup override?
This feature is similar to entering a custom query in a Source Qualifier transformation. When
entering a Lookup SQL Override, you can enter the entire override, or generate and edit the
default SQL statement.
The lookup query override can include WHERE clause.
Q. What is Sql Override?
The Source Qualifier provides the SQL Query option to override the default query. You can enter
any SQL statement supported by your source database. You might enter your own SELECT
statement, or have the database perform aggregate calculations, or call a stored procedure or
stored function to read the data and perform some tasks.
Q. How to get sequence value using Expression?
v_temp = v_temp+1
o_seq = IIF(ISNULL(v_temp), 0, v_temp)
Q. How to get Unique Record?
Source > SQ > SRT > EXP > FLT OR RTR > TGT
In Expression:
flag = Decode(true,eid=pre_eid, Y,'N)
flag_out = flag
pre_eid = eid
Q. What are the different transaction levels available in transaction
control transformation (TCL)?

The following are the transaction levels or built-in variables:

TC_CONTINUE_TRANSACTION: The Integration Service does not perform any transaction

change for this row. This is the default value of the expression.

TC_COMMIT_BEFORE: The Integration Service commits the transaction, begins a new

transaction, and writes the current row to the target. The current row is in the new transaction.

TC_COMMIT_AFTER: The Integration Service writes the current row to the target, commits

the transaction, and begins a new transaction. The current row is in the committed transaction.

TC_ROLLBACK_BEFORE: The Integration Service rolls back the current transaction, begins

a new transaction, and writes the current row to the target. The current row is in the new

transaction.

TC_ROLLBACK_AFTER: The Integration Service writes the current row to the target, rolls

back the transaction, and begins a new transaction. The current row is in the rolled back
transaction.
Q. What is difference between grep and find?

Grep is used for finding any string in the file.


Syntax - grep <String> <filename>
Example - grep 'compu' details.txt
Display the whole line,in which line compu string is found.

Find is used to find the file or directory in given path,


Syntax - find <filename>
Example - find compu*
Display all file names starting with computer

Q. What are the difference between DDL, DML and DCL commands?

DDL is Data Definition Language statements

CREATE to create objects in the database

ALTER alters the structure of the database

DROP delete objects from the database

TRUNCATE remove all records from a table, including all spaces allocated for the records

are removed

COMMENT add comments to the data dictionary

GRANT gives users access privileges to database

REVOKE withdraw access privileges given with the GRANT command

DML is Data Manipulation Language statements

SELECT retrieve data from the a database

INSERT insert data into a table

UPDATE updates existing data within a table

DELETE deletes all records from a table, the space for the records remain

CALL call a PL/SQL or Java subprogram

EXPLAIN PLAN explain access path to data

LOCK TABLE control concurrency

DCL is Data Control Language statements

COMMIT save work done

SAVEPOINT identify a point in a transaction to which you can later roll back

ROLLBACK restore database to original since the last COMMIT

SET TRANSACTION Change transaction options like what rollback segment to use

Q. What is Stored Procedure?


A stored procedure is a named group of SQL statements that have been previously created and
stored in the server database. Stored procedures accept input parameters so that a single
procedure can be used over the network by several clients using different input data. And when
the procedure is modified, all clients automatically get the new version. Stored procedures reduce
network traffic and improve performance. Stored procedures can be used to help ensure the
integrity of the database.
Q. What is View?
A view is a tailored presentation of the data contained in one or more tables (or other views).
Unlike a table, a view is not allocated any storage space, nor does a view actually contain data;
rather, a view is defined by a query that extracts or derives data from the tables the view
references. These tables are called base tables.
Views present a different representation of the data that resides within the base tables. Views are
very powerful because they allow you to tailor the presentation of data to different types of users.
Views are often used to:

Provide an additional level of table security by restricting access to a predetermined set of

rows and/or columns of a table

Hide data complexity

Simplify commands for the user

Present the data in a different perspective from that of the base table

Isolate applications from changes in definitions of base tables

Express a query that cannot be expressed without using a view

Q. What is Trigger?
A trigger is a SQL procedure that initiates an action when an event (INSERT, DELETE or UPDATE)
occurs. Triggers are stored in and managed by the DBMS. Triggers are used to maintain the
referential integrity of data by changing the data in a systematic fashion. A trigger cannot be
called or executed; the DBMS automatically fires the trigger as a result of a data modification to
the associated table. Triggers can be viewed as similar to stored procedures in that both consist of
procedural logic that is stored at the database level. Stored procedures, however, are not eventdrive and are not attached to a specific table as triggers are. Stored procedures are explicitly
executed by invoking a CALL to the procedure while triggers are implicitly executed. In addition,
triggers can also execute stored Procedures.
Nested Trigger: A trigger can also contain INSERT, UPDATE and DELETE logic within itself, so
when the trigger is fired because of data modification it can also cause another data modification,
thereby firing another trigger. A trigger that contains data modification logic within itself is called a
nested trigger.
Q. What is View?
A simple view can be thought of as a subset of a table. It can be used for retrieving data, as well
as updating or deleting rows. Rows updated or deleted in the view are updated or deleted in the
table the view was created with. It should also be noted that as data in the original table changes,
so does data in the view, as views are the way to look at part of the original table. The results of
using a view are not permanently stored in the database. The data accessed through a view is
actually constructed using standard T-SQL select command and can come from one to many
different base tables or even other views.
Q. What is Index?
An index is a physical structure containing pointers to the data. Indices are created in an existing
table to locate rows more quickly and efficiently. It is possible to create an index on one or more
columns of a table, and each index is given a name. The users cannot see the indexes; they are
just used to speed up queries. Effective indexes are one of the best ways to improve performance
in a database application. A table scan happens when there is no index available to help a query.
In a table scan SQL Server examines every row in the table to satisfy the query results. Table scans
are sometimes unavoidable, but on large tables, scans have a terrific impact on performance.
Clustered indexes define the physical sorting of a database tables rows in the storage media. For
this reason, each database table may
have only one clustered index. Non-clustered indexes are created outside of the database table
and contain a sorted list of references to the table itself.

Q. What is the difference between clustered and a non-clustered index?


A clustered index is a special type of index that reorders the way records in the table are
physically stored. Therefore table can have only one clustered index. The leaf nodes of a clustered
index contain the data pages. A nonclustered index is a special type of index in which the logical
order of the index does not match the physical stored order of the rows on disk. The leaf node of a
nonclustered index does not consist of the data pages. Instead, the leaf nodes contain index rows.
Q. What is Cursor?
Cursor is a database object used by applications to manipulate data in a set on a row-by row basis,
instead of the typical SQL commands that operate on all the rows in the set at one time.
In order to work with a cursor we need to perform some steps in the following order:

Declare cursor

Open cursor

Fetch row from the cursor

Process fetched row

Close cursor

Deallocate cursor

Q. What is the difference between a HAVING CLAUSE and a WHERE CLAUSE?


1. Specifies a search condition for a group or an aggregate. HAVING can be used only with the
SELECT statement.
2. HAVING is typically used in a GROUP BY clause. When GROUP BY is not used, HAVING behaves
like a WHERE clause.
3. Having Clause is basically used only with the GROUP BY function in a query. WHERE Clause is
applied to each row before they are part of the GROUP BY function in a query.

RANK CACHE
Sample Rank Mapping
When the Power Center Server runs a session with a Rank transformation, it compares an input
row with rows in the data cache. If the input row out-ranks a Stored row, the Power Center Server
replaces the stored row with the input row.
Example: Power Center caches the first 5 rows if we are finding top 5 salaried Employees. When
6th row is read, it compares it with 5 rows in cache and places it in Cache is needed.
1) RANK INDEX CACHE:
The index cache holds group information from the group by ports. If we are Using Group By on

DEPTNO, then this cache stores values 10, 20, 30 etc.


All Group By Columns are in RANK INDEX CACHE. Ex. DEPTNO
2) RANK DATA CACHE:
It holds row data until the Power Center Server completes the ranking and is generally larger than
the index cache. To reduce the data cache size, connect only the necessary input/output ports to
subsequent transformations.
All Variable ports if there, Rank Port, All ports going out from RANK Transformations are stored in
RANK DATA CACHE.
Example: All ports except DEPTNO In our mapping example.

Aggregator Caches
1. The Power Center Server stores data in the aggregate cache until it completes Aggregate
calculations.
2. It stores group values in an index cache and row data in the data cache. If the Power Center
Server requires more space, it stores overflow values in cache files.
Note: The Power Center Server uses memory to process an Aggregator transformation with sorted
ports. It does not use cache memory. We do not need to configure cache memory for Aggregator
transformations that use sorted ports.
1) Aggregator Index Cache:
The index cache holds group information from the group by ports. If we are using Group By on
DEPTNO, then this cache stores values 10, 20, 30 etc.

All Group By Columns are in AGGREGATOR INDEX CACHE. Ex. DEPTNO

2) Aggregator Data Cache:


DATA CACHE is generally larger than the AGGREGATOR INDEX CACHE.
Columns in Data Cache:

Variable ports if any

Non group by input/output ports.

Non group by input ports used in non-aggregate output expression.

Port containing aggregate function

JOINER CACHES
Joiner always caches the MASTER table. We cannot disable caching. It builds Index cache and Data
Cache based on MASTER table.
1) Joiner Index Cache:
All Columns of MASTER table used in Join condition are in JOINER INDEX CACHE.

Example: DEPTNO in our mapping.


2) Joiner Data Cache:
Master column not in join condition and used for output to other transformation or target table are
in Data Cache.
Example: DNAME and LOC in our mapping example.

Lookup Cache Files


1. Lookup Index Cache:
Stores data for the columns used in the lookup condition.
2. Lookup Data Cache:

For a connected Lookup transformation, stores data for the connected output ports, not

including ports used in the lookup condition.

For an unconnected Lookup transformation, stores data from the return port.

OLTP and OLAP

Logical Data Modeling Vs Physical Data Modeling

Router Transformation And Filter Transformation


Source Qualifier And Lookup Transformation

Mapping And Mapplet

Joiner Transformation And Lookup Transformation

Dimension Table and Fact Table

Connected Lookup and Unconnected Lookup

Connected Lookup

Unconnected Lookup

Receives input values directly from the


pipeline.

Receives input values from the result of a


:LKP expression in another
transformation.

We can use a dynamic or static cache.

We can use a static cache.

Cache includes all lookup columns used in Cache includes all lookup/output ports in
the mapping.
the lookup condition and the
lookup/return port.
If there is no match for the lookup
If there is no match for the lookup
condition, the Power Center Server returns condition, the Power Center Server
the default value for all output ports.
returns NULL.
If there is a match for the lookup
condition, the Power Center Server returns
the result of the lookup condition for all
lookup/output ports.

If there is a match for the lookup


condition, the Power Center Server
returns the result of the lookup condition
into the return port.

Pass multiple output values to another


transformation.

Pass one output value to another


transformation.

Supports user-defined default values

Does not support user-defined default


values.

Cache Comparison

Persistence and Dynamic Caches


Dynamic
1) When you use a dynamic cache, the Informatica Server updates the lookup cache as it passes
rows to the target.
2) In Dynamic, we can update catch will new data also.
3) Dynamic cache, Not Reusable.
(When we need updated cache data, That only we need Dynamic Cache)

Persistent
1) A Lookup transformation to use a non-persistent or persistent cache. The PowerCenter Server
saves or deletes lookup cache files after a successful session based on the Lookup Cache
Persistent property.
2) Persistent, we are not able to update the catch with new data.
3) Persistent catch is Reusable.
(When we need previous cache data, that only we need Persistent Cache)

View And Materialized View

Star Schema And Snow Flake Schema

Informatica - Transformations
In Informatica, Transformations help to transform the source data according to the requirements of
target system and it ensures the quality of the data being loaded into target.
Transformations are of two types: Active and Passive.

Active Transformation
An active transformation can change the number of rows that pass through it from source to
target. (i.e) It eliminates rows that do not meet the condition in transformation.

Passive Transformation
A passive transformation does not change the number of rows that pass through it (i.e) It passes
all rows through the transformation.

Transformations can be Connected or Unconnected.

Connected Transformation
Connected transformation is connected to other transformations or directly to target table in the
mapping.

Unconnected Transformation
An unconnected transformation is not connected to other transformations in the mapping. It is
called within another transformation, and returns a value to that transformation.

Following are the list of Transformations available in Informatica:


Aggregator Transformation
Expression Transformation
Filter Transformation
Joiner Transformation
Lookup Transformation
Normalizer Transformation
Rank Transformation
Router Transformation
Sequence Generator Transformation
Stored Procedure Transformation
Sorter Transformation
Update Strategy Transformation
XML Source Qualifier Transformation

In the following pages, we will explain all the above Informatica Transformations and their
significances in the ETL process in detail.
===========================================================
===================
Aggregator Transformation
Aggregator transformation is an Active and Connected transformation.
This transformation is useful to perform calculations such as averages and sums (mainly to
perform calculations on multiple rows or groups).
For example, to calculate total of daily sales or to calculate average of monthly or yearly sales.
Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE, MAX, SUM etc. can be used in
aggregate transformation.
===========================================================
===================
Expression Transformation
Expression transformation is a Passive and Connected transformation.
This can be used to calculate values in a single row before writing to the target.

For example, to calculate discount of each product


or to concatenate first and last names
or to convert date to a string field.
===========================================================
===================
Filter Transformation
Filter transformation is an Active and Connected transformation.
This can be used to filter rows in a mapping that do not meet the condition.
For example,
To know all the employees who are working in Department 10 or
To find out the products that falls between the rate category $500 and $1000.
===========================================================
===================
Joiner Transformation
Joiner Transformation is an Active and Connected transformation. This can be used to join two
sources coming from two different locations or from same location. For example, to join a flat file
and a relational source or to join two flat files or to join a relational source and a XML source.
In order to join two sources, there must be at least one matching port. While joining two sources it
is a must to specify one source as master and the other as detail.
The Joiner transformation supports the following types of joins:
1)Normal
2)Master Outer
3)Detail Outer
4)Full Outer
Normal join discards all the rows of data from the master and detail source that do not match,
based on the condition.
Master outer join discards all the unmatched rows from the master source and keeps all the
rows from the detail source and the matching rows from the master source.
Detail outer join keeps all rows of data from the master source and the matching rows from the
detail source. It discards the unmatched rows from the detail source.
Full outer join keeps all rows of data from both the master and detail sources.
===========================================================
===================
Lookup transformation
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is used
to look up data in a relational table, view, or synonym. Lookup definition can be imported either
from source or from target tables.
For example, if we want to retrieve all the sales of a product with an ID 10 and assume that the

sales data resides in another table. Here instead of using the sales table as one more source, use
Lookup transformation to lookup the data for the product, with ID 10 in sales table.
Connected lookup receives input values directly from mapping pipeline whereas
Unconnected lookup receives values from: LKP expression from another transformation.
Connected lookup returns multiple columns from the same row whereas
Unconnected lookup has one return port and returns one column from each row.
Connected lookup supports user-defined default values whereas
Unconnected lookup does not support user defined values.
===========================================================
===================
Normalizer Transformation
Normalizer Transformation is an Active and Connected transformation.
It is used mainly with COBOL sources where most of the time data is stored in de-normalized
format.
Also, Normalizer transformation can be used to create multiple rows from a single row of data.
===========================================================
===================
Rank Transformation
Rank transformation is an Active and Connected transformation.
It is used to select the top or bottom rank of data.
For example,
To select top 10 Regions where the sales volume was very high
or
To select 10 lowest priced products.
===========================================================
===================
Router Transformation
Router is an Active and Connected transformation. It is similar to filter transformation.
The only difference is, filter transformation drops the data that do not meet the condition whereas
router has an option to capture the data that do not meet the condition. It is useful to test multiple
conditions.
It has input, output and default groups.
For example, if we want to filter data like where State=Michigan, State=California, State=New
York and all other States. Its easy to route data to different tables.
===========================================================
===================

Sequence Generator Transformation


Sequence Generator transformation is a Passive and Connected transformation. It is used to create
unique primary key values or cycle through a sequential range of numbers or to replace missing
keys.
It has two output ports to connect transformations. By default it has two fields CURRVAL and
NEXTVAL (You cannot add ports to this transformation).
NEXTVAL port generates a sequence of numbers by connecting it to a transformation or target.
CURRVAL is the NEXTVAL value plus one or NEXTVAL plus the Increment By value.
===========================================================
===================
Sorter Transformation
Sorter transformation is a Connected and an Active transformation.
It allows sorting data either in ascending or descending order according to a specified field.
Also used to configure for case-sensitive sorting, and specify whether the output rows should be
distinct.
===========================================================
===================
Source Qualifier Transformation
Source Qualifier transformation is an Active and Connected transformation. When adding a
relational or a flat file source definition to a mapping, it is must to connect it to a Source Qualifier
transformation.
The Source Qualifier performs the various tasks such as
Overriding Default SQL query,
Filtering records;
join data from two or more tables etc.
===========================================================
===================
Stored Procedure Transformation
Stored Procedure transformation is a Passive and Connected & Unconnected transformation. It is
useful to automate time-consuming tasks and it is also used in error handling, to drop and recreate
indexes and to determine the space in database, a specialized calculation etc.
The stored procedure must exist in the database before creating a Stored Procedure
transformation, and the stored procedure can exist in a source, target, or any database with a
valid connection to the Informatica Server. Stored Procedure is an executable script with SQL
statements and control statements, user-defined variables and conditional statements.
===========================================================
===================
Update Strategy Transformation
Update strategy transformation is an Active and Connected transformation.
It is used to update data in target table, either to maintain history of data or recent changes.

You can specify how to treat source rows in table, insert, update, delete or data driven.
===========================================================
===================
XML Source Qualifier Transformation
XML Source Qualifier is a Passive and Connected transformation.
XML Source Qualifier is used only with an XML source definition.
It represents the data elements that the Informatica Server reads when it executes a session with
XML sources.
===========================================================
===================

Constraint-Based Loading
In the Workflow Manager, you can specify constraint-based loading for a session. When you select
this option, the Integration Service orders the target load on a row-by-row basis. For every row
generated by an active source, the Integration Service loads the corresponding transformed row
first to the primary key table, then to any foreign key tables. Constraint-based loading depends on
the following requirements:
Active source: Related target tables must have the same active source.
Key relationships: Target tables must have key relationships.
Target connection groups: Targets must be in one target connection group.
Treat rows as insert. Use this option when you insert into the target. You cannot use updates with
constraint based loading.
Active Source:
When target tables receive rows from different active sources, the Integration Service reverts to
normal loading for those tables, but loads all other targets in the session using constraint-based
loading when possible. For example, a mapping contains three distinct pipelines. The first two
contain a source, source qualifier, and target. Since these two targets receive data from different
active sources, the Integration Service reverts to normal loading for both targets. The third
pipeline contains a source, Normalizer, and two targets. Since these two targets share a single
active source (the Normalizer), the Integration Service performs constraint-based loading: loading
the primary key table first, then the foreign key table.
Key Relationships:
When target tables have no key relationships, the Integration Service does not perform constraintbased loading.
Similarly, when target tables have circular key relationships, the Integration Service reverts to a
normal load. For example, you have one target containing a primary key and a foreign key related
to the primary key in a second target. The second target also contains a foreign key that
references the primary key in the first target. The Integration Service cannot enforce constraintbased loading for these tables. It reverts to a normal load.
Target Connection Groups:
The Integration Service enforces constraint-based loading for targets in the same target
connection group. If you want to specify constraint-based loading for multiple targets that receive
data from the same active source, you must verify the tables are in the same target connection
group. If the tables with the primary key-foreign key relationship are in different target connection
groups, the Integration Service cannot enforce constraint-based loading when you run the
workflow. To verify that all targets are in the same target connection group, complete the following
tasks:

Verify all targets are in the same target load order group and receive data from the same
active source.

Use the default partition properties and do not add partitions or partition points.

Define the same target type for all targets in the session properties.


Define the same database connection name for all targets in the session properties.

Choose normal mode for the target load type for all targets in the session properties.
Treat Rows as Insert:
Use constraint-based loading when the session option Treat Source Rows As is set to insert. You
might get inconsistent data if you select a different Treat Source Rows As option and you configure
the session for constraint-based loading.
When the mapping contains Update Strategy transformations and you need to load data to a
primary key table first, split the mapping using one of the following options:

Load primary key table in one mapping and dependent tables in another mapping. Use
constraint-based loading to load the primary table.

Perform inserts in one mapping and updates in another mapping.


Constraint-based loading does not affect the target load ordering of the mapping. Target load
ordering defines the order the Integration Service reads the sources in each target load order
group in the mapping. A target load order group is a collection of source qualifiers,
transformations, and targets linked together in a mapping. Constraint based loading establishes
the order in which the Integration Service loads individual targets within a set of targets receiving
data from a single source qualifier.

Example
The following mapping is configured to perform constraint-based loading:
In the first pipeline, target T_1 has a primary key, T_2 and T_3 contain foreign keys referencing the
T1 primary key. T_3 has a primary key that T_4 references as a foreign key.
Since these tables receive records from a single active source, SQ_A, the Integration Service
loads rows to the target in the following order:
1. T_1
2. T_2 and T_3 (in no particular order)
3. T_4
The Integration Service loads T_1 first because it has no foreign key dependencies and contains a
primary key referenced by T_2 and T_3. The Integration Service then loads T_2 and T_3, but since
T_2 and T_3 have no dependencies, they are not loaded in any particular order. The Integration
Service loads T_4 last, because it has a foreign key that references a primary key in T_3.After
loading the first set of targets, the Integration Service begins reading source B. If there are no key
relationships between T_5 and T_6, the Integration Service reverts to a normal load for both
targets.
If T_6 has a foreign key that references a primary key in T_5, since T_5 and T_6 receive data from a
single active source, the Aggregator AGGTRANS, the Integration Service loads rows to the tables in
the following order:
T_5
T_6
T_1, T_2, T_3, and T_4 are in one target connection group if you use the same database connection
for each target, and you use the default partition properties. T_5 and T_6 are in another target
connection group together if you use the same database connection for each target and you use
the default partition properties. The Integration Service includes T_5 and T_6 in a different target
connection group because they are in a different target load order group from the first four targets.
Enabling Constraint-Based Loading:
When you enable constraint-based loading, the Integration Service orders the target load on a rowby-row basis. To enable constraint-based loading:
1. In the General Options settings of the Properties tab, choose Insert for the Treat Source Rows
As property.
2. Click the Config Object tab. In the Advanced settings, select Constraint Based Load Ordering.
3. Click OK.

Target Load Plan


When you use a mapplet in a mapping, the Mapping Designer lets you set the target load plan for
sources within the mapplet.
Setting the Target Load Order
You can configure the target load order for a mapping containing any type of target definition. In
the Designer, you can set the order in which the Integration Service sends rows to targets in
different target load order groups in a mapping. A target load order group is the collection of
source qualifiers, transformations, and targets linked together in a mapping. You can set the target
load order if you want to maintain referential integrity when inserting, deleting, or updating tables
that have the primary key and foreign key constraints.
The Integration Service reads sources in a target load order group concurrently, and it processes
target load order groups sequentially.
To specify the order in which the Integration Service sends data to targets, create one source
qualifier for each target within a mapping. To set the target load order, you then determine in
which order the Integration Service reads each source in the mapping.
The following figure shows two target load order groups in one mapping:
In this mapping, the first target load order group includes ITEMS, SQ_ITEMS, and T_ITEMS. The
second target load order group includes all other objects in the mapping, including the
TOTAL_ORDERS target. The Integration Service processes the first target load order group, and
then the second target load order group.
When it processes the second target load order group, it reads data from both sources at the same
time.
To set the target load order:

Create a mapping that contains multiple target load order groups.

Click Mappings > Target Load Plan.

The Target Load Plan dialog box lists all Source Qualifier transformations in the mapping and
the targets that receive data from each source qualifier.

Select a source qualifier from the list.

Click the Up and Down buttons to move the source qualifier within the load order.

Repeat steps 3 to 4 for other source qualifiers you want to reorder. Click OK.

Mapping Parameters & Variables


Mapping parameters and variables represent values in mappings and mapplets.
When we use a mapping parameter or variable in a mapping, first we declare the mapping
parameter or variable for use in each mapplet or mapping. Then, we define a value for the
mapping parameter or variable before we run the session.
Mapping Parameters
A mapping parameter represents a constant value that we can define before running a session.
A mapping parameter retains the same value throughout the entire session.
Example: When we want to extract records of a particular month during ETL process, we will
create a Mapping Parameter of data type and use it in query to compare it with the timestamp
field in SQL override.
After we create a parameter, it appears in the Expression Editor.
We can then use the parameter in any expression in the mapplet or mapping.
We can also use parameters in a source qualifier filter, user-defined join, or extract override, and in
the Expression Editor of reusable transformations.
Mapping Variables
Unlike mapping parameters, mapping variables are values that can change between sessions.

The Integration Service saves the latest value of a mapping variable to the repository at the
end of each successful session.

We can override a saved value with the parameter file.

We can also clear all saved values for the session in the Workflow Manager.

We might use a mapping variable to perform an incremental read of the source. For example, we
have a source table containing time stamped transactions and we want to evaluate the
transactions on a daily basis. Instead of manually entering a session override to filter source data
each time we run the session, we can create a mapping variable, $$IncludeDateTime. In the
source qualifier, create a filter to read only rows whose transaction date equals $
$IncludeDateTime, such as:
TIMESTAMP = $$IncludeDateTime
In the mapping, use a variable function to set the variable value to increment one day each time
the session runs. If we set the initial value of $$IncludeDateTime to 8/1/2004, the first time the
Integration Service runs the session, it reads only rows dated 8/1/2004. During the session, the
Integration Service sets $$IncludeDateTime to 8/2/2004. It saves 8/2/2004 to the repository at the
end of the session. The next time it runs the session, it reads only rows from August 2, 2004.
Used in following transformations:
Expression
Filter
Router
Update Strategy
Initial and Default Value:
When we declare a mapping parameter or variable in a mapping or a mapplet, we can enter an
initial value. When the Integration Service needs an initial value, and we did not declare an initial
value for the parameter or variable, the Integration Service uses a default value based on the data
type of the parameter or variable.
Data ->Default Value
Numeric ->0
String ->Empty String
Date time ->1/1/1
Variable Values: Start value and current value of a mapping variable
Start Value:
The start value is the value of the variable at the start of the session. The Integration Service looks
for the start value in the following order:

Value in parameter file

Value saved in the repository

Initial value

Default value
Current Value:
The current value is the value of the variable as the session progresses. When a session starts, the
current value of a variable is the same as the start value. The final current value for a variable is
saved to the repository at the end of a successful session. When a session fails to complete, the
Integration Service does not update the value of the variable in the repository.
Note: If a variable function is not used to calculate the current value of a mapping variable, the
start value of the variable is saved to the repository.
Variable Data type and Aggregation Type When we declare a mapping variable in a mapping, we
need to configure the Data type and aggregation type for the variable. The IS uses the aggregate
type of a Mapping variable to determine the final current value of the mapping variable.
Aggregation types are:
Count: Integer and small integer data types are valid only.
Max: All transformation data types except binary data type are valid.
Min: All transformation data types except binary data type are valid.
Variable Functions
Variable functions determine how the Integration Service calculates the current value of a mapping
variable in a pipeline.
SetMaxVariable: Sets the variable to the maximum value of a group of values. It ignores rows
marked for update, delete, or reject. Aggregation type set to Max.
SetMinVariable: Sets the variable to the minimum value of a group of values. It ignores rows
marked for update, delete, or reject. Aggregation type set to Min.
SetCountVariable: Increments the variable value by one. It adds one to the variable value when
a row is marked for insertion, and subtracts one when the row is Marked for deletion. It ignores
rows marked for update or reject. Aggregation type set to Count.

SetVariable: Sets the variable to the configured value. At the end of a session, it compares the
final current value of the variable to the start value of the variable. Based on the aggregate type of
the variable, it saves a final value to the repository.
Creating Mapping Parameters and Variables

Open the folder where we want to create parameter or variable.

In the Mapping Designer, click Mappings > Parameters and Variables. -or- In the Mapplet
Designer, click Mapplet > Parameters and Variables.

Click the add button.

Enter name. Do not remove $$ from name.

Select Type and Data type. Select Aggregation type for mapping variables.

Give Initial Value. Click ok.

Example: Use of Mapping of Mapping Parameters and Variables

EMP will be source table.


Create a target table MP_MV_EXAMPLE having columns: EMPNO, ENAME, DEPTNO, TOTAL_SAL,
MAX_VAR, MIN_VAR, COUNT_VAR and SET_VAR.
TOTAL_SAL = SAL+ COMM + $$BONUS (Bonus is mapping parameter that changes every month)
SET_VAR: We will be added one month to the HIREDATE of every employee.
Create shortcuts as necessary.
Creating Mapping
1. Open folder where we want to create the mapping.
2. Click Tools -> Mapping Designer.
3. Click Mapping-> Create-> Give name. Ex: m_mp_mv_example
4. Drag EMP and target table.
5. Transformation -> Create -> Select Expression for list -> Create > Done.
6. Drag EMPNO, ENAME, HIREDATE, SAL, COMM and DEPTNO to Expression.
7. Create Parameter $$Bonus and Give initial value as 200.
8. Create variable $$var_max of MAX aggregation type and initial value 1500.
9. Create variable $$var_min of MIN aggregation type and initial value 1500.
10.
Create variable $$var_count of COUNT aggregation type and initial value 0. COUNT is
visible when datatype is INT or SMALLINT.
11.
Create variable $$var_set of MAX aggregation type.
12. Create 5 output ports out_ TOTAL_SAL, out_MAX_VAR, out_MIN_VAR,
out_COUNT_VAR and out_SET_VAR.
13. Open expression editor for TOTAL_SAL. Do the same as we did earlier for SAL+ COMM. To add $
$BONUS to it, select variable tab and select the parameter from mapping parameter. SAL + COMM
+ $$Bonus
14. Open Expression editor for out_max_var.
15. Select the variable function SETMAXVARIABLE from left side pane. Select
$$var_max from variable tab and SAL from ports tab as shown below. SETMAXVARIABLE($
$var_max,SAL)
17. Open Expression editor for out_min_var and write the following expression:
SETMINVARIABLE($$var_min,SAL). Validate the expression.
18. Open Expression editor for out_count_var and write the following expression:
SETCOUNTVARIABLE($$var_count). Validate the expression.
19. Open Expression editor for out_set_var and write the following expression:
SETVARIABLE($$var_set,ADD_TO_DATE(HIREDATE,'MM',1)). Validate.
20. Click OK. Expression Transformation below:
21. Link all ports from expression to target and Validate Mapping and Save it.
22. See mapping picture on next page.

PARAMETER FILE

A parameter file is a list of parameters and associated values for a workflow, worklet, or session.
Parameter files provide flexibility to change these variables each time we run a workflow or
session.
We can create multiple parameter files and change the file we use for a session or workflow. We
can create a parameter file using a text editor such as WordPad or Notepad.
Enter the parameter file name and directory in the workflow or session properties.
A parameter file contains the following types of parameters and variables:
Workflow variable: References values and records information in a workflow.
Worklet variable: References values and records information in a worklet. Use predefined worklet
variables in a parent workflow, but we cannot use workflow variables from the parent workflow in a
worklet.
Session parameter: Defines a value that can change from session to session, such as a database
connection or file name.
Mapping parameter and Mapping variable
USING A PARAMETER FILE
Parameter files contain several sections preceded by a heading. The heading identifies the
Integration Service, Integration Service process, workflow, worklet, or session to which we want to
assign parameters or variables.

Make session and workflow.

Give connection information for source and target table.

Run workflow and see result.


Sample Parameter File for Our example:
In the parameter file, folder and session names are case sensitive.
Create a text file in notepad with name Para_File.txt
[Practice.ST:s_m_MP_MV_Example]
$$Bonus=1000
$$var_max=500
$$var_min=1200
$$var_count=0
CONFIGURING PARAMTER FILE
We can specify the parameter file name and directory in the workflow or session properties.
To enter a parameter file in the workflow properties:
1. Open a Workflow in the Workflow Manager.
2. Click Workflows > Edit.
3. Click the Properties tab.
4. Enter the parameter directory and name in the Parameter Filename field.
5. Click OK.
To enter a parameter file in the session properties:
1. Open a session in the Workflow Manager.
2. Click the Properties tab and open the General Options settings.
3. Enter the parameter directory and name in the Parameter Filename field.
4. Example: D:\Files\Para_File.txt or $PMSourceFileDir\Para_File.txt
5. Click OK.

Mapplet
A mapplet is a reusable object that we create in the Mapplet Designer.
It contains a set of transformations and lets us reuse that transformation logic in multiple
mappings.
Created in Mapplet Designer in Designer Tool.
We need to use same set of 5 transformations in say 10 mappings. So instead of making 5
transformations in every 10 mapping, we create a mapplet of these 5 transformations. Now we use
this mapplet in all 10 mappings. Example: To create a surrogate key in target. We create a mapplet
using a stored procedure to create Primary key for target table. We give target table name and key

column name as input to mapplet and get the Surrogate key as output.
Mapplets help simplify mappings in the following ways:
Include source definitions: Use multiple source definitions and source qualifiers to provide
source data for a mapping.
Accept data from sources in a mapping
Include multiple transformations: As many transformations as we need.
Pass data to multiple transformations: We can create a mapplet to feed data to multiple
transformations. Each Output transformation in a mapplet represents one output group in a
mapplet.
Contain unused ports: We do not have to connect all mapplet input and output ports in a
mapping.
Mapplet Input:
Mapplet input can originate from a source definition and/or from an Input transformation in the
mapplet. We can create multiple pipelines in a mapplet.
We use Mapplet Input transformation to give input to mapplet.
Use of Mapplet Input transformation is optional.
Mapplet Output:
The output of a mapplet is not connected to any target table.
We must use Mapplet Output transformation to store mapplet output.
A mapplet must contain at least one Output transformation with at least one connected port in the
mapplet.
Example1: We will join EMP and DEPT table. Then calculate total salary. Give the output to
mapplet out transformation.
EMP and DEPT will be source tables.
Output will be given to transformation Mapplet_Out.
Steps:

Open folder where we want to create the mapping.

Click Tools -> Mapplet Designer.

Click Mapplets-> Create-> Give name. Ex: mplt_example1

Drag EMP and DEPT table.

Use Joiner transformation as described earlier to join them.

Transformation -> Create -> Select Expression for list -> Create -> Done

Pass all ports from joiner to expression and then calculate total salary as described in
expression transformation.

Now Transformation -> Create -> Select Mapplet Out from list > Create -> Give name and
then done.

Pass all ports from expression to Mapplet output.

Mapplet -> Validate

Repository -> Save


Use of mapplet in mapping:
We can mapplet in mapping by just dragging the mapplet from mapplet folder on left pane as we
drag source and target tables.
When we use the mapplet in a mapping, the mapplet object displays only the ports from the Input
and Output transformations. These are referred to as the mapplet input and mapplet output ports.
Make sure to give correct connection information in session.
Making a mapping: We will use mplt_example1, and then create a filter
transformation to filter records whose Total Salary is >= 1500.

mplt_example1 will be source.

Create target table same as Mapplet_out transformation as in picture above. Creating


Mapping
Open folder where we want to create the mapping.

Click Tools -> Mapping Designer.

Click Mapping-> Create-> Give name. Ex: m_mplt_example1

Drag mplt_Example1 and target table.

Transformation -> Create -> Select Filter for list -> Create -> Done.

Drag all ports from mplt_example1 to filter and give filter condition.

Connect all ports from filter to target. We can add more transformations after filter if
needed.

Validate mapping and Save it.

Make session and workflow.


Give connection information for mapplet source tables.
Give connection information for target table.
Run workflow and see result.

Indirect Loading For Flat Files


Suppose, you have 10 flat files of same structure. All the flat files have same number of columns
and data type. Now we need to transfer all the 10 files to same target.
Names of files are say EMP1, EMP2 and so on.
Solution1:
1. Import one flat file definition and make the mapping as per need.
2. Now in session give the Source File name and Source File Directory location of one file.
3. Make workflow and run.
4. Now open session after workflow completes. Change the Filename and Directory to give
information of second file. Run workflow again.
5. Do the above for all 10 files.
Solution2:
1. Import one flat file definition and make the mapping as per need.
2. Now in session give the Source Directory location of the files.
3. Now in Fieldname use $InputFileName. This is a session parameter.
4. Now make a parameter file and give the value of $InputFileName.
$InputFileName=EMP1.txt
5. Run the workflow
6. Now edit parameter file and give value of second file. Run workflow again.
7. Do same for remaining files.
Solution3:
1. Import one flat file definition and make the mapping as per need.
2. Now make a notepad file that contains the location and name of each 10 flat files.

Sample:
D:\EMP1.txt
E:\EMP2.txt
E:\FILES\DWH\EMP3.txt and so on
3. Now make a session and in Source file name and Source File Directory location fields, give the
name and location of above created file.
4. In Source file type field, select Indirect.
5. Click Apply.
6. Validate Session
7. Make Workflow. Save it to repository and run.

Incremental Aggregation
When we enable the session option-> Incremental Aggregation the Integration Service
performs incremental aggregation, it passes source data through the mapping and uses historical
cache data to perform aggregation calculations incrementally.
When using incremental aggregation, you apply captured changes in the source to aggregate
calculations in a session. If the source changes incrementally and you can capture changes, you
can configure the session to process those changes. This allows the Integration Service to update

the target incrementally, rather than forcing it to process the entire source and recalculate the
same data each time you run the session.
For example, you might have a session using a source that receives new data every day. You can
capture those incremental changes because you have added a filter condition to the mapping that
removes pre-existing data from the flow of data. You then enable incremental aggregation.
When the session runs with incremental aggregation enabled for the first time on March 1, you use
the entire source. This allows the Integration Service to read and store the necessary aggregate
data. On March 2, when you run the session again, you filter out all the records except those timestamped March 2. The Integration Service then processes the new data and updates the target
accordingly. Consider using incremental aggregation in the following circumstances:
You can capture new source data. Use incremental aggregation when you can capture new source
data each time you run the session. Use a Stored Procedure or Filter transformation to process new
data.
Incremental changes do not significantly change the target. Use incremental aggregation when the
changes do not significantly change the target. If processing the incrementally changed source
alters more than half the existing target, the session may not benefit from using incremental
aggregation. In this case, drop the table and recreate the target with complete source data.
Note: Do not use incremental aggregation if the mapping contains percentile or median functions.
The Integration Service uses system memory to process these functions in addition to the cache
memory you configure in the session properties. As a result, the Integration Service does not store
incremental aggregation values for percentile and median functions in disk caches.
Integration Service Processing for Incremental Aggregation
(i)The first time you run an incremental aggregation session, the Integration Service processes the
entire source. At the end of the session, the Integration Service stores aggregate data from that
session run in two files, the index file and the data file. The Integration Service creates the files in
the cache directory specified in the Aggregator transformation properties.
(ii)Each subsequent time you run the session with incremental aggregation, you use the
incremental source changes in the session. For each input record, the Integration Service checks
historical information in the index file for a corresponding group. If it finds a corresponding group,
the Integration Service performs the aggregate operation incrementally, using the aggregate data
for that group, and saves the incremental change. If it does not find a corresponding group, the
Integration Service creates a new group and saves the record data.
(iii)When writing to the target, the Integration Service applies the changes to the existing target. It
saves modified aggregate data in the index and data files to be used as historical data the next
time you run the session.
(iv) If the source changes significantly and you want the Integration Service to continue saving
aggregate data for future incremental changes, configure the Integration Service to overwrite
existing aggregate data with new aggregate data.
Each subsequent time you run a session with incremental aggregation, the Integration Service
creates a backup of the incremental aggregation files. The cache directory for the Aggregator
transformation must contain enough disk space for two sets of the files.
(v)When you partition a session that uses incremental aggregation, the Integration Service creates
one set of cache files for each partition.
The Integration Service creates new aggregate data, instead of using historical data, when you
perform one of the following tasks:

Save a new version of the mapping.

Configure the session to reinitialize the aggregate cache.

Move the aggregate files without correcting the configured path or directory for the files in
the session properties.

Change the configured path or directory for the aggregate files without moving the files to
the new location.

Delete cache files.

Decrease the number of partitions.

When the Integration Service rebuilds incremental aggregation files, the data in the previous
files is lost.
Note: To protect the incremental aggregation files from file corruption or disk failure, periodically
back up the files.
Preparing for Incremental Aggregation:

When you use incremental aggregation, you need to configure both mapping and session
properties:

Implement mapping logic or filter to remove pre-existing data.

Configure the session for incremental aggregation and verify that the file directory has
enough disk space for the aggregate files.
Configuring the Mapping
Before enabling incremental aggregation, you must capture changes in source data. You can use a
Filter or Stored Procedure transformation in the mapping to remove pre-existing source data during
a session.
Configuring the Session
Use the following guidelines when you configure the session for incremental aggregation:
(i) Verify the location where you want to store the aggregate files.
The index and data files grow in proportion to the source data. Be sure the cache directory has
enough disk space to store historical data for the session.
When you run multiple sessions with incremental aggregation, decide where you want the files
stored. Then, enter the appropriate directory for the process variable, $PMCacheDir, in the
Workflow Manager. You can enter session-specific directories for the index and data files. However,
by using the process variable for all sessions using incremental aggregation, you can easily
change the cache directory when necessary by changing $PMCacheDir.
Changing the cache directory without moving the files causes the Integration Service to reinitialize
the aggregate cache and gather new aggregate data.
In a grid, Integration Services rebuild incremental aggregation files they cannot find. When an
Integration Service rebuilds incremental aggregation files, it loses aggregate history.
(ii) Verify the incremental aggregation settings in the session properties.
You can configure the session for incremental aggregation in the Performance settings on the
Properties tab.
You can also configure the session to reinitialize the aggregate cache. If you choose to reinitialize
the cache, the Workflow Manager displays a warning indicating the Integration Service overwrites
the existing cache and a reminder to clear this option after running the session.

When should we go for hash partitioning?


Scenarios for choosing hash partitioning:
Not enough knowledge about how much data maps into a give range.
Sizes of range partition differ quite substantially, or are difficult to balance manually
Range partitioning would cause data to be clustered undesirably.
Features such as parallel DML, partition pruning, joins etc are important.
You Can Define Following Partition Types In Workflow Manager:
1) Database Partitioning
The integration service queries the IBM db2 or oracle system for table partition information. It
reads partitioned data from the corresponding nodes in the database. Use database partitioning
with oracle or IBM db2 source instances on a multi-node table space. Use database partitioning
with db2 targets
2) Hash Partitioning
Use hash partitioning when you want the integration service to distribute rows to the partitions by
group. For example, you need to sort items by item id, but you do not know how many items have
a particular id number
3) Key Range
you specify one or more ports to form a compound partition key. The integration service passes
data to each partition depending on the ranges you specify for each port. Use key range
partitioning where the sources or targets in the pipeline are partitioned by key range.

4) Simple Pass-Through
The integration service passes all rows at one partition point to the next partition point without
redistributing them. Choose pass-through partitioning where you want to create an additional
pipeline stage to improve performance, but do not want to change the distribution of data across
partitions
5) Round-Robin
The integration service distributes data evenly among all partitions. Use round-robin partitioning
where you want each partition to process approximately the same number of rows.

Partition Types Overview


Creating Partition Tables
To create a partition table gives the following statement
Create table sales (year number(4),
product varchar2(10),
amt number(10))
partition by range (year)
(
partition p1 values less than (1992) ,
partition p2 values less than (1993),
partition p5 values less than (MAXVALUE)
);
The following example creates a table with list partitioning
Create table customers (custcode number(5),
Name varchar2(20),
Addr varchar2(10,2),
City varchar2(20),
Bal number(10,2))
Partition by list (city),
Partition north_India values (DELHI,CHANDIGARH),
Partition east_India values (KOLKOTA,PATNA),
Partition south_India values (HYDERABAD,BANGALORE,
CHENNAI),
Partition west India values (BOMBAY,GOA);
alter table sales add partition p6 values less than (1996);
alter table customers add partition central_India values (BHOPAL,NAGPUR);SSS
Alter table sales drop partition p5;
Alter table sales merge partition p2 and p3 into
partition p23;
The following statement adds a new set of cities ('KOCHI', 'MANGALORE') to an existing partition
list.
ALTER TABLE customers
MODIFY PARTITION south_india
ADD VALUES ('KOCHI', 'MANGALORE');
The statement below drops a set of cities (KOCHI' and 'MANGALORE') from an existing partition

value list.
ALTER TABLE customers
MODIFY PARTITION south_india
DROP VALUES (KOCHI,MANGALORE);
SPLITTING PARTITIONS

You can split a single partition into two partitions. For example to split the partition p5 of sales
table into two partitions give the following command.
Alter table sales split partition p5 into
(Partition p6 values less than (1996),
Partition p7 values less then (MAXVALUE));
TRUNCATING PARTITON
Truncating a partition will delete all rows from the partition.
To truncate a partition give the following statement
Alter table sales truncate partition p5;
LISTING INFORMATION ABOUT PARTITION TABLES

To see how many partitioned tables are there in your schema give the following statement
Select * from user_part_tables;
To see on partition level partitioning information
Select * from user_tab_partitions;

TASKS
The Workflow Manager contains many types of tasks to help you build workflows and worklets. We
can create reusable tasks in the Task Developer.
Types of tasks:
Task Type
Tool where task can be
Reusable or not
created
Session

Task Developer

Yes

Email

Workflow Designer

Yes

Command

Worklet Designer

Yes

Event-Raise

Workflow Designer

No

Event-Wait

Worklet Designer

No

Timer

No

Decision

No

Assignment

No

Control

No

SESSION TASK

A session is a set of instructions that tells the Power Center Server how and when to move data
from sources to targets.
To run a session, we must first create a workflow to contain the Session task.
We can run as many sessions in a workflow as we need. We can run the Session tasks sequentially
or concurrently, depending on our needs.
The Power Center Server creates several files and in-memory caches depending on the
transformations and options used in the session.
EMAIL TASK
The Workflow Manager provides an Email task that allows us to send email during a workflow.
Created by Administrator usually and we just drag and use it in our mapping.
Steps:
1. In the Task Developer or Workflow Designer, choose Tasks-Create.
2. Select an Email task and enter a name for the task. Click Create.
3. Click Done.
4. Double-click the Email task in the workspace. The Edit Tasks dialog box appears.
5. Click the Properties tab.
6. Enter the fully qualified email address of the mail recipient in the Email User Name field.
7. Enter the subject of the email in the Email Subject field. Or, you can leave this field blank.
8. Click the Open button in the Email Text field to open the Email Editor.
9. Click OK twice to save your changes.
Example: To send an email when a session completes:
Steps:
1. Create a workflow wf_sample_email
2. Drag any session task to workspace.
3. Edit Session task and go to Components tab.
4. See On Success Email Option there and configure it.
5. In Type select reusable or Non-reusable.
6. In Value, select the email task to be used.
7. Click Apply -> Ok.
8. Validate workflow and Repository -> Save
9. We can also drag the email task and use as per need.
10.
We can set the option to send email on success or failure in components tab of a
session task.
COMMAND TASK
The Command task allows us to specify one or more shell commands in UNIX or DOS commands in
Windows to run during the workflow.
For example, we can specify shell commands in the Command task to delete reject files, copy a
file, or archive target files.
Ways of using command task:
1. Standalone Command task: We can use a Command task anywhere in the workflow or worklet
to run shell commands.
2. Pre- and post-session shell command: We can call a Command task as the pre- or post-session
shell command for a Session task. This is done in COMPONENTS TAB of a session. We can run it in
Pre-Session Command or Post Session Success Command or Post Session Failure Command. Select
the Value and Type option as we did in Email task.
Example: to copy a file sample.txt from D drive to E.
Command: COPY D:\sample.txt E:\ in windows
Steps for creating command task:
1. In the Task Developer or Workflow Designer, choose Tasks-Create.
2. Select Command Task for the task type.
3. Enter a name for the Command task. Click Create. Then click done.
4. Double-click the Command task. Go to commands tab.
5. In the Commands tab, click the Add button to add a command.
6. In the Name field, enter a name for the new command.
7. In the Command field, click the Edit button to open the Command Editor.
8. Enter only one command in the Command Editor.
9. Click OK to close the Command Editor.
10.
Repeat steps 5-9 to add more commands in the task.

11.
Click OK.
Steps to create the workflow using command task:
1. Create a task using the above steps to copy a file in Task Developer.
2. Open Workflow Designer. Workflow -> Create -> Give name and click ok.
3. Start is displayed. Drag session say s_m_Filter_example and command task.
4. Link Start to Session task and Session to Command Task.
5. Double click link between Session and Command and give condition in editor as
6. $S_M_FILTER_EXAMPLE.Status=SUCCEEDED
7. Workflow-> Validate
8. Repository > Save
WORKING WITH EVENT TASKS
We can define events in the workflow to specify the sequence of task execution.
Types of Events:
Pre-defined event: A pre-defined event is a file-watch event. This event Waits for a specified file
to arrive at a given location.
User-defined event: A user-defined event is a sequence of tasks in the Workflow. We create
events and then raise them as per need.
Steps for creating User Defined Event:
1. Open any workflow where we want to create an event.
2. Click Workflow-> Edit -> Events tab.
3. Click to Add button to add events and give the names as per need.
4. Click Apply -> Ok. Validate the workflow and Save it.
Types of Events Tasks:
EVENT RAISE: Event-Raise task represents a user-defined event. We use this task to raise a user
defined event.
EVENT WAIT: Event-Wait task waits for a file watcher event or user defined event to occur before
executing the next session in the workflow.
Example1: Use an event wait task and make sure that session s_filter_example runs when abc.txt
file is present in D:\FILES folder.
Steps for creating workflow:
1. Workflow -> Create -> Give name wf_event_wait_file_watch -> Click ok.
2. Task -> Create -> Select Event Wait. Give name. Click create and done.
3. Link Start to Event Wait task.
4. Drag s_filter_example to workspace and link it to event wait task.
5. Right click on event wait task and click EDIT -> EVENTS tab.
6. Select Pre Defined option there. In the blank space, give directory and filename to watch.
Example: D:\FILES\abc.tct
7. Workflow validate and Repository Save.
Example 2: Raise a user defined event when session s_m_filter_example succeeds. Capture this
event in event wait task and run session S_M_TOTAL_SAL_EXAMPLE
Steps for creating workflow:
1. Workflow -> Create -> Give name wf_event_wait_event_raise -> Click ok.
2. Workflow -> Edit -> Events Tab and add events EVENT1 there.
3. Drag s_m_filter_example and link it to START task.
4. Click Tasks -> Create -> Select EVENT RAISE from list. Give name
5. ER_Example. Click Create and then done. Link ER_Example to s_m_filter_example.
6. Right click ER_Example -> EDIT -> Properties Tab -> Open Value for User Defined Event and
Select EVENT1 from the list displayed. Apply -> OK.
7. Click link between ER_Example and s_m_filter_example and give the condition
$S_M_FILTER_EXAMPLE.Status=SUCCEEDED
8. Click Tasks -> Create -> Select EVENT WAIT from list. Give name EW_WAIT. Click Create and
then done.
9. Link EW_WAIT to START task.
10.
Right click EW_WAIT -> EDIT-> EVENTS tab.
11.
Select User Defined there. Select the Event1 by clicking Browse Events button.
12.
Apply -> OK.

13.
Drag S_M_TOTAL_SAL_EXAMPLE and link it to EW_WAIT.
14.
Mapping -> Validate
15.
Repository -> Save.
Run workflow and see.
TIMER TASK
The Timer task allows us to specify the period of time to wait before the Power Center Server runs
the next task in the workflow. The Timer task has two types of settings:
Absolute time: We specify the exact date and time or we can choose a user-defined workflow
variable to specify the exact time. The next task in workflow will run as per the date and time
specified.
Relative time: We instruct the Power Center Server to wait for a specified period of time after the
Timer task, the parent workflow, or the top-level workflow starts.
Example: Run session s_m_filter_example relative to 1 min after the timer task.
Steps for creating workflow:
1. Workflow -> Create -> Give name wf_timer_task_example -> Click ok.
2. Click Tasks -> Create -> Select TIMER from list. Give name TIMER_Example. Click Create and
then done.
3. Link TIMER_Example to START task.
4. Right click TIMER_Example-> EDIT -> TIMER tab.
5. Select Relative Time Option and Give 1 min and Select From start time of this task Option.
6. Apply -> OK.
7. Drag s_m_filter_example and link it to TIMER_Example.
8. Workflow-> Validate and Repository -> Save.
DECISION TASK
The Decision task allows us to enter a condition that determines the execution of the workflow,
similar to a link condition.
The Decision task has a pre-defined variable called $Decision_task_name.condition that represents
the result of the decision condition.
The Power Center Server evaluates the condition in the Decision task and sets the pre-defined
condition variable to True (1) or False (0).
We can specify one decision condition per Decision task.
Example: Command Task should run only if either s_m_filter_example or
S_M_TOTAL_SAL_EXAMPLE succeeds. If any of s_m_filter_example or
S_M_TOTAL_SAL_EXAMPLE fails then S_m_sample_mapping_EMP should run.
Steps for creating workflow:
1. Workflow -> Create -> Give name wf_decision_task_example -> Click ok.
2. Drag s_m_filter_example and S_M_TOTAL_SAL_EXAMPLE to workspace and link both of them to
START task.
3. Click Tasks -> Create -> Select DECISION from list. Give name DECISION_Example. Click Create
and then done. Link DECISION_Example to both s_m_filter_example and
S_M_TOTAL_SAL_EXAMPLE.
4. Right click DECISION_Example-> EDIT -> GENERAL tab.
5. Set Treat Input Links As to OR. Default is AND. Apply and click OK.
6. Now edit decision task again and go to PROPERTIES Tab. Open the Expression editor by clicking
the VALUE section of Decision Name attribute and enter the following condition:
$S_M_FILTER_EXAMPLE.Status = SUCCEEDED OR $S_M_TOTAL_SAL_EXAMPLE.Status = SUCCEEDED
7. Validate the condition -> Click Apply -> OK.
8. Drag command task and S_m_sample_mapping_EMP task to workspace and link them to
DECISION_Example task.
9. Double click link between S_m_sample_mapping_EMP & DECISION_Example & give the
condition: $DECISION_Example.Condition = 0. Validate & click OK.
10.
Double click link between Command task and DECISION_Example and give the
condition: $DECISION_Example.Condition = 1. Validate and click OK.
11.
Workflow Validate and repository Save.
Run workflow and see the result.

CONTROL TASK
We can use the Control task to stop, abort, or fail the top-level workflow or the parent workflow
based on an input link condition.
A parent workflow or worklet is the workflow or worklet that contains the Control task.
We give the condition to the link connected to Control Task.
Control Option

Description

Fail Me

Fails the control task.

Fail Parent

Marks the status of the WF or worklet


that contains the
Control task as failed.

Stop Parent

Stops the WF or worklet that contains the


Control task.

Abort Parent

Aborts the WF or worklet that contains


the Control task.

Fail Top-Level WF

Fails the workflow that is running.

Stop Top-Level WF

Stops the workflow that is running.

Abort Top-Level WF

Aborts the workflow that is running.

Example: Drag any 3 sessions and if anyone fails, then Abort the top level workflow.
Steps for creating workflow:
1. Workflow -> Create -> Give name wf_control_task_example -> Click ok.
2. Drag any 3 sessions to workspace and link all of them to START task.
3. Click Tasks -> Create -> Select CONTROL from list. Give name cntr_task.
4. Click Create and then done.
5. Link all sessions to the control task cntr_task.
6. Double click link between cntr_task and any session say s_m_filter_example and give the
condition: $S_M_FILTER_EXAMPLE.Status = SUCCEEDED.
7. Repeat above step for remaining 2 sessions also.
8. Right click cntr_task-> EDIT -> GENERAL tab. Set Treat Input Links As to OR. Default is AND.
9. Go to PROPERTIES tab of cntr_task and select the value Fail top level
10.
Workflow for Control Option. Click Apply and OK.
11.
Workflow Validate and repository Save.
Run workflow and see the result.
ASSIGNMENT TASK
The Assignment task allows us to assign a value to a user-defined workflow variable.
See Workflow variable topic to add user defined variables.

To use an Assignment task in the workflow, first create and add the

Assignment task to the workflow. Then configure the Assignment task to assign values or
expressions to user-defined variables.

We cannot assign values to pre-defined workflow.


Steps to create Assignment Task:
1. Open any workflow where we want to use Assignment task.
2. Edit Workflow and add user defined variables.
3. Choose Tasks-Create. Select Assignment Task for the task type.
4. Enter a name for the Assignment task. Click Create. Then click done.
5. Double-click the Assignment task to open the Edit Task dialog box.
6. On the Expressions tab, click Add to add an assignment.
7. Click the Open button in the User Defined Variables field.
8. Select the variable for which you want to assign a value. Click OK.
9. Click the Edit button in the Expression field to open the Expression Editor.
10.
Enter the value or expression you want to assign.
11.
Repeat steps 7-10 to add more variable assignments as necessary.

12.

Click OK.

Scheduler
We can schedule a workflow to run continuously, repeat at a given time or interval, or we can
manually start a workflow. The Integration Service runs a scheduled workflow as configured.
By default, the workflow runs on demand. We can change the schedule settings by editing the
scheduler. If we change schedule settings, the Integration Service reschedules the workflow
according to the new settings.
A scheduler is a repository object that contains a set of schedule settings.
Scheduler can be non-reusable or reusable.
The Workflow Manager marks a workflow invalid if we delete the scheduler associated with the
workflow.
If we choose a different Integration Service for the workflow or restart the Integration Service, it
reschedules all workflows.
If we delete a folder, the Integration Service removes workflows from the schedule.
The Integration Service does not run the workflow if:
The prior workflow run fails.
We remove the workflow from the schedule
The Integration Service is running in safe mode
Creating a Reusable Scheduler
For each folder, the Workflow Manager lets us create reusable schedulers so we can reuse the
same set of scheduling settings for workflows in the folder.
Use a reusable scheduler so we do not need to configure the same set of scheduling settings in
each workflow.
When we delete a reusable scheduler, all workflows that use the deleted scheduler becomes
invalid. To make the workflows valid, we must edit them and replace the missing scheduler.

Steps:
Open the folder where we want to create the scheduler.
In the Workflow Designer, click Workflows > Schedulers.
Click Add to add a new scheduler.
In the General tab, enter a name for the scheduler.
Configure the scheduler settings in the Scheduler tab.
Click Apply and OK.
Configuring Scheduler Settings
Configure the Schedule tab of the scheduler to set run options, schedule options, start options,

and end options for the schedule.


There are 3 run options:
Run on Demand
Run Continuously
Run on Server initialization

1. Run on Demand:
Integration Service runs the workflow when we start the workflow manually.
2. Run Continuously:
Integration Service runs the workflow as soon as the service initializes. The Integration Service
then starts the next run of the workflow as soon as it finishes the previous run.
3. Run on Server initialization
Integration Service runs the workflow as soon as the service is initialized. The Integration Service
then starts the next run of the workflow according to settings in Schedule Options.
Schedule options for Run on Server initialization:
Run Once: To run the workflow just once.
Run every: Run the workflow at regular intervals, as configured.
Customized Repeat: Integration Service runs the workflow on the dates and times specified in
the Repeat dialog box.
Start options for Run on Server initialization:

Start Date

Start Time

End options for Run on Server initialization:


End on: IS stops scheduling the workflow in the selected date.
End After: IS stops scheduling the workflow after the set number of
Workflow runs.
Forever: IS schedules the workflow as long as the workflow does not fail.
Creating a Non-Reusable Scheduler
In the Workflow Designer, open the workflow.
Click Workflows > Edit.
In the Scheduler tab, choose Non-reusable. Select Reusable if we want to select an existing
reusable scheduler for the workflow.
Note: If we do not have a reusable scheduler in the folder, we must
Create one before we choose Reusable.
Click the right side of the Scheduler field to edit scheduling settings for the non- reusable

scheduler
If we select Reusable, choose a reusable scheduler from the Scheduler
Browser dialog box.
Click Ok.
Points to Ponder:
To remove a workflow from its schedule, right-click the workflow in the Navigator window and
choose Unscheduled Workflow.
To reschedule a workflow on its original schedule, right-click the workflow in the Navigator window
and choose Schedule Workflow.
Pushdown Optimization Overview

You can push transformation logic to the source or target database using pushdown

optimization. When you run a session configured for pushdown optimization, the Integration
Service translates the transformation logic into SQL queries and sends the SQL queries
to the database. The source or target database executes the SQL queries to process the
transformations.

The amount of transformation logic you can push to the database depends on the database,

transformation logic, and mapping and session configuration. The Integration Service processes all
transformation logic that it cannot push to a database.

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic

that the Integration Service can push to the source or target database. You can also use the
Pushdown Optimization Viewer to view the messages related to pushdown optimization.

The following figure shows a mapping containing transformation logic that can be pushed to

the source database:

This mapping contains a Filter transformation that filters out all items except those with an

ID greater than 1005. The Integration Service can push the transformation logic to the database. It
generates the following SQL statement to process the transformation logic:

INSERT INTO ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC, n_PRICE) SELECT ITEMS.ITEM_ID,

ITEMS.ITEM_NAME, ITEMS.ITEM_DESC, CAST(ITEMS.PRICE AS INTEGER) FROM ITEMS WHERE


(ITEMS.ITEM_ID >1005)

The Integration Service generates an INSERT SELECT statement to get the ID, NAME, and

DESCRIPTION columns from the source table. It filters the data using a WHERE clause. The
Integration Service does not extract data from the database at this time.
Pushdown Optimization Types
You can configure the following types of pushdown optimization:
Source-side pushdown optimization.
The Integration Service pushes as much

transformation logic as possible to the


source database.
Target-side pushdown optimization.
The Integration Service pushes as much
transformation logic as possible to the
target database.
Full pushdown optimization. The
Integration Service attempts to push all
transformation logic to the target
database. If the Integration Service
cannot push all transformation logic to
the database, it performs both sourceside and target-side pushdown
optimization.
Running Source-Side Pushdown Optimization Sessions
When you run a session configured for source-side pushdown optimization, the Integration Service
analyzes the mapping from the source to the target or until it reaches a downstream
transformation it cannot push to the database.
The Integration Service generates and executes a SELECT statement based on the transformation
logic for each transformation it can push to the database. Then, it reads the results of this SQL
query and processes the remaining transformations.
Running Target-Side Pushdown Optimization Sessions
When you run a session configured for target-side pushdown optimization, the Integration Service
analyzes the mapping from the target to the source or until it reaches an upstream transformation
it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on
the transformation logic for each transformation it can push to the database. The Integration
Service processes the transformation logic up to the point that it can push the transformation logic
to the target database. Then, it executes the generated SQL.
Running Full Pushdown Optimization Sessions
To use full pushdown optimization, the source and target databases must be in the same relational
database management system. When you run a session configured for full pushdown optimization,
the Integration Service analyzes the mapping from the source to the target or until it reaches a
downstream transformation it cannot push to the target database. It generates and executes SQL
statements against the source or target based on the transformation logic it can push to the
database.
When you run a session with large quantities of data and full pushdown optimization, the database
server must run a long transaction. Consider the following database performance issues when you
generate a long transaction:
A long transaction uses more database
resources.
A long transaction locks the database for
longer periods of time. This reduces
database concurrency and increases the
likelihood of deadlock.
A long transaction increases the likelihood
of an unexpected event.
To minimize database performance issues for long transactions, consider using source-side or
target-side pushdown optimization.
Integration Service Behavior with Full Optimization
When you configure a session for full optimization, the Integration Service analyzes the mapping
from the source to the target or until it reaches a downstream transformation it cannot push to the
target database. If the Integration Service cannot push all transformation logic to the target
database, it tries to push all transformation logic to the source database. If it cannot push all
transformation logic to the source or target, the Integration Service pushes as much
transformation logic to the source database, processes intermediate transformations that it cannot
push to any database, and then pushes the remaining transformation logic to the target database.

The Integration Service generates and executes an INSERT SELECT, DELETE, or UPDATE statement
for each database to which it pushes transformation logic.
For example, a mapping contains the following transformations:
The Rank transformation cannot be pushed to the source or target database. If you configure the
session for full pushdown optimization, the Integration Service pushes the Source Qualifier
transformation and the Aggregator transformation to the source, processes the Rank
transformation, and pushes the Expression transformation and target to the target database. The
Integration Service does not fail the session if it can push only part of the transformation logic to
the database.

Active and Idle Databases


During pushdown optimization, the Integration Service pushes the transformation logic to one
database, which is called the active database. A database that does not process transformation
logic is called an idle database. For example, a mapping contains two sources that are joined by a
Joiner transformation. If the session is configured for source-side pushdown optimization, the
Integration Service pushes the Joiner transformation logic to the source in the detail pipeline,
which is the active database. The source in the master pipeline is the idle database because it
does not process transformation logic.
The Integration Service uses the following criteria to determine which database is active or idle:
1. When using full pushdown optimization,
the target database is active and the
source database is idle.
2. In sessions that contain a Lookup
transformation, the source or target
database is active, and the lookup
database is idle.
3. In sessions that contain a Joiner
transformation, the source in the detail
pipeline is active, and the source in the
master pipeline is idle.
4. In sessions that contain a Union
transformation, the source in the first
input group is active. The sources in other
input groups are idle.
To push transformation logic to an active database, the database user account of the active
database must be able to read from the idle databases.

Working with Databases


You can configure pushdown optimization for the following databases:
IBM DB2
Microsoft SQL Server
Netezza
Oracle
Sybase ASE
Teradata
Databases that use ODBC drivers
When you push transformation logic to a database, the database may produce different output
than the Integration Service. In addition, the Integration Service can usually push more
transformation logic to a database if you use a native driver, instead of an ODBC driver.
Comparing the Output of the Integration Service and Databases

The Integration Service and databases can produce different results when processing the same
transformation logic. The Integration Service sometimes converts data to a different format when
it reads data. The Integration Service and database may also handle null values, case sensitivity,
and sort order differently.
The database and Integration Service produce different output when the following settings and
conversions are different:
Nulls treated as the highest or
lowest value. The Integration Service
and a database can treat null values
differently. For example, you want to push
a Sorter transformation to an Oracle
database. In the session, you configure
nulls as the lowest value in the sort order.
Oracle treats null values as the highest
value in the sort order.
Sort order. The Integration Service and a
database can use different sort orders.
For example, you want to push the
transformations in a session to a Microsoft
SQL Server database, which is configured
to use a sort order that is not case
sensitive. You configure the session
properties to use the binary sort order,
which is case sensitive. The results differ
based on whether the Integration Service
or Microsoft SQL Server database process
the transformation logic.
Case sensitivity. The Integration Service
and a database can treat case sensitivity
differently. For example, the Integration
Service uses case sensitive queries and
the database does not. A Filter
transformation uses the following filter
condition: IIF(col_varchar2 = CA, TRUE,
FALSE). You need the database to return
rows that match CA. However, if you
push this transformation logic to a
Microsoft SQL Server database that is not
case sensitive, it returns rows that match
the values Ca, ca, cA, and CA.
Numeric values converted to
character values. The Integration
Service and a database can convert the
same numeric value to a character value
in different formats. The database can
convert numeric values to an
unacceptable character format. For
example, a table contains the number
1234567890. When the Integration
Service converts the number to a
character value, it inserts the characters
1234567890. However, a database
might convert the number to 1.2E9. The
two sets of characters represent the same
value. However, if you require the
characters in the format 1234567890,
you can disable pushdown optimization.

Precision. The Integration Service and a


database can have different precision for
particular datatypes. Transformation
datatypes use a default numeric precision
that can vary from the native datatypes.
For example, a transformation Decimal
datatype has a precision of 1-28. The
corresponding Teradata Decimal datatype
has a precision of 1-18. The results can
vary if the database uses a different
precision than the Integration Service.

Using ODBC Drivers


When you use native drivers for all databases, except Netezza, the Integration Service generates
SQL statements using native database SQL. When you use ODBC drivers, the Integration Service
usually cannot detect the database type. As a result, it generates SQL statements using ANSI SQL.
The Integration Service can generate more functions when it generates SQL statements using the
native language than ANSI SQL.
Note: Although the Integration Service uses an ODBC driver for the Netezza database, the
Integration Service detects that the database is Netezza and generates native database SQL when
pushing the transformation logic to the Netezza database.
In some cases, ANSI SQL is not compatible with the database syntax. The following sections
describe problems that you can encounter when you use ODBC drivers. When possible, use native
drivers to prevent these problems.

Working with Dates


The Integration Service and database can process dates differently. When you configure the
session to push date conversion to the database, you can receive unexpected results or the
session can fail.
The database can produce different output than the Integration Service when the following date
settings and conversions are different:
Date values converted to character
values. The Integration Service converts
the transformation Date/Time datatype to
the native datatype that supports
subsecond precision in the database. The
session fails if you configure the datetime
format in the session to a format that the
database does not support. For example,
when the Integration Service performs the
ROUND function on a date, it stores the
date value in a character column, using
the format MM/DD/YYYY HH:MI:SS.US.
When the database performs this
function, it stores the date in the default
date format for the database. If the
database is Oracle, it stores the date as
the default DD-MON-YY. If you require the
date to be in the format MM/DD/YYYY
HH:MI:SS.US, you can disable pushdown
optimization.
Date formats for TO_CHAR and
TO_DATE functions. The Integration
Service uses the date format in the
TO_CHAR or TO_DATE function when the
Integration Service pushes the function to

the database. The database converts


each date string to a datetime value
supported by the database.
For example, the Integration Service pushes the following expression to the database:
TO_DATE( DATE_PROMISED, 'MM/DD/YY' )
The database interprets the date string in the DATE_PROMISED port based on the specified date
format string MM/DD/YY. The database converts each date string, such as 01/22/98, to the
supported date value, such as Jan 22 1998 00:00:00.
If the Integration Service pushes a date format to an IBM DB2, a Microsoft SQL Server, or a Sybase
database that the database does not support, the Integration Service stops pushdown optimization
and processes the transformation.
The Integration Service converts all dates before pushing transformations to an Oracle or Teradata
database. If the database does not support the date format after the date conversion, the session
fails.
HH24 date format. You cannot use the
HH24 format in the date format string for
Teradata. When the Integration Service
generates SQL for a Teradata database, it
uses the HH format string instead.
Blank spaces in date format strings.
You cannot use blank spaces in the date
format string in Teradata. When the
Integration Service generates SQL for a
Teradata database, it substitutes the
space with B.
Handling subsecond precision for a
Lookup transformation. If you enable
subsecond precision for a Lookup
transformation, the database and
Integration Service perform the lookup
comparison using the subsecond
precision, but return different results.
Unlike the Integration Service, the
database does not truncate the lookup
results based on subsecond precision. For
example, you configure the Lookup
transformation to show subsecond
precision to the millisecond. If the lookup
result is 8:20:35.123456, a database
returns 8:20:35.123456, but the
Integration Service returns 8:20:35.123.
SYSDATE built-in variable. When you
use the SYSDATE built-in variable, the
Integration Service returns the current
date and time for the node running the
service process. However, when you push
the transformation logic to the database,
the SYSDATE variable returns the current
date and time for the machine hosting the
database. If the time zone of the machine
hosting the database is not the same as
the time zone of the machine running the
Integration Service process, the results
can vary.

http://shan-informatica.blogspot.com/

Datawarehouse- BASIC DEFINITIONS - Informatica

Datawarehouse - BASIC DEFINITIONS (by Shankar Prasad)


DWH : is a repository of integrated information, specifically structured for queries and analysis. Data
and information are extracted from heterogeneous sources as they are generated. This makes it much
easier and more efficient to run queries over data that originally came from different sources.
Data Mart : is a collection of subject areas organized for decision support based on the needs of a
given department. Ex : sales, marketing etc. the data mart is designed to suit the needs of a department.
Data mart is much less granular than the ware house data
Data Warehouse : is used on an enterprise level, while data marts is used on a business division /
department level. Data warehouses are arranged around the corporate subject areas found in the
corporate data model. Data warehouses contain more detail information while most data marts contain
more summarized or aggregated data.
OLTP : Online Transaction Processing. This is standard, normalized database structure. OLTP is
designed for Transactions, which means that inserts, updates and deletes must be fast.
OLAP : Online Analytical Processing. Read-only, historical, aggregated data.
Fact Table : contain the quantitative measures about the business
Dimension Table : descriptive data about the facts (business)
Conformed dimensions : dimension table shared by fact tables.. these tables connect separate star
schemas into an enterprise star schema.
Star Schema : is a set of tables comprised of a single, central fact table surrounded by de-normalized
dimensions. Star schema implement dimensional data structures with de-normalized dimensions
Snow Flake : is a set of tables comprised of a single, central fact table surrounded by normalized
dimension hierarchies. Snowflake schema implement dimensional data structures with fully
normailized dimensions.

Staging Area : it is the work place where raw data is brought in, cleaned, combined, archived and
exported to one or more data marts. The purpose of data staging area is to get data ready for loading
into a presentation layer.
Queries : The DWH contains 2 types of queries. There will be fixed queries that are clearly defined
and well understood, such as regular reports, canned queries and common aggregations.
There will also be ad hoc queries that are unpredictable, both in quantity and frequency.
Ad Hoc Query : are the starting point for any analysis into a database. The ability to run any query
when desired and expect a reasonable response that makes the data warehouse worthwhile and makes
the design such a significant challenge.
The end-user access tools are capable of automatically generating the database query that answers
any question posted by the user.
Canned Queries : are pre-defined queries. Canned queries contain prompts that allow you to
customize the query for your specific needs
Kimball (Bottom up) vs Inmon (Top down) approaches :
Acc. To Ralph Kimball, when you plan to design analytical solutions for an enterprise, try building data
marts. When you have 3 or 4 such data marts, you would be having an enterprise wide data warehouse
built up automatically without time and effort from exclusively spent on building the EDWH. Because
the time required for building a data mart is lesser than for an EDWH.
INMON : try to build an Enterprise wide Data warehouse first and all the data marts will be the subsets
of the EDWH. Acc. To him, independent data marts cannot make up an enterprise data warehouse
under any circumstance, but they will remain isolated pieces of information stove pieces
************************************************************************************************************************
Dimensional Data Model :
Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal
form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be
stored differently in a dimensional model than in a 3rd normal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in this type of
modeling:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a
dimension. For example, one possible hierarchy in the Time dimension is Year --> Quarter --> Month --> Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be
such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can
be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a
store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the
lookup table for the Quarter attribute would include a list of all of the quarters available in the data
warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter,
and one or more additional fields that specifies how that particular quarter is represented on a report (for
example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables,
but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by
lookup tables. Attributes are the non-key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema types are Star
Schema and Snowflake Schema.
Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially
connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or
complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star
explodes into more points. The main advantage of the snowflake schema is the improvement in query
performance due to minimized disk storage requirements and joining smaller lookup tables. The main
disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase
number of lookup tables.
Whether one uses a star or a snowflake largely depends on personal preference and business needs.
Personally, I am partial to snowflakes, when there is a business case to analyze the information at that
particular level.
Slowly Changing Dimensions:
The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this
applies to cases where the attribute for a record varies over time. We give an example below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer
lookup table has the following record:
Customer Key
Name
State
1001
Christina
Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its
customer table to reflect this change? This is the "Slowly Changing Dimension" problem.
There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated
essentially as two people.
Type 3: The original record is modified to reflect the change.
We next take a look at each of the scenarios and how the data model and the data looks like for each of
them. Finally, we compare and contrast among the three alternatives.
Type 1 Slowly Changing Dimension:
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other
words, no history is kept.
In our example, recall we originally have the following table:
Customer Key
1001

Name
Christina

State
Illinois

After Christina moved from Illinois to California, the new information replaces the new record, and we have
the following table:
Customer Key
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep
track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in
this case, the company would not be able to know that Christina lived in Illinois before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep
track of historical changes.
Type 2 Slowly Changing Dimension:
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information.
Therefore, both the original and the new record will be present. The newe record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table:
Customer Key
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very
high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track
historical changes.
Type 3 Slowly Changing Dimension :
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of
interest, one indicating the original value, and one indicating the current value. There will also be a column
that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
To accomodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key
Name

Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the
following table (assuming the effective date of change is January 15, 2003):
Customer Key

Name

Original State
Illinois

Current State
California

Effective Date

1001
Christina
15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if
Christina later moves to Texas on December 15, 2003, the California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track
historical changes, and when such changes will only occur for a finite number of time.
Surrogate key :
A surrogate key is frequently a sequential number but doesn't have to be. Having the key independent of all
other columns insulates the database relationships from changes in data values or database design and
guarantees uniqueness.
Some database designers use surrogate keys religiously regardless of the suitability of other candidate keys.
However, if a good key already exists, the addition of a surrogate key will merely slow down access,
particularly if it is indexed.
The concept of surrogate key is important in data warehouse ,surrogate means deputy or substitute.
surrogate key is a small integer(say 4 bytes)that can uniquely identify the record in the dimension
table.however it has no meaning data warehouse experts suggest that production key used in the databases
should not be used in the dimension tables as primary keys instead in there place the surrogate key have to
be used which are generated automatically.

Conceptual, Logical, And Physical Data Models:


There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain
the difference among the three, the order with which each one is created, and how to go from one level to
the other.
Conceptual Data Model
Features of conceptual data model include:
Includes the important entities and the relationships among them.
No attribute is specified.
No primary key is specified.
At this level, the data modeler attempts to identify the highest-level relationships among the different
entities.
Logical Data Model
Features of logical data model include:

Includes all entities and relationships among them.


All attributes for each entity are specified.
The primary key for each entity specified.
Foreign keys (keys identifying the relationship between different entities) are specified.
Normalization occurs at this level.

At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to
how they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data model to be combined
into a single step (deliverable).
The steps for designing the logical data model are as follows:
1.
2.
3.
4.
5.
6.

Identify all entities.


Specify primary keys for all entities.
Find the relationships between different entities.
Find all attributes for each entity.
Resolve many-to-many relationships.
Normalization.

Physical Data Model


Features of physical data model include:

Specification all tables and columns.


Foreign keys are used to identify relationships between tables.
Denormalization may occur based on user requirements.
Physical considerations may cause the physical data model to be quite different from the logical data
model.

At this level, the data modeler will specify how the logical data model will be realized in the database
schema.
The steps for physical data model design are as follows:
1.
2.
3.
4.

Convert entities into tables.


Convert relationships into foreign keys.
Convert attributes into columns.
Modify the physical data model based on physical constraints / requirements.

What Is OLAP :
OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP was by Dr.
Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular white paper was sponsored
by one of the OLAP tool vendors, thus causing it to lose objectivity. The OLAP Report has proposed the FASMI
test, Fast Analysis of Shared Multidimensional Information. For a more detailed description of both Dr. Codd's
rules and the FASMI test, please visit The OLAP Report.
For people on the business side, the key feature out of the above list is "Multidimensional." In other words,
the ability to analyze metrics in different dimensions such as time, geography, gender, product, etc. For
example, sales for the company is up. What region is most responsible for this increase? Which store in this
region is most responsible for the increase? What particular product category or categories contributed the
most to the increase? Answering these types of questions in order means that you are performing an OLAP
analysis.
Depending on the underlying technology used, OLAP can be braodly divided into two different camps: MOLAP

and ROLAP. A discussion of the different OLAP types can be found in the MOLAP, ROLAP, and HOLAP section.
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP
(ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The
storage is not in the relational database, but in proprietary formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and
dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are performed when the cube is
built, it is not possible to include a large amount of data in the cube itself. This is not to say that the
data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this
case, only summary-level information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not already exist in the
organization. Therefore, to adopt MOLAP technology, chances are additional investments in human
and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement.
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on
data size of the underlying relational database. In other words, ROLAP itself places no limitation on
data amount.
Can leverage functionalities inherent in the relational database: Often, relational database already
comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational
database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL
queries) in the relational database, the query time can be long if the underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements
to query the relational database, and SQL statements do not fit all needs (for example, it is difficult
to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by
what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box
complex functions as well as the ability to allow users to define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information,
HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can

"drill through" from the cube into the underlying relational data.
Bill Inmon vs. Ralph Kimball:
In the data warehousing field, we often hear about discussions on where a person / organization's philosophy
falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two.
Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise
has one data warehouse, and data marts source their information from the data warehouse. In the data
warehouse, information is stored in 3rd normal form.
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise.
Information is always stored in the dimensional model.
There is no right or wrong between these two ideas, as they represent different data warehousing
philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is
because most data warehouses started out as a departmental effort, and hence they originated as a data
mart. Only when more data marts are built later do they evolve into a data warehouse.
********************************************************************************
******************** Shankar Prasad ****************************************
********************************************************************************

Informatica Interview Question Answer:


by Shankar Prasad
----------------------------------------------------------------------------------------------------------------------Q. What are Target Types on the Server?
A. Target Types are File, Relational and ERP.
Q. What are Target Types on the Server?
A. Target Types are File, Relational and ERP.
Q. How do you identify existing rows of data in the target table using
lookup transformation?
A. There are two ways to lookup the target table to verify a row exists or not :
1. Use connect dynamic cache lookup and then check the values of
NewLookuprow Output port to decide whether the incoming record already exists
in the table / cache or not.
2. Use Unconnected lookup and call it from an expression transformation and
check the Lookup condition port value (Null/ Not Null) to decide whether the
incoming record already exists in the table or not.
Q. What are Aggregate transformations?
A. Aggregator transform is much like the Group by clause in traditional SQL.
This particular transform is a connected/active transform which can take the
incoming data from the mapping pipeline and group them based on the group by

ports specified and can caculated aggregate functions like ( avg, sum, count,
stddev....etc) for each of those groups.
From a performance perspective if your mapping has an AGGREGATOR transform
use filters and sorters very early in the pipeline if there is any need for them.
Q. What are various types of Aggregation?
A. Various types of aggregation are SUM, AVG, COUNT, MAX, MIN, FIRST, LAST,
MEDIAN, PERCENTILE, STDDEV, and VARIANCE.
Q. What are Dimensions and various types of Dimension?
A. Dimensions are classified to 3 types.
1. SCD TYPE 1(Slowly Changing Dimension): this contains current data.
2. SCD TYPE 2(Slowly Changing Dimension): this contains current data +
complete historical data.
3. SCD TYPE 3(Slowly Changing Dimension): this contains current data.
+partially historical data
Q. What are 2 modes of data movement in Informatica Server?
A. The data movement mode depends on whether Informatica Server should
process single byte or multi-byte character data. This mode selection can affect
the enforcement of code page relationships and code page validation in the
Informatica Client and Server.
a) Unicode - IS allows 2 bytes for each character and uses additional byte for each
non-ascii character (such as Japanese characters)
b) ASCII - IS holds all data in a single byte
The IS data movement mode can be changed in the Informatica Server
configuration parameters. This comes into effect once you restart the Informatica
Server.

Q. What is Code Page Compatibility?


A. Compatibility between code pages is used for accurate data movement when
the Informatica Sever runs in the Unicode data movement mode. If the code
pages are identical, then there will not be any data loss. One code page can be a
subset or superset of another. For accurate data movement, the target code page
must be a superset of the source code page.
`Superset - A code page is a superset of another code page when it contains the
character encoded in the other code page, it also contains additional characters
not contained in the other code page.
Subset - A code page is a subset of another code page when all characters in the
code page are encoded in the other code page.
What is Code Page used for?
Code Page is used to identify characters that might be in different languages. If
you are importing Japanese data into mapping, u must select the Japanese code
page of source data.

Q. What is Router transformation?


A. It is different from filter transformation in that we can specify multiple
conditions and route the data to multiple targets depending on the condition.

Q. What is Load Manager?


A. While running a Workflow, the PowerCenter Server uses the Load Manager
process and the Data Transformation Manager Process (DTM) to run the workflow
and carry out workflow tasks. When the PowerCenter Server runs a workflow, the
Load Manager performs the following tasks:
1.
2.
3.
4.
5.
6.
7.
8.

Locks the workflow and reads workflow properties.


Reads the parameter file and expands workflow variables.
Creates the workflow log file.
Runs workflow tasks.
Distributes sessions to worker servers.
Starts the DTM to run sessions.
Runs sessions from master servers.
Sends post-session email if the DTM terminates abnormally.

When the PowerCenter Server runs a session, the DTM performs the following
tasks:
1. Fetches session and mapping metadata from the repository.
2. Creates and expands session variables.
3. Creates the session log file.
4. Validates session code pages if data code page validation is enabled. Checks
query
conversions if data code page validation is disabled.
5. Verifies connection object permissions.
6. Runs pre-session shell commands.
7. Runs pre-session stored procedures and SQL.
8. Creates and runs mappings, reader, writer, and transformation threads to
extract, transform, and load data.
9. Runs post-session stored procedures and SQL.
10. Runs post-session shell commands.
11. Sends post-session email.

Q. What is Data Transformation Manager?


A. After the load manager performs validations for the session, it creates the DTM
process. The DTM process is the second process associated with the session run.
The primary purpose of the DTM process is to create and manage threads that
carry out the session tasks.
The DTM allocates process memory for the session and divide it into buffers.
This is also known as buffer memory. It creates the main thread, which is called

the master thread. The master thread creates and manages all other threads.
If we partition a session, the DTM creates a set of threads for each partition
to allow concurrent processing.. When Informatica server writes messages to the
session log it includes thread type and thread ID.
Following are the types of threads that DTM creates:
Master Thread - Main thread of the DTM process. Creates and manages all other
threads.
Mapping Thread - One Thread to Each Session. Fetches Session and Mapping
Information.
Pre and Post Session Thread - One Thread each to Perform Pre and Post Session
Operations.
Reader Thread - One Thread for Each Partition for Each Source Pipeline.
Writer Thread - One Thread for Each Partition if target exist in the source pipeline
write to the target.
Transformation Thread - One or More Transformation Thread For Each Partition.

Q. What is Session and Batches?


A.
Session - A Session Is A set of instructions that tells the Informatica Server
How And When To Move Data From Sources To Targets. After creating the session,
we can use either the server manager or the command line program pmcmd to
start or stop the session. Batches - It Provides A Way to Group Sessions For Either
Serial Or Parallel Execution By The Informatica Server. There Are Two Types Of
Batches :
1. Sequential - Run Session One after the Other.
2. Concurrent Run Session At The Same Time.
Q. What is a source qualifier?
A.
It represents all data queried from
the source.
Q. Why we use lookup transformations?
A. Lookup Transformations can access data from relational tables that are not
sources in mapping. With Lookup transformation, we can accomplish the following
tasks:
Get a related value-Get the Employee Name from Employee table based on the
Employee ID

Perform Calculation.
Update slowly changing dimension tables - We can use unconnected lookup
transformation to determine whether the records already exist in the target or
not.
Q. While importing the relational source definition from database, what
are the meta data of source U import?
Source name
Database location
Column names
Data types
Key constraints
Q. How many ways you can update a relational source
definition and what are they?
A.
Two ways
1. Edit the definition
2. Reimport the definition
Q. Where should you place the flat file to import the flat file
definition to the designer?
A.
Place it in local folder
Q. Which transformation should u need while using the Cobol sources as
source definitions?
A.
Normalizer transformation which is used to normalize the data. Since Cobol
sources r often consists of denormalized data.

Q. How can you create or import flat file definition in to the warehouse
designer?
A.
You can create flat file definition in warehouse designer. In the warehouse
designer, you can create a new target: select the type as flat file. Save it and u
can enter various columns for that created target by editing its properties. Once
the target is created, save it. You can import it from the mapping designer.
Q. What is a mapplet?
A.
A mapplet should have a mapplet input transformation which receives input
values, and an output transformation which passes the final modified data to back
to the mapping. Set of transformations where the logic can be reusable when the
mapplet is
displayed within the mapping only input & output ports are displayed so that the
internal logic is hidden from end-user point of view.
Q. What is a transformation?

A. It is a repository object that generates, modifies or


passes data.
Q. What are the designer tools for creating transformations?
A. Mapping designer
Transformation developer
Mapplet designer
Q. What are connected and unconnected transformations?
A. Connect Transformation : A transformation which participates in the mapping
data flow. Connected transformation can receive multiple inputs and provides
multiple outputs
Unconnected: An unconnected transformation does not participate in the mapping
data flow. It can receive multiple inputs and provides single output

Q. In how many ways can you create ports?


A. Two ways
1. Drag the port from another transformation
2. Click the add button on the ports tab.
Q. What are reusable transformations?
A. A transformation that can be reused is called a reusable transformation
They can be created using two methods:
1.
Using transformation developer
2.
Create normal one and promote it to reusable
Q. What are mapping parameters and mapping variables?
A. Mapping parameter represents a constant value that U can define before
running a session. A mapping parameter retains the same value throughout the
entire session.
When u use the mapping parameter ,U declare and use the parameter in a
mapping or mapplet. Then define the value of parameter in a parameter file for
the session.
Unlike a mapping parameter, a mapping variable represents a value that can
change throughout the session. The Informatica server saves the value of
mapping variable to the repository at the end of session run and uses that value
next time U run the session.
Q. Can U use the mapping parameters or variables created in one
mapping into another mapping?
A. NO.
We can use mapping parameters or variables in any transformation of the same
mapping or mapplet in which U have created mapping parameters or variables.

Q. How can U improve session performance in aggregator


transformation?
A. 1. Use sorted input. Use a sorter before the aggregator
2. Do not forget to check the option on the aggregator that tells the aggregator
that the input is sorted on the same keys as group by. The key order is also very
important.
Q. Is aggregate cache in aggregator transformation?
A. The aggregator stores data in the aggregate cache until it completes
aggregate calculations. When u run a session that uses an aggregator
transformation, the Informatica server creates index and data caches in
memory to process the transformation. If the Informatica server requires more
space, it stores overflow values in cache files.
Q. What r the difference between joiner transformation and source
qualifier transformation?
A. You can join heterogeneous data sources in joiner transformation which we
cannot achieve in source qualifier transformation.
You need matching keys to join two relational sources in source qualifier
transformation. Whereas u doesnt need matching keys to join two sources.
Two relational sources should come from same data source in sourcequalifier.
You can join relational sources which r coming from different sources also.
Q. In which conditions can we not use joiner transformations?
A. You cannot use a Joiner transformation in the following situations (according
to Informatica 7.1):
Either input pipeline contains an Update Strategy transformation.
You connect a Sequence Generator transformation directly before the Joiner
transformation.
Q. What r the settings that u use to configure the joiner
transformation?
A. Master and detail source
Type of join
Condition of the join
Q. What are the join types in joiner transformation?
A. Normal (Default) -- only matching rows from both master and detail
Master outer -- all detail rows and only matching rows from master
Detail outer -- all master rows and only matching rows from detail
Full outer -- all rows from both master and detail ( matching or non matching)
Q. What are the joiner caches?
A. When a Joiner transformation occurs in a session, the Informatica Server
reads all the records from the master source and builds index and data caches
based on the master rows.
After building the caches, the Joiner transformation reads records from the

detail source and performs joins.


Q. Why use the lookup transformation?
A. To perform the following tasks.
Get a related value. For example, if your source table includes employee ID,
but you want to include the employee name in your target table to make your
summary data easier to read.
Perform a calculation. Many normalized tables include values used in a
calculation, such as gross sales per invoice or sales tax, but not the calculated
value (such as net sales).
Update slowly changing dimension tables. You can use a Lookup transformation
to determine whether records already exist in the target.

Q. Between
What is meant
by lookup
caches? Lookups
Differences
Connected
and Unconnected
A. The
Informatica server builds a cache in memory when
it processesLookup
the first
Connected
Lookup
Unconnected
row of a data in a cached look up transformation. It allocates
memory
for thefrom the result
Receives
input values
based directly
on the amount
u pipeline.
configure in the transformation
session in another
Receives cache
input values
from the
of a :LKPorexpression
properties. The Informatica server stores condition values
in the index cache and
transformation.
output
values or
in static
the data
cache.
You can use
a dynamic
cache.
You can use a static cache.
Q. What
r the types
of used
lookup
caches?
Cache includes
all lookup
columns
in the
mapping (that
Cache includes all lookup/output
Persistent
cache:
U can in
save
lookup
cache and
files and reuse them the next
is, lookupA.
source
columns
included
the the
lookup
condition
ports in the lookup condition and the
time columns
the Informatica
server
processes
lookup transformation configured to use
lookup source
linked as
output
ports toaother
lookup/return port.
the cache.
transformations).
Can return multiple columns from the same row or insert into Designate one return port (R).
Recache
from
database: If the persistent cache is not Returns
synchronized
with the
the dynamic
lookup
cache.
one column
from each row.
table,
configure
the the
lookup
transformation to rebuild the lookup
If there islookup
no match
for you
the can
lookup
condition,
PowerCenter
If there is no match for the lookup
cache.
Server returns
the default value for all output ports. If you
condition, the PowerCenter Server
configure dynamic caching, the PowerCenter Server inserts
returns NULL.
cache:
U can configure
a static or read-only cache for only lookup table. By
rows into Static
the cache
or leaves
it unchanged.
Informatica
server
creates the
a static
cache. It caches the lookup table and
If there isdefault
a match
for the lookup
condition,
PowerCenter
If there is a match for the lookup
lookup
values
forcondition
each rowfor
that
Server returns
the
resultinofthe
thecache
lookup
allcomes into the transformation. When
condition, the PowerCenter Server
the lookup
is true,dynamic
the Informatica
lookup/output
ports. condition
If you configure
caching,server
the does not update the cache
returns the result of the lookup
while
it processes
the lookup
PowerCenter
Server
either updates
thetransformation.
row the in the cache or
condition into the return port.
leaves the row unchanged.
Dynamic cache: If you want to cache the target table and
new rows
into
Passinsert
one output
value
to another
cache and the target, you can create a look up transformation
to use dynamic
transformation.
The
Pass multiple output values to another transformation. Link
cache. The Informatica server dynamically inserts data
to the target table. port passes the
lookup/output/return
lookup/output ports to another transformation.
value to the transformation calling
Shared cache: U can share the lookup cache between :LKP
multiple
transactions. You
expression.
can share unnamed cache between transformations inDoes
the same
mapping.
not support
user-defined default
Supports user-defined default values.
values.

Q. What r the types of lookup caches?


A. Persistent cache: U can save the lookup cache files and reuse them the next
time the Informatica server processes a lookup transformation configured to use
the cache.
Recache from database: If the persistent cache is not synchronized with the
lookup table, you can configure the lookup transformation to rebuild the lookup
cache.
Static cache: U can configure a static or read-only cache for only lookup table. By
default Informatica server creates a static cache. It caches the lookup table and
lookup values in the cache for each row that comes into the transformation. When
the lookup condition is true, the Informatica server does not update the cache
while it processes the lookup transformation.
Dynamic cache: If you want to cache the target table and insert new rows into
cache and the target, you can create a look up transformation to use dynamic
cache. The Informatica server dynamically inserts data to the target table.
Shared cache: U can share the lookup cache between multiple transactions. You
can share unnamed cache between transformations in the same mapping.

Q: What do you know about Informatica and ETL?


A: Informatica is a very useful GUI based ETL tool.
Q: FULL and DELTA files. Historical and Ongoing load.
A: FULL file contains complete data as of today including history data, DELTA file
contains only the changes since last extract.
Q: Power Center/ Power Mart which products have you worked with?
A: Power Center will have Global and Local repository, whereas Power Mart will
have only Local repository.
Q: Explain what are the tools you have used in Power Center and/or
Power Mart?
A: Designer, Server Manager, and Repository Manager.

Q: What is a Mapping?
A: Mapping Represent the data flow between source and target
Q: What are the components must contain in Mapping?
A: Source definition, Transformation, Target Definition and Connectors

Q: What is Transformation?
A: Transformation is a repository object that generates, modifies, or passes data.
Transformation performs specific function. They are two types of transformations:
1. Active
Rows, which are affected during the transformation or can change the no of rows
that pass through it. Eg: Aggregator, Filter, Joiner, Normalizer, Rank, Router,
Source qualifier, Update Strategy, ERP Source Qualifier, Advance External
Procedure.
2. Passive
Does not change the number of rows that pass through it. Eg: Expression,
External Procedure, Input, Lookup, Stored Procedure, Output, Sequence
Generator, XML Source Qualifier.
Q: Which transformation can be overridden at the Server?
A: Source Qualifier and Lookup Transformations
Q: What is connected and unconnected Transformation and give
Examples?
Q: What are Options/Type to run a Stored Procedure?
A:
Normal: During a session, the stored procedure runs where the transformation
exists in the mapping on a row-by-row basis. This is useful for calling the stored
procedure for each row of data that passes through the mapping, such as running
a calculation against an input port. Connected stored procedures run only in
normal mode.
Pre-load of the Source. Before the session retrieves data from the source, the
stored procedure runs. This is useful for verifying the existence of tables or
performing joins of data in a temporary table.
Post-load of the Source. After the session retrieves data from the source, the
stored procedure runs. This is useful for removing temporary tables.
Pre-load of the Target. Before the session sends data to the target, the stored
procedure runs. This is useful for verifying target tables or disk space on the
target system.

Post-load of the Target. After the session sends data to the target, the stored
procedure runs. This is useful for re-creating indexes on the database.
It must contain at least one Input and one Output port.
Q: What kinds of sources and of targets can be used in Informatica?
A:
Sources may be Flat file, relational db or XML.
Target may be relational tables, XML or flat files.
Q: Transformations: What are the different transformations you have
worked with?
A:
Source Qualifier (XML, ERP, MQ)
Joiner
Expression
Lookup
Filter
Router
Sequence Generator
Aggregator
Update Strategy
Stored Proc
External Proc
Advanced External Proc
Rank
Normalizer
Q: What are active/passive transformations?
A: Passive transformations do not change the nos. of rows passing through it
whereas active transformation changes the nos. rows passing thru it.
Active: Filter, Aggregator, Rank, Joiner, Source Qualifier
Passive: Expression, Lookup, Stored Proc, Seq. Generator
Q: What are connected/unconnected transformations?
A:
Connected transformations are part of the mapping pipeline. The input and
output ports are connected to other transformations.

Unconnected transformations are not part of the mapping pipeline. They are
not linked in the map with any input or output ports. Eg. In Unconnected Lookup
you can pass multiple values to unconnected transformation but only one column
of data will be returned from the transformation. Unconnected: Lookup, Stored
Proc.
Q: In target load ordering, what do you order - Targets or Source
Qualifiers?
A: Source Qualifiers. If there are multiple targets in the mapping, which are
populated from multiple sources, then we can use Target Load ordering.
Q: Have you used constraint-based load ordering? Where do you set
this?
A: Constraint based loading can be used when you have multiple targets in the
mapping and the target tables have a PK-FK relationship in the database. It can be
set in the session properties. You have to set the Source Treat Rows as: INSERT
and check the box Constraint based load ordering in Advanced Tab.
Q: If you have a FULL file that you have to match and load into a
corresponding table, how will you go about it? Will you use Joiner
transformation?
A: Use Joiner and join the file and Source Qualifier.
Q: If you have 2 files to join, which file will you use as the master file?
A: Use the file with lesser nos. of records as master file.
Q: If a sequence generator (with increment of 1) is connected to (say) 3
targets and each target uses the NEXTVAL port, what value will each
target get?
A: Each target will get the value in multiple of 3.
Q: Have you used the Abort, Decode functions?
A: Abort can be used to Abort / stop the session on an error condition.
If the primary key column contains NULL, and you need to stop the session from
continuing then you may use ABORT function in the default value for the port. It
can be used with IIF and DECODE function to Abort the session.
Q: Have you used SQL Override?
A: It is used to override the default SQL generated in the Source Qualifier / Lookup
transformation.

Q: If you make a local transformation reusable by mistake, can you undo


the reusable action?
A: No
Q: What is the difference between filter and router transformations?
A: Filter can filter the records based on ONE condition only whereas Router can be
used to filter records on multiple condition.
Q: Lookup transformations: Cached/un-cached
A: When the Lookup Transformation is cached the Informatica Server caches the
data and index. This is done at the beginning of the session before reading the
first record from the source. If the Lookup is uncached then the Informatica reads
the data from the database for every record coming from the Source Qualifier.
Q: Connected/unconnected if there is no match for the lookup, what is
returned?
A: Unconnected Lookup returns NULL if there is no matching record found in the
Lookup transformation.
Q: What is persistent cache?
A: When the Lookup is configured to be a persistent cache Informatica server does
not delete the cache files after completion of the session. In the next run
Informatica server uses the cache file from the previous session.
Q: What is dynamic lookup strategy?
A: The Informatica server compares the data in the lookup table and the cache, if
there is no matching record found in the cache file then it modifies the cache files
by inserting the record. You may use only (=) equality in the lookup condition.
If multiple matches are found in the lookup then Informatica fails the session. By
default the Informatica server creates a static cache.
Q: Mapplets: What are the 2 transformations used only in mapplets?
A: Mapplet Input / Source Qualifier, Mapplet Output
Q: Have you used Shortcuts?
A: Shortcuts may used to refer to another mapping. Informatica refers to the
original mapping. If any changes are made to the mapping / mapplet, it is
immediately reflected in the mapping where it is used.

Q: If you used a database when importing sources/targets that was


dropped later on, will your mappings still be valid?
A: No
Q: In expression transformation, how can you store a value from the
previous row?
A: By creating a variable in the transformation.
Q: How does Informatica do variable initialization? Number/String/Date
A: Number 0, String blank, Date 1/1/1753
Q: Have you used the Informatica debugger?
A: Debugger is used to test the mapping during development. You can give
breakpoints in the mappings and analyze the data.
Q: What do you know about the Informatica server architecture? Load
Manager, DTM, Reader, Writer, Transformer.
A:
Load Manager is the first process started when the session runs. It checks for
validity of mappings, locks sessions and other objects.
DTM process is started once the Load Manager has completed its job. It starts
a thread for each pipeline.
Reader scans data from the specified sources.
Writer manages the target/output data.
Transformer performs the task specified in the mapping.
Q: Have you used partitioning in sessions? (not available with
Powermart)
A: It is available in PowerCenter. It can be configured in the session properties.
Q: Have you used External loader? What is the difference between
normal and bulk loading?
A: External loader will perform direct data load to the table/data files, bypass the
SQL layer and will not log the data. During normal data load, data passes through
SQL layer, data is logged in to the archive log file and as a result it is slow.
Q: Do you enable/disable decimal arithmetic in session properties?
A: Disabling Decimal Arithmetic will improve the session performance but it
converts numeric values to double, thus leading to reduced accuracy.

Q: When would use multiple update strategy in a mapping?


A: When you would like to insert and update the records in a Type 2 Dimension
table.
Q: When would you truncate the target before running the session?
A: When we want to load entire data set including history in one shot. Update
strategy do not have dd_update, dd_delete and it does only dd_insert.
Q: How do you use stored proc transformation in the mapping?
A: In side mapping we can use stored procedure transformation, pass input
parameters and get back the output parameters. When handling through session,
it can be invoked either in Pre-session or post-session scripts.
Q: What did you do in the stored procedure? Why did you use stored
proc instead of using expression?
A:
Q: When would you use SQ, Joiner and Lookup?
A:
If we are using multiples source tables and they are related at the database,
then we can use a single SQ.
If we need to Lookup values in a table or Update Slowly Changing Dimension
tables then we can use Lookup transformation.
Joiner is used to join heterogeneous sources, e.g. Flat file and relational tables.
Q: How do you create a batch load? What are the different types of
batches?
A: Batch is created in the Server Manager. It contains multiple sessions. First
create sessions and then create a batch. Drag the sessions into the batch from
the session list window.
Batches may be sequential or concurrent. Sequential batch runs the sessions
sequentially. Concurrent sessions run parallel thus optimizing the server
resources.
Q: How did you handle reject data? What file does Informatica create for
bad data?
A: Informatica saves the rejected data in a .bad file. Informatica adds a row
identifier for each record rejected indicating whether the row was rejected
because of Writer or Target. Additionally for every column there is an indicator for
each column specifying whether the data was rejected due to overflow, null,

truncation, etc.
Q: How did you handle runtime errors? If the session stops abnormally
how were you managing the reload process?
Q: Have you used pmcmd command? What can you do using this
command?
A: pmcmd is a command line program. Using this command
You can start sessions
Stop sessions
Recover session
Q: What are the two default repository user groups
A: Administrators and Public
Q: What are the Privileges of Default Repository and Extended
Repository user?
A:
Default Repository Privileges
o
Use Designer
o
Browse Repository
o
Create Session and Batches
Extended Repository Privileges
o
Session Operator
o
Administer Repository
o
Administer Server
o
Super User
Q: How many different locks are available for repository objects
A: There are five kinds of locks available on repository objects:
Read lock. Created when you open a repository object in a folder for which you
do not have write permission. Also created when you open an object with an
existing write lock.
Write lock. Created when you create or edit a repository object in a folder for
which you have write permission.
Execute lock. Created when you start a session or batch, or when the
Informatica Server starts a scheduled session or batch.
Fetch lock. Created when the repository reads information about repository

objects from the database.


Save lock. Created when you save information to the repository.
Q: What is Session Process?
A: The Load Manager process. Starts the session, creates the DTM process, and
sends post-session email when the session completes.
Q: What is DTM process?
A: The DTM process creates threads to initialize the session, read, write, transform
data, and handle pre and post-session operations.
Q: When the Informatica Server runs a session, what are the tasks
handled?
A:
Load Manager (LM):
o
LM locks the session and reads session properties.
o
LM reads the parameter file.
o
LM expands the server and session variables and parameters.
o
LM verifies permissions and privileges.
o
LM validates source and target code pages.
o
LM creates the session log file.
o
LM creates the DTM (Data Transformation Manager) process.

Data Transformation Manager (DTM):


o
DTM process allocates DTM process memory.
o
DTM initializes the session and fetches the mapping.
o
DTM executes pre-session commands and procedures.
o
DTM creates reader, transformation, and writer threads for each
source pipeline. If the pipeline is partitioned, it creates a set of
threads for each partition.
o
DTM executes post-session commands and procedures.
o
DTM writes historical incremental aggregation and lookup data to
disk, and it writes persisted sequence values and mapping variables
to the repository.
o
Load Manager sends post-session email

Q: What is Code Page?


A: A code page contains the encoding to specify characters in a set of one or more
languages.

Q: How to handle the performance in the server side?


A: Informatica tool has no role to play here. The server administrator will take up
the issue.
Q: What are the DTM (Data Transformation Manager) Parameters?
A:
DTM Memory parameter - Default buffer block size/Data & Index Cache size ,
Reader Parameter - Line Sequential buffer length for flat files,
General Parameter - Commit Interval (source and Target)/ Others- Enabling
Lookup cache,
Event based Scheduling - Indicator file to wait for.
1.
Explain about your projects
Architecture
Dimension and Fact tables
Sources and Targets
Transformations used
Frequency of populating data
Database size
2.
What is dimension modeling?
Unlike ER model the dimensional model is very asymmetric with one large central
table called as fact table connected to multiple dimension tables .It is also called
star schema.
3.
What are mapplets?
Mapplets are reusable objects that represents collection of transformations
Transformations not to be included in mapplets are
Cobol source definitions
Joiner transformations
Normalizer Transformations
Non-reusable sequence generator transformations
Pre or post session procedures
Target definitions
XML Source definitions
IBM MQ source definitions
Power mart 3.5 style Lookup functions
4.
What are the transformations that use cache for performance?
Aggregator, Lookups, Joiner and Ranker

5.
What the active and passive transformations?
An active transformation changes the number of rows that pass through the
mapping.
1. Source Qualifier
2. Filter transformation
3. Router transformation
4. Ranker
5. Update strategy
6. Aggregator
7. Advanced External procedure
8. Normalizer
9. Joiner
Passive transformations do not change the number of rows that pass through the
mapping.
1. Expressions
2. Lookup
3. Stored procedure
4. External procedure
5. Sequence generator
6. XML Source qualifier

6.
What is a lookup transformation?
Used to look up data in a relational table, views, or synonym, The informatica
server queries the lookup table based on the lookup ports in the transformation. It
compares lookup transformation port values to lookup table column values based
on the lookup condition. The result is passed to other transformations and the
target.
Used to :
Get related value
Perform a calculation
Update slowly changing dimension tables.
Diff between connected and unconnected lookups. Which is better?
Connected :
Received input values directly from the pipeline
Can use Dynamic or static cache.
Cache includes all lookup columns used in the mapping

Can return multiple columns from the same row


If there is no match , can return default values
Default values can be specified.
Un connected :
Receive input values from the result of a LKP expression in another
transformation.
Only static cache can be used.
Cache includes all lookup/output ports in the lookup condition and lookup or return
port.
Can return only one column from each row.
If there is no match it returns null.
Default values cannot be specified.
Explain various caches :
Static:
Caches the lookup table before executing the transformation. Rows are not added
dynamically.
Dynamic:
Caches the rows as and when it is passed.
Unshared:
Within the mapping if the lookup table is used in more than one transformation
then the cache built for the first lookup can be used for the others. It cannot be
used across mappings.
Shared:
If the lookup table is used in more than one transformation/mapping then the
cache built for the first lookup can be used for the others. It can be used across
mappings.
Persistent :
If the cache generated for a Lookup needs to be preserved for subsequent use
then persistent cache is used. It will not delete the index and data files. It is useful
only if the lookup table remains constant.
What are the uses of index and data caches?
The conditions are stored in index cache and records from the lookup are stored
in data cache
7.
Explain aggregate transformation?
The aggregate transformation allows you to perform aggregate calculations, such
as averages, sum, max, min etc. The aggregate transformation is unlike the
Expression transformation, in that you can use the aggregator transformation to
perform calculations in groups. The expression transformation permits you to

perform calculations on a row-by-row basis only.


Performance issues ?
The Informatica server performs calculations as it reads and stores necessary data
group and row data in an aggregate cache.
Create Sorted input ports and pass the input records to aggregator in sorted
forms by groups then by port
Incremental aggregation?
In the Session property tag there is an option for performing incremental
aggregation. When the Informatica server performs incremental aggregation , it
passes new source data through the mapping and uses historical cache (index
and data cache) data to perform new aggregation calculations incrementally.
What are the uses of index and data cache?
The group data is stored in index files and Row data stored in data files.
8.
Explain update strategy?
Update strategy defines the sources to be flagged for insert, update, delete, and
reject at the targets.
What are update strategy constants?
DD_INSERT,0
DD_UPDATE,1
DD_DELETE,2
DD_REJECT,3
If DD_UPDATE is defined in update strategy and Treat source rows as
INSERT in Session . What happens?
Hints: If in Session anything other than DATA DRIVEN is mentions then Update
strategy in the mapping is ignored.
What are the three areas where the rows can be flagged for particular
treatment?
In mapping, In Session treat Source Rows and In Session Target Options.
What is the use of Forward/Reject rows in Mapping?
9.
Explain the expression transformation ?
Expression transformation is used to calculate values in a single row before
writing to the target.
What are the default values for variables?
Hints: Straing = Null, Number = 0, Date = 1/1/1753
10. Difference between Router and filter transformation?
In filter transformation the records are filtered based on the condition and rejected
rows are discarded. In Router the multiple conditions are placed and the rejected

rows can be assigned to a port.


How many ways you can filter the records?
1. Source Qualifier
2. Filter transformation
3. Router transformation
4. Ranker
5. Update strategy
.
11. How do you call stored procedure and external procedure
transformation ?
External Procedure can be called in the Pre-session and post session tag in the
Session property sheet.
Store procedures are to be called in the mapping designer by three methods
1. Select the icon and add a Stored procedure transformation
2. Select transformation Import Stored Procedure
3. Select Transformation Create and then select stored procedure.
12. Explain Joiner transformation and where it is used?
While a Source qualifier transformation can join data originating from a common
source database, the joiner transformation joins two related heterogeneous
sources residing in different locations or file systems.
Two relational tables existing in separate databases
Two flat files in different file systems.
Two different ODBC sources
In one transformation how many sources can be coupled?
Two sources can be couples. If more than two is to be couples add another Joiner
in the hierarchy.
What are join options?
Normal (Default)
Master Outer
Detail Outer
Full Outer

13. Explain Normalizer transformation?


The normaliser transformation normalises records from COBOL and relational
sources, allowing you to organise the data according to your own needs. A
Normaliser transformation can appear anywhere in a data flow when you
normalize a relational source. Use a Normaliser transformation instead of the
Source Qualifier transformation when you normalize COBOL source. When you

drag a COBOL source into the Mapping Designer Workspace, the Normaliser
transformation appears, creating input and output ports for every columns in the
source.
14. What is Source qualifier transformation?
When you add relational or flat file source definition to a mapping , you need to
connect to a source Qualifier transformation. The source qualifier represents the
records that the informatica server reads when it runs a session.
Join Data originating from the same source database.
Filter records when the Informatica server reads the source data.
Specify an outer join rather than the default inner join.
Specify sorted ports
Select only distinct values from the source
Create a custom query to issue a special SELECT statement for the Informatica
server to read the source data.

15. What is Ranker transformation?


Filters the required number of records from the top or from the bottom.
16. What is target load option?
It defines the order in which informatica server loads the data into the targets.
This is to avoid integrity constraint violations

17. How do you identify the bottlenecks in Mappings?


Bottlenecks can occur in
1. Targets
The most common performance bottleneck occurs when the informatica server
writes to a target
database. You can identify target bottleneck by configuring the session to write
to a flat file target.
If the session performance increases significantly when you write to a flat file,
you have a target
bottleneck.
Solution :
Drop or Disable index or constraints
Perform bulk load (Ignores Database log)
Increase commit interval (Recovery is compromised)
Tune the database for RBS, Dynamic Extension etc.,

2. Sources
Set a filter transformation after each SQ and see the records are not through.
If the time taken is same then there is a problem.
You can also identify the Source problem by
Read Test Session where we copy the mapping with sources, SQ and remove
all transformations
and connect to file target. If the performance is same then there is a Source
bottleneck.
Using database query Copy the read query directly from the log. Execute the
query against the
source database with a query tool. If the time it takes to execute the query and
the time to fetch
the first row are significantly different, then the query can be modified using
optimizer hints.
Solutions:
Optimize Queries using hints.
Use indexes wherever possible.
3. Mapping
If both Source and target are OK then problem could be in mapping.
Add a filter transformation before target and if the time is the same then there
is a problem.
(OR) Look for the performance monitor in the Sessions property sheet and view
the counters.
Solutions:
If High error rows and rows in lookup cache indicate a mapping bottleneck.
Optimize Single Pass Reading:
Optimize Lookup transformation :
1. Caching the lookup table:
When caching is enabled the informatica server caches the lookup table
and queries the
cache during the session. When this option is not enabled the server
queries the lookup
table on a row-by row basis.
Static, Dynamic, Shared, Un-shared and Persistent cache
2. Optimizing the lookup condition
Whenever multiple conditions are placed, the condition with equality
sign should take
precedence.
3. Indexing the lookup table

The cached lookup table should be indexed on order by columns. The


session log contains
the ORDER BY statement
The un-cached lookup since the server issues a SELECT statement for
each row passing
into lookup transformation, it is better to index the lookup table on the
columns in the
condition
Optimize Filter transformation:
You can improve the efficiency by filtering early in the data flow. Instead of
using a filter
transformation halfway through the mapping to remove a sizable amount of
data.
Use a source qualifier filter to remove those same rows at the source,
If not possible to move the filter into SQ, move the filter transformation as
close to the
source
qualifier as possible to remove unnecessary data early in the data flow.
Optimize Aggregate transformation:
1. Group by simpler columns. Preferably numeric columns.
2. Use Sorted input. The sorted input decreases the use of aggregate
caches. The server
assumes all input data are sorted and as it reads it performs aggregate
calculations.
3. Use incremental aggregation in session property sheet.
Optimize Seq. Generator transformation:
1. Try creating a reusable Seq. Generator transformation and use it in
multiple mappings
2. The number of cached value property determines the number of values
the informatica
server caches at one time.
Optimize Expression transformation:
1. Factoring out common logic
2. Minimize aggregate function calls.
3. Replace common sub-expressions with local variables.
4. Use operators instead of functions.
4. Sessions
If you do not have a source, target, or mapping bottleneck, you may have a

session bottleneck.
You can identify a session bottleneck by using the performance details. The
informatica server
creates performance details when you enable Collect Performance Data on the
General Tab of
the session properties.
Performance details display information about each Source Qualifier, target
definitions, and
individual transformation. All transformations have some basic counters that
indicate the
Number of input rows, output rows, and error rows.
Any value other than zero in the readfromdisk and writetodisk counters for
Aggregate, Joiner,
or Rank transformations indicate a session bottleneck.
Low bufferInput_efficiency and BufferOutput_efficiency counter also
indicate a session
bottleneck.
Small cache size, low buffer memory, and small commit intervals can cause
session bottlenecks.
5. System (Networks)
18. How to improve the Session performance?
1 Run concurrent sessions
2 Partition session (Power center)
3. Tune Parameter DTM buffer pool, Buffer block size, Index cache size, data
cache size, Commit Interval, Tracing level (Normal, Terse, Verbose Init, Verbose
Data)
The session has memory to hold 83 sources and targets. If it is more, then DTM
can be increased.
The informatica server uses the index and data caches for Aggregate, Rank,
Lookup and Joiner
transformation. The server stores the transformed data from the above
transformation in the data
cache before returning it to the data flow. It stores group information for those
transformations in
index cache.
If the allocated data or index cache is not large enough to store the date, the
server stores the data
in a temporary disk file as it processes the session data. Each time the server
pages to the disk the
performance slows. This can be seen from the counters .

Since generally data cache is larger than the index cache, it has to be more than
the index.
4. Remove Staging area
5. Tune off Session recovery
6. Reduce error tracing
19. What are tracing levels?
Normal-default
Logs initialization and status information, errors encountered, skipped rows due to
transformation errors, summarizes session results but not at the row level.
Terse
Log initialization, error messages, notification of rejected data.
Verbose Init.
In addition to normal tracing levels, it also logs additional initialization
information, names of index and data files used and detailed transformation
statistics.
Verbose Data.
In addition to Verbose init, It records row level logs.
20. What is Slowly changing dimensions?
Slowly changing dimensions are dimension tables that have slowly increasing data
as well as updates to existing data.
21. What are mapping parameters and variables?
A mapping parameter is a user definable constant that takes up a value before
running a session. It can be used in SQ expressions, Expression transformation
etc.
Steps:
Define the parameter in the mapping designer - parameter & variables .
Use the parameter in the Expressions.
Define the values for the parameter in the parameter file.
A mapping variable is also defined similar to the parameter except that the value
of the variable is subjected to change.
It picks up the value in the following order.
1. From the Session parameter file
2. As stored in the repository object in the previous run.
3. As defined in the initial values in the designer.
4. Default values
Q. What are the output files that the Informatica server creates during

the session running?


Informatica server log: Informatica server (on UNIX) creates a log for all status
and error messages (default name: pm.server.log). It also creates an error log for
error messages. These files will be created in Informatica home directory
Session log file: Informatica server creates session log file for each session. It
writes information about session into log files such as initialization process,
creation of sql commands for reader and writer threads, errors encountered and
load summary. The amount of detail in session log file depends on the tracing
level that you set.
Session detail file: This file contains load statistics for each target in mapping.
Session detail includes information such as table name, number of rows written or
rejected. You can view this file by double clicking on the session in monitor
window.
Performance detail file: This file contains information known as session
performance details which helps you where performance can be improved. To
generate this file select the performance detail option in the session property
sheet.
Reject file: This file contains the rows of data that the writer does not write to
targets.
Control file: Informatica server creates control file and a target file when you run a
session that uses the external loader. The control file contains the information
about the target flat file such as data format and loading instructions for the
external loader.
Post session email: Post session email allows you to automatically communicate
information about a session run to designated recipients. You can create two
different messages. One if the session completed successfully the other if the
session fails.
Indicator file: If you use the flat file as a target, you can configure the Informatica
server to create indicator file. For each target row, the indicator file contains a
number to indicate whether the row was marked for insert, update, delete or
reject.
Output file: If session writes to a target file, the Informatica server creates the
target file based on file properties entered in the session property sheet.
Cache files: When the Informatica server creates memory cache it also creates
cache files.
For the following circumstances Informatica server creates index and data cache
files:
Aggregator transformation
Joiner transformation

Rank transformation
Lookup transformation
Q. What is the difference between joiner transformation and source
qualifier transformation?
A. You can join heterogeneous data sources in joiner transformation which we
cannot do in source qualifier transformation.
Q. What is meant by lookup caches?
A. The Informatica server builds a cache in memory when it processes the first
row of a data in a cached look up transformation. It allocates memory for the
cache based on the amount you configure in the transformation or session
properties. The Informatica server stores condition values in the index cache and
output values in the data cache.
Q. What is meant by parameters and variables in Informatica and how it
is used?
A. Parameter: A mapping parameter represents a constant value that you can
define before running a session. A mapping parameter retains the same value
throughout the entire session.
Variable: A mapping variable represents a value that can change through the
session. Informatica Server saves the value of a mapping variable to the
repository at the end of each successful session run and uses that value the next
time you run the session
Q. What is target load order?
You specify the target load order based on source qualifiers in a mapping. If you
have multiple source qualifiers connected to multiple targets, you can define the
order in which Informatica server loads data into the targets
nformatica is a leading data integration software. The products of the company support
various enterprise-wide data integration and data quality solutions including data
warehousing, data migration, data consolidation, data synchronization, data governance,
master data management, and cross-enterprise data integration.
The important Informatica Components are:
Power
Power
Power
Power

Exchange
Center
Center Connect
Exchange

Power Channel
Metadata Exchange
Power Analyzer
Super Glue

This section will contain some useful tips and tricks for optimizing informatica performance.
This includes some of the real time problems or errors and way to troubleshoot them, best
prcatices etc.
Q1: Introduce Yourself.
Re: What is incremental aggregation and how it is done?
Answer When using incremental aggregation, you apply captured
#4
changes in the source to aggregate calculations in a
session. If the source changes only incrementally and you
can capture changes, you can configure the session to
process only those changes. This allows the Informatica
Server to update your target incrementally, rather than
forcing it to process the entire source and recalculate the
same calculations each time you run the session.

Q2: What is datawarehousing?


a collection of data designed to support management decision making. Data warehouses
contain a wide variety of data that present a coherent picture of business conditions at a
single point in time.
Development of a data warehouse includes development of systems to extract data from
operating systems plus installation of a warehousedatabase system that provides managers
flexible access to the data.
The term data warehousing generally refers to the combination of many different databases
across an entire enterprise. Contrast with data mart.
Q3: What is the need of datawarehousing?
Q4: Diff b/w OLTP & OlAP
OLTP

Current data
Short database transactions
Online update/insert/delete
Normalization is promoted
High volume transactions
Transaction recovery is necessary

OLAP
Current and historical data
Long database transactions
Batch update/insert/delete
Denormalization is promoted
Low volume transactions
Transaction recovery is not necessary

Q5: Why do we use OLTP & OLAP


Q6: How to handle decimal in informatica while using flatfies?
while importing flat file definetion just specify the scale for a neumaric data type. in the mapping, the flat file source
supports only number datatype(no decimal and integer). In the SQ associated with that source will have a data type as
decimal for that number port of the source.

Q7: Why do we use update stratgey?


Seession Properties like pre Souurce Rows
INSERT,UPDATE,REJECT,DELETE ,,
Using Session Properties We can do single flow only.

SCD aplicable for Insert,Update,,at a time using Update


Strategy trans only.
Using Update Trans we can creat SCD mapping easily.
----------------Actually its important to use a update strategy
transofmration in the SCD's as SCDs maintain some historical
data specially type 2 dimensions. In this case we may need
to flag rows from the same target for different database

operations. Hence we have no choice but to use update


strategy as at session level this will not be possible.
Q8: Can we use update strategy in flatfiles?
Data in flat file cannot be updated
Q9: If yes why? If not why?
Q10: What is junk dimension?
A junk dimension is a collection of random transactional codes or text attributes that are unrelated to any
particular dimension. The junk dimension is simply a structure that provides a convenient place to store the
junk attributes.

Q11 Diff between iif and decode?


You can use nested IIF statements to test

multiple conditions. The following example tests for various

conditions and returns 0 if sales is zero or negative:


IIF( SALES > 0 IIF( SALES < 50 SALARY1 IIF( SALES < 100 SALARY2 IIF( SALES < 200 SALARY3 BONUS)))
0)
You can use DECODE instead of IIF in many cases. DECODE may improve readability. The following shows
how you can use DECODE instead of IIF :
SALES > 0 and SALES < 50 SALARY1
SALES > 49 AND SALES < 100 SALARY2
SALES > 99 AND SALES < 200 SALARY3

Q12 Diff b/w co-related subquery and nested subquery


Correlated subquery runs once for each row selected by the outer query. It contains a reference to a value
from the row selected by the outer query.
Nested subquery runs only once for the entire nesting (outer) query. It does not contain any reference to the
outer query row.
For example
Correlated Subquery:
select e1.empname e1.basicsal e1.deptno from emp e1 where e1.basicsal (select max(basicsal) from emp e2
where e2.deptno e1.deptno)
Nested Subquery:
select empname basicsal deptno from emp where (deptno basicsal) in (select deptno max(basicsal) from
emp group by deptno)

Q13: What is Union?


The Union transformation is a multiple input group transformation that you use to merge data from multiple
pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar to the
UNION ALL SQL statement to combine the results from two or more SQL statements. Similar to the
UNION ALL statement the Union transformation does not remove duplicate rows.

The Integration Service processes all input groups in parallel. The Integration Service concurrently reads
sources connected to the Union transformation and pushes blocks of data into the input groups of
the transformation. The Union transformation processes the blocks of data based on the order it receives the
blocks from the Integration Service.
You can connect heterogeneous sources to a Union transformation. The Union transformation merges sources
with matching ports and outputs the data from one output group with the same ports as the input groups.

Q14: How to use union?


what is the difference between star schema and Snowflake Schema
Star Schema : Star Schema is a relational database schema for representing multimensional data.
It is the simplest form of data warehouse schema that contains one or more dimensions and fact
tables. It is called a star schema because the entity-relationship diagram between dimensions and fact
tables resembles a star where one fact table is connected to multiple dimensions. The center of the star
schema consists of a large fact table and it points towards the dimension tables. The advantage of star
schema are slicing down performance increase and easy understanding of data.
Snowflake Schema : A snowflake schema is a term that describes a star schema structure normalized
through the use of outrigger tables. i.e dimension table hierachies are broken into simpler tables.
In a star schema every dimension will have a primary key.

In a star schema a dimension table will not have any parent table.
Whereas in a snow flake schema a dimension table will have one or more parent tables.
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill
down the data from topmost hierachies to the lowermost hierarchies.

Q15: How many data sources are available?


Q16: What is scd:
scd-slowly changing dimension
It is the capturing the slowly changing data which changes
very slowly with respect to the time. for example: the
address of a custumer may change in rare case. the address
of a custumer never changes frequently.

there are 3 types of scd. type1 - here the most recent


changed data is stored
type2- here the recent data as well as all past data

(historical data) is stored


trpe3- here partially historical data and recent data are
stored. it mean it stores most recent update and most recent
history.
As datawarehouse is a historical data, so type2 is more
usefull for it.
Q17: Types of scd
Q18: How can we improve the session performance?
Re: How the informatica server increases the session performance through partitioning the
source?
Answer For a relational sources informatica server creates multiple
#1
connections for each parttion of a single source and
extracts seperate range of data for each connection.
Informatica server reads multiple partitions of a single
source concurently. Similarly for loading also informatica
server creates multiple connections to the target and loads
partitions of data concurently.

For XML and file sources,informatica server reads multiple


files concurently. For loading the data informatica server
creates a seperate file for each partition(of a source
file).U can choose to merge the targets.

Q19:What do you mean by informatica?

Q20: Diff b/w dimensions and fact table


Dimension Table features
1. It provides the context /descriptive information for a fact table measurements.
2. Provides entry points to data.
3. Structure of Dimension - Surrogate key one or more other fields that compose the natural key (nk) and set of
Attributes.
4. Size of Dimension Table is smaller than Fact Table.
5. In a schema more number of dimensions are presented than Fact Table.
6. Surrogate Key is used to prevent the primary key (pk) violation(store historical data).
7. Values of fields are in numeric and text representation.
Fact Table features
1. It provides measurement of an enterprise.
2. Measurement is the amount determined by observation.
3. Structure of Fact Table - foreign key (fk) Degenerated Dimension and Measurements.
4. Size of Fact Table is larger than Dimension Table.
5. In a schema less number of Fact Tables observed compared to Dimension Tables.
6. Compose of Degenerate Dimension fields act as Primary Key.
7. Values of the fields always in numeric or integer form.

Performance tuning in Informatica?


The goal of performance tuning is optimize session performance so sessions run during the available load window for
the Informatica Server.Increase the session performance by following.
The performance of the Informatica Server is related to network connections. Data generally moves across a network at
less than 1 MB per second, whereas a local disk moves data five to twenty times faster. Thus network connections ofteny
affect on session performance.So aviod netwrok connections.
Flat files: If ur flat files stored on a machine other than the informatca server, move those files to the machine that
consists of informatica server.
Relational datasources: Minimize the connections to sources ,targets and informatica server to
improve session performance.Moving target database into server system may improve session
performance.
Staging areas: If u use staging areas u force informatica server to perform multiple datapasses.
Removing of staging areas may improve session performance.
U can run the multiple informatica servers againist the same repository.Distibuting the session load to
multiple informatica servers may improve session performance.
Run the informatica server in ASCII datamovement mode improves the session performance.Because ASCII datamovement
mode stores a character value in one byte.Unicode mode takes 2 bytes to store a character.
If a session joins multiple source tables in one Source Qualifier, optimizing the query may improve performance. Also,
single table select statements with an ORDER BY or GROUP BY clause may benefit from optimization such as adding
indexes.
We can improve the session performance by configuring the network packet size,which allows
data to cross the network at one time.To do this go to server manger ,choose server configure database connections.
If u r target consists key constraints and indexes u slow the loading of data.To improve the session performance in this
case drop constraints and indexes before u run the session and rebuild them after completion of session.
Running a parallel sessions by using concurrent batches will also reduce the time of loading the
data.So concurent batches may also increase the session performance.
Partittionig the session improves the session performance by creating multiple connections to sources and targets and
loads data in paralel pipe lines.
In some cases if a session contains a aggregator transformation ,u can use incremental aggregation to improve session
performance.

Aviod transformation errors to improve the session performance.


If the sessioin containd lookup transformation u can improve the session performance by enabling the look up cache.
If Ur session contains filter transformation ,create that filter transformation nearer to the sources
or u can use filter condition in source qualifier.
Aggreagator,Rank and joiner transformation may oftenly decrease the session performance .Because they must group data
before processing it.To improve session performance in this case use sorted ports option.

You can also perform the following tasks to optimize the mapping:
1.
2.
3.
4.
5.

Configure single-pass reading.


Optimize datatype conversions.
Eliminate transformation errors.
Optimize transformations.
Optimize expressions.

RE: Why did you use stored procedure in your ETL Appli...
Click Here to view complete document
hi
usage of stored procedure has the following advantages
1checks the status of the target database
2drops and recreates indexes
3determines if enough space exists in the database
4performs aspecilized calculation
=======================================
Stored procedure in Informatica will be useful to impose complex business rules.
======================================= static cache:

1.static cache remains same during the session run


2.static can be used to relational and falt file lookup
types
3.static cache can be used to both unconnected and
connected lookup transformation
4.we can handle multiple matches in static cache
5.we can use other than relational operators like <,>,<=,>=
&=

Dynamic cache:

1.dynamic cache changes durig session run


2.dynamic cache can be used to only relational lookup types
3.Dynamic cache can be used to only connetced lookups
4.we cannot multiple matches in dynamic cache
5.we can use only = operator with dynamic cache.
Q. What is the difference between $ & $$ in mapping or parameter file?
In which cases they are generally used?
A.
$ prefixes are used to denote session Parameter and variables and $$
prefixes are used to denote mapping parameters and variables.
how to connect two or more table with single source qualifier?

create a Oracle source with how much ever column you want
and write the join query in SQL query override. But the
column order and data type should be same as in the SQL query.

A set of worlflow tasks is called worklet


Workflow tasks means
1)timer2)decesion3)command4)eventwait5)eventrise6)mail etc......
But we r use diffrent situations by using this only
=======================================
Worklet is a set of tasks. If a certain set of task has to be reused in many workflows then we use
worklets. To execute a Worklet it has to be placed inside a workflow.
The use of worklet in a workflow is similar to the use of mapplet in a mapping.
Worklet is reusable workflows. It might contain more than on task in it. We can use these worklets in
other workflows

Which will beter perform IIf or decode?


decode is better perform than iff condtion,decode can be
uesd insted of using multiple iff cases

DECODE FUNCTION YOU CAN FIND IN SQL BUT IIF FUNCTION IS NOT
IN SQL. DECODE FUNCTION WILL GIVE CLEAR READABILITY TO
UNDERSTAND THE LOGIC TO OTHER.

What is source qualifier transformation?

SQ is an active tramsformation. It performs one of the following task: to join data from the
same source database to filtr the rows when Power centre reads source data to perform an
outer join to select only distinct values from the source
In source qualifier transformatio a user can defined join conditons,filter the data and
eliminating the duplicates. The default source qualifier can over written by the above options,
this is known as SQL Override.
The source qualifier represents the records that the informatica server reads when it runs a
session.
When we add a relational or a flat file source definition to a mapping,we need to connect it to
a source qualifier transformation.The source qualifier transformation represents the records
that the informatica server reads when it runs a session.

How many dimension tables did you had in your project and name some dimensions
(columns)?
Product Dimension : Product Key, Product id, Product Type, Product name, Batch Number.
Distributor Dimension: Distributor key, Distributor Id, Distributor Location,
Customer Dimension : Customer Key, Customer Id, CName, Age, status, Address, Contact
Account Dimension : Account Key, Acct id, acct type, Location, Balance,

What is meant by clustering?


It will join two (or more) tables in single buffer, will retrieve the data easily.

What are the rank caches?


the informatica server stores group information in an index catche and row data in data catche
when the server runs a session with a Rank transformation, it compares an input row with rows with rows in data
cache. If the input row out-ranks a stored row,the Informatica server replaces the stored row with the input row.
During the session ,the informatica server compares an inout row with rows in the datacache. If the input row
out-ranks a stored row, the informatica server replaces the stored row with the input row. The informatica server
stores group information in an index cache and row data in a data cache.
Q. What type of repositories can be created using Informatica Repository Manager?
A. Informatica PowerCenter includeds following type of repositories :

Standalone Repository : A repository that functions individually and this is unrelated to any
other repositories.

Global Repository : This is a centralized repository in a domain. This repository can contain
shared objects across the repositories in a domain. The objects are shared through global
shortcuts.

Local Repository : Local repository is within a domain and its not a global repository. Local
repository can connect to a global repository using global shortcuts and can use objects in its shared folders.

Versioned Repository : This can either be local or global repository but it allows version
control for the repository. A versioned repository can store multiple copies, or versions of an
object. This features allows to efficiently develop, test and deploy metadata in the
production environment.
Q. What is a code page?
A. A code page contains encoding to specify characters in a set of one or more languages. The
code page is selected based on source of the data. For example if source contains Japanese text
then the code page should be selected to support Japanese text.
When a code page is chosen, the program or application for which the code page is set, refers to a
specific set of data that describes the characters the application recognizes. This influences the
way that application stores, receives, and sends character data.
Q. Which all databases PowerCenter Server on Windows can connect to?
A. PowerCenter Server on Windows can connect to following databases:

IBM DB2
Informix
Microsoft Access
Microsoft Excel
Microsoft SQL Server
Oracle
Sybase
Teradata

Q. Which all databases PowerCenter Server on UNIX can connect to?


A. PowerCenter Server on UNIX can connect to following databases:

IBM DB2
Informix
Oracle
Sybase
Teradata

Infomratica Mapping Designer


Q. How to execute PL/SQL script from Informatica mapping?
A. Stored Procedure (SP) transformation can be used to execute PL/SQL Scripts. In SP
Transformation PL/SQL procedure name can be specified. Whenever the session is executed, the
session will call the pl/sql procedure.
Q. How can you define a transformation? What are different types of transformations
available in Informatica?
A. A transformation is a repository object that generates, modifies, or passes data. The Designer
provides a set of transformations that perform specific functions. For example, an Aggregator
transformation performs calculations on groups of data. Below are the various transformations
available in Informatica:

Aggregator
Application Source Qualifier
Custom
Expression
External Procedure
Filter
Input
Joiner
Lookup
Normalizer
Output
Rank
Router
Sequence Generator
Sorter
Source Qualifier
Stored Procedure
Transaction Control
Union
Update Strategy
XML Generator
XML Parser

XML Source Qualifier


Q. What is a source qualifier? What is meant by Query Override?
A. Source Qualifier represents the rows that the PowerCenter Server reads from a relational or flat
file source when it runs a session. When a relational or a flat file source definition is added to a
mapping, it is connected to a Source Qualifier transformation.
PowerCenter Server generates a query for each Source Qualifier Transformation whenever it runs
the session. The default query is SELET statement containing all the source columns. Source
Qualifier has capability to override this default query by changing the default settings of the
transformation properties. The list of selected ports or the order they appear in the default query
should not be changed in overridden query.
Q. What is aggregator transformation?
A. The Aggregator transformation allows performing aggregate calculations, such as averages and
sums. Unlike Expression Transformation, the Aggregator transformation can only be used to
perform calculations on groups. The Expression transformation permits calculations on a row-byrow basis only.
Aggregator Transformation contains group by ports that indicate how to group the data. While
grouping the data, the aggregator transformation outputs the last row of each group unless
otherwise specified in the transformation properties.
Various group by functions available in Informatica are : AVG, COUNT, FIRST, LAST, MAX, MEDIAN,
MIN, PERCENTILE, STDDEV, SUM, VARIANCE.
Q. What is Incremental Aggregation?
A. Whenever a session is created for a mapping Aggregate Transformation, the session option for
Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it
passes new source data through the mapping and uses historical cache data to perform new
aggregation calculations incrementally.
Q. How Union Transformation is used?
A. The union transformation is a multiple input group transformation that can be used to merge
data from various sources (or pipelines). This transformation works just like UNION ALL statement
in SQL, that is used to combine result set of two SELECT statements.
Q. Can two flat files be joined with Joiner Transformation?
A. Yes, joiner transformation can be used to join data from two flat file sources.
Q. What is a look up transformation?
A. This transformation is used to lookup data in a flat file or a relational table, view or synonym. It
compares lookup transformation ports (input ports) to the source column values based on the
lookup condition. Later returned values can be passed to other transformations.
Q. Can a lookup be done on Flat Files?
A. Yes.
Q. What is the difference between a connected look up and unconnected look up?
A. Connected lookup takes input values directly from other transformations in the pipleline.
Unconnected lookup doesnt take inputs directly from any other transformation, but it can be used in any
transformation (like expression) and can be invoked as a function using :LKP expression. So, an unconnected lookup can be
called multiple times in a mapping.
Q. What is a mapplet?
A. A mapplet is a reusable object that is created using mapplet designer. The mapplet contains set
of transformations and it allows us to reuse that transformation logic in multiple mappings.
Q. What does reusable transformation mean?
A. Reusable transformations can be used multiple times in a mapping. The reusable transformation

is stored as a metadata separate from any other mapping that uses the transformation. Whenever
any changes to a reusable transformation are made, all the mappings where the transformation is
used will be invalidated.
Q. What is update strategy and what are the options for update strategy?
A. Informatica processes the source data row-by-row. By default every row is marked to be inserted
in the target table. If the row has to be updated/inserted based on some logic Update Strategy
transformation is used. The condition can be specified in Update Strategy to mark the processed
row for update or insert.
Following options are available for update strategy :

DD_INSERT : If this is used the Update Strategy flags the row for insertion. Equivalent
numeric value of DD_INSERT is 0.

DD_UPDATE : If this is used the Update Strategy flags the row for update. Equivalent
numeric value of DD_UPDATE is 1.

DD_DELETE : If this is used the Update Strategy flags the row for deletion. Equivalent
numeric value of DD_DELETE is 2.

DD_REJECT : If this is used the Update Strategy flags the row for rejection. Equivalent
numeric value of DD_REJECT is 3.

Re: What are Anti joins


Answer Anti-joins:
#1
Anti-joins are written using the NOT EXISTS or NOT IN
constructs. An anti-join between two tables returns rows
from the first table for which there are no corresponding
rows in the second table. In other words, it returns rows
that fail to match the sub-query on the right side.

Suppose you want a list of departments with no employees.


You could write a query like this:
SELECT d.department_name
FROM departments d
MINUS
SELECT d.department_name

FROM departments d, employees e


WHERE d.department_id = e.department_id
ORDER BY department_name;

The above query will give the desired results, but it might
be clearer to write the query using an anti-join:
SELECT d.department_name
FROM departments d
WHERE NOT EXISTS (SELECT NULL
FROM employees e
WHERE e.department_id = d.department_id)
ORDER BY d.department_name;

Re: Without using any transformations how u can load the data into target?

if i were the candidate i would simply say if there are no


transformations to be done, i will simply run an insert
script if the source and target can talk to each other. or
simply source -> source qualifier -> target. if the
interviewer says SQ is a transformation, then say "then i
dont know. i have always used informatica when there is
some kind of transformation involved because that is what
informatica is mainly used for".
What is a source qualifier?
What is a surrogate key?

What is difference between Mapplet and reusable transformation?


What is DTM session?
What is a Mapplet?
What is a look up function? What is default transformation for the look up function?
What is difference between a connected look up and unconnected look up?
What is up date strategy and what are the options for update strategy?
What is subject area?
What is the difference between truncate and delete statements?
What kind of Update strategies are normally used (Type 1, 2 & 3) & what are the differences?
What is the exact syntax of an update strategy?
What are bitmap indexes and how and why are they used?
What is bulk bind? How does it improve performance?
What are the different ways to filter rows using Informatica transformations?
What is referential Integrity error? How do you rectify it?
What is DTM process?
What is target load order?
What exactly is a shortcut and how do you use it?
What is a shared folder?
What are the different transformations where you can use a SQL override?
What is the difference between a Bulk and Normal mode and where exactly is it defined?
What is the difference between Local & Global repository?
What are data driven sessions?
What are the common errors while running a Informatica session?
What are worklets and what is their use?
What is change data capture?
What exactly is tracing level?
What is the difference between constraints based load ordering and target load plan?
What is a deployment group and what is its use?
When and how a partition is defined using Informatica?
How do you improve performance in an Update strategy?
How do you validate all the mappings in the repository at once?
How can you join two or more tables without using the source qualifier override SQL or a Joiner transformation?
How can you define a transformation? What are different types of transformations in Informatica?
How many repositories can be created in Informatica?
How many minimum groups can be defined in a Router transformation?
How do you define partitions in Informatica?
How can you improve performance in an Aggregator transformation?
How does the Informatica know that the input is sorted?
How many worklets can be defined within a workflow?
How do you define a parameter file? Give an example of its use.

If you join two or more tables and then pull out about two columns from each table into the source qualifier and then just
pull out one column from the source qualifier into an Expression transformation and then do a generate SQL in the
source qualifier how many columns will show up in the generated SQL.
In a Type 1 mapping with one source and one target table what is the minimum number of update strategy
transformations to be used?
At what levels can you define parameter files and what is the order?
In a session log file where can you find the reader and the writer details?
For joining three heterogeneous tables how many joiner transformations are required?
Can you look up a flat file using Informatica?
While running a session what default files are created?
Describe the use of Materialized views and how are they different from a normal view.
Contributed by Mukherjee, Saibal (ETL Consultant)
Many readers are asking Wheres the answer? Well it will take some time before I get time to write it But there is no
reason to get upset The informatica help files should have all of these answers!
Posted in ETL Tools, Informatica, Informatica FAQs, Interview FAQs,Uncategorized | 26 Comments

Loading & testing fact/transactional/balances (data), which is valid between dates!


Tuesday, July 25th, 2006

This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load fact or
transactional data which keeps on changing between dates.

ex: prices of shares, Company ratings, etc.

The table above shows an entity in the source system that contains time variant values but they dont change daily. The values are valid over a period of
time; then they change.

1 .What the table structure should be used in the data warehouse?


Maybe Ralph Kimball or Bill Inmon can come with better data model!

But for ETL developers or ETL leads the decision is already made so lets look

for a solution.
2. What should be the ETL design to load such a structure?
Design A

There is one to one relationship between the source row and the target row.
There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a new row
with current flag and go to the previous row and retire it. Now this step is a very costly ETL step it will slow down the ETL
process.

From the report writer issue this model is a major challange to use. Because what if the report wants a rate which
is not current. Imagine the complex query.
Design B

In this design the sanpshot of the source table is taken every day.
The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1 million
rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in values are in hours or
minutes?

But you have a very happy user who can write SQL reports very easily.
Design C

Can there be a comprimise. How about using from date (time) to date (time)!

The report write can simply

provide a date (time) and the straight SQL can return a value/row that was valid at that moment.

However the ETL is indeed complex as the A model. Because while the current row will be from current date toinfinity. The previous row has to be retired to from date to todays date -1.

This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given date and
time only one instance of the row exists (for the primary key).
Which design is better, I have used all depending on the situtation.
3. What should be the unit test plan?
There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely test those. Here are some examples
of test plans
a. There should be only one value for a given date/date time
b. During the initial load when the data is available for multiple days the process should go sequential and create snapshots/ranges correctly.
c. At any given time there should be only one current row .
d. etc

Datawarehouse and Informatica Interview Question


*******************Shankar Prasad*******************************

1.Can 2 Fact Tables share same dimensions Tables? How many Dimension tables are associated
with one Fact Table ur project?
Ans: Yes
2.What is ROLAP, MOLAP, and DOLAP...?
Ans: ROLAP (Relational OLAP), MOLAP (Multidimensional OLAP), and DOLAP (Desktop OLAP). In
these three OLAP
architectures, the interface to the analytic layer is typically the same; what is quite different
is how the data is physically stored.
In MOLAP, the premise is that online analytical processing is best implemented by storing
the data multidimensionally; that is,
data must be stored multidimensionally in order to be viewed in a multidimensional manner.
In ROLAP, architects believe to store the data in the relational model; for instance, OLAP

capabilities are best provided


against the relational database.
DOLAP, is a variation that exists to provide portability for the OLAP user. It creates
multidimensional datasets that can be
transferred from server to desktop, requiring only the DOLAP software to exist on the target
system. This provides significant
advantages to portable computer users, such as salespeople who are frequently on the road
and do not have direct access to
their office server.
3.What is an MDDB? and What is the difference between MDDBs and RDBMSs?
Ans: Multidimensional Database There are two primary technologies that are used for storing
the data used in OLAP applications.
These two technologies are multidimensional databases (MDDB) and relational databases
(RDBMS). The major difference
between MDDBs and RDBMSs is in how they store data. Relational databases store their
data in a series of tables and
columns. Multidimensional databases, on the other hand, store their data in a large
multidimensional arrays.
For example, in an MDDB world, you might refer to a sales figure as Sales with Date,
Product, and Location coordinates of
12-1-2001, Car, and south, respectively.
Advantages of MDDB:
Retrieval is very fast because

single

The data corresponding to any combination of dimension members can be retrieved with a
I/O.
Data is clustered compactly in a multidimensional array.
Values are caluculated ahead of time.
The index is small and can therefore usually reside completely in memory.

Storage is very efficient because

The blocks contain only data.

A single index locates the block corresponding to a combination of sparse dimension


numbers.
4. What is MDB modeling and RDB Modeling?
Ans:
5. What is Mapplet and how do u create Mapplet?
Ans: A mapplet is a reusable object that represents a set of transformations. It allows you to
reuse transformation logic and can
contain as many transformations as you need.
Create a mapplet when you want to use a standardized set of transformation logic in several
mappings. For example, if you

have a several fact tables that require a series of dimension keys, you can create a mapplet
containing a series of Lookup
transformations to find each dimension key. You can then use the mapplet in each fact table
mapping, rather than recreate the
same lookup logic in each mapping.
To create a new mapplet:
1. In the Mapplet Designer, choose Mapplets-Create Mapplet.
2. Enter a descriptive mapplet name.
The recommended naming convention for mapplets is mpltMappletName.
3. Click OK.
The Mapping Designer creates a new mapplet in the Mapplet Designer.
4. Choose Repository-Save.
6. What for is the transformations are used?
Ans: Transformations are the manipulation of data from how it appears in the source system(s)
into another form in the data
warehouse or mart in a way that enhances or simplifies its meaning. In short, u transform
data into information.
This includes Datamerging, Cleansing, Aggregation: Datamerging: Process of standardizing data types and fields. Suppose one source system
calls integer type data as smallint
where as another calls similar data as decimal. The data from the two source systems needs
to rationalized when moved into
the oracle data format called number.
Cleansing: This involves identifying any changing inconsistencies or inaccuracies.
Eliminating inconsistencies in the data from multiple sources.
Converting data from different systems into single consistent data set suitable for
analysis.
Meets a standard for establishing data elements, codes, domains, formats and
naming conventions.
Correct data errors and fills in for missing data values.
Aggregation: The process where by multiple detailed values are combined into a single
summary value typically summation numbers representing dollars spend or units sold.
Generate summarized data for use in aggregate fact and dimension tables.
Data Transformation is an interesting concept in that some transformation can occur
during the extract, some during the

transformation, or even in limited cases--- during load portion of the ETL process. The
type of transformation function u
need will most often determine where it should be performed. Some transformation functions
could even be performed in more
than one place. Bze many of the transformations u will want to perform already exist in

some form or another in more than


one of the three environments (source database or application, ETL tool, or the target db).
7. What is the difference btween OLTP & OLAP?
Ans: OLTP stand for Online Transaction Processing. This is standard, normalized database
structure. OLTP is designed for
Transactions, which means that inserts, updates, and deletes must be fast. Imagine a call
center that takes orders. Call takers are continually taking calls and entering orders that may
contain numerous items. Each order and each item must be inserted into a database. Since
the performance of database is critical, we want to maximize the speed of inserts (and
updates and deletes). To maximize performance, we typically try to hold as few
records in the database as possible.
OLAP stands for Online Analytical Processing. OLAP is a term that means many things to
many people. Here, we will use the term OLAP and Star Schema pretty much interchangeably.
We will assume that star schema database is an OLAP system.( This is not the same thing
that Microsoft calls OLAP; they extend OLAP to mean the cube structures built using their
product, OLAP Services). Here, we will assume that any system of read-only, historical,
aggregated data is an OLAP system.
A data warehouse(or mart) is way of storing data for later retrieval. This retrieval is almost
always used to support decision-making in the organization. That is why many data
warehouses are considered to be DSS (Decision-Support Systems).
Both a data warehouse and a data mart are storage mechanisms for read-only, historical,
aggregated data.
By read-only, we mean that the person looking at the data wont be changing it. If a user
wants at the sales yesterday for a certain product, they should not have the ability to change
that number.
The historical part may just be a few minutes old, but usually it is at least a day old.A data
warehouse usually holds data that goes back a certain period in time, such as five years. In
contrast, standard OLTP systems usually only hold data as long as it is current or active. An
order table, for example, may move orders to an archive table once they have been
completed, shipped, and received by the customer.
When we say that data warehouses and data marts hold aggregated data, we need to stress
that there are many levels of aggregation in a typical data warehouse.
8. If data source is in the form of Excel Spread sheet then how do use?
Ans: PowerMart and PowerCenter treat a Microsoft Excel source as a relational database, not a
flat file. Like relational sources,
the Designer uses ODBC to import a Microsoft Excel source. You do not need database
permissions to import Microsoft
Excel sources.

To import an Excel source definition, you need to complete the following tasks:
Install the Microsoft Excel ODBC driver on your system.
Create a Microsoft Excel ODBC data source for each source file in the ODBC 32-bit
Administrator.

Prepare Microsoft Excel spreadsheets by defining ranges and formatting columns of


numeric data.

Import the source definitions in the Designer.


Once you define ranges and format cells, you can import the ranges in the Designer. Ranges
display as source definitions
when you import the source.
9. Which db is RDBMS and which is MDDB can u name them?
Ans: MDDB ex. Oracle Express Server(OES), Essbase by Hyperion Software, Powerplay by
Cognos and
RDBMS ex. Oracle , SQL Server etc.
10. What are the modules/tools in Business Objects? Explain theier purpose briefly?
Ans: BO Designer, Business Query for Excel, BO Reporter, Infoview,Explorer,WEBI, BO
Publisher, and Broadcast Agent, BO
ZABO).
InfoView: IT portal entry into WebIntelligence & Business Objects.
Base module required for all options to view and refresh reports.
Reporter: Upgrade to create/modify reports on LAN or Web.
Explorer: Upgrade to perform OLAP processing on LAN or Web.
Designer: Creates semantic layer between user and database.
Supervisor: Administer and control access for group of users.
WebIntelligence: Integrated query, reporting, and OLAP analysis over the Web.
Broadcast Agent: Used to schedule, run, publish, push, and broadcast pre-built reports and
spreadsheets, including event
notification and response capabilities, event filtering, and calendar based
notification, over the LAN, email, pager,Fax, Personal Digital Assistant( PDA), Short Messaging
Service(SMS), etc.
Set Analyzer - Applies set-based analysis to perform functions such as execlusion,
intersections, unions, and overlaps
visually.
Developer Suite Build packaged, analytical, or customized apps.
11.What are the Ad hoc quries, Canned Quries/Reports? and How do u create them?
(Plz check this pageC\:BObjects\Quries\Data Warehouse - About Queries.htm)

Ans: The data warehouse will contain two types of query. There will be fixed queries that are
clearly defined and well understood, such as regular reports, canned queries (standard
reports) and common aggregations. There will also be ad hoc queries that are unpredictable,
both in quantity and frequency.
Ad Hoc Query: Ad hoc queries are the starting point for any analysis into a database. Any
business analyst wants to know what is inside the database. He then proceeds by calculating
totals, averages, maximum and minimum values for most attributes within the database. These
are unpredictable element of a data warehouse. It is exactly that ability to run any query when
desired and expect a reasonable response that makes the data warhouse worthwhile, and makes
the design such a significant challenge.
The end-user access tools are capable of automatically generating the database query that
answers any Question posed by the user. The user will typically pose questions in terms that they
are familier with (for example, sales by store last week); this is converted into the database
query by the access tool, which is aware of the structure of information within the data
warehouse.
Canned queries: Canned queries are predefined queries. In most instances, canned queries
contain prompts that allow you to customize the query for your specific needs. For example, a
prompt may ask you for a School, department, term, or section ID. In this instance you would
enter the name of the School, department or term, and the query will retrieve the specified data
from the Warehouse.You can measure resource requirements of these queries, and the results can
be used for capacity palnning and for database design.
The main reason for using a canned query or report rather than creating your own is that your
chances of misinterpreting data or getting the wrong answer are reduced. You are assured of
getting the right data and the right answer.
12. How many Fact tables and how many dimension tables u did? Which table precedes what?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp
13. What is the difference between STAR SCHEMA & SNOW FLAKE SCHEMA?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp
14. Why did u choose STAR SCHEMA only? What are the benefits of STAR SCHEMA?
Ans: Because its denormalized structure , i.e., Dimension Tables are denormalized. Why to
denormalize means the first (and often
only) answer is : speed. OLTP structure is designed for data inserts, updates, and deletes,
but not data retrieval. Therefore,
we can often squeeze some speed out of it by denormalizing some of the tables and having
queries go against fewer tables.
These queries are faster because they perform fewer joins to retrieve the same recordset.
Joins are also confusing to many
End users. By denormalizing, we can present the user with a view of the data that is far
easier for them to understand.

Benefits of STAR SCHEMA:

Far fewer Tables.


Designed for analysis across time.
Simplifies joins.
Less database space.
Supports drilling in reports.
Flexibility to meet business and technical needs.
15. How do u load the data using Informatica?
Ans: Using session.
16. (i) What is FTP? (ii) How do u connect to remote? (iii) Is there another way to use FTP without
a special utility?
Ans: (i): The FTP (File Transfer Protocol) utility program is commonly used for copying files to
and from other computers. These
computers may be at the same site or at different sites thousands of miles apart. FTP is
general protocol that works on UNIX
systems as well as other non- UNIX systems.
(ii): Remote connect commands:
ftp machinename
ex: ftp 129.82.45.181 or ftp iesg
If the remote machine has been reached successfully, FTP responds by asking for a
loginname and password. When u enter
ur own loginname and password for the remote machine, it returns the prompt like below
ftp>
and permits u access to ur own home directory on the remote machine. U should be able to
move around in ur own directory
and to copy files to and from ur local machine using the FTP interface commands.
Note: U can set the mode of file transfer to ASCII ( default and transmits seven bits per
character).
Use the ASCII mode with any of the following:
- Raw Data (e.g. *.dat or *.txt, codebooks, or other plain text documents)
- SPSS Portable files.
- HTML files.
If u set mode of file transfer to Binary (the binary mode transmits all eight bits per byte
and thus provides less chance of
a transmission error and must be used to transmit files other than ASCII files).
For example use binary mode for the following types of files:
- SPSS System files

- SAS Dataset
- Graphic files (eg., *.gif, *.jpg, *.bmp, etc.)
- Microsoft Office documents (*.doc, *.xls, etc.)
(iii): Yes. If u r using Windows, u can access a text-based FTP utility from a DOS prompt.
To do this, perform the following steps:
1.

From the Start Programs MS-Dos Prompt

2.

Enter ftp ftp.geocities.com. A prompt will appear


(or)
Enter ftp to get ftp prompt ftp> open hostname ex. ftp>open ftp.geocities.com (It

connect to the specified host).


3.
Enter ur yahoo! GeoCities member name.
4.
enter your yahoo! GeoCities pwd.
You can now use standard FTP commands to manage the files in your Yahoo! GeoCities directory.
17.What cmd is used to transfer multiple files at a time using FTP?
Ans: mget ==> To copy multiple files from the remote machine to the local machine. You will be
prompted for a y/n answer before
transferring each file mget * ( copies all files in the current remote directory to ur
current local directory,
using the same file names).
mput ==> To copy multiple files from the local machine to the remote machine.
18. What is an Filter Transformation? or what options u have in Filter Transformation?
Ans: The Filter transformation provides the means for filtering records in a mapping. You pass
all the rows from a source
transformation through the Filter transformation, then enter a filter condition for the
transformation. All ports in a Filter
transformation are input/output, and only records that meet the condition pass
through the Filter transformation.
Note: Discarded rows do not appear in the session log or reject files
To maximize session performance, include the Filter transformation as close to the
sources in the mapping as possible.
Rather than passing records you plan to discard through the mapping, you then filter out
unwanted data early in the
flow of data from sources to targets.

You cannot concatenate ports from more than one transformation into the Filter
transformation; the input ports for the filter

must come from a single transformation. Filter transformations exist within the flow of the
mapping and cannot be
unconnected. The Filter transformation does not allow setting output default values.
19.What are default sources which will supported by Informatica Powermart ?
Ans :

Relational tables, views, and synonyms.


Fixed-width and delimited flat files that do not contain binary data.
COBOL files.

20. When do u create the Source Definition ? Can I use this Source Defn to any Transformation?
Ans: When working with a file that contains fixed-width binary data, you must create the
source definition.
The Designer displays the source definition as a table, consisting of names, datatypes, and
constraints. To use a source
definition in a mapping, connect a source definition to a Source Qualifier or Normalizer
transformation. The Informatica
Server uses these transformations to read the source data.
21. What is Active & Passive Transformation ?
Ans: Active and Passive Transformations
Transformations can be active or passive. An active transformation can change the
number of records passed through it. A
passive transformation never changes the record count.For example, the Filter
transformation removes rows that do not
meet the filter condition defined in the transformation.
Active transformations that might change the record count include the following:

Advanced External Procedure


Aggregator
Filter
Joiner
Normalizer
Rank
Source Qualifier
Note: If you use PowerConnect to access ERP sources, the ERP Source Qualifier is also an
active transformation.

/*
You can connect only one of these active transformations to the same
transformation or target, since the Informatica
Server cannot determine how to concatenate data from different sets of records with
different numbers of rows.
*/
Passive transformations that never change the record count include the following:

Lookup
Expression
External Procedure
Sequence Generator
Stored Procedure
Update Strategy
You can connect any number of these passive transformations, or connect one active
transformation with any number of
passive transformations, to the same transformation or target.
22. What is staging Area and Work Area?
Ans: Staging Area : - Holding Tables on DW Server.
- Loaded from Extract Process
- Input for Integration/Transformation
- May function as Work Areas
- Output to a work area or Fact Table
Work Area: - Temporary Tables
- Memory

23. What is Metadata? (plz refer DATA WHING IN THE REAL WORLD BOOK page # 125)
Ans: Defn: Data About Data
Metadata contains descriptive data for end users. In a data warehouse the term metadata is
used in a number of different

situations.
Metadata is used for:

Data transformation and load

Data management

Query management
Data transformation and load:
Metadata may be used during data transformation and load to describe the source data and any
changes that need to be made. The advantage of storing metadata about the data being
transformed is that as source data changes the changes can be captured in the metadata, and
transformation programs automatically regenerated.
For each source data field the following information is reqd:
Source Field:

Unique identifier (to avoid any confusion occurring betn 2 fields of the same anme from
different sources).

Name (Local field name).

Type (storage type of data, like character,integer,floating pointand so on).

Location
- system ( system it comes from ex.Accouting system).
- object ( object that contains it ex. Account Table).
The destination field needs to be described in a similar way to the source:
Destination:

Unique identifier

Name

Type (database data type, such as Char, Varchar, Number and so on).

Tablename (Name of the table th field will be part of).


The other information that needs to be stored is the transformation or transformations that need
to be applied to turn the source data into the destination data:
Transformation:

Transformation (s)
- Name
- Language (name of the lanjuage that transformation is written in).
- module name
- syntax
The Name is the unique identifier that differentiates this from any other similar transformations.
The Language attribute contains the name of the lnguage that the transformation is written
in.
The other attributes are module name and syntax. Generally these will be mutually exclusive,
with only one being defined. For simple transformations such as simple SQL functions the
syntax will be stored. For complex transformations the name of the module that contains the
code is stored instead.
Data management:
Metadata is reqd to describe the data as it resides in the data warehouse.This is needed by the
warhouse manager to allow it to track and control all data movements. Every object in the
database needs to be described.

Metadata is needed for all the following:

Tables
- Columns
- name
- type

Indexes
- Columns
- name
- type

Views
- Columns
- name
- type

Constraints

- name
- type
- table
- columns
Aggregations, Partition information also need to be stored in Metadata( for details refer page
# 30)
Query Generation:
Metadata is also required by the query manger to enable it to generate queries. The same
metadata can be used by the Whouse manager to describe the data in the data warehouse is also
reqd by the query manager.
The query mangaer will also generate metadata about the queries it has run. This metadata
can be used to build a history of all quries run and generate a query profile for each user,
group of users and the data warehouse as a whole.
The metadata that is reqd for each query is:
- query
- tables accessed
- columns accessed
- name
- refence identifier
- restrictions applied
- column name
- table name
- reference identifier
- restriction
- join Criteria applied

- aggregate functions used

group

by

criteria

sort

criteria

syntax
execution plan
resources

24. What kind of Unix flavoures u r experienced?


Ans: Solaris 2.5 SunOs 5.5 (Operating System)
Solaris 2.6 SunOs 5.6 (Operating System)
Solaris 2.8 SunOs 5.8 (Operating System)
AIX 4.0.3
5.5.1 2.5.1 May 96 sun4c, sun4m, sun4d, sun4u, x86, ppc
5.6 2.6 Aug. 97 sun4c, sun4m, sun4d, sun4u, x86
5.7 7 Oct. 98 sun4c, sun4m, sun4d, sun4u, x86
5.8 8 2000 sun4m, sun4d, sun4u, x86

25. What are the tasks that are done by Informatica Server?
Ans:The Informatica Server performs the following tasks:

Manages the scheduling and execution of sessions and batches


Executes sessions and batches
Verifies permissions and privileges

Interacts with the Server Manager and pmcmd.


The Informatica Server moves data from sources to targets based on metadata stored in a
repository. For instructions on how to move and transform data, the Informatica Server reads a
mapping (a type of metadata that includes transformations and source and target definitions).
Each mapping uses a session to define additional information and to optionally override mappinglevel options. You can group multiple sessions to run as a single unit, known as a batch.
26. What are the two programs that communicate with the Informatica Server?
Ans: Informatica provides Server Manager and pmcmd programs to communicate with the
Informatica Server:
Server Manager. A client application used to create and manage sessions and batches, and to
monitor and stop the Informatica Server. You can use information provided through the Server
Manager to troubleshoot sessions and improve session performance.
pmcmd. A command-line program that allows you to start and stop sessions and batches, stop
the Informatica Server, and verify if the Informatica Server is running.
27. When do u reinitialize Aggregate Cache?
Ans: Reinitializing the aggregate cache overwrites historical aggregate data with new aggregate
data. When you reinitialize the
aggregate cache, instead of using the captured changes in source tables, you typically need
to use the use the entire source
table.
For example, you can reinitialize the aggregate cache if the source for a session changes
incrementally every day and
completely changes once a month. When you receive the new monthly source, you might
configure the session to reinitialize
the aggregate cache, truncate the existing target, and use the new source table during the
session.

/? Note: To be clarified when server manger works for following ?/


To reinitialize the aggregate cache:
1.In the Server Manager, open the session property sheet.
2.Click the Transformations tab.
3.Check Reinitialize Aggregate Cache.
4.Click OK three times to save your changes.
5.Run the session.
The Informatica Server creates a new aggregate cache, overwriting the existing aggregate
cache.
/? To be check for step 6 & step 7 after successful run of session ?/
6.After running the session, open the property sheet again.
7.Click the Data tab.

8.Clear Reinitialize Aggregate Cache.


9.Click OK.
28. (i) What is Target Load Order in Designer?

Ans: Target Load Order: - In the Designer, you can set the order in which the
Informatica Server sends records to various target
definitions in a mapping. This feature is crucial if you want to maintain
referential integrity when inserting, deleting, or updating
records in tables that have the primary key and foreign key constraints
applied to them. The Informatica Server writes data to
all the targets connected to the same Source Qualifier or Normalizer
simultaneously, to maximize performance.
28. (ii) What are the minimim condition that u need to have so as to use Targte Load Order Option
in Designer?
Ans: U need to have Multiple Source Qualifier transformations.
To specify the order in which the Informatica Server sends data to targets, create one Source
Qualifier or Normalizer
transformation for each target within a mapping. To set the target load order, you then
determine the order in which each
Source Qualifier sends data to connected targets in the mapping.
When a mapping includes a Joiner transformation, the Informatica Server sends all
records to targets connected to that
Joiner at the same time, regardless of the target load order.
28(iii). How do u set the Target load order?
Ans: To set the target load order:
1. Create a mapping that contains multiple Source Qualifier transformations.
2. After you complete the mapping, choose Mappings-Target Load Plan.
A dialog box lists all Source Qualifier transformations in the mapping, as well as the
targets that receive data from each
Source Qualifier.
3. Select a Source Qualifier from the list.
4. Click the Up and Down buttons to move the Source Qualifier within the load order.
5. Repeat steps 3 and 4 for any other Source Qualifiers you wish to reorder.
6. Click OK and Choose Repository-Save.
29. What u can do with Repository Manager?
Ans: We can do following tasks using Repository Manager : To create usernames, you must have one of the following sets of privileges:
- Administer Repository privilege
- Super User privilege

To create a user group, you must have one of the following privileges :
- Administer Repository privilege
- Super User privilege
To assign or revoke privileges , u must hv one of the following privilege..
- Administer Repository privilege
- Super User privilege
Note: You cannot change the privileges of the default user groups or the default repository
users.
30. What u can do with Designer ?
Ans: The Designer client application provides five tools to help you create mappings:
Source Analyzer. Use to import or create source definitions for flat file, Cobol, ERP, and
relational sources.
Warehouse Designer. Use to import or create target definitions.
Transformation Developer. Use to create reusable transformations.
Mapplet Designer. Use to create mapplets.
Mapping Designer. Use to create mappings.
Note:The Designer allows you to work with multiple tools at one time. You can also work in
multiple folders and repositories
31. What are different types of Tracing Levels u hv in Transformations?
Ans: Tracing Levels in Transformations :Level

Description

Terse

Indicates when the Informatica Server initializes the session and its
components. Summarizes session results, but not at the level of individual
records.

Normal

Includes initialization information as well as error messages and


notification of rejected data.

Verbose initialization Includes all information provided with the Normal setting plus more
extensive information about initializing transformations in the session.
Verbose data

Includes all information provided with the Verbose initialization


setting.

Note: By default, the tracing level for every transformation is


Normal.
To add a slight performance boost, you can also set the tracing level to Terse, writing the
minimum of detail to the session log
when running a session containing the transformation.

31(i). What the difference is between a database, a data warehouse and a data mart?
Ans: -- A database is an organized collection of information.
-- A data warehouse is a very large database with special sets of tools to extract and
cleanse data from operational systems
and to analyze data.
-- A data mart is a focused subset of a data warehouse that deals with a single area of data
and is organized for quick
analysis.
32. What is Data Mart, Data WareHouse and Decision Support System explain briefly?
Ans: Data Mart:
A data mart is a repository of data gathered from operational data and other sources that is
designed to serve a particular
community of knowledge workers. In scope, the data may derive from an enterprise-wide
database or data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in terms of analysis,
content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in
terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the presence of the other
in some form. However, most writers using the term seem to agree that the design of a data
mart tends to start from an analysis of user needs and that a data warehouse tends to
start from an analysis of what data already exists and how it can be collected in such a
way that the data can later be used. A data warehouse is a central aggregation of data (which
can be distributed physically); a data mart is a data repository that may derive from a data
warehouse or not and that emphasizes ease of access and usability for a particular designed
purpose. In general, a data warehouse tends to be a strategic but somewhat unfinished concept;
a data mart tends to be tactical and aimed at meeting an immediate need.
Data Warehouse:
A data warehouse is a central repository for all or significant parts of the data that an
enterprise's various business systems collect. The term was coined by W. H. Inmon. IBM
sometimes uses the term "information warehouse."
Typically, a data warehouse is housed on an enterprise mainframe server. Data from various
online transaction processing (OLTP) applications and other sources is selectively extracted and
organized on the data warehouse database for use by analytical applications and user queries.
Data warehousing emphasizes the capture of data from diverse sources for useful analysis and
access, but does not generally start from the point-of-view of the end user or knowledge worker
who may need access to specialized, sometimes local databases. The latter idea is known as the
data mart.
data mining, Web mining, and a decision support system (DSS) are three kinds of
applications that can make use of a data warehouse.
Decision Support System:
A decision support system (DSS) is a computer program application that analyzes business data
and presents it so that users can make business decisions more easily. It is an "informational

application" (in distinction to an "operational application" that collects the data in the course of
normal business operation).
Typical information that a decision support application might gather and present would
be:
Comparative sales figures between one week and the next
Projected revenue figures based on new product sales assumptions
The consequences of different decision alternatives, given past experience in a context that is
described
A decision support system may present information graphically and may include an expert system
or artificial intelligence (AI). It may be aimed at business executives or some other group of
knowledge workers.
33. What r the differences between Heterogeneous and Homogeneous?
Ans: Heterogeneous

Homogeneous

Stored in different Schemas

Common structure

Stored in different file or db types


Spread across in several countries
Different platform n H/W config.

Same database type


Same data center
Same platform and H/Ware configuration.

34. How do you use DDL commands in PL/SQL block ex. Accept table name from user and drop it,
if available else display msg?
Ans: To invoke DDL commands in PL/SQL blocks we have to use Dynamic SQL, the Package
used is DBMS_SQL.
35. What r the steps to work with Dynamic SQL?
Ans: Open a Dynamic cursor, Parse SQL stmt, Bind i/p variables (if any), Execute SQL stmt of
Dynamic Cursor and
Close the Cursor.
36. Which package, procedure is used to find/check free space available for db objects like
table/procedures/views/synonymsetc?
Ans: The Package
The Procedure
The Table

is DBMS_SPACE
is UNUSED_SPACE
is DBA_OBJECTS

Note: See the script to find free space @ c:\informatica\tbl_free_space


37. Does informatica allow if EmpId is PKey in Target tbl and source data is 2 rows with same
EmpID?If u use lookup for the same
situation does it allow to load 2 rows or only 1?
Ans: => No, it will not it generates pkey constraint voilation. (it loads 1 row)

=> Even then no if EmpId is Pkey.


38. If Ename varchar2(40) from 1 source(siebel), Ename char(100) from another source (oracle)
and the target is having Name
varchar2(50) then how does informatica handles this situation? How Informatica handles
string and numbers datatypes
sources?
39. How do u debug mappings? I mean where do u attack?
40. How do u qry the Metadata tables for Informatica?
41(i). When do u use connected lookup n when do u use unconnected lookup?
Ans:
Connected Lookups : A connected Lookup transformation is part of the mapping data flow. With connected
lookups, you can have multiple return values. That is, you can pass multiple values from the
same row in the lookup table out of the Lookup transformation.
Common uses for connected lookups include:
=> Finding a name based on a number ex. Finding a Dname based on deptno
=> Finding a value based on a range of dates
=> Finding a value based on multiple conditions
Unconnected Lookups : An unconnected Lookup transformation exists separate from the data flow in the mapping.
You write an expression using
the :LKP reference qualifier to call the lookup within another transformation.
Some common uses for unconnected lookups include:
=> Testing the results of a lookup in an expression
=> Filtering records based on the lookup results
=> Marking records for update based on the result of a lookup (for example, updating slowly
changing dimension tables)
=> Calling the same lookup multiple times in one mapping

41(ii). What r the differences between Connected lookups and Unconnected lookups?
Ans:
Although both types of lookups perform the same basic task, there are some
important differences:
----------------------------------------------------------------------------------------------------------------------------Connected Lookup

Unconnected Lookup

---------------------------------------------------------------

--------------------------------------------------------------Part of the mapping data flow.

Separate from the mapping data flow.

Can return multiple values from the same row.

Returns one value from each row.

You link the lookup/output ports to another


Return port (R).

You designate the return value with the

transformation.
Supports default values.

Does not support default values.

If there's no match for the lookup condition, the

If there's no match for the lookup condition,

the server
server returns the default value for all output ports.
More visible. Shows the data passing in and out

returns NULL.
Less visible. You write an expression using

:LKP to tell
of the lookup.
Cache includes all lookup columns used in the

the server when to perform the lookup.


Cache includes lookup/output ports in the

Lookup condition
mapping (that is, lookup table columns included

and lookup/return port.

in the lookup condition and lookup table


columns linked as output ports to other
transformations).
42. What u need concentrate after getting explain plan?
Ans: The 3 most significant columns in the plan table are named OPERATION,OPTIONS, and
OBJECT_NAME.For each step,
these tell u which operation is going to be performed and which object is the target of that
operation.
Ex:**************************
TO USE EXPLAIN PLAN FOR A QRY...
**************************
SQL> EXPLAIN PLAN
2 SET STATEMENT_ID = 'PKAR02'
3 FOR
4 SELECT JOB,MAX(SAL)
5 FROM EMP
6 GROUP BY JOB
7 HAVING MAX(SAL) >= 5000;
Explained.
**************************
TO QUERY THE PLAN TABLE :**************************

SQL> SELECT RTRIM(ID)||' '||


2

LPAD(' ', 2*(LEVEL-1))||OPERATION

||' '||OPTIONS

||' '||OBJECT_NAME STEP_DESCRIPTION

5 FROM PLAN_TABLE
6 START WITH ID = 0 AND STATEMENT_ID = 'PKAR02'
7 CONNECT BY PRIOR ID = PARENT_ID
8 AND STATEMENT_ID = 'PKAR02'
9 ORDER BY ID;
STEP_DESCRIPTION
---------------------------------------------------0 SELECT STATEMENT
1
2

FILTER
SORT GROUP BY

TABLE ACCESS FULL EMP

43. How components are interfaced in Psoft?


Ans:
44. How do u do the analysis of an ETL?
Ans:
============================================================
==
45. What is Standard, Reusable Transformation and Mapplet?
Ans: Mappings contain two types of transformations, standard and reusable. Standard
transformations exist within a single
mapping. You cannot reuse a standard transformation you created in another mapping, nor
can you create a shortcut to that transformation. However, often you want to create
transformations that perform common tasks, such as calculating the average salary in a
department. Since a standard transformation cannot be used by more than one mapping, you
have to set up the same transformation each time you want to calculate the average salary in a
department.
Mapplet: A mapplet is a reusable object that represents a set of transformations. It allows
you to reuse transformation logic
and can

contain as many transformations as you need. A mapplet can contain

transformations, reusable transformations, and


shortcuts to transformations.
46. How do u copy Mapping, Repository, Sessions?
Ans: To copy an object (such as a mapping or reusable transformation) from a shared folder,
press the Ctrl key and drag and drop
the mapping into the destination folder.
To copy a mapping from a non-shared folder, drag and drop the mapping into the destination
folder.
In both cases, the destination folder must be open with the related tool active.
For example, to copy a mapping, the Mapping Designer must be active. To copy a Source
Definition, the Source Analyzer must be active.
Copying Mapping:
To copy the mapping, open a workbook.
In the Navigator, click and drag the mapping slightly to the right, not dragging it to the
workbook.
When asked if you want to make a copy, click Yes, then enter a new name and click OK.

Choose Repository-Save.
Repository Copying: You can copy a repository from one database to another. You use this
feature before upgrading, to
preserve the original repository. Copying repositories provides a quick way to copy all
metadata you want to use as a basis for
a new repository.
If the database into which you plan to copy the repository contains an existing repository, the
Repository Manager deletes the existing repository. If you want to preserve the old repository,
cancel the copy. Then back up the existing repository before copying the new repository.
To copy a repository, you must have one of the following privileges:

Administer Repository privilege

Super User privilege


To copy a repository:
1. In the Repository Manager, choose Repository-Copy Repository.
2. Select a repository you wish to copy, then enter the following information:
---------------------------------------------------------------------------------------------------------Copy Repository Field Required/ Optional

Description

----------------------------------------------------------------------------------------------------------

Repository
Required
repository name must be unique within

Name for the repository copy. Each


the domain and should be easily distinguished from all

other repositories.
Database Username
Required
database. This login must have the

Username required to connect to the

appropriate database permissions to create the


repository.
Database Password
Required
username.Must be in US-ASCII.
ODBC Data Source
Required
database.
Native Connect String Required
database.
Code Page
Required
repository. Must be a superset of the code

Password associated with the database


Data source used to connect to the
Connect string identifying the location of the
Character set associated with the

page of the repository you want to copy.


If you are not connected to the repository you want to copy, the Repository Manager
asks you to log in.
3.
Click OK.

5. If asked whether you want to delete an existing repository data in the second repository,
click OK to delete it. Click Cancel to preserve the existing repository.

Copying Sessions:
In the Server Manager, you can copy stand-alone sessions within a folder, or copy sessions in and
out of batches.
To copy a session, you must have one of the following:

Create Sessions and Batches privilege with read and write permission

Super User privilege


To copy a session:
1. In the Server Manager, select the session you wish to copy.
2. Click the Copy Session button or choose Operations-Copy Session.
The Server Manager makes a copy of the session. The Informatica Server names the copy after
the original session, appending a number, such as session_name1.
47. What are shortcuts, and what is advantage?
Ans: Shortcuts allow you to use metadata across folders without making copies, ensuring uniform
metadata. A shortcut inherits all
properties of the object to which it points. Once you create a shortcut, you can configure the
shortcut name and description.
When the object the shortcut references changes, the shortcut inherits those changes. By
using a shortcut instead of a copy,

you ensure each use of the shortcut exactly matches the original object. For example, if you
have a shortcut to a target
definition, and you add a column to the definition, the shortcut automatically inherits the
additional column.
Shortcuts allow you to reuse an object without creating multiple objects in the repository. For
example, you use a source
definition in ten mappings in ten different folders. Instead of creating 10 copies of the same
source definition, one in each
folder, you can create 10 shortcuts to the original source definition.
You can create shortcuts to objects in shared folders. If you try to create a shortcut to a nonshared folder, the Designer
creates a copy of the object instead.
You can create shortcuts to the following repository objects:

Source definitions
Reusable transformations
Mapplets
Mappings
Target definitions
Business components
You can create two types of shortcuts:
Local shortcut. A shortcut created in the same repository as the original object.
Global shortcut. A shortcut created in a local repository that references an object in a global
repository.
Advantages: One of the primary advantages of using a shortcut is maintenance. If you need
to change all instances of an
object, you can edit the original repository object. All shortcuts accessing the object
automatically inherit the changes.
Shortcuts have the following advantages over copied repository objects:

You can maintain a common repository object in a single location. If you need to edit the
object, all shortcuts immediately inherit the changes you make.

You can restrict repository users to a set of predefined metadata by asking users to
incorporate the shortcuts into their work instead of developing repository objects independently.

You can develop complex mappings, mapplets, or reusable transformations, then reuse
them easily in other folders.

You can save space in your repository by keeping a single repository object and using

shortcuts to that object, instead of creating copies of the object in multiple folders or multiple
repositories.
48. What are Pre-session and Post-session Options?
(Plzz refer Help Using Shell Commands n Post-Session Commands and Email)
Ans: The Informatica Server can perform one or more shell commands before or after the
session runs. Shell commands are
operating system commands. You can use pre- or post- session shell commands, for
example, to delete a reject file or
session log, or to archive target files before the session begins.
The status of the shell command, whether it completed successfully or failed, appears in
the session log file.
To call a pre- or post-session shell command you must:
1.
Use any valid UNIX command or shell script for UNIX servers, or any valid DOS or batch file
for Windows NT servers.
2.
Configure the session to execute the pre- or post-session shell commands.
You can configure a session to stop if the Informatica Server encounters an error while executing
pre-session shell commands.
For example, you might use a shell command to copy a file from one directory to another. For a
Windows NT server you would use the following shell command to copy the SALES_ ADJ file
from the target directory, L, to the source, H:
copy L:\sales\sales_adj H:\marketing\
For a UNIX server, you would use the following command line to perform a similar operation:
cp sales/sales_adj marketing/
Tip: Each shell command runs in the same environment (UNIX or Windows NT) as the Informatica
Server. Environment settings in one shell command script do not carry over to other scripts. To
run all shell commands in the same environment, call a single shell script that in turn invokes
other scripts.
49. What are Folder Versions?
Ans: In the Repository Manager, you can create different versions within a folder to help you
archive work in development. You can copy versions to other folders as well. When you save a
version, you save all metadata at a particular point in development. Later versions contain new or
modified metadata, reflecting work that you have completed since the last version.
Maintaining different versions lets you revert to earlier work when needed. By archiving the
contents of a folder into a version each time you reach a development landmark, you can access
those versions if later edits prove unsuccessful.

You create a folder version after completing a version of a difficult mapping, then continue
working on the mapping. If you are unhappy with the results of subsequent work, you can revert
to the previous version, then create a new version to continue development. Thus you keep the
landmark version intact, but available for regression.
Note: You can only work within one version of a folder at a time.
50. How do automate/schedule sessions/batches n did u use any tool for automating
Sessions/batch?
Ans: We scheduled our sessions/batches using Server Manager.
You can either schedule a session to run at a given time or interval, or you can manually
start the session.
U needto hv create sessions n batches with Read n Execute permissions or super user
privilege.
If you configure a batch to run only on demand, you cannot schedule it.
Note: We did not use any tool for automation process.
51. What are the differences between 4.7 and 5.1 versions?
Ans: New Transformations added like XML Transformation and MQ Series Transformation, and
PowerMart and PowerCenter both
are same from 5.1version.
52. What r the procedure that u need to undergo before moving Mappings/sessions from
Testing/Development to Production?
Ans:
53. How many values it (informatica server) returns when it passes thru Connected Lookup n
Unconncted Lookup?
Ans: Connected Lookup can return multiple values where as Unconnected Lookup will return only
one values that is Return Value.
54. What is the difference between PowerMart and PowerCenter in 4.7.2?
Ans: If You Are Using PowerCenter
PowerCenter allows you to register and run multiple Informatica Servers against the same
repository. Because you can run
these servers at the same time, you can distribute the repository session load across available
servers to improve overall
performance.
With PowerCenter, you receive all product functionality, including distributed metadata, the
ability to organize repositories into
a data mart domain and share metadata across repositories.
A PowerCenter license lets you create a single repository that you can configure as a global
repository, the core component

of a data warehouse.
If You Are Using PowerMart
This version of PowerMart includes all features except distributed metadata and multiple
registered servers. Also, the various
options available with PowerCenter (such as PowerCenter Integration Server for BW,
PowerConnect for IBM DB2,
PowerConnect for SAP R/3, and PowerConnect for PeopleSoft) are not available with PowerMart.

55. What kind of modifications u can do/perform with each Transformation?


Ans: Using transformations, you can modify data in the following ways:
-----------------

------------------------

Task
----------------Calculate a value
Perform an aggregate calculations

Transformation
-----------------------Expression
Aggregator

Modify text

Expression

Filter records

Filter, Source Qualifier

Order records queried by the Informatica Server Source Qualifier


Call a stored procedure
Call a procedure in a shared library or in the

Stored Procedure
External Procedure

COM layer of Windows NT


Generate primary keys
Limit records to a top or bottom range
Normalize records, including those read

Sequence Generator
Rank
Normalizer

from COBOL sources


Look up values
Determine whether to insert, delete, update,

Lookup
Update Strategy

or reject records
Join records from different databases

Joiner

or flat file systems


56. Expressions in Transformations, Explain briefly how do u use?
Ans: Expressions in Transformations
To transform data passing through a transformation, you can write an expression. The most
obvious examples of these are the
Expression and Aggregator transformations, which perform calculations on either single
values or an entire range of values
within a port. Transformations that use expressions include the following:
---------------------

------------------------------------------

Transformation

How It Uses Expressions

--------------------Expression

Aggregator

Filter
using an expression.
Rank
Update Strategy

-----------------------------------------Calculates the result of an expression for each row


passing through the transformation, using values from one or more
ports.
Calculates the result of an aggregate expression, such as
a sum or average, based on all data passing through a port or on
groups within that data.
Filters records based on a condition you enter
Filters the top or bottom range of records, based on a
condition you enter using an expression.
Assigns a numeric code to each record based on an
expression, indicating whether the Informatica Server should use the
information in the record to insert, delete, or update the target.

In each transformation, you use the Expression Editor to enter the expression. The Expression
Editor supports the transformation language for building expressions. The transformation
language uses SQL-like functions, operators, and other components to build the expression.
For example, as in SQL, the transformation language includes the functions COUNT and SUM.
However, the PowerMart/PowerCenter transformation language includes additional functions
not found in SQL.
When you enter the expression, you can use values available through ports. For example, if
the transformation has two input ports representing a price and sales tax rate, you can
calculate the final sales tax using these two values. The ports used in the expression can
appear in the same transformation, or you can use output ports in other transformations.
57. In case of Flat files (which comes thru FTP as source) has not arrived then what happens?
Where do u set this option?
Ans: U get an fatel error which cause server to fail/stop the session.
U can set Event-Based Scheduling Option in Session Properties under General tab->Advanced options..
----------------------------------Event-Based
Required/ Optional
-----------------------------------Indicator File to Wait For
Optional
scheduling. Enter the indicator file

-----------------Description
-----------------Required to use event-based

(or directory and file) whose arrival schedules


the session. If you do
not enter a directory, the Informatica Server
assumes the file appears
in the server variable directory $PMRootDir.
58. What is the Test Load Option and when you use in Server Manager?

Ans: When testing sessions in development, you may not need to process the entire source. If
this is true, use the Test Load
Option(Session Properties General Tab Target Options Choose Target Load options as
Normal (option button), with
Test Load cheked (Check box) and No.of rows to test ex.2000 (Text box with Scrolls)). You
can also click the Start button.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------59. SCD Type 2 and SGT difference?
60. Differences between 4.7 and 5.1?
61. Tuning Informatica Server for improving performance? Performance Issues?
Ans: See /* C:\pkar\Informatica\Performance Issues.doc */
62. What is Override Option? Which is better?
63. What will happen if u increase buffer size?
64. what will happen if u increase commit Intervals? and also decrease commit Intervals?
65. What kind of Complex mapping u did? And what sort of problems u faced?
66. If u have 10 mappings designed and u need to implement some changes(may be in existing
mapping or new mapping need to
be designed) then how much time it takes from easier to complex?
67. Can u refresh Repository in 4.7 and 5.1? and also can u refresh pieces (partially) of repository
in 4.7 and 5.1?
68. What is BI?
Ans: http://www.visionnet.com/bi/index.shtml
69. Benefits of BI?
Ans: http://www.visionnet.com/bi/bi-benefits.shtml
70. BI Faq
Ans: http://www.visionnet.com/bi/bi-faq.shtml
71. What is difference between data scrubbing and data cleansing?
Ans: Scrubbing data is the process of cleaning up the junk in legacy data and making it accurate
and useful for the next generations
of automated systems. This is perhaps the most difficult of all conversion activities. Very

often, this is made more difficult when


the customer wants to make good data out of bad data. This is the dog work. It is also the
most important and can not be done
without the active participation of the user.
DATA CLEANING - a two step process including DETECTION and then CORRECTION of
errors in a data set
72. What is Metadata and Repository?
Ans:
Metadata. Data about data .
It contains descriptive data for end users.
Contains data that controls the ETL processing.
Contains data about the current state of the data warehouse.
ETL updates metadata, to provide the most current state.
Repository. The place where you store the metadata is called a repository. The more
sophisticated your repository, the more
complex and detailed metadata you can store in it. PowerMart and PowerCenter use a
relational database as the
repository.

73. SQL * LOADER?


Ans: http://downloadwest.oracle.com/otndoc/oracle9i/901_doc/server.901/a90192/ch03.htm#1004678
74. Debugger in Mapping?
75. Parameters passing in 5.1 vesion exposure?
76. What is the filename which u need to configure in Unix while Installing Informatica?
77. How do u select duplicate rows using Informatica i.e., how do u use Max(Rowid)/Min(Rowid) in
Informatica?
**********************************Shankar
Prasad*************************************************

Informatica - Question - Answer

Deleting duplicate row using Informatica


Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records
in the Target System eliminating the duplicate rows. What will be the approach?
Ans.

Let us assume that the source system is a Relational Database . The source table is having duplicate
rows. Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of
the source table and load the target accordingly.
Source Qualifier Transformation DISTINCT clause

Deleting duplicate row for FLAT FILE sources


Now suppose the source system is a Flat File. Here in the Source Qualifier you will not be able to
select the distinct clause as it is disabled due to flat file source table. Hence the next approach may be
we use a Sorter Transformation and check the Distinct option. When we select the distinct option all
the columns will the selected as keys, in ascending order by default.

Sorter Transformation DISTINCT clause

Deleting Duplicate Record Using Informatica Aggregator


Other ways to handle duplicate records in source batch run is to use an Aggregator Transformation
and using the Group By checkbox on the ports having duplicate occurring data. Here you can have the
flexibility to select the last or the first of the duplicate column value records. Apart from that using
Dynamic Lookup Cache of the target table and associating the input ports with the lookup port and
checking the Insert Else Update option will help to eliminate the duplicate records in source and hence
loading unique records in the target.

Loading Multiple Target Tables Based on Conditions


Q2. Suppose we have some serial numbers in a flat file source. We want to load the serial numbers in
two target files one containing the EVEN serial numbers and the other file having the ODD ones.
Ans.
After the Source Qualifier place a Router Transformation . Create two Groups namely EVEN and
ODD, with filter conditions as MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively.
Then output the two groups into two flat file targets.

Router Transformation Groups Tab

Normalizer Related Questions


Q3. Suppose in our Source Table we have data as given below:
Student Name
Sam
John
Tom

We want to load our Target Table as:


Student Name
Sam
Sam
Sam
John
John
John
Tom
Tom
Tom
Describe your approach.

Maths
100
75
80

Subject Name
Maths
Life Science
Physical Science
Maths
Life Science
Physical Science
Maths
Life Science
Physical Science

Ans.
Here to convert the Rows to Columns we have to use the Normalizer Transformation followed by an
Expression Transformation to Decode the column taken into consideration. For more details on how the
mapping is performed please visit Working with Normalizer

Q4. Name the transformations which converts one to many rows i.e increases the i/p:o/p row count.
Also what is the name of its reverse transformation.
Ans.
Normalizer as well as Router Transformations are the Active transformation which can increase the
number of input rows to output rows.
Aggregator Transformation is the active transformation that performs the reverse action.
Q5. Suppose we have a source table and we want to load three target tables based on source rows such
that first row moves to first target table, secord row in second target table, third row in third target
table, fourth row again in first target table so on and so forth. Describe your approach.
Ans.
We can clearly understand that we need a Router transformation to route or filter source data to the
three target tables. Now the question is what will be the filter conditions. First of all we need an
Expression Transformation where we have all the source table columns and along with that we have
another i/o port say seq_num, which is gets sequence numbers for each source row from the port
NextVal of a Sequence Generator start value 0 and increment by 1. Now the filter condition for the
three router groups will be:
MOD(SEQ_NUM,3)=1 connected to 1st target table, MOD(SEQ_NUM,3)=2 connected to 2nd target
table, MOD(SEQ_NUM,3)=0 connected to 3rd target table.

Router Transformation Groups Tab

Loading Multiple Flat Files using one mapping


Q6. Suppose we have ten source flat files of same structure. How can we load all the files in target
database in a single batch run using a single mapping.
Ans.
After we create a mapping to load data in target database from flat files, next we move on to the session
property of the Source Qualifier. To load a set of source files we need to create a file say final.txt
containing the source falt file names, ten files in our case and set the Source filetype option as
Indirect. Next point this flat file final.txt fully qualified through Source file directory and Source
filename .
Image: Session Property Flat File
Q7. How can we implement Aggregation operation without using an Aggregator Transformation in
Informatica.
Ans.
We will use the very basic concept of the Expression Transformation that at a time we can access the
previous row data as well as the currently processed data in an expression transformation. What we
need is simple Sorter, Expression and Filter transformation to achieve aggregation at Informatica level.
For detailed understanding visit Aggregation without Aggregator
Q8. Suppose in our Source Table we have data as given below:
Student Name
Sam
Tom
Sam
John
Sam
John
John
Tom
Tom

We want to load our Target Table as:


Student Name
Sam
John
Tom
Describe your approach.

Subject Name
Maths
Maths
Physical Science
Maths
Life Science
Life Science
Physical Science
Life Science
Physical Science

Maths
100
75
80

Ans.
Here our scenario is to convert many rows to one rows, and the transformation which will help us to

achieve this is Aggregator .Our Mapping will look like this:

Mapping using sorter and Aggregator

We will sort the source data based on STUDENT_NAME ascending followed by SUBJECT ascending.

Sorter Transformation

Now based on STUDENT_NAME in GROUP BY clause the following output subject columns are
populated as
MATHS: MAX(MARKS, SUBJECT='Maths')
LIFE_SC: MAX(MARKS, SUBJECT='Life Science')
PHY_SC: MAX(MARKS, SUBJECT='Physical Science')

Aggregator Transformation

Revisiting Source Qualifier Transformation


Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and why it is an
ACTIVE transformation?
Ans.
A Source Qualifier is an Active and Connected Informatica transformation that reads the rows from a
relational database or flat file source.
We can configure the SQ to join [Both INNER as well as OUTER JOIN] data originating from the
same source database.
We can use a source filter to reduce the number of rows the Integration Service queries.
We can specify a number for sorted ports and the Integration Service adds an ORDER BY clause to
the default SQL query.
We can choose Select Distinct option for relational databases and the Integration Service adds a
SELECT DISTINCT clause to the default SQL query.
Also we can write Custom/Used Defined SQL query which will override the default query in the SQ
by changing the default settings of the transformation properties.
Aslo we have the option to write Pre as well as Post SQL statements to be executed before and after
the SQ query in the source database.
Since the transformation provides us with the property Select Distinct , when the Integration Service
adds a SELECT DISTINCT clause to the default SQL query, which in turn affects the number of rows

returned by the Database to the Integration Service and hence it is an Active transformation.
Q10. What happens to a mapping if we alter the datatypes between Source and its corresponding
Source Qualifier?
Ans.
The Source Qualifier transformation displays the transformation datatypes. The transformation
datatypes determine how the source database binds data when the Integration Service reads it.
Now if we alter the datatypes in the Source Qualifier transformation or the datatypes in the source
definition and Source Qualifier transformation do not match, the Designer marks the mapping as
invalid when we save it.
Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports property in the SQ and
then we add Custom SQL Query. Explain what will happen.
Ans.
Whenever we add Custom SQL or SQL override query it overrides the User-Defined Join, Source
Filter, Number of Sorted Ports, and Select Distinct settings in the Source Qualifier transformation.
Hence only the user defined SQL Query will be fired in the database and all the other options will be
ignored .
Q12. Describe the situations where we will use the Source Filter, Select Distinct and Number Of Sorted
Ports properties of Source Qualifier transformation.
Ans.
Source Filter option is used basically to reduce the number of rows the Integration Service queries so
as to improve performance.
Select Distinct option is used when we want the Integration Service to select unique values from a
source, filtering out unnecessary data earlier in the data flow, which might improve performance.
Number Of Sorted Ports option is used when we want the source data to be in a sorted fashion so as
to use the same in some following transformations like Aggregator or Joiner, those when configured for
sorted input will improve the performance.
Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query and the
OUTPUT PORTS order in SQ transformation do not match?
Ans.
Mismatch or Changing the order of the list of selected columns to that of the connected transformation
output ports may result is session failure.
Q14. What happens if in the Source Filter property of SQ transformation we include keyword WHERE
say, WHERE CUSTOMERS.CUSTOMER_ID > 1000.

Ans.
We use source filter to reduce the number of source records. If we include the string WHERE in the
source filter, the Integration Service fails the session .
Q15. Describe the scenarios where we go for Joiner transformation instead of Source Qualifier
transformation.
Ans.
While joining Source Data of heterogeneous sources as well as to join flat files we will use the Joiner
transformation.
Use the Joiner transformation when we need to join the following types of sources:
Join data from different Relational Databases.
Join data from different Flat Files.
Join relational sources and flat files.
Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase source system.
Ans.
Sybase supports a maximum of 16 columns in an ORDER BY clause. So if the source is Sybase, do not
sort more than 16 columns.
Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to Target tables
TGT1 and TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?
Ans.
If we have multiple Source Qualifier transformations connected to multiple targets, we can designate
the order in which the Integration Service loads data into the targets.
In the Mapping Designer, We need to configure the Target Load Plan based on the Source Qualifier
transformations in a mapping to specify the required loading order.
Image: Target Load Plan

Target Load Plan Ordering

Q18. Suppose we have a Source Qualifier transformation that populates two target tables. How do you
ensure TGT2 is loaded after TGT1?
Ans.
In the Workflow Manager, we can Configure Constraint based load ordering for a session. The
Integration Service orders the target load on a row-by-row basis. For every row generated by an active
source, the Integration Service loads the corresponding transformed row first to the primary key table,
then to the foreign key table.
Hence if we have one Source Qualifier transformation that provides data for multiple target tables
having primary and foreign key relationships, we will go for Constraint based load ordering.
Image: Constraint based loading

Revisiting Filter Transformation


Q19. What is a Filter Transformation and why it is an Active one?
Ans.
A Filter transformation is an Active and Connected transformation that can filter rows in a mapping.
Only the rows that meet the Filter Condition pass through the Filter transformation to the next
transformation in the pipeline. TRUE and FALSE are the implicit return values from any filter
condition we set. If the filter condition evaluates to NULL, the row is assumed to be FALSE.
The numeric equivalent of FALSE is zero (0) and any non-zero value is the equivalent of TRUE.
As an ACTIVE transformation, the Filter transformation may change the number of rows passed
through it. A filter condition returns TRUE or FALSE for each row that passes through the

transformation, depending on whether a row meets the specified condition. Only rows that return
TRUE pass through this transformation. Discarded rows do not appear in the session log or reject files.
Q20. What is the difference between Source Qualifier transformations Source Filter to Filter
transformation?
Ans.
SQ Source Filter
Source Qualifier transformation filters rows when read from a source.
Source Qualifier transformation can only filter rows from Relational
Sources.
Source Qualifier limits the row set extracted from a source.
Source Qualifier reduces the number of rows used throughout the
mapping and hence it provides better performance.
The filter condition in the Source Qualifier transformation only uses
standard SQL as it runs in the database.

Filter Transformation
Filter transformation filters rows from withi
Filter transformation filters rows coming fro
system in the mapping level.
Filter transformation limits the row set sent
To maximize session performance, include t
close to the sources in the mapping as possib
data early in the flow of data from sources to
Filter Transformation can define a condition
transformation function that returns either a

Revisiting Joiner Transformation


Q21. What is a Joiner Transformation and why it is an Active one?
Ans.
A Joiner is an Active and Connected transformation used to join source data from the same source
system or from two related heterogeneous sources residing in different locations or file systems.
The Joiner transformation joins sources with at least one matching column. The Joiner transformation
uses a condition that matches one or more pairs of columns between the two sources.
The two input pipelines include a master pipeline and a detail pipeline or a master and a detail branch.
The master pipeline ends at the Joiner transformation, while the detail pipeline continues to the target.
In the Joiner transformation, we must configure the transformation properties namely Join Condition,
Join Type and Sorted Input option to improve Integration Service performance.
The join condition contains ports from both input sources that must match for the Integration Service to
join two rows. Depending on the type of join selected, the Integration Service either adds the row to
the result set or discards the row .
The Joiner transformation produces result sets based on the join type, condition, and input data sources.
Hence it is an Active transformation.
Q22. State the limitations where we cannot use Joiner in the mapping pipeline.
Ans.
The Joiner transformation accepts input from most transformations. However, following are the
limitations:

Joiner transformation cannot be used when either of the input pipeline contains an Update Strategy
transformation.
Joiner transformation cannot be used if we connect a Sequence Generator transformation directly
before the Joiner transformation.
Q23. Out of the two input pipelines of a joiner, which one will you set as the master pipeline?
Ans.
During a session run, the Integration Service compares each row of the master source against the
detail source.
The master and detail sources need to be configured for optimal performance .
To improve performance for an Unsorted Joiner transformation, use the source with fewer rows as the
master source. The fewer unique rows in the master, the fewer iterations of the join comparison occur,
which speeds the join process.
When the Integration Service processes an unsorted Joiner transformation, it reads all master rows
before it reads the detail rows. The Integration Service blocks the detail source while it caches rows
from the master source . Once the Integration Service reads and caches all master rows, it unblocks
the detail source and reads the detail rows.
To improve performance for a Sorted Joiner transformation, use the source with fewer duplicate key
values as the master source.
When the Integration Service processes a sorted Joiner transformation, it blocks data based on the
mapping configuration and it stores fewer rows in the cache, increasing performance. Blocking logic is
possible if master and detail input to the Joiner transformation originate from different sources .
Otherwise, it does not use blocking logic. Instead, it stores more rows in the cache.
Q24. What are the different types of Joins available in Joiner Transformation?
Ans.
In SQL, a join is a relational operator that combines data from multiple tables into a single result set.
The Joiner transformation is similar to an SQL join except that data can originate from different types
of sources.
The Joiner transformation supports the following types of joins :
Normal
Master Outer
Detail Outer
Full Outer

Join Type property of Joiner Transformation

Note: A normal or master outer join performs faster than a full outer or detail outer join.
Q25. Define the various Join Types of Joiner Transformation.
Ans.
In a normal join , the Integration Service discards all rows of data from the master and detail source
that do not match, based on the join condition.
A master outer join keeps all rows of data from the detail source and the matching rows from the
master source. It discards the unmatched rows from the master source.
A detail outer join keeps all rows of data from the master source and the matching rows from the detail
source. It discards the unmatched rows from the detail source.
A full outer join keeps all rows of data from both the master and detail sources.
Q26. Describe the impact of number of join conditions and join order in a Joiner Transformation.
Ans.
We can define one or more conditions based on equality between the specified master and detail
sources.
Both ports in a condition must have the same datatype . If we need to use two ports in the join
condition with non-matching datatypes we must convert the datatypes so that they match. The Designer
validates datatypes in a join condition.
Additional ports in the join condition increases the time necessary to join two sources.

The order of the ports in the join condition can impact the performance of the Joiner transformation. If
we use multiple ports in the join condition, the Integration Service compares the ports in the order we
specified.
NOTE: Only equality operator is available in joiner join condition.
Q27. How does Joiner transformation treat NULL value matching.
Ans.
The Joiner transformation does not match null values .
For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service
does not consider them a match and does not join the two rows.
To join rows with null values, replace null input with default values in the Ports tab of the joiner, and
then join on the default values.
Note: If a result set includes fields that do not contain data in either of the sources, the Joiner
transformation populates the empty fields with null values. If we know that a field will return a NULL
and we do not want to insert NULLs in the target, set a default value on the Ports tab for the
corresponding port.
Q28. Suppose we configure Sorter transformations in the master and detail pipelines with the following
sorted ports in order: ITEM_NO, ITEM_NAME, PRICE.
When we configure the join condition, what are the guidelines we need to follow to maintain the sort
order?
Ans.
If we have sorted both the master and detail pipelines in order of the ports say ITEM_NO,
ITEM_NAME and PRICE we must ensure that:
Use ITEM_NO in the First Join Condition.
If we add a Second Join Condition, we must use ITEM_NAME.
If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use ITEM_NAME in
the Second Join Condition.
If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort order and the
Integration Service fails the session .
Q29. What are the transformations that cannot be placed between the sort origin and the Joiner
transformation so that we do not lose the input sort order.
Ans.
The best option is to place the Joiner transformation directly after the sort origin to maintain sorted

data.
However do not place any of the following transformations between the sort origin and the Joiner
transformation:
Custom
Unsorted Aggregator
Normalizer
Rank
Union transformation
XML Parser transformation
XML Generator transformation
Mapplet [if it contains any one of the above mentioned transformations]
Q30. Suppose we have the EMP table as our source. In the target we want to view those employees
whose salary is greater than or equal to the average salary for their departments.
Describe your mapping approach.Ans.
Our Mapping will look like this:
Image: Mapping using Joiner
To start with the mapping we need the following transformations:
After the Source qualifier of the EMP table place a Sorter Transformation . Sort based on DEPTNO
port.

Sorter Ports Tab

Next we place a Sorted Aggregator Transformation . Here we will find out the AVERAGE
SALARY for each (GROUP BY) DEPTNO .
When we perform this aggregation, we lose the data for individual employees. To maintain employee
data, we must pass a branch of the pipeline to the Aggregator Transformation and pass a branch with
the same sorted source data to the Joiner transformation to maintain the original data. When we join
both branches of the pipeline, we join the aggregated data with the original data.

Aggregator Ports Tab

Aggregator Properties Tab

So next we need Sorted Joiner Transformation to join the sorted aggregated data with the original
data, based on DEPTNO .
Here we will be taking the aggregated pipeline as the Master and original dataflow as Detail Pipeline.

Joiner Condition Tab

Joiner Properties Tab

After that we need a Filter Transformation to filter out the employees having salary less than average
salary for their department.
Filter Condition: SAL>=AVG_SAL

Filter Properties Tab

Lastly we have the Target table instance.

Revisiting Sequence Generator Transformation


Q31. What is a Sequence Generator Transformation?
Ans.
A Sequence Generator transformation is a Passive and Connected transformation that generates
numeric values.
It is used to create unique primary key values, replace missing primary keys, or cycle through a
sequential range of numbers.
This transformation by default contains ONLY Two OUTPUT ports namely CURRVAL and
NEXTVAL . We cannot edit or delete these ports neither we cannot add ports to this unique
transformation.
We can create approximately two billion unique numeric values with the widest range from 1 to
2147483647.

Q32. Define the Properties available in Sequence Generator transformation in brief.


Ans.
Sequence Generator Properties
Description
Start value of the generated sequence tha
to use if we use the Cycle option. If we s
Start Value
Service cycles back to this value when it
Default is 0.
Difference between two consecutive val
Increment By
Default is 1.
Maximum value generated by SeqGen. A
End Value
session will fail if the sequence generato
Default is 2147483647.
Current value of the sequence. Enter the
Current Value
Service to use as the first value in the se
Default is 1.
If selected, when the Integration Service
Cycle
for the sequence, it wraps around and sta
the configured Start Value.
Number of sequential values the Integra
Number of Cached Values
Default value for a standard Sequence G
Default value for a reusable Sequence G
Restarts the sequence at the current valu
Reset
This option is disabled for reusable Sequ
Q33. Suppose we have a source table populating two target tables. We connect the NEXTVAL port of

the Sequence Generator to the surrogate keys of both the target tables.
Will the Surrogate keys in both the target tables be same? If not how can we flow the same sequence
values in both of them.
Ans.
When we connect the NEXTVAL output port of the Sequence Generator directly to the surrogate key
columns of the target tables, the Sequence number will not be the same .
A block of sequence numbers is sent to one target tables surrogate key column. The second targets
receives a block of sequence numbers from the Sequence Generator transformation only after the first
target table receives the block of sequence numbers.
Suppose we have 5 rows coming from the source, so the targets will have the sequence values as TGT1
(1,2,3,4,5) and TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and
Increment by 1.
Now suppose the requirement is like that we need to have the same surrogate keys in both the targets.
Then the easiest way to handle the situation is to put an Expression Transformation in between the
Sequence Generator and the Target tables. The SeqGen will pass unique values to the expression
transformation, and then the rows are routed from the expression transformation to the targets.

Sequence Generator

Q34. Suppose we have 100 records coming from the source. Now for a target column population we
used a Sequence generator.
Suppose the Current Value is 0 and End Value of Sequence generator is set to 80. What will happen?
Ans.
End Value is the maximum value the Sequence Generator will generate. After it reaches the End value
the session fails with the following error message:
TT_11009 Sequence Generator Transformation: Overflow error.

Failing of session can be handled if the Sequence Generator is configured to Cycle through the
sequence, i.e. whenever the Integration Service reaches the configured end value for the sequence, it
wraps around and starts the cycle again, beginning with the configured Start Value.
Q35. What are the changes we observe when we promote a non resuable Sequence Generator to a
resuable one?
And what happens if we set the Number of Cached Values to 0 for a reusable transformation?
Ans.
When we convert a non reusable sequence generator to resuable one we observe that the Number of
Cached Values is set to 1000 by default; And the Reset property is disabled.
When we try to set the Number of Cached Values property of a Reusable Sequence Generator to 0 in
the Transformation Developer we encounter the following error message:
The number of cached values must be greater than zero for reusable sequence transformation.

Which is the fastest? Informatica or Oracle?


In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle
and found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time
we will look into the JOIN operation, not only because JOIN is the single most important data set
operation but also because performance of JOIN can give crucial data to a developer in order to
develop proper push down optimization manually.
Informatica is one of the leading data integration tools in todays world. More than 4,000 enterprises
worldwide rely on Informatica to access, integrate and trust their information assets with it. On the
other hand, Oracle database is arguably the most successful and powerful RDBMS system that is
trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems
are bests in the technologies that they support. But when it comes to the application development,
developers often face challenge to strike the right balance of operational load sharing between these
systems. This article will help them to take the informed decision.

Which JOINs data faster? Oracle or Informatica?


As an application developer, you have the choice of either using joining syntaxes in database level to
join your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome.
The question is which system performs this faster?

Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will
start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2
million, 4 million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million
master table data volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target
2. Informatica PowerCentre 8.5 as ETL tool

3. Database and Informatica setup on different physical servers using HP UNIX


4. Source database table has no constraint, no index, no database statistics and no partition
5. Source database table is not available in Oracle shared pool before the same is read
6. There is no session level partition in Informatica PowerCentre
7. There is no parallel hint provided in extraction SQL query
8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre
designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to
sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN
data in informatica level. We have executed these mappings with different data points and logged the
result.
Further to the above test we will execute m_db_side_join mapping once again, this time with proper
database side indexes and statistics and log the results.

Result
The following graph shows the performance of Informatica and Database in terms of time taken by
each system to sort data. The average time is plotted along vertical axis and data points are plotted
along horizontal axis.
Data Points
Master Table Record Count
1
0.1 M
2
0.2 M
3
0.4 M
4
0.6 M

Verdict
In our test environment, Oracle 10g performs JOIN operation 24% faster
than Informatica Joiner Transformation while without Index and 42%
faster with Database Index
Assumption
1. Average server load remains same during all the experiments

2. Average network speed remains same during all the experiments

Note
1. This data can only be used for performance comparison but cannot be used for performance
benchmarking.
2. This data is only indicative and may vary in different testing conditions.

Which is the fastest? Informatica or Oracle?


Informatica is one of the leading data integration tools in todays world. More than 4,000 enterprises
worldwide rely on Informatica to access, integrate and trust their information assets with it. On the
other hand, Oracle database is arguably the most successful and powerful RDBMS system that is
trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems
are bests in the technologies that they support. But when it comes to the application development,
developers often face challenge to strike the right balance of operational load sharing between these
systems.
Think about a typical ETL operation often used in enterprise level data integration. A lot of data
processing can be either redirected to the database or to the ETL tool. In general, both the database and
the ETL tool are reasonably capable of doing such operations with almost same efficiency and
capability. But in order to achieve the optimized performance, a developer must carefully consider and
decide which system s/he should be trusting with for each individual processing task.
In this article, we will take a basic database operation Sorting, and we will put these two systems to
test in order to determine which does it faster than the other, if at all.

Which sorts data faster? Oracle or Informatica?


As an application developer, you have the choice of either using ORDER BY in database level to sort
your data or using SORTER TRANSFORMATION in Informatica to achieve the same outcome. The
question is which system performs this faster?

Test Preparation
We will perform the same test with different data points (data volumes) and log the results. We will
start with 1 million records and we will be doubling the volume for each next data points. Here are the
details of the setup we will use,
1. Oracle 10g database as relational source and target
2. Informatica PowerCentre 8.5 as ETL tool
3. Database and Informatica setup on different physical servers using HP UNIX
4. Source database table has no constraint, no index, no database statistics and no partition
5. Source database table is not available in Oracle shared pool before the same is read
6. There is no session level partition in Informatica PowerCentre
7. There is no parallel hint provided in extraction SQL query
8. The source table has 10 columns and first 8 columns will be used for sorting
9. Informatica sorter has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre

designer. The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to
sort data in database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort data
in informatica level. We have executed these mappings with different data points and logged the result.

Result
The following graph shows the performance of Informatica and Database in terms of time taken by
each system to sort data. The time is plotted along vertical axis and data volume is plotted along
horizontal axis.

Verdict
The above experiment demonstrates that Oracle
database is faster in SORT operation than Informatica
by an average factor of 14%.
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments

Note
This data can only be used for performance comparison but cannot be used for performance
benchmarking.

Informatica Reject File - How to Identify rejection reason


Saurav Mitra
inShare0
0diggsdigg

When we run a session, the integration service may create a reject file for each target instance in the
mapping to store the target reject record. With the help of the Session Log and Reject File we can
identify the cause of data rejection in the session. Eliminating the cause of rejection will lead to
rejection free loads in the subsequent session runs. If the Informatica Writer or the Target Database
rejects data due to any valid reason the integration service logs the rejected records into the reject file.
Every time we run the session the integration service appends the rejected records to the reject file.

Working with Informatica Bad Files or Reject Files


By default the Integration service creates the reject files or bad files in the $PMBadFileDir process
variable directory. It writes the entire reject record row in the bad file although the problem may be in
any one of the Columns. The reject files have a default naming convention like
[target_instance_name].bad . If we open the reject file in an editor we will see comma separated
values having some tags/ indicator and some data values. We will see two types of Indicators in the
reject file. One is the Row Indicator and the other is the Column Indicator .
For reading the bad file the best method is to copy the contents of the bad file and saving the same as a
CSV (Comma Sepatared Value) file. Opening the csv file will give an excel sheet type look and feel.
The firstmost column in the reject file is the Row Indicator , that determines whether the row was
destined for insert, update, delete or reject. It is basically a flag that determines the Update Strategy for
the data row. When the Commit Type of the session is configured as User-defined the row indicator
indicates whether the transaction was rolled back due to a non-fatal error, or if the committed
transaction was in a failed target connection group.

List of Values of Row Indicators:


Row Indicator
0
1
2
3
4
5
6
7
8
9

Indicator Significance
Insert
Update
Delete
Reject
Rolled-back insert
Rolled-back update
Rolled-back delete
Committed insert
Committed update
Committed delete

Now comes the Column Data values followed by their Column Indicators, that determines the data
quality of the corresponding Column.

List of Values of Column Indicators:


>
Column Indicator
D

Type of data
Valid data or Good Data.

Overflowed Numeric Data.

Null Value.

Truncated String Data.

Also to be noted that the second column contains column indicator flag value 'D' which signifies that
the Row Indicator is valid.
Now let us see how Data in a Bad File looks like:
0,D,7,D,John,D,5000.375,O,,N,BrickLand Road Singapore,T

Implementing Informatica Incremental Aggregation


Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate
calculations in a session. If the source changes incrementally and we can capture the changes, then we
can configure the session to process those changes. This allows the Integration Service to update the
target incrementally, rather than forcing it to delete previous loads data, process the entire source data
and recalculate the same data each time you run the session.

Incremental Aggregation
When the session runs with incremental aggregation enabled for the first time say 1st week of Jan, we
will use the entire source. This allows the Integration Service to read and store the necessary aggregate
data information. On 2nd week of Jan, when we run the session again, we will filter out the CDC
records from the source i.e the records loaded after the initial load. The Integration Service then
processes these new data and updates the target accordingly.
Use incremental aggregation when the changes do not significantly change the target. If
processing the incrementally changed source alters more than half the existing target, the session may
not benefit from using incremental aggregation. In this case, drop the table and recreate the target with
entire source data and recalculate the same aggregation formula .
INCREMENTAL AGGREGATION, may be helpful in cases when we need to load data in
monthly facts in a weekly basis.
Let us see a sample mapping to implement incremental aggregation:
Image: Incremental Aggregation Sample Mapping
Look at the Source Qualifier query to fetch the CDC part using a BATCH_LOAD_CONTROL
table that saves the last successful load date for the particular mapping.
Image: Incremental Aggregation Source Qualifier

Look at the ports tab of Expression transformation.

Look at the ports tab of Aggregator Transformation.

Now the most important session properties configuation to implement incremental Aggregation

If we want to reinitialize the aggregate cache suppose during first week of every month we will
configure another session same as the previous session the only change being the Reinitialize aggregate
cache property checked in

Now have a look at the source table data:


CUSTOMER_KEY
1111
2222
3333
1111
1111
2222
4444
5555

INVOICE_KEY
5001
5002
5003
6007
6008
6009
1234
6157

After the first Load on 1st week of Jan 2010, the data in the target is as follows:
CUSTOMER_KEY
1111
2222
3333

INVOICE_KEY
5001
5002
5003

Now during the 2nd week load it will process only the incremental data in the source i.e those records
having load date greater than the last session run date. After the 2nd weeks load after incremental

aggregation of the incremental source data with the aggregate cache file data will update the target
table with the following dataset:
CUSTOMER_KEY
INVOICE_KEY
1111
6008
2222
6009
3333
5003
4444
1234
5555
6157
The first time we run an incremental aggregation session, the Integration Service processes the entire
source. At the end of the session, the Integration Service stores aggregate data for that session run in
two files, the index file and the data file. The Integration Service creates the files in the cache directory
specified in the Aggregator transformation properties.Each subsequent time we run the session with
incremental aggregation, we use the incremental source changes in the session. For each input record,
the Integration Service checks historical information in the index file for a corresponding group. If it
finds a corresponding group, the Integration Service performs the aggregate operation incrementally,
using the aggregate data for that group, and saves the incremental change. If it does not find a
corresponding group, the Integration Service creates a new group and saves the record data.
When writing to the target, the Integration Service applies the changes to the existing target. It saves
modified aggregate data in the index and data files to be used as historical data the next time you run
the session.
Each subsequent time we run a session with incremental aggregation, the Integration Service creates a
backup of the incremental aggregation files. The cache directory for the Aggregator transformation
must contain enough disk space for two sets of the files.
The Integration Service creates new aggregate data, instead of using historical data, when we configure
the session to reinitialize the aggregate cache, Delete cache files etc.
When the Integration Service rebuilds incremental aggregation files, the data in the previous files is
lost.
Note: To protect the incremental aggregation files from file corruption or disk failure,
periodically back up the files.

Using Informatica Normalizer Transformation


Saurav Mitra
inShare0
0diggsdigg

.
Normalizer, a native transformation in Informatica, can ease many complex data transformation

requirement. Learn how to effectively use normalizer here.

Using Noramalizer Transformation


A Normalizer is an Active transformation that returns multiple rows from a source row, it returns
duplicate data for single-occurring source columns. The Normalizer transformation parses multipleoccurring columns from COBOL sources, relational tables, or other sources. Normalizer can be used to
transpose the data in columns to rows.
Normalizer effectively does the opposite of what Aggregator does!

Example of Data Transpose using Normalizer


Think of a relational table that stores four quarters of sales by store and we need to create a row for
each sales occurrence. We can configure a Normalizer transformation to return a separate row for each
quarter like below..
The following source rows contain four quarters of sales by store:
Source Table
Store
Quarter1
Store1
100
Store2
250
The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that
identifies the quarter number:
Target Table
Store
Store 1
Store 1
Store 1
Store 1
Store 2
Store 2
Store 2
Store 2

Sales
100
300
500
700
250
450
650
850

How Informatica Normalizer Works


Suppose we have the following data in source:
Name
Sam
John
Tom
Sam
John
Tom
and we need to transform the source data and populate this as below in the target table:
Name

Month
Jan
Jan
Jan
Feb
Feb
Feb
Month

Sam
Jan
Sam
Jan
Sam
Jan
John
Jan
John
Jan
John
Jan
Tom
Jan
Tom
Jan
Tom
Jan
.. like this.
Now below is the screen-shot of a complete mapping which shows how to achieve this result using
Informatica PowerCenter Designer. Image: Normalization Mapping Example 1
I will explain the mapping further below.

Setting Up Normalizer Transformation Property


First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab
of the Normalizer transformation, since we have Food,Houserent and Transportation.
Which in turn will create the corresponding 3 input ports in the ports tab along with the fields
Individual and Month

In the Ports tab of the Normalizer the ports will be created automatically as configured in the
Normalizer tab.
Interestingly we will observe two new columns namely,

GK_EXPENSEHEAD
GCID_EXPENSEHEAD
GK field generates sequence number starting from the value as defined in Sequence field while GCID
holds the value of the occurence field i.e. the column no of the input Expense head.
Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.

Now the GCID will give which expense corresponds to which field while converting columns to rows.
Below is the screen-shot of the expression to handle this GCID efficiently:
Image: Expression to handle GCID

Informatica Dynamic Lookup Cache


A LookUp cache does not change once built. But what if the underlying lookup table changes the data
after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the
underlying table changes?
Dynamic Lookup Cache

Let's think about this scenario. You are loading your target table through a mapping. Inside the
mapping you have a Lookup and in the Lookup, you are actually looking up the same target table
you are loading. You may ask me, "So? What's the big deal? We all do it quite often...". And yes you

are right. There is no "big deal" because Informatica (generally) caches the lookup table in the very
beginning of the mapping, so whatever record getting inserted to the target table through the mapping,
will have no effect on the Lookup cache. The lookup will still hold the previously cached data, even if
the underlying target table is changing.
But what if you want your Lookup cache to get updated as and when the target table is changing? What
if you want your lookup cache to always show the exact snapshot of the data in your target table at that
point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will
need a dynamic cache to handle this.

But why anyone will need a dynamic cache?


To understand this, let's first understand a static cache scenario.

Informatica Dynamic Lookup Cache


Saurav Mitra
Article Index
Informatica Dynamic Lookup Cache
What is Static Cache
What is Dynamic Cache
How does dynamic cache work
Dynamic Lookup Mapping Example
Dynamic Lookup Sequence ID
Dynamic Lookup Ports
NULL handling in LookUp
Other Details
All Pages
Page 1 of 9
inShare0
0diggsdigg

.
A LookUp cache does not change once built. But what if the underlying lookup table changes the data
after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the
underlying table changes?
Dynamic Lookup Cache

Let's think about this scenario. You are loading your target table through a mapping. Inside the
mapping you have a Lookup and in the Lookup, you are actually looking up the same target table
you are loading. You may ask me, "So? What's the big deal? We all do it quite often...". And yes you
are right. There is no "big deal" because Informatica (generally) caches the lookup table in the very
beginning of the mapping, so whatever record getting inserted to the target table through the mapping,
will have no effect on the Lookup cache. The lookup will still hold the previously cached data, even if
the underlying target table is changing.

But what if you want your Lookup cache to get updated as and when the target table is changing? What
if you want your lookup cache to always show the exact snapshot of the data in your target table at that
point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will
need a dynamic cache to handle this.

But why anyone will need a dynamic cache?


To understand this, let's first understand a static cache scenario.

Static Cache Scenario


Let's suppose you run a retail business and maintain all your customer information in a customer
master table (RDBMS table). Every night, all the customers from your customer master table is loaded
in to a Customer Dimension table in your data warehouse. Your source customer table is a transaction
system table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes
his address, the old address is updated with the new address. But your data warehouse table stores the
history (may be in the form of SCD Type-II). There is a map that loads your data warehouse table from
the source table. Typically you do a Lookup on target (static cache) and check with your every
incoming customer record to determine if the customer is already existing in target or not. If the
customer is not already existing in target, you conclude the customer is new and INSERT the record
whereas if the customer is already existing, you may want to update the target record with this new
record (if the record is updated). This is illustrated below, You don't need dynamic Lookup cache for
this
Image: A static Lookup Cache to determine if a source record is new or updatable

Dynamic Lookup Cache Scenario


Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures
that your source table does not have any duplicate record.
But, What if you had a flat file as source with many duplicate records?
Would the scenario be same? No, see the below illustration.

Image: A Scenario illustrating the use of dynamic lookup cache


Here are some more examples when you may consider using dynamic lookup,
Updating a master customer table with both new and updated customer information coming
together as shown above
Loading data into a slowly changing dimension table and a fact table at the same time.

Remember, you typically lookup the dimension while loading to fact. So you load dimension
table before loading fact table. But using dynamic lookup, you can load both simultaneously.
Loading data from a file with many duplicate records and to eliminate duplicate records in
target by updating a duplicate row i.e. keeping the most recent row or the initial row
Loading the same data from multiple sources using a single mapping. Just consider the previous
Retail business example. If you have more than one shops and Linda has visited two of your
shops for the first time, customer record Linda will come twice during the same load.

So, How does dynamic lookup work?


When the Integration Service reads a row from the source, it updates the lookup cache by performing
one of the following actions:
Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service
inserts the row in the cache based on input ports or generated Sequence-ID. The Integration
Service flags the row as insert.
Updates the row in the cache: If the row exists in the cache, the Integration Service updates
the row in the cache based on the input ports. The Integration Service flags the row as update.
Makes no change to the cache: This happens when the row exists in the cache and the lookup
is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is
configured to update existing rows only or, the row is in the cache, but based on the lookup
condition, nothing changes. The Integration Service flags the row as unchanged.
Notice that Integration Service actually flags the rows based on the above three conditions.
And that's a great thing, because, if you know the flag you can actually reroute the row to achieve
different logic. This flag port is called
NewLookupRow
Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to
use a Router or Filter transformation followed by an Update Strategy.
Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:
0 = Integration Service does not update or insert the row in the cache.
1 = Integration Service inserts the row into the cache.
2 = Integration Service updates the row in the cache.
When the Integration Service reads a row, it changes the lookup cache depending on the results of the
lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the
NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.

1. Write a command to replace the word "bad" with "good" in file?


sed s/bad/good/ < filename
2. Write a command to replace the word "bad" with "good" globally in a file?
sed s/bad/good/g < filename
3. Write a command to replace the character '/' with ',' in a file?
sed 's/\//,/' < filename
sed 's|/|,|' < filename
4. Write a command to replace the word "apple" with "(apple)" in a file?
sed s/apple/(&)/ < filename
5. Write a command to switch the two consecutive words "apple" and "mango" in a file?
sed 's/\(apple\) \(mango\)/\2 \1/' < filename
6. Write a command to replace the second occurrence of the word "bat" with "ball" in a file?
sed 's/bat/ball/2' < filename
7. Write a command to remove all the occurrences of the word "jhon" except the first one in a line with
in the entire file?
sed 's/jhon//2g' < filename
8. Write a command to remove the first number on line 5 in file?
sed '5 s/[0-9][0-9]*//' < filename
9. Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename
10. Write a command to replace the word "gum" with "drum" in the first 100 lines of a file?
sed '1,00 s/gum/drum/' < filename
11. write a command to replace the word "lite" with "light" from 100th line to last line in a file?
sed '100,$ s/lite/light/' < filename
12. Write a command to remove the first 10 lines from a file?
sed '1,10 d' < filename
13. Write a command to duplicate each line in a file?
sed 'p' < filename

14. Write a command to duplicate empty lines in a file?


sed '/^$/ p' < filename
15. Write a sed command to print the lines that do not contain the word "run"?
sed -n '/run/!p' < filename

Find Command in Unix and Linux Examples


Find is one of the powerful utility of Unix (or Linux) used for searching the files in a directory
hierarchy. The syntax of find command is
find [pathnames] [conditions]

Let see some practical exercises on using find command.


1. How to run the last executed find command?
!find

This will execute the last find command. It also displays the last find command executed along with the
result on the terminal.
2. How to find for a file using name?
find -name "sum.java"
./bkp/sum.java
./sum.java

This will find all the files with name "sum.java" in the current directory and sub-directories.
3. How to find for files using name and ignoring case?
find -iname "sum.java"
./SUM.java
./bkp/sum.java
./sum.java

This will find all the files with name "sum.java" while ignoring the case in the current directory and
sub-directories.
4. How to find for a file in the current directory only?

find -maxdepth 1 -name "sum.java"


./sum.java

This will find for the file "sum.java" in the current directory only
5. How to find for files containing a specific word in its name?
find -name "*java*"
./SUM.java
./bkp/sum.java
./sum.java
./multiply.java

It displayed all the files which have the word "java" in the filename
6. How to find for files in a specific directory?
find /etc -name "*java*"

This will look for the files in the /etc directory with "java" in the filename
7. How to find the files whose name are not "sum.java"?
find -not -name "sum.java"
.
./SUM.java
./bkp
./multiply.java

This is like inverting the match. It prints all the files except the given file "sum.java".
8. How to limit the file searches to specific directories?
find -name "sum.java"
./tmp/sum.java
./bkp/var/tmp/files/sum.java
./bkp/var/tmp/sum.java
./bkp/var/sum.java
./bkp/sum.java
./sum.java

You can see here the find command displayed all the files with name "sum.java" in the current

directory and sub-directories.


a. How to print the files in the current directory and one level down to the current directory?
find -maxdepth 2 -name "sum.java"
./tmp/sum.java
./bkp/sum.java
./sum.java

b. How to print the files in the current directory and two levels down to the current directory?
find -maxdepth 3 -name "sum.java"
./tmp/sum.java
./bkp/var/sum.java
./bkp/sum.java
./sum.java

c. How to print the files in the subdirectories between level 1 and 4?


find -mindepth 2 -maxdepth 5 -name "sum.java"
./tmp/sum.java
./bkp/var/tmp/files/sum.java
./bkp/var/tmp/sum.java
./bkp/var/sum.java
./bkp/sum.java

9. How to find the empty files in a directory?


find . -maxdepth 1 -empty
./empty_file

10. How to find the largest file in the current directory and sub directories
find . -type f -exec ls -s {} \; | sort -n -r | head -1

The find command "find . -type f -exec ls -s {} \;" will list all the files along with the size of the file.
Then the sort command will sort the files based on the size. The head command will pick only the first
line from the output of sort.
11. How to find the smallest file in the current directory and sub directories
find . -type f -exec ls -s {} \; | sort -n -r | tail -1

Another method using find is


find . -type f -exec ls -s {} \; | sort -n | head -1

12. How to find files based on the file type?


a. Finding socket files
find . -type s

b. Finding directories
find . -type d

c. Finding hidden directories


find -type d -name ".*"

d. Finding regular files


find . -type f

e. Finding hidden files


find . -type f -name ".*"

13. How to find files based on the size?


a. Finding files whose size is exactly 10M
find . -size 10M

b. Finding files larger than 10M size


find . -size +10M

c. Finding files smaller than 10M size


find . -size -10M

14. How to find the files which are modified after the modification of a give file.
find -newer "sum.java"

This will display all the files which are modified after the file "sum.java"
15. Display the files which are accessed after the modification of a give file.
find -anewer "sum.java"

16. Display the files which are changed after the modification of a give file.
find -cnewer "sum.java"

17. How to find the files based on the file permissions?


find . -perm 777

This will display the files which have read, write, and execute permissions. To know the permissions of
files and directories use the command "ls -l".
18. Find the files which are modified within 30 minutes.
find . -mmin -30

19. Find the files which are modified within 1 day.


find . -mtime -1

20. How to find the files which are modified 30 minutes back
find . -not -mmin -30

21. How to find the files which are modified 1 day back.
find . -not -mtime -1

22. Print the files which are accessed within 1 hour.

find . -amin -60

23. Print the files which are accessed within 1 day.


find . -atime -1

24. Display the files which are changed within 2 hours.


find . -cmin -120

25. Display the files which are changed within 2 days.


find . -ctime -2

26. How to find the files which are created between two files.
find . -cnewer f1 -and ! -cnewer f2

So far we have just find the files and displayed on the terminal. Now we will see how to perform some
operations on the files.
1. How to find the permissions of the files which contain the name "java"?
find -name "*java*"|xargs ls -l

Alternate method is
find -name "*java*" -exec ls -l {} \;

2. Find the files which have the name "java" in it and then display only the files which have "class"
word in them?
find -name "*java*" -exec grep -H class {} \;

3. How to remove files which contain the name "java".


find -name "*java*" -exec rm -r {} \;

This will delete all the files which have the word java" in the file name in the current directory and

sub-directories.

The basic syntax of AWK:


awk 'BEGIN {start_action} {action} END {stop_action}' filename

Here the actions in the begin block are performed before processing the file and the actions in the end
block are performed after processing the file. The rest of the actions are performed while processing the
file.
Examples:
Create a file input_file with the following data. This file can be easily created using the output of ls -l.
-rw-r--r-- 1 center center 0 Dec 8 21:39 p1
-rw-r--r-- 1 center center 17 Dec 8 21:15 t1
-rw-r--r-- 1 center center 26 Dec 8 21:38 t2
-rw-r--r-- 1 center center 25 Dec 8 21:38 t3
-rw-r--r-- 1 center center 43 Dec 8 21:39 t4
-rw-r--r-- 1 center center 48 Dec 8 21:39 t5

From the data, you can observe that this file has rows and columns. The rows are separated by a new
line character and the columns are separated by a space characters. We will use this file as the input for
the examples discussed here.
1. awk '{print $1}' input_file
Here $1 has a meaning. $1, $2, $3... represents the first, second, third columns... in a row respectively.
This awk command will print the first column in each row as shown below.
-rw-r--r--rw-r--r--rw-r--r--rw-r--r--rw-r--r--rw-r--r--

To print the 4th and 6th columns in a file use awk '{print $4,$5}' input_file
Here the Begin and End blocks are not used in awk. So, the print command will be executed for each
row it reads from the file. In the next example we will see how to use the Begin and End blocks.

2. awk 'BEGIN {sum=0} {sum=sum+$5} END {print sum}' input_file


This will prints the sum of the value in the 5th column. In the Begin block the variable sum is assigned
with value 0. In the next block the value of 5th column is added to the sum variable. This addition of
the 5th column to the sum variable repeats for every row it processed. When all the rows are processed
the sum variable will hold the sum of the values in the 5th column. This value is printed in the End
block.
3. In this example we will see how to execute the awk script written in a file. Create a file sum_column
and paste the below script in that file
#!/usr/bin/awk -f
BEGIN {sum=0}
{sum=sum+$5}
END {print sum}

Now execute the the script using awk command as


awk -f sum_column input_file.
This will run the script in sum_column file and displays the sum of the 5th column in the input_file.
4. awk '{ if($9 == "t4") print $0;}' input_file
This awk command checks for the string "t4" in the 9th column and if it finds a match then it will print
the entire line. The output of this awk command is
-rw-r--r-- 1 pcenter pcenter 43 Dec 8 21:39 t4

5. awk 'BEGIN { for(i=1;i<=5;i++) print "square of", i, "is",i*i; }'


This will print the squares of first numbers from 1 to 5. The output of the command is
square of 1 is 1
square of 2 is 4
square of 3 is 9
square of 4 is 16
square of 5 is 25

Notice that the syntax of if and for are similar to the C language.

Awk Built in Variables:


You have already seen $0, $1, $2... which prints the entire line, first column, second column...
respectively. Now we will see other built in variables with examples.
FS - Input field separator variable:
So far, we have seen the fields separted by a space character. By default Awk assumes that fields in a
file are separted by space characters. If the fields in the file are separted by any other character, we can
use the FS variable to tell about the delimiter.
6. awk 'BEGIN {FS=":"} {print $2}' input_file
OR
awk -F: '{print $2} input_file
This will print the result as
39 p1
15 t1
38 t2
38 t3
39 t4
39 t5

OFS - Output field separator variable:


By default whenever we printed the fields using the print statement the fields are displayed with space
character as delimiter. For example
7. awk '{print $4,$5}' input_file
The output of this command will be
center 0
center 17
center 26
center 25
center 43
center 48

We can change this default behavior using the OFS variable as

awk 'BEGIN {OFS=":"} {print $4,$5}' input_file


center:0
center:17
center:26
center:25
center:43
center:48

Note: print $4,$5 and print $4$5 will not work the same way. The first one displays the output with
space as delimiter. The second one displays the output without any delimiter.
NF - Number of fileds variable:
The NF can be used to know the number of fields in line
8. awk '{print NF}' input_file
This will display the number of columns in each row.
NR - number of records variable:
The NR can be used to know the line number or count of lines in a file.
9. awk '{print NR}' input_file
This will display the line numbers from 1.
10. awk 'END {print NR}' input_file
This will display the total number of lines in the file.
String functions in Awk:
Some of the string functions in awk are:
index(string,search)
length(string)
split(string,array,separator)
substr(string,position)
substr(string,position,max)
tolower(string)
toupper(string)
Advanced Examples:

1. Filtering lines using Awk split function


The awk split function splits a string into an array using the delimiter.
The syntax of split function is
split(string, array, delimiter)
Now we will see how to filter the lines using the split function with an example.
The input "file.txt" contains the data in the following format
1 U,N,UNIX,000
2 N,P,SHELL,111
3 I,M,UNIX,222
4 X,Y,BASH,333
5 P,R,SCRIPT,444

Required output: Now we have to print only the lines in which whose 2nd field has the string "UNIX"
as the 3rd field( The 2nd filed in the line is separated by comma delimiter ).
The ouptut is:
1 U,N,UNIX,000
3 I,M,UNIX,222

The awk command for getting the output is:


awk '{
split($2,arr,",");
if(arr[3] == "UNIX")
print $0
} ' file.txt

Recommended Posts:
Find Command
Grep Command
Awk command Examples - Part2
If you like this post, then please share it in google by clicking on the +1 button

Examples of Awk Command in Unix - Part 2


1. Inserting a new line after every 2 lines
We will see how to implement this using the awk command with an example.
The input "file.txt" contains the below data:
1A
2B
3C
4D
5E
6F

Let say, we want to insert the new line "9 Z" after every two lines in the input file. The required output
data after inserting a new line looks as
1A
2B
9Z
3C
4D
9Z
5E
6F
9Z

The awk command for getting this output is


awk '{
if(NR%2 == 0)
{
print $0"\n9 Z";
}
else
{
print $0
}
}' file.txt

2. Replace the Nth occurrence of a pattern


The input file contains the data.
AAA 1

BBB 2
CCC 3
AAA 4
AAA 5
BBB 6
CCC 7
AAA 8
BBB 9
AAA 0

Now we want to replace the fourth occurrence of the first filed "AAA" with "ZZZ" in the file.
The required output is:
AAA 1
BBB 2
CCC 3
AAA 4
AAA 5
BBB 6
CCC 7
ZZZ 8
BBB 9
AAA 0

The awk command for getting this output is


awk 'BEGIN {count=0}
{
if($1 == "AAA")
{
count++
}
if(count == 4)
{
sub("AAA","ZZZ",$1)
}
}
{
print $0
}' file.txt

3. Find the sum of even and odd lines separately


The input file data:
A 10

B 39
C 22
D 44
E 75
F 89
G 67

You have to get the second field and then find the sum the even and odd lines.
The required output is
174, 172

The awk command for producing this output is


awk '{
if(NR%2 == 1)
{
sum_e = sum_e + $2
}
else
{
sum_o = sum_o + $2
}
}
END { print sum_e,sum_o }' file.txt

4. Fibonacci series using awk command


Now we will produce the Fibonacci series using the awk command.
awk ' BEGIN{
for(i=0;i<=10;i++)
{
if (i <=1 )
{
x=0;
y=1;
print i;
}
else
{
z=x+y;
print z;
x=y;
y=z;

}
}
}'

The output is
0
1
1
2
3
5
8
13
21
34
55

5. Remove leading zeros from a file using the awk command. The input file contains the below data.
0012345
05678
01010
00001

After removing the leading zeros, the output should contain the below data.
12345
5678
1010
1

The awk command for this is.


awk '{print $1 + 0}' file.txt
awk '{printf "%d\n",$0}' file.txt
The basic syntax of grep command is
grep [options] pattern [list of files]
Let see some practical examples on grep command.
1. Running the last executed grep command

This saves a lot of time if you are executing the same command again and again.
!grep
This displays the last executed grep command and also prints the result set of the command on the
terminal.
2. Search for a string in a file
This is the basic usage of grep command. It searches for the given string in the specified file.
grep "Error" logfile.txt
This searches for the string "Error" in the log file and prints all the lines that has the word "Error".
3. Searching for a string in multiple files.
grep "string" file1 file2
grep "string" file_pattern
This is also the basic usage of the grep command. You can manually specify the list of files you want to
search or you can specify a file pattern (use regular expressions) to search for.
4. Case insensitive search
The -i option enables to search for a string case insensitively in the give file. It matches the words like
"UNIX", "Unix", "unix".
grep -i "UNix" file.txt

5. Specifying the search string as a regular expression pattern.


grep "^[0-9].*" file.txt
This will search for the lines which starts with a number. Regular expressions is huge topic and I am
not covering it here. This example is just for providing the usage of regular expressions.
6. Checking for the whole words in a file.
By default, grep matches the given string/pattern even if it found as a substring in a file. The -w option
to grep makes it match only the whole words.
grep -w "world" file.txt

7. Displaying the lines before the match.

Some times, if you are searching for an error in a log file; it is always good to know the lines around
the error lines to know the cause of the error.
grep -B 2 "Error" file.txt
This will prints the matched lines along with the two lines before the matched lines.
8. Displaying the lines after the match.
grep -A 3 "Error" file.txt
This will display the matched lines along with the three lines after the matched lines.
9. Displaying the lines around the match
grep -C 5 "Error" file.txt
This will display the matched lines and also five lines before and after the matched lines.
10. Searching for a sting in all files recursively
You can search for a string in all the files under the current directory and sub-directories with the help
-r option.
grep -r "string" *

11. Inverting the pattern match


You can display the lines that are not matched with the specified search sting pattern using the -v
option.
grep -v "string" file.txt

12. Displaying the non-empty lines


You can remove the blank lines using the grep command.
grep -v "^$" file.txt

13. Displaying the count of number of matches.


We can find the number of lines that matches the given string/pattern

grep -c "sting" file.txt

14. Display the file names that matches the pattern.


We can just display the files that contains the given string/pattern.
grep -l "string" *

15. Display the file names that do not contain the pattern.
We can display the files which do not contain the matched string/pattern.
grep -L "string" *

16. Displaying only the matched pattern.


By default, grep displays the entire line which has the matched string. We can make the grep to display
only the matched string by using the -o option.
grep -o "string" file.txt

17. Displaying the line numbers.


We can make the grep command to display the position of the line which contains the matched string in
a file using the -n option
grep -n "string" file.txt

18. Displaying the position of the matched string in the line


The -b option allows the grep command to display the character position of the matched string in a file.
grep -o -b "string" file.txt

19. Matching the lines that start with a string


The ^ regular expression pattern specifies the start of a line. This can be used in grep to match the lines
which start with the given string or pattern.
grep "^start" file.txt

20. Matching the lines that end with a string


The $ regular expression pattern specifies the end of a line. This can be used in grep to match the lines
which end with the given string or pattern.
grep "end$" file.txt

If you like this post, please share it on google by clicking on the +1 button.

1. Write a unix/linux cut command to print characters by position?


The cut command can be used to print characters in a line by specifying the position of the
characters. To print the characters in a line, use the -c option in cut command

cut -c4 file.txt


x
u
l

The above cut command prints the fourth character in each line of the file. You can print more than one
character at a time by specifying the character positions in a comma separated list as shown in the
below example
cut -c4,6 file.txt
xo
ui
ln

This command prints the fourth and sixth character in each line.
2.Write a unix/linux cut command to print characters by range?
You can print a range of characters in a line by specifying the start and end position of the characters.
cut -c4-7 file.txt
x or
unix
linu

The above cut command prints the characters from fourth position to the seventh position in each line.
To print the first six characters in a line, omit the start position and specify only the end position.

cut -c-6 file.txt


unix o
is uni
is lin

To print the characters from tenth position to the end, specify only the start position and omit the end
position.
cut -c10- file.txt
inux os
ood os
good os

If you omit the start and end positions, then the cut command prints the entire line.
cut -c- file.txt

3.Write a unix/linux cut command to print the fields using the delimiter?
You can use the cut command just as awk command to extract the fields in a file using a delimiter. The
-d option in cut command can be used to specify the delimiter and -f option is used to specify the field
position.
cut -d' ' -f2 file.txt
or
unix
linux

This command prints the second field in each line by treating the space as delimiter. You can print more
than one field by specifying the position of the fields in a comma delimited list.
cut -d' ' -f2,3 file.txt
or linux
unix good
linux good

The above command prints the second and third field in each line.
Note: If the delimiter you specified is not exists in the line, then the cut command prints the entire line.
To suppress these lines use the -s option in cut command.
4. Write a unix/linux cut command to display range of fields?

You can print a range of fields by specifying the start and end position.
cut -d' ' -f1-3 file.txt

The above command prints the first, second and third fields. To print the first three fields, you can
ignore the start position and specify only the end position.
cut -d' ' -f-3 file.txt

To print the fields from second fields to last field, you can omit the last field position.
cut -d' ' -f2- file.txt

5. Write a unix/linux cut command to display the first field from /etc/passwd file?
The /etc/passwd is a delimited file and the delimiter is a colon (:). The cut command to display the first
field in /etc/passwd file is
cut -d':' -f1 /etc/passwd

6. The input file contains the below text


> cat filenames.txt
logfile.dat
sum.pl
add_int.sh

Using the cut command extract the portion after the dot.
First reverse the text in each line and then apply the command on it.
rev filenames.txt | cut -d'.' -f1

Compressing files under Linux or UNIX cheat sheet

Both Linux and UNIX include various commands for Compressing and decompresses (read as expand compressed

Compressing files

Syntax
gzip {filename}

bzip2 {filename}

zip {.zip-filename} {filename-to-compress}

tar -zcvf {.tgz-file} {files}


tar -jcvf {.tbz2-file} {files}

Decompressing files
Syntax
gzip -d {.gz file}
gunzip {.gz file}
bzip2 -d {.bz2-file}
bunzip2 {.bz2-file}
unzip {.zip file}
tar -zxvf {.tgz-file}
tar -jxvf {.tbz2-file}

List the contents of an archive/compressed file

Some time you just wanted to look at files inside an archive or compressed file. Then all of the above command su
Syntax
gzip -l {.gz file}
unzip -l {.zip file}
tar -ztvf {.tar.gz}
tar -jtvf {.tbz2}

Related articles
https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy