ETL - Interview Question&Answers-2
ETL - Interview Question&Answers-2
ETL - Interview Question&Answers-2
SQL Quaries( Null Values, Duplicate Records, Joins,Where,group by, count, NVL
Functions)
ETL Testing Concepts/ DataWare House Concepts( SCD Type1,2,3 Fact Dimension
Tables, Star Schema, Snow Flake Schema)
ETL Pre Validation / Post
Data Ware House Project Workflow. Validations(Before/After Transformation)
Data Validations:
1st Level : Row Count, null Values, Duplicates
2nd Level: Source, Target Validations
3rd Level(Data MOdel Level): Column Size, Date Time Size
BIG Data Project Work Flow
HIVE Quaries
Azure Cloud Testing
Azure Devops Test Plans(Epic, Bug, User Story, Task)
JIRA Tool Whole Procedure ( How to raise an bug JIRA)
Bug Life Cycle/ Software Testing Life Cycle
Testing : Regression Testing, Integration Testing, Re Testing, load testing, Data Base
Testing, Functional Testing
Writing Test Cases/ Test Plans
Important Questions:
1) Tell me about Your Self --- (You need to Explain years of exp, what kind of testing
you have exeprience, which kind of tools
you have hands on exeprience, & finally your Roles, Responsibilities as ETL Tester)
A product owner is responsible for ensuring the success of a project in Scrum. The product owner is
responsible for managing and optimizing the product backlog in order to maximize the value of the
product. A Scrum framework is an Agile methodology that facilitates communication and self-organization
within a team.
A Product Owner is part of the scrum team. The key responsibilities of a Product Owner are to define user
stories and create a product backlog. The Product Owner is the primary point of contact on behalf of the
customer to identify the product requirements for the development team.
JIRA Related Interview Questions:
(OR)
- Detected By
4)Explain in the Sprint what kind of phases will be there?
Ans) TO DO-->IN Progress -->Complete
In Agile, an epic is simply a collection of user stories. These stories are related to one another
and combine to form one large story. Epics can work across different teams and projects, but
they will be united under a broad banner label, known as a theme. An initiative will group
similar epics under one common objective within an organization.
Benefits of epics
Better organization
Improved time management
Clear client priorities
7) Under Issue Type what kind of Options will be there in JIRA Tool?
Ans) Epic, User Stories, Tasks, Sub Tasks, Bug, Test
8) Where you execute Test Cases?
Ans ) In JIRA Tool we have apps called zephyr squad, Xray
XRAY
Project Scenario Related Interview Questions:
The STTM document is the blueprint of your ETL solution. In addition to mapping
the fields from source to target, the STTM document also captures the following
important information:
Defect Status or Bug Status in defect life cycle is the present state from which the defect or a
bug is currently undergoing. The goal of defect status is to precisely convey the current state
or progress of a defect or bug in order to better track and understand the actual progress of
the defect life cycle.
The number of states that a defect goes through varies from project to project. Below lifecycle
diagram, covers all possible states
New: When a new defect is logged and posted for the first time. It is assigned a status
as NEW.
Assigned: Once the bug is posted by the tester, the lead of the tester approves the bug
and assigns the bug to the developer team
Open: The developer starts analyzing and works on the defect fix
Fixed: When a developer makes a necessary code change and verifies the change, he
or she can make bug status as “Fixed.”
Pending retest: Once the defect is fixed the developer gives a particular code for
retesting the code to the tester. Since the software testing remains pending from the
testers end, the status assigned is “pending retest.”
Retest: Tester does the retesting of the code at this stage to check whether the defect is
fixed by the developer or not and changes the status to “Re-test.”
Verified: The tester re-tests the bug after it got fixed by the developer. If there is no bug
detected in the software, then the bug is fixed and the status assigned is “verified.”
Reopen: If the bug persists even after the developer has fixed the bug, the tester
changes the status to “reopened”. Once again the bug goes through the life cycle.
Closed: If the bug is no longer exists then tester assigns the status “Closed.”
Duplicate: If the defect is repeated twice or the defect corresponds to the same concept
of the bug, the status is changed to “duplicate.”
Rejected: If the developer feels the defect is not a genuine defect then it changes the
defect to “rejected.”
Deferred: If the present bug is not of a prime priority and if it is expected to get fixed in
the next release, then status “Deferred” is assigned to such bugs
Not a bug:If it does not affect the functionality of the application then the status
assigned to a bug is “Not a bug”.
10) Where you will execute your Test Cases?
Ans) In JIRA Tool by using Zypher Sqaud or XRAY
11) Who will execute Test Cases?
Ans) You or Team Lead
12) Explain about left, right join and Types of Join
13) What is your Project Source & Where it is Located
Ans)
14) What is the Job, who will execute the JOb?
15) What is your company, where it is located?
16) To whom to assign the JOB?
17) What is Your Source Data and what type of Data it is?
18) Explain the Architecture of ETL/DWH Project
19) How will you validate your Source data and where will you validate that?
20) Where you will execute query in your Target System?
21) Which kind of Validations you did in your Project? And In Which area you execute
those validation (Platform: Snowflake /HIVE Query Editor/ Informatica Tool/ Oracle )?
22) What type of Databases you are using in your Project?
11)How to Hide the columns which are not requested in for Testing?(Views)
12)WAQ for RANK, Row Number, Dense Rank
Scenarios : On Joins
Scenarios : group by
Scenarios:duplicates
Scenarios: Null Values
13) Difference between Rank and Dense Rank
14) Write a Query to Join the tables, Types of Joins
15)What are the Constrains
16)How to Identify Duplicate Values
17) WAQ to get last 500 records out of 1000 records.
18) find out duplicate records
sel
What is OLAP?
Online Analytical Processing, a category of software tools which provide analysis of data for
business decisions. OLAP systems allow users to analyze database information from multiple
database systems at one time.
What is OLTP?
Online transaction processing shortly known as OLTP supports transaction-oriented
applications in a 3-tier architecture. OLTP administers day to day transaction of an
organization.
Example of OLAP
Any Datawarehouse system is an OLAP system. Uses of OLAP are as follows
A company might compare their mobile phone sales in September with sales in October,
then compare those results with another location which may be stored in a sperate
database.
Amazon analyzes purchases by its customers to come up with a personalized
homepage with products which likely interest to their customer.
However, the person that completes authentication process first will be able to get money. In
this case, OLTP system makes sure that withdrawn amount will be never more than the
amount present in the bank. The key to note here is that OLTP systems are optimized
for transactional superiority instead data analysis.
Other examples of OLTP applications are:
Online banking
Online airline ticket booking
Sending a text message
Order entry
Add a book to shopping cart
manages database
modification.
database. making.
transactions.
OLTP.
Input A passive transformation that passes data into a mapplet. Can be used in a mapplet, but not in a
mapping.
Java Executes user logic coded in Java.
Can be active or passive.
Joiner An active transformation that joins data from two sources.
Labeler A passive transformation that adds a labeler asset that you created in
Data Quality
to a mapping or mapplet. Use a labeler asset to identify the types of information in an input field
and to assign labels for each type to the data.
Lookup Looks up data from a lookup object. Defines the lookup object and lookup connection. Also
defines the lookup condition and the return values.
A passive lookup transformation returns one row. An active lookup transformation returns more
than one row.
Machine Runs a machine learning model and returns predictions to the mapping.
Learning
Mapplet Inserts a mapplet into a mapping or another mapplet. A mapplet contains transformation logic
that you can create and use to transform data before it is loaded into the target.
Can be active or passive based on the transformation logic in the mapplet.
Normalizer An active transformation that processes data with multiple-occurring fields and returns a row for
each instance of the multiple-occurring data.
Output A passive transformation that passes data from a mapplet to a downstream transformation. Can
be used in a mapplet, but not in a mapping.
Parse A passive transformation that adds a parse asset that you created in
Data Quality
to a mapping or mapplet. Use a parse asset to parse the words or strings in an input field into
one or more discrete output fields based on the types of information that the words or strings
contain.
Python Runs Python code that defines transformation functionality. Can be active or passive.
Rank An active transformation that limits records to a top or bottom range.
Router An active transformation that you can use to apply a condition to incoming data.
Rule A passive transformation that adds a rule specification asset that you created in
Specification Data Quality
to a mapping or mapplet. Use a rule specification asset to apply the data requirements of a
business rule to a data set.
Sequence A passive transformation that generates a sequence of values.
Generator
Sorter A passive transformation that sorts data in ascending or descending order, according to a
specified sort condition.
SQL Calls a stored procedure or function or executes a query against a database.
Passive when it calls a stored procedure or function. Can be active or passive when it processes
a query.
Structure Parser A passive transformation that analyzes unstructured data from a flat file source and writes the
data in a structured format.
Transaction An active transformation that commits or rolls back sets of rows during a mapping run.
Control
Union An active transformation that merges data from multiple input groups into a single output group.
Transformation Description
Velocity A passive transformation that executes a Velocity script to convert JSON or XML hierarchal input
from one format to another without flattening the data.
Verifier A passive transformation that adds a verifier asset that you created in
Data Quality
to a mapping or mapplet. Use a verifier asset to verify and enhance postal address data.
Web Services An active transformation that connects to a web service as a web service client to access,
transform, or deliver data.
3) what is Informatica?
Informatica is a data integration tool based on ETL architecture. It provides
data integration software and services for various businesses.
(OR)
Informatica is a data processing tool that is widely used for ETL to extract transform
and load processing.
The latest version of Informatica PowerCenter available is 9.6.0. The different editions for the
PowerCenter are
Standard edition
Advanced edition
Premium edition
4) what is Normalisation?
o Ans) Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations. It is also used
to eliminate undesirable characteristics like Insertion, Update, and Deletion Anomalies.
o Normalization divides the larger table into smaller and links them using relationships.
o The normal form is used to reduce redundancy from the database table.
The main reason for normalizing the relations is removing these anomalies. Failure to eliminate anomalies
leads to data redundancy and can cause data integrity and other problems as the database grows.
Normalization consists of a series of guidelines that helps to guide you in creating a good database
structure.
Types of Normal Forms:
Normalization works through a series of stages called Normal forms. The normal forms apply to individual
relations. The relation is said to be in particular normal form if it satisfies constraints.
Normal Description
Form
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional dependent
on the primary key.
4NF A relation will be in 4NF if it is in Boyce Codd's normal form and has no multi-valued
dependency.
5NF A relation is in 5NF. If it is in 4NF and does not contain any join dependency, joining should be
lossless.
Advantages of Normalization
o Normalization helps to minimize data redundancy.
o Greater overall database organization.
o Data consistency within the database.
o Much more flexible database design.
o Enforces the concept of relational integrity.
5) What is Redundency?
Ans) Redundancy means having multiple copies of same data in the database. This
problem arises when a database is not normalized.
Data redundancy refers to the practice of keeping data in two or more places within a
database or data storage system. Data redundancy ensures an organization can
provide continued operations or services in the event something happens to its data
-- for example, in the case of data corruption or data loss. The concept applies to
areas such as databases, computer memory and file storage systems.
It takes less time for the While it takes more time than star
4. execution of queries. schema for the execution of queries.
A lookup table is nothing but a 'lookup' it give values to referenced table (it is a reference), it is used at the run time,
Lookup tables are also used extensively to validate input values by matching against a list of valid (or invalid)
items in an array and, in some programming .
Lookup tables are a key-value store for faster lookups by key (e.g. use a lookup table to get all users who clicked on a
specific ad in a specific timeframe).
12) Explain about Data Ware House Achitectures
1)Centralized or Enterprise Architecture
1) Fedareated Architecture
2) Multi Tired Architecture)
12) What is Fact table & Dimension Table
Ans) In data warehousing, a fact table consists of the measurements, metrics or facts of a business process.
(or)
A fact table consists of facts of a particular business process
Facts are also known as measurements or metrics
Keys: It has a key or a primary key which is the accumulation of all the primary keys of all
dimension tables linked with it. That key is known as a concatenated key that helps to
Fact Table Grain: The grain of a table depicts the level of the detail or the depth of the
information that is contained in that table. More the level, the more the efficiency of the
table.
semi-additive. Fully additive or additive measures are added to all the dimensions. Semi-
additive are those measures that are added to some of the dimensions and not to all the
Sparse Data: There are records that have attributes containing null values or measures.
base dimension.
Types of Fact Table
It is categorized under three fundamental measurement events:
Transactional
Periodic Snapshot
Accumulating Snapshots
occurrence of an event at any instantaneous point in time. The facts measure are valid only for
that particular instant and only for that event. The grain which is associated with the
transaction table specifies as “one row per line in a transaction”. Usually, it contains the data of
the detailed level, which leads it to have a large number of dimensions associated with it. It
captures the measurement at the most basic or atomic level of dimension. This helps the table
to give robust dimensional grouping, roll up & drill-down reporting capabilities to the users. It
is dense and sparse. It can be large, maybe containing billions of records. Let us see an
moment”. It normally includes more non-additive and semi-additive facts. It helps to review
the cumulative performance of the business at regular and predictable intervals of time. In this,
the performance of an activity at the end of each day or a week or a month or any other time
interval is represented, unlike the transaction fact table where a new row is added for the
occurrence of every event. But snapshot fact tables or periodic snapshots are dependent on
the transaction fact table to get the detailed data present in the transaction fact table. The
periodic snapshot tables are mostly dense and can be large as transaction fact tables. Let us
see an example of the periodic snapshot of the sales of the same grocery shop as in the
starting and end. Accumulating snapshots mostly have multiple data stamps that represent the
predictable phases or events that occur during the lifetime. Sometimes there is an extra
column containing the date that shows when the row was last updated. Let us see an example.
13) Expain about types of Facts?
(Additive, Semi Additive, Non Additive)
Types of Facts
There are three types of facts:
1. Summative facts: Summative facts are used with aggregation functions such as sum (), average (), etc.
2. Semi summative facts: There are small numbers of quasi-summative fact aggregation functions that will
apply.
For example, consider bank account details. We also cannot also apply () for a bank balance which will not
have useful results, but the minimum() and maximum() functions return useful information.
3. Non-additive facts: We cannot use numerical aggregation functions such as sum (), average (), on non-
additive facts. For non-additive facts, ratio or percentage is used.
1. Addictive Type:
Product-wise
sales
Daily Sales
Measurements in a fact table that can be summed up across only a few dimensions keys Following table is
used to record current balance and profit margin for each id at a particular instance of time (Day end)
In the above table, we cannot sum up current balance across Acct Id
If we ask balance for Id 21653 we will say that 22000, not 22000+80000
3. Non-addictive fact:
A fact table without any measures is called the factless fact table
ALLOWS PRESERVES
TYPE TITLE DESCRIPTION
UPDATES HISTORY
2 Add a new record ✔ ✔ Track changes by creating multiple records (e.g. ValidFrom
3 Add a new field ✔ ✔(*) Track changes using separate columns (e.g. CurrentValue,
Very simply, there are 6 types of Slowly Changing Dimension that are commonly used, they are as follows:
SCD Type 1
SCD type 1 methodology is used when there is no need to store historical data in the dimension
table. This method overwrites the old data in the dimension table with the new data. It is used to
correct data errors in the dimension.
------------------------------------------------
1 1 Mark Chicago
Here the customer Location is Chicago and the customer moved to another location New York. If
you use type1 method, it just simply overwrites the data. The data in the updated table will be.
------------------------------------------------
The advantage of type1 is ease of maintenance and less space occupied. The disadvantage is that
there is no historical data kept in the data warehouse.
SCD Type 2
SCD type 2 stores the entire history the data in the dimension table. With type 2 we can store
unlimited history in the dimension table. In type 2, you can store the data in three different ways.
They are:
Versioning
Flagging
Effective Date
As an example, let’s use the same example of customer who changes the location. Initially the
customer is in Illinois location and the data in dimension table will look as.
--------------------------------------------------------
1 1 Marston Illinois 1
The customer moves from Illinois to Seattle and the version number will be incremented. The
dimension table will look as
--------------------------------------------------------
1 1 Marston Illinois 1
2 1 Marston Seattle 2
Now again if the customer is moved to another location, a new record will be inserted into the
dimension table with the next version number.
SCD Type 2 Flagging
In flagging method, a flag column is created in the dimension table. The current record will have
the flag value as 1 and the previous records will have the flag as 0.
--------------------------------------------------------
1 1 Marston Illinois 1
Now when the customer moves to a new location, the old records will be updated with flag value
as 0 and the latest record will have the flag value as 1.
--------------------------------------------------------
1 1 Marston Illinois 0
2 1 Marston Seattle 1
-------------------------------------------------------------------------
The NULL in the End_Date indicates the current version of the data and the remaining records
indicate the past data.
SCD Type 3
In type 3 method, only the current status and previous status of the row is maintained in the table.
To track these changes two separate columns are created in the table. The customer dimension
table in the type 3 method will look as
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Now again if the customer moves from seattle to NewYork, then the updated table will be
--------------------------------------------------------------------------
3) Explain How you been doing your Quality Assurance Process for your ETL/DWH
Project?
ETL testing is quite different from conventional testing. There are many challenges we
faced while performing data warehouse testing. Here is the list of few ETL testing
challenges I experienced on my project:
- Incompatible and duplicate data.
- Loss of data during ETL process.
- Unavailability of inclusive test bed.
- Testers have no privileges to execute ETL jobs by their own.
- Volume and complexity of data is very huge.
- Fault in business process and procedures.
- Trouble acquiring and building test data.
- Missing business flow information.
ETL Tester Responsibilities:
QA will validate whether all column names are matched as per PDM document or not.
Expected Result: Column names in table should match as per PDM document.
Actual Result: After test case execution we have to specify actual result
Source SQL: PDM Document will act as Source here
Target SQL: Show table tablename;
9: Scenario Check:
We need to this check based on business knowledge. Example 1: Every policy should have only one
policy term as open record. We need to write test case to validate this scenario. Example 2: Load
timestamp should be lower than Update time stamp. We need to write test case to validate this
scenario.
Test Planning
a) Change scrum team in if required:
b) Identify Tactical Ricks
Note: Please Share Interview Questions with me once you Complete Your Interview, I
will add those questions, It could be helpful for future candidates.