DWM Unit 1. Introduction To Data Warehousing
DWM Unit 1. Introduction To Data Warehousing
DWM Unit 1. Introduction To Data Warehousing
Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
*Data warehousing:
A Data Warehousing (DW) is process for collecting and managing business data from varied
sources to provide meaningful business insights.
Data warehouse is an information system that contains historical and commutative data from
single or multiple sources.
A Data warehouse is typically used to connect and analyse business data from
heterogeneous (different by nature) sources.
It simplifies reporting and analysis process of the organization.
It is also a single version of truth for any company for decision making and forecasting.
1. Subject-Oriented
2. Integrated
3. Time-variant
4. Non-volatile
1. Subject-Oriented:
A data warehouse is subject oriented as it offers information regarding a theme instead of
companies ongoing operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. It concentrates on modelling
and analysis of data for decision making.
It also provides a simple and concise (summarized) view around the specific subject by
excluding data which is not helpful to support the decision process.
2. Integrated:
In Data Warehouse, integration means the establishment of a common unit of measure for
all similar data from the dissimilar database.
A data warehouse is developed by integrating data from varied sources like a mainframe,
relational databases, flat files, etc. and convert all data in single standard format.
1
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
3. Time-Variant:
The data collected in a data warehouse with a particular period (time) and offers information
from the historical point of view.
Another aspect of time variance is that once data is inserted in the warehouse, it can't be
updated or changed.
4. Non-volatile:
Data warehouse is also non-volatile means the previous data is not erased when new data
is entered in it. Data is read-only and periodically refreshed.
This also helps to analyse historical data and understand what & when happened. Activities
like delete, update, and insert which are performed in an operational application environment
are omitted in Data warehouse environment.
Only two types of data operations performed in the Data Warehousing are Data loading and
Data access.
2
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
Operational Database
Parameters Data Warehouse
System
OLTP OLAP
Applications (Online Transaction (Online Analytical
Processing) Processing)
1. Improving integration
2. Speeding up response times
3. Faster and more flexible reporting
4. Recording changes to build history
5. Increasing data quality
6. Unburdening the IT department (no need to track data regularly)
7. Increasing recognizability
8. Increasing findability
3
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using visualization
tools.
1. Airline:
Analysis of crew assignments, flight data, flight routes, fairs.
2. Banking:
Analysis of customer data, transactions, loans, accounts, KYC.
3. Healthcare:
Generate patient's treatment reports, share data with tie-in insurance companies, medical aid
services, etc.
4. Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyse tax records, health policy records, for every individual.
6. Retail chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing policy.
7. Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.
4
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
8. Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to remove
data redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse. This
architecture is not expandable and also not supporting a large number of end-users. It also
has connectivity problems because of network limitations.
Three-tier architecture:
This is the most widely used architecture.
It consists of the Top, Middle and Bottom Tier.
Database Analytic
Metadata Tool
Raw
Data Report
Database
Tool
Summary
Data
Data
Mining
Tool
Fig: Three-Tiered Data Warehouse Architecture
1. Bottom Tier:
The database or data sources of the Datawarehouse servers is the bottom tier.
It is usually a relational database system.
5
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
2. Middle Tier:
The middle tier in Data warehouse is an OLAP server which is implemented using
either ROLAP or MOLAP model.
This application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the database.
3. Top-Tier:
The top tier is a front-end client layer.
Top tier is the tools and API that user used to get data out from the data warehouse.
The different tools are Query tools, reporting tools, managed query tools, Analysis
tools and Data mining tools.
2. Data Marts:
A data mart is a subset of the data warehouse.
It is specially designed for a particular line of business, such as sales, finance, sales or
finance. In an independent data mart, data can collect directly from sources.
Due to large amount of data, a single warehouse can become overburdened.
So, to prevent the warehouse from becoming impossible to navigate, subdivisions created,
called as Data Marts.
These data marts divide the information saved in the warehouse into categories or specific
groups of users.
In a simple word Data mart is a subsidiary of a data warehouse.
6
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
3. Virtual Warehouse:
A virtual warehouse is essentially a separate business database, which contains only
required data for operation system.
The data found in a virtual warehouse is usually copied from multiple sources throughout an
operation system.
Virtual warehouse is used to search the data quickly and without accessing the entire system.
It speeds up the overall access process.
1. Extraction:
The first step of the ETL process is extraction.
In this step, data from various source systems is extracted which can be in various
formats like relational databases, No SQL, XML and flat files into the staging area.
The data cannot be loaded in data warehouse; therefore, this is one of the most
important steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation.
In this step, a set of rules or functions are applied on the extracted data to convert it into
a single standard format.
7
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
3. Loading:
The third and final step of the ETL process is loading.
In this step, the transformed data is finally loaded into the data warehouse.
ETL Pipelining:
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it
can transform and during that period some new data can be extracted. And while the
transformed data is being loaded into the data warehouse, the already extracted data can be
transformed.
ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder,
CloverETL and MarkLogic.
8
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
*Metadata Repository:
(Metadata: data about data, repository: big container)
Metadata is information about the structures that contain the actual data.
It is data about the structures that contain data. Metadata may describe the structure of any
data, of any subject, stored in any format.
Metadata repository contains the structures of all data at one place.
This gives the plenty of data more than requirement for decision making.
With one stop information, business will have more control on the changes, and can-do
impact analysis.
Metadata Repository used for building, maintain, managing Data warehouse.
For example, a line in sales database may contain: 4030 KJ732 299.90
This is a meaningless data until we consult the Meta that tell us it was.
The Meta of the data is
• Model number: 4030
• Sales Agent ID: KJ732
• Total sales amount of $299.90
Therefore, Meta Data are essential ingredients in the transformation of data into knowledge.
1. Technical Meta Data: This kind of Metadata contains information about warehouse
which is used by Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users a
way easy to understand information stored in the data warehouse.
9
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
2. Saves times
A data warehouse standardizes, preserves, and stores data from different sources, and
integration of all the data in one place.
So, all critical data is available to all users simultaneously.
10
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
2 Marks Questions:
11
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.
Annexure 1
Metadata Repository:
All departments in an
Subjects (departments) Specific department
organization
Implementation time
More Less
required
12