DWM Unit 1. Introduction To Data Warehousing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Padmashri Dr.

Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

Unit 1: Introduction to Data Warehousing (08 Marks)

*Data warehousing:
A Data Warehousing (DW) is process for collecting and managing business data from varied
sources to provide meaningful business insights.
Data warehouse is an information system that contains historical and commutative data from
single or multiple sources.
A Data warehouse is typically used to connect and analyse business data from
heterogeneous (different by nature) sources.
It simplifies reporting and analysis process of the organization.
It is also a single version of truth for any company for decision making and forecasting.

Characteristics of Data warehousing:

A data warehouse has following characteristics:

1. Subject-Oriented
2. Integrated
3. Time-variant
4. Non-volatile

1. Subject-Oriented:
A data warehouse is subject oriented as it offers information regarding a theme instead of
companies ongoing operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. It concentrates on modelling
and analysis of data for decision making.
It also provides a simple and concise (summarized) view around the specific subject by
excluding data which is not helpful to support the decision process.

2. Integrated:
In Data Warehouse, integration means the establishment of a common unit of measure for
all similar data from the dissimilar database.
A data warehouse is developed by integrating data from varied sources like a mainframe,
relational databases, flat files, etc. and convert all data in single standard format.

1
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

This integration helps in effective analysis of data.


Example: There are three different applications labelled A, B and C.
In Application A gender field store logical values like M or F
In Application B gender field is a numerical value,
In Application C application, gender field stored in the form of a character value.
However, after transformation and cleaning process all this data is stored in common format
in the Data Warehouse.

3. Time-Variant:
The data collected in a data warehouse with a particular period (time) and offers information
from the historical point of view.
Another aspect of time variance is that once data is inserted in the warehouse, it can't be
updated or changed.

4. Non-volatile:
Data warehouse is also non-volatile means the previous data is not erased when new data
is entered in it. Data is read-only and periodically refreshed.
This also helps to analyse historical data and understand what & when happened. Activities
like delete, update, and insert which are performed in an operational application environment
are omitted in Data warehouse environment.
Only two types of data operations performed in the Data Warehousing are Data loading and
Data access.

2
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

*Difference between Operational Database System and Data Warehouse:

Operational Database
Parameters Data Warehouse
System

Designed to support high Designed to support high


Definition volume transaction volume analytical
processing. processing.

Design Application oriented Subject oriented

Performance Low for analysis process High for analysis process

Data Used Current data Historical data

Updation of Data Regularly Rarely

Operations on data Insert, delete, update Read only

Data redundancy No Yes

Access to system Repetitive Ad-hoc

Function Day-to-day operations Decision making

OLTP OLAP
Applications (Online Transaction (Online Analytical
Processing) Processing)

*Need for Data Warehousing:

1. Improving integration
2. Speeding up response times
3. Faster and more flexible reporting
4. Recording changes to build history
5. Increasing data quality
6. Unburdening the IT department (no need to track data regularly)
7. Increasing recognizability
8. Increasing findability

3
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

*Applications of Data Warehouse:

Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using visualization
tools.
1. Airline:
Analysis of crew assignments, flight data, flight routes, fairs.

2. Banking:
Analysis of customer data, transactions, loans, accounts, KYC.

3. Healthcare:
Generate patient's treatment reports, share data with tie-in insurance companies, medical aid
services, etc.

4. Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyse tax records, health policy records, for every individual.

5. Investment and Insurance sector:


In this sector, the warehouses are primarily used to analyse data patterns, customer trends,
and to track market movements.

6. Retail chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing policy.

7. Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.

4
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

8. Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.

*Data Warehouse Architecture:

There are mainly three types of Datawarehouse Architectures:

Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to remove
data redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse. This
architecture is not expandable and also not supporting a large number of end-users. It also
has connectivity problems because of network limitations.

Three-tier architecture:
This is the most widely used architecture.
It consists of the Top, Middle and Bottom Tier.

Database Analytic
Metadata Tool

Raw
Data Report
Database
Tool
Summary
Data

Data
Mining
Tool
Fig: Three-Tiered Data Warehouse Architecture

1. Bottom Tier:
The database or data sources of the Datawarehouse servers is the bottom tier.
It is usually a relational database system.

5
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

2. Middle Tier:
The middle tier in Data warehouse is an OLAP server which is implemented using
either ROLAP or MOLAP model.
This application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the database.

3. Top-Tier:
The top tier is a front-end client layer.
Top tier is the tools and API that user used to get data out from the data warehouse.
The different tools are Query tools, reporting tools, managed query tools, Analysis
tools and Data mining tools.

*Data Warehouse Models:

Three main types of Data Warehouses are:

1. Enterprise Data Warehouse


2. Data Marts
3. Virtual Warehouse

1. Enterprise Data Warehouse (EDW):


Enterprise Data Warehouse is a centralized warehouse, which aggregates the information
automatically.
It offers a unified approach for organizing and representing data.
It also provides the ability to classify data according to the subject and give access
accordingly to users.
It provides decision support service across the enterprise.

2. Data Marts:
A data mart is a subset of the data warehouse.
It is specially designed for a particular line of business, such as sales, finance, sales or
finance. In an independent data mart, data can collect directly from sources.
Due to large amount of data, a single warehouse can become overburdened.
So, to prevent the warehouse from becoming impossible to navigate, subdivisions created,
called as Data Marts.
These data marts divide the information saved in the warehouse into categories or specific
groups of users.
In a simple word Data mart is a subsidiary of a data warehouse.

6
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

3. Virtual Warehouse:
A virtual warehouse is essentially a separate business database, which contains only
required data for operation system.
The data found in a virtual warehouse is usually copied from multiple sources throughout an
operation system.
Virtual warehouse is used to search the data quickly and without accessing the entire system.
It speeds up the overall access process.

*ETL Process in Data Warehouse:


ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is
a process in which an ETL tool extracts the data from various data source systems,
transforms it in the staging area and then finally, loads it into the Data Warehouse system.

Fig: ETL process in Data Warehousing

1. Extraction:
The first step of the ETL process is extraction.
In this step, data from various source systems is extracted which can be in various
formats like relational databases, No SQL, XML and flat files into the staging area.
The data cannot be loaded in data warehouse; therefore, this is one of the most
important steps of ETL process.

2. Transformation:
The second step of the ETL process is transformation.
In this step, a set of rules or functions are applied on the extracted data to convert it into
a single standard format.
7
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

It may involve following processes/tasks:


• Filtering – loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values and missing values.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

3. Loading:
The third and final step of the ETL process is loading.
In this step, the transformed data is finally loaded into the data warehouse.

ETL Pipelining:
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it
can transform and during that period some new data can be extracted. And while the
transformed data is being loaded into the data warehouse, the already extracted data can be
transformed.

ETL Tools: Most commonly used ETL tools are Sybase, Oracle Warehouse builder,
CloverETL and MarkLogic.

8
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

*Metadata Repository:
(Metadata: data about data, repository: big container)
Metadata is information about the structures that contain the actual data.
It is data about the structures that contain data. Metadata may describe the structure of any
data, of any subject, stored in any format.
Metadata repository contains the structures of all data at one place.
This gives the plenty of data more than requirement for decision making.
With one stop information, business will have more control on the changes, and can-do
impact analysis.
Metadata Repository used for building, maintain, managing Data warehouse.

For example, a line in sales database may contain: 4030 KJ732 299.90
This is a meaningless data until we consult the Meta that tell us it was.
The Meta of the data is
• Model number: 4030
• Sales Agent ID: KJ732
• Total sales amount of $299.90
Therefore, Meta Data are essential ingredients in the transformation of data into knowledge.

Benefits of Metadata Repository:


a. Integration of the metadata across the organization.
b. Build relationship between various metadata types
c. Build relationship between various disparate (different in nature) systems.
d. Version control of the changes at structure level.
e. link view to master data.
f. automatic synchronization with various authorized metadata source systems.
g. More control to business decisions.
h. discovering discrepancies, gaps, lineage, metrics at data structure level.

Metadata can be classified into following categories:

1. Technical Meta Data: This kind of Metadata contains information about warehouse
which is used by Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users a
way easy to understand information stored in the data warehouse.

9
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

*Benefits/Advantages of a Data Warehouse:

1. Delivers enhanced business intelligence


By having access to information from various sources in a single platform, decision makers
will no longer need to rely on limited data.

2. Saves times
A data warehouse standardizes, preserves, and stores data from different sources, and
integration of all the data in one place.
So, all critical data is available to all users simultaneously.

3. Enhances data quality and consistency


A data warehouse converts data from multiple sources into a consistent format.
The data from different sources can be filtered, sorted, cleaned.
This will lead to more accurate data, which will become the basis for solid decisions.

4. Generates a high Return on Investment (ROI)


Companies experience higher revenues and cost savings than those that haven’t invested in
a data warehouse.

5. Provides competitive advantage


Data warehouses helps to get a holistic (as a whole not parts) view of their current standing
and evaluate opportunities and risks, thus providing companies with a competitive advantage.

6. Improves the decision-making process


Data warehousing provides better insights (detailed understanding) to decision makers by
maintaining a related database of current and historical data.

7. Enables organizations to forecast with confidence


With advanced features of Data warehouse, organization can forecast their line of action
easily.

8. Streamlines (well organized) the flow of information


Data warehousing facilitates the flow of information through all related or non-related parties.

10
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

Chapter 1: Assignment Questions

2 Marks Questions:

1. Define Data Warehouse.

2. Explain need of Data Warehousing.

3. Define extraction, transformation and loading in Data Warehouse.

4. Explain any two characteristics of Data Warehousing.

4/6 Marks Questions:

1. Differentiate between operational database system and data warehouse.

2. Explain three-tiered data warehouse architecture.

3. Explain any two data warehouse models.

4. Explain ETL process in data warehouse.

5. Explain metadata repository with its benefits.

6. Explain advantages of Data Warehousing.

11
Padmashri Dr. Vitthalrao Vikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
CM6I DWM 22621
Prof. V. D. Vaidya, CM Dept.

Annexure 1

Metadata Repository:

Ex: Metadata of a Book Store:


1. Name of book
2. Summary of book
3. Publication of book
4. Edition of book
5. Author of book
6. Date of publication
7. Availability of book
8. Reviews of book
Above information (metadata) helps to search the book, access the book, whether to
purchase or not.

Difference: Data Warehouse Vs Data Marts:

Parameters Data Warehouse Data Marts

Collection of large amounts Sub division of Data


Definition
of data. warehouse

All departments in an
Subjects (departments) Specific department
organization

Design process Complex Simple

Implementation time
More Less
required

Data handling time required More Less


Storage required More (100GB to 1TB) Less (up to 100GB)
Flexibility More Less
Function Subject independent Subject dependent

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy