Data Warehousing and Data Mining: Dr. Karunendra Verma
Data Warehousing and Data Mining: Dr. Karunendra Verma
Data Warehousing and Data Mining: Dr. Karunendra Verma
1
BASIC CONCEPTS
• What is Data?
Data is raw, unorganized facts that need to be processed. Data
can be something simple and seemingly random and useless
until it is organized
• What is Information?
When data is processed, organized, structured or presented in a
given context so as to make it useful, it is called information.
• How data and Information are
related?
Example :Each student's test score is one piece of data.
The average score of a class or of the entire school is
information that can be derived from the given data.
2
What is a Warehouse in general?
❖ Definition of DSS
DSS aims to get High Level information out of the
detailed data stored in transaction processing system
Business Intelligence
5
Problems in Storage and Retrieval of
Data for DSS
1. Many DSS queries can not be written
using SQL
Reasons
6
2. Large Companies have Various sources of Data
(Branches spread worldwide)
o Located at different places
o Use different Database Schemas
o Designed and implemented at
different time
o May use different DBMS S/W
To Avoid These Problems Companies are using
Data Warehouse 7
Data Warehouse Definition
10
Definition by Bill Inmon
1 0 M F
X Y
134 c ms
1.6 meters
11
Definition by Bill Inmon
12
Definition by Bill Inmon
Insert Load
Update Access
Delete
Operational Data
DBMS Warehouse
13
This definition Given
by Bill Inmon
remains reasonably
accurate almost for
10 years
14
Accepted Changes in Definition
Now a days
▪ DW can be volatile due to required multi- terra
(2 power 40 bytes) bytes of data
▪ DW becomes more general. It can have data about
more than one subject. A single subject DW is
called “Data Mart” and now a DW is an enterprise.
▪ Only a period (3 years) history is kept into DW and
older tuples are automatically “rolled off”
15
What is “Data Warehousing”?
S1 B2
B1 Bn
Sn
Data Loaders
Unified Schema
Data Base
(un-normalized)
Bi- Branches
Si- Schemas
18
1. When and How to gather data?
2
1
DW DW
1
Active
n Order 21
Passive
2. What Schema To use ?
1. Schema Integration
2. Data conversion
22
2. What Schema To use ?
23
3. Data Transformation and Cleansing.
25
Data Transformation
Name
26
4. How to propagate Updates?
27
5. What Data To Summarize?
29
Data Warehouse Schemas
Consistency,
Avoiding redundancy,
Modification of data,
Ability to represent data…
30
Note Difference
1. Data analysis
2. To support interactive analysis of
summary
31
Data Warehouse Schema
color
400 units
Yellow
100 units
blue
150 units a b
Item name
XL
300 units
Size
Data Visualization
34
Fact Tables
35
EXAMPLE OF A FACT TABLE
1. Star schema
2. Snowflake schema
37
Item_info Store
Store_id
Item_id
City
Item_name Sales
State
Color
Item_id Country
Size
Store_id
category
Customer_id
Date
Customer
Date_info Number
Customer_id
Date Prize
Name
Day
Street
Month
STAR City
Year
SCHEMA state
38
Characteristics of star schema
1. Contains only one Fact table and
multiple dimension tables
2. Primary key of fact table is composite
key made of all the dimension
attributes present in it.
3. Fact table may include level (e.g. item
1 sold at district, regional and state
level)
4. A single fact table will contain detail
data such as sales at a store, of an
item k, on date xyz 39
How to design a star schema?
41
TUTORIAL 1
Q.1. Design of star schema for Olympic events.
-- Consider particular example of attendance at Olympic events.
Facts are numbers attending, value of ticket sales. Dimensions
include Olympiad (year of Olympic), venue, sport, type ( heat
(common match), semifinal, final), men's / woman's. Venues
are classified by location and type of building into central
enclosed, central open, remote. Sports are subdivided into
events.
The following is a sample of a report representing
attendance at various events. (A page will be given) Do the
following
a). Construct a fact table for this Olympic event.
b). What is the key of the fact table?
c). Design a star schema by using the fact table designed in a)
and using dimension tables.
42
Olympiad
Olympiad Venue
city
Organizing Venue
committee.
Location
Contact
address Region
Olympiad
Venue
Sport_event
Sports
Gender
43
Drawbacks of star schema
45
Snowflake schema
Location key
Fact City
Street
City key City key
City name
State
Country
47
Example of Snowflake
tim
Schema
ite
e
time_key m suppli
day item_key
day_of_the_we Sales Fact item_nam er
supplier_k
ek Table e ey
month time_key brand supplier_ty
quarter type pe
year item_key supplier_k
ey
bran branch_key locatio
n
location_k
ch
branch_ke location_key ey
y
street
branch_na units_sold cit
city_key
me y
city_key
dollars_sold
branch_typ city
e avg_sales state_or_provi
Measur nce
es country 48
Difference
• Star schema • Snowflake schema
1. Dimension tables 1. Some dimension
are un-normalized tables are
2. Requires more normalized
space due to 2. Requires less
redundant data space since less
3. Query evaluation redundancy
cost is less due to 3. Query evaluation
less join operations cost is more due to
4. Simple and more joins
commonly used 4. Difficult and less
common 49
Q.2. Design a snowflake schema for the
above partial star schema shown, by
decomposing dimension tables into 3NF.
Consider following functional
dependencies exist on customer and
store dimensions.
50
Customer id
Customer id
Name
Store id Address
Phone
location
Store id
Slocation id
Owner
City
State
country
51
Functional dependencies
52
Fact Constellation Schema
53
General example of fact constellation
F1 F2
D1 D2 D3
54
Example of Fact
tim Constellation
e ite
time_key Shipping Fact
m
day item_key Table
day_of_the_we Sales Fact item_name time_key
ek Table brand
month time_key type
quarter supplier_ty item_key
year pe shipper_key
item_key
branch_key from_locatio
bran locatio n
ch
branch_ke location_key n to_location
location_key
y street dollars_cost
branch_na units_sold
city
me
dollars_sold province_or_st units_shippe
branch_typ ate d shipp
e avg_sales country
Measure er
shipper_ke
s y
shipper_na
55
me
Q.3. Draw the fact constellation schema
diagram for following tables by identifying
fact and dimension tables.
Sales=(time_key, item_key,branch_key,
location_key, dollars_sold, units_sold)
Shipping=(item_key, time_key, shipper_key,
from_location, to_location, dollars_cost,
units_shipped)
56
Time=(time_key, day_of_week, month,
quarter, year)
Branch=(branch_key, branch_name,
branch_type)
Location=(location_key, street, city, country)
Item=(item_key, item_name,brand, type,
supplier)
Shipper=(shipper_key,shipper_name,
location_key, shipper_type) 57
Q.3. a) How many subjects this fact
constellation schema handles? Why?
b) How many fact tables are there?
Which?
c) What tables are shared by all the
fact tables?
58
Difference between OLTP and DW
system
OLTP DW
time database
Data Cubes and OLAP
▪ A DW is modeled by multidimensional
database structures, where each dimension
corresponds to an attribute (s) and each cell
stores the value of aggregate measure in it.
60
DATA CUBE
DATA CUBE
61
Browsing a Data Cube
62
On-Line Analytic Processing (OLAP)
• Relational DBMS contains 2-
dimensional data spread in rows and
columns. Thus OLTP (On-Line
Transaction Processing) is way to
use DBMS.
• DW is a multidimensional structure,
so OLAP is the proper way to use DW.
• OLAP is set of operations that uses
aggregation of data at various
dimensions, to present the data at
different levels. 63
Difference between OLTP and OLAP
No Feature OLTP OLAP
66
Difference between OLTP and OLAP
68
Country
State k
State 1
District m
District 1
country Year
Quarter
state
city
Week
Month
street
Day
Total Partial 71
Use of concept Hierarchy
1. Roll-Up (Drill-Up)
2. Drill-Down
3. Slice and Dice
1
4. Pivot (Rotate)
Roll-Up: 2
TV r
r
ct od
r r m
P U.S.A
Pr
VCRC
Countr
su
Canad
m
a
Mexic
y
o
su
m
74
Roll-Up on dimension reduction
75
Roll-Up on dimension reduction
76
Extreme Roll-Up
city
78
Slice and Dice
79
Slice Operation 80
81
Dice
82
Dicing Operation 83
Pivot (Rotate)
84
Before Rotating 85
86
After Rotating
Three-Tier data Warehouse
Architecture
Data
Marts
Data Data OLAP Engine Front-End88
BOTTOM-tier:- DW database server
91
ROLAP VSS MOLAP VSS HOLAP
93
Data Warehousing - Interview Questions
96
• Q: Explain data mart.
• A : Data mart contains the subset of organization-wide data. This
subset of data is valuable to specific groups of an organization. In
other words, we can say that a data mart contains data specific to a
particular group.
• Q: List the phases involved in the data warehouse delivery
process.
• A : The stages are IT strategy, Education, Business Case Analysis,
technical Blueprint, Build the version, History Load, Ad hoc query,
Requirement Evolution, Automation, and Extending Scope.
• Q: Define load manager.
• A : A load manager performs the operations required to extract and
load the process. The size and complexity of load manager varies
between specific solutions from data warehouse to data warehouse.
• Q: Define the functions of a load manager.
• A : A load manager extracts data from the source system. Fast load
the extracted data into temporary data store. Perform simple
transformations into structure similar to the one in the data
warehouse.
97
• Q: Define a warehouse manager.
• A : Warehouse manager is responsible for the warehouse
management process. The warehouse manager consist of third
party system software, C programs and shell scripts. The size and
complexity of warehouse manager varies between specific
solutions.
• Q: Define the functions of a warehouse manager.
• A : The warehouse manager performs consistency and referential
integrity checks, creates the indexes, business views, partition
views against the base data, transforms and merge the source data
into the temporary store into the published data warehouse, backs
up the data in the data warehouse, and archives the data that has
reached the end of its captured life.
99
• Q: What is Normalization?
• A : Normalization splits up the data into additional tables.
• Q: Out of star schema and snowflake schema, whose
dimension table is normalized?
• A : Snowflake schema uses the concept of normalization.
• Q: What is the benefit of normalization?
• A : Normalization helps in reducing data redundancy.
• Q: Which language is used for defining Schema Definition?
• A : Data Mining Query Language (DMQL) is used for Schema
Definition.
• Q: What language is the base of DMQL?
• A : DMQL is based on Structured Query Language (SQL).
• Q: What are the reasons for partitioning?
• A : Partitioning is done for various reasons such as easy
management, to assist backup recovery, to enhance performance.
• Q: What kind of costs are involved in Data Marting?
• A : Data Marting involves hardware & software cost, network
access cost, and time cost
100
THANK YOU
• http://www.tutorialspoint.com/dwh/
dwh_quick_guide.htm
101