Data Warehousing and Data Mining: Dr. Karunendra Verma

Data Warehousing
and Data mining
Dr. Karunendra Verma

Assot. Professor, Dept. Of CSE
1
BASIC CONCEPTS
• What is Data?
Data is raw, unorganized facts that need to be processed. Data
can be something simple and seemingly random and useless
until it is organized
• What is Information?
When data is processed, organized, structured or presented in a
given context so as to make it useful, it is called information.
• How data and Information are
related?
Example :Each student's test score is one piece of data.
The average score of a class or of the entire school is
information that can be derived from the given data.
2
What is a Warehouse in general?
“A warehouse is a commercial building for

storage of goods”.
Warehouses are used by manufacturers,
importers, exporters, wholesalers, transport
businesses, customs, etc. They are usually
large plain buildings in industrial parts of
towns. They come equipped with loading
docks ( harbor ) to load and unload trucks;
or sometimes are loaded directly from
railways, airports, or seaports. They also
often have cranes and forklifts for moving
goods.
3
Decision Support System
• How DBMS Applications are classified as ?
1. Transaction Processing System:
A transaction process system (TPS) is an information
processing system for business transactions involving
the collection, modification and retrieval of all
transaction data.
2. Decision Support System (DSS)
❖ Definition of DSS
DSS aims to get High Level information out of the
detailed data stored in transaction processing system
Information generated after observing several

modifications to tuples
4
Uses of DSS
1. What products to stock?

2. When to plan for production?
3. At what quantity a product to be
manufactured?
4. What items are preferred by peoples?
5. What items are likely to be
purchased in future?
Business Intelligence
5
Problems in Storage and Retrieval of
Data for DSS
1. Many DSS queries can not be written
using SQL
Reasons
a. Require Extensive statistical Analysis

b. Requires some preprocessing for analysis
c. Some queries written in SQL are inefficient
6
2. Large Companies have Various sources of Data
(Branches spread worldwide)
o Located at different places
o Use different Database Schemas
o Designed and implemented at
different time
o May use different DBMS S/W
To Avoid These Problems Companies are using
Data Warehouse 7
Data Warehouse Definition
• A Data Warehouse gathers data from

multiple sources under a unified schema
at a single site and thus provides a single
uniform interface to data. It also stores
historical data.
Bill Inmon coined the term “Data
Warehouse” in 1990
Definition by Father of Data Warehouse

8
Definition by Bill Inmon
• A Data warehouse is a subject oriented,

integrated, time variant and non-volatile
collection of data in support of
managements decision making
1. Subject
process.Oriented:- data that gives information
about a particular subject (like sales, production)
rather than mixing various subjects about
companies ongoing operations
There must be a separate Warehouse for each

Subject
9
Application and Subject Orientation
Application Subject
❑ Loans ❑ Customer
❑ Savings ❑ Vendor
❑ Accounts ❑ Sales
❑ Trusts ❑ Products
❑ Checking ❑ Purchase
DW focuses on “Data Modeling” and “Data
Design” but “Process Design” is not part of the
Data Warehousing
10
2. Integrated:- means data is gathered into

data warehouse from variety of sources and
merged into a single unified schema
Operational DBMS Data Warehouse
Male Female
1 0 M F
X Y
134 c ms
5.8 feet Inches
1.6 meters
11
• 3. Time Variant:- All the data in the data

warehouse is identified with a
particular time period. e.g.
- balance of an account on a given date
- total sale of a specific item on a date
Thus Data warehouse may have multiple tuples of the
same account indicating various transactions on it till
date
12
4. Non-volatile:- Data is stable in data

warehouse. More data is added but data
Replaceis never removed
Insert Load
Update Access
Delete
Operational Data
DBMS Warehouse
13
This definition Given
by Bill Inmon
remains reasonably
accurate almost for
10 years
14
Accepted Changes in Definition
Now a days
▪ DW can be volatile due to required multi- terra
(2 power 40 bytes) bytes of data
▪ DW becomes more general. It can have data about
more than one subject. A single subject DW is
called “Data Mart” and now a DW is an enterprise.
▪ Only a period (3 years) history is kept into DW and
older tuples are automatically “rolled off”
Deleting tuples based on dates
15
What is “Data Warehousing”?
• It is set of H/W and S/W components that

can be used to better analyze the massive
amount of data that companies are
accumulating to make better business
decisions
Data warehousing is not just the data in Data
warehouse but the term deals with other things also,
like
Architecture Data Loaders
Query and Analysis tools

16
S2
S1 B2
B1 Bn
Sn
Data Loaders
Unified Schema
Data Base
(un-normalized)
Bi- Branches
Si- Schemas
Analytic Tools General Architecture 17

of DW
Issues in building Data
Warehouse
18
1. When and How to gather data?
Based on this we have two architectures

of DW.
a. Source Driven Architecture

b. Destination driven Architecture
In Source Driven architecture data Sources

transmits new information either
continually (as transaction processing
takes place) OR Periodically (e.g. per night)
to DW 19
In Source Driven Architecture,
“Sources” are more active than
Destination.
In Destination Driven Architecture, DW

periodically request for new Data to the
Sources.
In Destination Driven Architecture,
“Destination” is more active than
Sources.
20
Data Sources
2
1
DW DW
1
Source Driven Destination Driven
Active
n Order 21
Passive
2. What Schema To use ?
❑ Since various Data Sources have

different schema and data model
(Network, Relational and OODM), DW
has to perform following tasks
1. Schema Integration
2. Data conversion
22
2. What Schema To use ?
1. Schema Integration:- Finding unified

and integrated schema suitable for all the
data sources
2. Data Conversion:- Converting Data into

unified and integrated schema.
Thus Data stored in DW is not just a copy of

Data present in multiple data sources
23
3. Data Transformation and Cleansing.
Cleansing:- (means purification or refinement).

It is the task of correcting and preprocessing
the Data.
What is need of Data Cleansing?
(1). Same attribute value may be written in
different manner at various data sources. e. g
company name
- I.B.M - IBM
- International Business Machines 24
(2). To correct spelling mistakes in
- Names of City/ State/ Country
- Other standard attributes
(3) To correct Incorrect Entries of

- ZIP codes
- Telephone codes
25
Data Transformation
Other than cleansing; data may be

transformed for
- Changing units of measure
- converting data to a different
schema by
joining multiple columns e. g. name
First-name Middle-name Last-name
Name
26
4. How to propagate Updates?
- Updates on relations at the data

sources must be propagated to the DW
- It is exactly like “View-maintenance” in
DBMS.
27
5. What Data To Summarize?
- Raw data generated online is too-large

- Require more space to store
(impracticable)
- Require more time to answer queries
on it.
However many queries can be answered
by using summary of Data rather than
maintaining big relations.
Data obtained by aggregation on relations
28
Example of aggregation / Data summary
- Instead of storing Data about every
sale of clothing, we can store total
sales by item-name and size.
ETL Tasks:- different steps involved in
getting data into DW is Extraction, Transform
and Load
Getting data from various data sources
29
Data Warehouse Schemas
❑ For what purpose DBMS

Schema are designed ?
Consistency,
Avoiding redundancy,
Modification of data,
Ability to represent data…
30
Note Difference
❑ For what purpose DW Schema

are designed ?
1. Data analysis
2. To support interactive analysis of
summary
31
Data Warehouse Schema
❑ Data present in DW is modeled as

“Multidimensional Data”
Data that can be modeled as “dimension
attributes” and “measure attributes” is
called as “Multidimensional Data”
Measure Attribute:- An attribute that can be
measured and aggregated
e.g. Number of units sold, balance, etc are

measure attributes . 32
Data Warehouse Schema
Dimension Attributes:- are those

attributes that define dimensions on
which measure attributes and
summaries are viewed
Sales=(item_name, color, size, units-sold)
Dimension Attribute Measure Attribute

33
Interpretation Of Dimension and
Measure Attributes
color
400 units
Yellow
100 units
blue
150 units a b
Item name
XL
300 units
Size
Data Visualization
34
Fact Tables
• DW contains set of “Fact Tables”,

which are the tables, containing
“multidimensional Data”.
Facts about Fact Tables
- Are very large Tables
- e.g. a table storing Sales information of a
retail store with one tuple for each purchase
made by a customer.
35
EXAMPLE OF A FACT TABLE
• Sales Table include

• Item_id
• DateOfPurchase Dimension
• Customer_id Attributes
• Store_id
• NoOfItemsSold Measure
• prizeOfItem Attributes
To minimize storage requirement dimension

attributes used in fact table, are primary keys
of other tables called dimension table 36
Data Warehouse Star Schema
1. Star schema
2. Snowflake schema
Star Schema, is a Data warehouse

schema, containing a fact table and
multiple dimension tables
37
Item_info Store
Store_id
Item_id
City
Item_name Sales
State
Color
Item_id Country
Size
Store_id
category
Customer_id
Date
Customer
Date_info Number
Customer_id
Date Prize
Name
Day
Street
Month
STAR City
Year
SCHEMA state
38
Characteristics of star schema
1. Contains only one Fact table and
multiple dimension tables
2. Primary key of fact table is composite
key made of all the dimension
attributes present in it.
3. Fact table may include level (e.g. item
1 sold at district, regional and state
level)
4. A single fact table will contain detail
data such as sales at a store, of an
item k, on date xyz 39
How to design a star schema?
1. Find out unified schema by considering

schemas at all the data sources.
2. Find out measure and dimension
attributes in each table along with their
primary keys.
3. Prepare a single fact table by collecting
primary keys of all the tables plus some
additional attributes ( e.g. level, total
sale and etc).
40
4. Draw the schema diagram of above
schema.
5. Before actual loading the data do the
following things
1. Data transformation
2. Data Cleansing
3. Adding time unit as part of key
41
TUTORIAL 1
Q.1. Design of star schema for Olympic events.
-- Consider particular example of attendance at Olympic events.
Facts are numbers attending, value of ticket sales. Dimensions
include Olympiad (year of Olympic), venue, sport, type ( heat
(common match), semifinal, final), men's / woman's. Venues
are classified by location and type of building into central
enclosed, central open, remote. Sports are subdivided into
events.
The following is a sample of a report representing
attendance at various events. (A page will be given) Do the
following
a). Construct a fact table for this Olympic event.
b). What is the key of the fact table?
c). Design a star schema by using the fact table designed in a)
and using dimension tables.
42
Olympiad
Olympiad Venue
city
Organizing Venue
committee.
Location
Contact
address Region
Olympiad
Venue
Sport_event
Sports
Gender
Gender Attendance Sport_event
Ticket sale Event_class

Gender Subsport
Fact Sporting-
federation
43
Drawbacks of star schema
1. Addition and deletions of levels in

hierarchy will require physical
modification to the fact table
Fact=(store_key, store, district, region, zone, total sale, units_sold)
2. Since dimension tables are unnormalized,

star schema requires more space
44
Snowflake schema
❑ Snowflake is a pile (flake) of snow

made during snowfall.
45
Snowflake schema
❑ Is variant of star schema where some

dimension tables are normalized, by
splitting the data into additional tables
❑ The resulting schema diagram looks
like a “snowflake” , hence it is called
snowflake.
❑ Following partial schema diagram
shows an example of snowflake
schema
46
Location key
Location
Location key
Fact City
Street
City key City key
City name
State
Country
47
Example of Snowflake
tim
Schema
ite
e
time_key m suppli
day item_key
day_of_the_we Sales Fact item_nam er
supplier_k
ek Table e ey
month time_key brand supplier_ty
quarter type pe
year item_key supplier_k
ey
bran branch_key locatio
n
location_k
ch
branch_ke location_key ey
y
street
branch_na units_sold cit
city_key
me y
city_key
dollars_sold
branch_typ city
e avg_sales state_or_provi
Measur nce
es country 48
Difference
• Star schema • Snowflake schema
1. Dimension tables 1. Some dimension
are un-normalized tables are
2. Requires more normalized
space due to 2. Requires less
redundant data space since less
3. Query evaluation redundancy
cost is less due to 3. Query evaluation
less join operations cost is more due to
4. Simple and more joins
commonly used 4. Difficult and less
common 49
Q.2. Design a snowflake schema for the
above partial star schema shown, by
decomposing dimension tables into 3NF.
Consider following functional
dependencies exist on customer and
store dimensions.
50
Customer id
Customer id
Name
Store id Address
Phone
location
Store id
Slocation id
Owner
City
State
country
51
Functional dependencies
customerid Name, address

customerid phone
phone location
storeid owner, slocationid

slocationid city, state
state country
52
Fact Constellation Schema
1. Constellation means “group of stars”

2. Star schema handles only one fact table
and thus only one subject. When we
have to handle multiple subjects,
multiple facts tables must be used.
3. Fact constellation schema allows more
than one fact tables that shares several
dimension tables.
53
General example of fact constellation
F1 F2
D1 D2 D3
54
Example of Fact
tim Constellation
e ite
time_key Shipping Fact
m
day item_key Table
day_of_the_we Sales Fact item_name time_key
ek Table brand
month time_key type
quarter supplier_ty item_key
year pe shipper_key
item_key
branch_key from_locatio
bran locatio n
ch
branch_ke location_key n to_location
location_key
y street dollars_cost
branch_na units_sold
city
me
dollars_sold province_or_st units_shippe
branch_typ ate d shipp
e avg_sales country
Measure er
shipper_ke
s y
shipper_na
55
me
Q.3. Draw the fact constellation schema
diagram for following tables by identifying
fact and dimension tables.
Sales=(time_key, item_key,branch_key,
location_key, dollars_sold, units_sold)
Shipping=(item_key, time_key, shipper_key,
from_location, to_location, dollars_cost,
units_shipped)
56
Time=(time_key, day_of_week, month,
quarter, year)
Branch=(branch_key, branch_name,
branch_type)
Location=(location_key, street, city, country)
Item=(item_key, item_name,brand, type,
supplier)
Shipper=(shipper_key,shipper_name,
location_key, shipper_type) 57
Q.3. a) How many subjects this fact
constellation schema handles? Why?
b) How many fact tables are there?
Which?
c) What tables are shared by all the
fact tables?
58
Difference between OLTP and DW
system
OLTP DW
Few Indexes Many
Many Joins few
Less Duplicated More

data
Rare Derived Common
Data and
Aggregation
One tuple at a Modification Bulk 59
time database
Data Cubes and OLAP
▪ A DW is modeled by multidimensional
database structures, where each dimension
corresponds to an attribute (s) and each cell
stores the value of aggregate measure in it.
▪ Thus the actual structure of DW may be a

relational data store or multidimensional
data cube
60
DATA CUBE
DATA CUBE
61
Browsing a Data Cube
62
On-Line Analytic Processing (OLAP)
• Relational DBMS contains 2-
dimensional data spread in rows and
columns. Thus OLTP (On-Line
Transaction Processing) is way to
use DBMS.
• DW is a multidimensional structure,
so OLAP is the proper way to use DW.
• OLAP is set of operations that uses
aggregation of data at various
dimensions, to present the data at
different levels. 63
Difference between OLTP and OLAP
No Feature OLTP OLAP
1 Characteristi Operational Informational

c processing processing
2 Orientation Transaction Analysis
3 Users Clerk, DBA, Knowledge
DB workers
(managers,
professionals
executive, analyst)
4 Function Day-to-day Long term
operation decision support
64

5 DB design ER based, Star, snowflake,
application subject oriented
oriented
6 Data Current Historical
7 Summariza Highly detail summarized

tion
8 View Flat relational Multidimensional
(2-D)
65

9 Unit of work Simple Complex query
transaction
10 Access Read/Write Mostly Read
11 Focus Data in Information out
12 Operations Index on PK Lots of scan
66

13 Numbers tens millions
of records
accessed
14 Number of Thousands Hundreds
users
15 DB size 100 MB to 100 GB to TB
GB
16 Metric Transaction Response time
throughput 67
Concept Hierarchy
• Defines a sequence of mappings

from a set of low level concepts to
high level, more general concepts.
• Next figure shows concept hierarchy
of location
68
Country
State k
State 1
District m
District 1
City 1 …..... City n-1 City n

69
Concept Hierarchies
• Many concept hierarchies are

implicit within the database schema.
• Location=(id, street, city, state,

country)
• These attributes are related by total

order
like street<city<state<country
Or in a partial order framing a lattice. 70
Total and Partial Concept Hierarchies
country Year
Quarter
state
city
Week
Month
street
Day
Total Partial 71
Use of concept Hierarchy
• In multidimensional model , data is

organized into multiple dimensions, each
dimensions contains multiple levels of
abstraction defined by concept
hierarchies.
• Organization of data in concept
hierarchies provides users with the
flexibility to view data from different
perspective.
72
OLAP OPERATIONS
1. Roll-Up (Drill-Up)
2. Drill-Down
3. Slice and Dice
1
4. Pivot (Rotate)
Roll-Up: 2
Roll-Up is also called as “Drill-Up” by some

vendors. Performs aggregation on data cube
by climbing up a concept hierarchy for a
dimension or by dimension reduction.
73
Roll-Up on Date dimension
Total annual
Dat sales
1Qt 2Qt 3Qt 4Qt su of TV in U.S.A.
e
u
TV r
r
ct od
r r m
P U.S.A
Pr
VCRC
Countr
su
Canad
m
a
Mexic
y
o
su
m
74
Roll-Up on dimension reduction
• When roll-Up is performed by

dimension reduction , one or more
dimensions are removed from the
given data cube.
▪ For example in sales cube containing
only two dimensions location and time,
aggregation can be done on only location
rather than on both location and time
75
Roll-Up on dimension reduction
76
Extreme Roll-Up
Aggregation along n-1 dimension on n dimension cube

77
Drill-Down
• It is reverse of Roll-Up , it can be realized

by stepping down a concept hierarchy for
a dimension. or Country
• By introducing a new dimension.
city
78
Slice and Dice
• Slice operation performs a selection

on one or more dimension of the given
cube, resulting in a sub cube.
79
Slice Operation 80
81
Dice
• Dicing refers to range selection in

multiple dimensions.
• Dice selects two or more dimensions from
a given cube and provides a new sub-
cube
82
Dicing Operation 83
Pivot (Rotate)
• Pivot (Rotate) is a visualization

operation that rotates the data axes in
order to provide an alternative
presentation of the data.
84
Before Rotating 85
86
After Rotating
Three-Tier data Warehouse
Architecture
• DW can be designed by using three

tiers.
1. Top-tier:- Front end tools query

report, data mining, analysis. (Client)
2. Middle-tier:- OLAP server
3. Bottom-tier:- Warehouse database
server 87
Multi-Tiered DW Architecture
Bottom Middle top
Monito
r
Metada & OLAP
other
ta Integra Server
sourc
tor
es Analysis
Operation Extract Query
al Transfor Data Serv Reports
DBs m
Warehou e Data
Load
Refresh se mining
Data
Marts
Data Data OLAP Engine Front-End88
BOTTOM-tier:- DW database server
• Data from operational DB and other

sources are extracted using application
programs called “Gateways” . ( DB
Interface)
• A Gateway is supported by underlying
DBMS, to provide DB connectivity.
• Examples are ODBC, OLE-DB (Open
linking and embedding), JDBC.
89
MIDDLE-tier:- OLAP server
• OLAP server is typically implemented
in either
1. Relational OLAP (ROLAP):-
is an extended relational DBMS that
maps operations on multidimensional
data to standard relational operations.
2. Multidimensional OLAP (MOLAP):-
is a special purpose server that
directly implements multidimensional
data and operations. 90
3. Hybrid OLAP (HOLAP):-
Is combination of both ROLAP and
MOLAP.
91
ROLAP VSS MOLAP VSS HOLAP
ROLAP MOLAP HOLAP
Relational Data cubes Combination

tables, M views of tables and
data cubes
Higher Faster Both features
scalability computation
Easy to build Difficult to Difficult to

by extending build since build since
relational DB start from start from
scratch scratch 92
Top-tier:- Client
• Query and reporting tools

• Analysis tools
• Data mining tools
(prediction, trend analysis)
93
Data Warehousing - Interview Questions
• Q: Define data warehouse?

• A : Data warehouse is a subject oriented, integrated, time-variant,
and nonvolatile collection of data that supports management's
decision-making process.
• Q: What does subject-oriented data warehouse signify?
• A : Subject oriented signifies that the data warehouse stores the
information around a particular subject such as product, customer,
sales, etc.
• Q: List any five applications of data warehouse.
• A : Some applications include financial services, banking services,
customer goods, retail sectors, controlled manufacturing.
• Q: What do OLAP and OLTP stand for?
• A : OLAP is an acronym for Online Analytical Processing and
OLTP is an acronym of Online Transactional Processing.
94
• Q: What is the very basic difference between data warehouse
and operational databases?
• A : A data warehouse contains historical information that is made
available for analysis of the business whereas an operational
database contains current information that is required to run the
business.
• Q: List the Schema that a data warehouse system can
implements.
• A : A data Warehouse can implement star schema, snowflake
schema, and fact constellation schema.
• Q: What is Data Warehousing?
• A : Data Warehousing is the process of constructing and using the
data warehouse.
• Q: List the process that are involved in Data Warehousing.
• A : Data Warehousing involves data cleaning, data integration and
data consolidations.
• Q: List the functions of data warehouse tools and utilities.
• A : The functions performed by Data warehouse tool and utilities
are Data Extraction, Data Cleaning, Data Transformation, Data
Loading and Refreshing. 95
• Q: What do you mean by Data Extraction?
• A : Data extraction means gathering data from multiple
heterogeneous sources.
• Q: Define metadata?
• A : Metadata is simply defined as data about data. In other words,
we can say that metadata is the summarized data that leads us to
the detailed data.
• Q: What does Metadata Respiratory contain?
• A : Metadata respiratory contains definition of data warehouse,
business metadata, operational metadata, data for mapping from
operational environment to data warehouse, and the algorithms for
summarization.
• Q: How does a Data Cube help?
• A : Data cube helps us to represent the data in multiple dimensions.
The data cube is defined by dimensions and facts.
• Q: Define dimension?
• A : The dimensions are the entities with respect to which an
enterprise keeps the records.
96
• Q: Explain data mart.
• A : Data mart contains the subset of organization-wide data. This
subset of data is valuable to specific groups of an organization. In
other words, we can say that a data mart contains data specific to a
particular group.
• Q: List the phases involved in the data warehouse delivery
process.
• A : The stages are IT strategy, Education, Business Case Analysis,
technical Blueprint, Build the version, History Load, Ad hoc query,
Requirement Evolution, Automation, and Extending Scope.
• Q: Define load manager.
• A : A load manager performs the operations required to extract and
load the process. The size and complexity of load manager varies
between specific solutions from data warehouse to data warehouse.
• Q: Define the functions of a load manager.
• A : A load manager extracts data from the source system. Fast load
the extracted data into temporary data store. Perform simple
transformations into structure similar to the one in the data
warehouse.
97
• Q: Define a warehouse manager.
• A : Warehouse manager is responsible for the warehouse
management process. The warehouse manager consist of third
party system software, C programs and shell scripts. The size and
complexity of warehouse manager varies between specific
solutions.
• Q: Define the functions of a warehouse manager.
• A : The warehouse manager performs consistency and referential
integrity checks, creates the indexes, business views, partition
views against the base data, transforms and merge the source data
into the temporary store into the published data warehouse, backs
up the data in the data warehouse, and archives the data that has
reached the end of its captured life.
• Q: What is Summary Information?

• A : Summary Information is the area in data warehouse where the
predefined aggregations are kept.
• Q: What does the Query Manager responsible for?
• A : Query Manager is responsible for directing the queries to the
suitable tables. 98
• Q: List the types of OLAP server
• A : There are four types of OLAP servers, namely Relational OLAP,
Multidimensional OLAP, Hybrid OLAP, and Specialized SQL
Servers.
• Q: Which one is faster, Multidimensional OLAP or Relational
OLAP?
• A : Multidimensional OLAP is faster than Relational OLAP.
• Q: List the functions performed by OLAP.
• A : OLAP performs functions such as roll-up, drill-down, slice, dice,
and pivot.
• Q: How many dimensions are selected in Slice operation?
• A : Only one dimension is selected for the slice operation.
• Q: How many dimensions are selected in dice operation?
• A : For dice operation two or more dimensions are selected for a
given cube.
• Q: How many fact tables are there in a star schema?
• A : There is only one fact table in a star Schema.
99
• Q: What is Normalization?
• A : Normalization splits up the data into additional tables.
• Q: Out of star schema and snowflake schema, whose
dimension table is normalized?
• A : Snowflake schema uses the concept of normalization.
• Q: What is the benefit of normalization?
• A : Normalization helps in reducing data redundancy.
• Q: Which language is used for defining Schema Definition?
• A : Data Mining Query Language (DMQL) is used for Schema
Definition.
• Q: What language is the base of DMQL?
• A : DMQL is based on Structured Query Language (SQL).
• Q: What are the reasons for partitioning?
• A : Partitioning is done for various reasons such as easy
management, to assist backup recovery, to enhance performance.
• Q: What kind of costs are involved in Data Marting?
• A : Data Marting involves hardware & software cost, network
access cost, and time cost
100
THANK YOU
• http://www.tutorialspoint.com/dwh/
dwh_quick_guide.htm
101

Data Warehousing and Data Mining: Dr. Karunendra Verma

Uploaded by

Copyright:

Available Formats

Data Warehousing and Data Mining: Dr. Karunendra Verma

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing and Data Mining: Dr. Karunendra Verma

Uploaded by

Copyright:

Available Formats

Data Warehousing

and Data mining

Dr. Karunendra Verma

“A warehouse is a commercial building for

2. Decision Support System (DSS)

Information generated after observing several

1. What products to stock?

a. Require Extensive statistical Analysis

• A Data Warehouse gathers data from

Definition by Father of Data Warehouse

• A Data warehouse is a subject oriented,

There must be a separate Warehouse for each

2. Integrated:- means data is gathered into

5.8 feet Inches

• 3. Time Variant:- All the data in the data

4. Non-volatile:- Data is stable in data

Deleting tuples based on dates

• It is set of H/W and S/W components that

Query and Analysis tools

Analytic Tools General Architecture 17

Based on this we have two architectures

a. Source Driven Architecture

In Source Driven architecture data Sources

In Destination Driven Architecture, DW

Source Driven Destination Driven

❑ Since various Data Sources have

1. Schema Integration:- Finding unified

2. Data Conversion:- Converting Data into

Thus Data stored in DW is not just a copy of

Cleansing:- (means purification or refinement).

(3) To correct Incorrect Entries of

Other than cleansing; data may be

- Updates on relations at the data

- Raw data generated online is too-large

Getting data from various data sources

❑ For what purpose DBMS

❑ For what purpose DW Schema

❑ Data present in DW is modeled as

e.g. Number of units sold, balance, etc are

Dimension Attributes:- are those

Dimension Attribute Measure Attribute

• DW contains set of “Fact Tables”,

• Sales Table include

To minimize storage requirement dimension

Star Schema, is a Data warehouse

1. Find out unified schema by considering

Gender Attendance Sport_event

Ticket sale Event_class

1. Addition and deletions of levels in

2. Since dimension tables are unnormalized,

❑ Snowflake is a pile (flake) of snow

❑ Is variant of star schema where some

customerid Name, address

storeid owner, slocationid

1. Constellation means “group of stars”

Few Indexes Many

Many Joins few

Less Duplicated More

▪ Thus the actual structure of DW may be a