Data Warehouse
Data Warehouse
Data Warehouse
n
i
i
L T
CSP002N-week2 25
Cube Computation: ROLAP-Based Method
Efficient cube computation methods
ROLAP-based cubing algorithms
Array-based cubing algorithm
Bottom-up computation method
ROLAP-based cubing algorithms
Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related
tuples
Grouping is performed on some subaggregates as a partial
grouping step
Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
CSP002N-week2 26
Indexing OLAP Data: Bitmap Index
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value
for the indexed column
not suitable for high cardinality domains
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table
Index on Region Index on Type
CSP002N-week2 27
Indexing OLAP Data: Bitmap Index
The purpose of constructing OLAP index structures is
to speed up query processing in data cubes.
Bitmap index is especially useful for low-cardinality
domains because comparison, join, and aggregation
operations are then reduced to bit arithmetic, which
substantially reduces the processing time. Bitmap
index leads to significant reductions in space and I/O
since a string of characters can be represented by a
single bit.
not suitable for high cardinality domains
For higher cardinality domains, the method can be
adapted using compression techniques.
CSP002N-week2 28
Indexing OLAP Data: Join Indices
Join index: JI(R-id, S-id) where R (R-id, ) S
(S-id, )
Traditional indices map the values to a list of
record ids
It materializes relational join in JI file and
speeds up relational join a rather costly
operation
In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
E.g. fact table: Sales and two dimensions city
and product
A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
Join indices can span multiple dimensions
CSP002N-week2 29
Indexing OLAP Data: Join Indices
location sales_key
Main Street
Main Street
Main Street
T57
T238
T884
item Sales_key
Sony-TV
Sony-TV
T57
T459
Join index table for
location/sales
Join index table for
item/sales
Location Item Sales_key
Main Street
Sony-TV
T57
Join index table linking two dimensions for location/item/sales
CSP002N-week2 30
Indexing OLAP Data: Join Indices
The join index records can identify joinable tuples without
performing costly join operations.
Suppose that there are 360 time values, 100 items, 50
branches, 30 locations, and 100 million sales tuples in the
sales_star data cube. If the sales fact table has recorded sales for
only 30 items, the remaining 70 items will obviously not
participate in joins. If join indices are not used, additional I/Os
have to be performed to bring the joining portions of the fact
talble and dimension table together.
CSP002N-week2 31
Indexing OLAP Data
The purpose of materializing cuboids and constructing
OLAP index structures is to speed up query processing
in data cubes.
To further speed up query processing, the join
indexing and bitmap indexing methods can be
integrated to form bitmapped join indices.
Microsoft SQL Server and Sybase IQ support bitmap
index. Oracle 8 used bitmap and join indices.
CSP002N-week2 32
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
CSP002N-week2 33
Data Mining
Data mining is a popular technique in
searching for interesting and unusual
patterns in data, has also been enabled by
the construction of data warehouses, and
there are claims of enhanced sales through
exploitation of patterns discovered in this
way.
CSP002N-week2 34
What is association rule mining?
mil
k
bre
ad
su
gar
butt
er
cere
al
eg
gs
Basket
1
1 1 0 0 1 0
Basket
2
1 1 1 0 0 1
Basket
3
1 1 0 1 0 0
Basket
4
0 0 1 0 0 1
CSP002N-week2 35
What is association rule mining? (cont.)
milk bread sugar butter cereal eggs
Basket 1 1 1 0 0 1 0
Basket 2 1 1 1 0 0 1
Basket 3 1 1 0 1 0 0
Basket 4 0 0 1 0 0 1
count 3 3 2 1 1 2
Support (milk)=3
Support (bread)=3
Support (sugar)=2
Support (milk U bread)=3
Support (milk U sugar)=1
Support (milk U bread U sugar)=1
Support (milk U bread U sugar U butter U cereal U eggs)=0
Confidence (A B)=Support (A U B)/Support (A)
As Confidence (milk bread) =
= Support (milk U bread)/Support (milk) = 3/3 = 100%,
Then milk bread
If Confidence (A B) >= min_conf, Then A B
CSP002N-week2 36
How DM improve your business?
Strategy 1: Placing milk
and bread within close
proximity may further
encourage the sale of
these items together
within single visits to the
store.
CSP002N-week2 37
How DM improve your business?
Strategy 2: Placing milk and
bread at opposite ends
of the store may entice
customers who purchase
such items to pick up
other items along the
way.
CSP002N-week2 38
How DM improve your business?
Strategy 3:Put these two
items into a package at
reduced price.
CSP002N-week2 39
Summary
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of managements decision-making
process
A multi-dimensional model of a data warehouse
Star schema, snowflake schema, fact constellations
A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
Efficient computation of data cubes
Partial vs. full vs. no materialization
Multiway array aggregation
Bitmap index implementations
CSP002N-week2 40
Exercises 1
1. Suppose that a data warehouse consists of the three dimensional time,
doctor, and patient, and the two measures count and charge, where
charge is the fee that a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for
modeling data warehouse.
(b) Draw a schema diagram for the above data warehouse using one
of the schema classes listed in (a).
(c) Stating with base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2000.
(d) To obtain the same list, write an SQL query assuming the data is
stored in a relational database with the schema fee (day, month, year,
doctor, hospital, patient, count, charge).
CSP002N-week2 41
Exercises 2
2. Suppose that a data warehouse consists of the four
dimensional date, spectator, location, and game, and the
two measures count and charge, where charge is the fare
that a spectator pays when watching a game on a given
date. Spectator may be students, adults, or seniors, with
each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.
(b) Stating with base cuboid [date, spectator, location ,
game], what specific OLAP operations should be
performed in order to list the total charge paid by student
spectator at GM_Place in 2000.
CSP002N-week2 42
Exercises 3
3. A popular data warehouse implementation is to construct a
multidimensional database, known as a data cube.
Unfortunately, this may often generate a huge, yet very
sparse multidimensional matrix.
(a) Present an example illustrating such a huge and sparse
data cube.
(b) Design an implementation method that can elegantly
overcome this sparse matrix problem. Note that you need
to explain your data structures in detail and discuss the
space needed, as well as how to retrieve data from your
structure.
CSP002N-week2 43
Exercises 4
4. In data warehouse technology, a multiple dimensional view
can be implemented by a relational database technique
(ROLAP), or by a multidimensional database technique
(MOLAP), or a hybrid database technique (HOLAP).
(a) Briefly describe each implementation technique.
(b) For each technique, explain the following function may be
implemented: The generation of a data warehouse
(including aggregation)
(c) Which implementation techniques do you prefer, and
why?
CSP002N-week2 44
Exercises 5
5. What are the differences between the three main
types of data warehouse usage: information
processing, analytical processing, and data
mining?
CSP002N-week2 45
References
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers
Margaret Dunham, Data Mining: Introductory and Advanced Topics, Published by Prentice Hall
Microsoft SQL Server, http://www.microsoft.com/sql/
Oracle, http://www.oracle.com/
DBMiner Technology Inc., http://www.dbminer.com/
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the
computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases, 506-521, Bombay,
India, Sept. 1996.
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. In Proc. 1997
ACM-SIGMOD Int. Conf. Management of Data, 417-427, Tucson, Arizona, May 1997.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data
for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, 94-105, Seattle,
Washington, June 1998.
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. 1997 Int. Conf. Data
Engineering, 232-243, Birmingham, England, April 1997.
K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In Proc. 1999 ACM-
SIGMOD Int. Conf. Management of Data (SIGMOD'99), 359-370, Philadelphia, PA, June 1999.
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record,
26:65-74, 1997.
OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube:
A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29-54, 1997.
CSP002N-week2 46
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In
Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 205-216, Montreal,
Canada, June 1996.
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998.
K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf.
Very Large Data Bases, 116-125, Athens, Greece, Aug. 1997.
K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple
granularities. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), 263-277,
Valencia, Spain, March 1998.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data
cubes. In Proc. Int. Conf. of Extending Database Technology (EDBT'98), pages 168-182,
Valencia, Spain, March 1998.
E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley &
Sons, 1997.
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous
multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data,
159-170, Tucson, Arizona, May 1997.