DATA Cube PDF
DATA Cube PDF
DATA Cube PDF
May 1997
Technical Report
MSR-TR-97-32
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
This paper appeared in Data Mining and Knowledge Discovery 1(1): 29-53 (1997)
1
IBM Research, 500 Harry Road, San Jose, CA. 95120
Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab, and Sub-Totals
Jim Gray
Surajit Chaudhuri
Adam Bosworth
Andrew Layman
Don Reichart
Murali Venkatrao
Frank Pellow1
Hamid Pirahesh2
Technical Report
MSR-TR-96-xx
Microsoft Research
Advanced Technology Division
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
2
IBM Research, 500 Harry Road, San Jose, CA. 95120
Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab, and Sub-Totals3
Abstract: Data analysis applications typically aggregate these visualization and data analysis tools represent the
data across many dimensions looking for anomalies or dataset as an N-dimensional space. Visualization tools
unusual patterns. The SQL aggregate functions and the render two and three-dimensional sub-slabs of this space as
GROUP BY operator produce zero-dimensional or one- 2D or 3D objects.
dimensional aggregates. Applications need the N- Color and time (motion) add two more dimensions to the
dimensional generalization of these operators. This pa- display giving the potential for a 5D display. A Spread-
per defines that operator, called the data cube or simply sheet application such as Excel is an example of a data
cube. The cube operator generalizes the histogram, visualization/analysis tool that is used widely. Data analy-
cross-tabulation, roll-up, drill-down, and sub-total con- sis tools often try to identify a subspace of the N-
structs found in most report writers. The novelty is that dimensional space which is “interesting” (e.g., discriminat-
cubes are relations. Consequently, the cube operator can ing attributes of the data set).
be imbedded in more complex non-procedural data
analysis programs. The cube operator treats each of the
N aggregation attributes as a dimension of N-space. The
aggregate of a particular set of attribute values is a point Spread Sheet
in this space. The set of points forms an N-dimensional
Analyze &
Extract Formulate
cube. Super-aggregates are computed by aggregating the
N-cube to lower dimensional spaces. This paper (1) ex- Table
Size vs Speed Price vs Speed
plains the cube and roll-up operators, (2) shows how they 1015
Nearline Cache
104
fit in SQL, (3) explains how users can define new aggre- 1012
1 Tape Offline
Tape Main 102
Disc Secondary
gate functions for cubes, and (4) discusses efficient tech- Size(B)
109 Secondary
Online
Tape
Online $/MB
Disc Tape 100
niques to compute the cube. Many of these features are Visualize Main Nearline Offline
Tape
106 Tape 10-2
being added to the SQL Standard.
Cache
103 10-4
10-9 10-6 10-3 100 103 10-9 10-6 10-3 100 103
Access Time (seconds) Access Time (seconds)
3
An extended abstract of this paper appeared in [Gray et.al.]
Data Cube 1
tions use constructs such as histogram, cross-tabulation, As mentioned in the introduction, visualization and data
subtotals, roll-up and drill-down extensively. analysis tools extensively use dimensionality reduction
(aggregation) for better comprehensibility. Often data
This paper examines how a relational engine can support along the other dimensions that are not included in a “2-D”
efficient extraction of information from a SQL database representation are summarized via aggregation in the form
that matches the above requirements of the visualization of histogram, cross-tabulation, subtotals etc. In SQL
and data analysis. We begin by discussing the relevant Standard, we depend on aggregate functions and the Group
features in Standard SQL and some of the vendor-specific By operator to support aggregation.
SQL extensions. Section 2 discusses why GROUP BY
fails to adequately address the requirements. The Cube The SQL standard [SQL], [Melton, Simon] provides five
and the ROLLUP operators are introduced in Section 3 functions to aggregate the values in a table: COUNT(),
and we also discuss how these operators overcome some SUM(), MIN(), MAX(), and AVG(). For example, the
of the shortcomings of GROUP BY. Sections 4 and 5 dis- average of all measured temperatures is expressed as:
cuss how we can address and compute the Cube. SELECT AVG(Temp)
FROM Weather;
1.1. Relational and SQL Data Extraction In addition, SQL allows aggregation over distinct values.
The following query counts the distinct number of report-
How do traditional relational databases fit into this multi- ing times in the Weather table:
dimensional data analysis picture? How can 2D flat files SELECT COUNT(DISTINCT Time)
(SQL tables) model an N-dimensional problem? Further- FROM Weather;
more, how do the relational systems support the ability to Aggregate functions return a single value. Using the
support operations over N-dimensional representation that GROUP BY construct, SQL can also create a table of many
are central to visualization and data analysis programs? aggregate values indexed by a set of attributes. For exam-
We address each of these two issues in this section. The ple, The following query reports the average temperature
answer to the first question is that relational systems for each reporting time and altitude:
model N-dimensional data as a relation with N-attribute SELECT Time, Altitude, AVG(Temp)
FROM Weather
domains. For example, 4-dimensional (4D) earth tem- GROUP BY Time, Altitude;
perature data is typically represented by a Weather table Aggregate Values
Grouping Values
(Table 1). The first four columns represent the four di-
mensions: latitude, longitude, altitude, and time. Addi- Partitioned Table
tional columns represent measurements at the 4D points
such as temperature, pressure, humidity, and wind veloc-
ity. Each individual weather measurement is recorded as
a new row of this table. Often these measured values are Sum()
aggregates over time (the hour) or space (a measurement
area centered on the point).
Figure 2: The GROUP BY relational operator partitions a
Table 1: Weather table into groups. Each group is then aggregated by a
Time (UCT) Latitude Longitude Altitude Temp Pres function. The aggregation function summarizes some col-
(m) (c) (mb)
96/6/1:1500 37:58:33N 122:45:28W 102 21 1009
umn of groups returning a value for each group.
Data Cube 2
Ratio_To_Total(expression): Sums all the expres-
Table 2: SQL Aggregates in Standard Benchmarks. sions. Then for each instance, divides the expression
Benchmark Queries Aggregates GROUP BYs instance by the total sum.
TPC-A, B 1 0 0
TPC-C 18 4 0 To give an example, the following SQL statement
TPC-D 16 27 15 SELECT Percentile, MIN(Temp), MAX(Temp)
FROM Weather
Wisconsin 18 3 2 GROUP BY N_tile(Temp,10) as Percentile
AS3AP 23 20 2 HAVING Percentile = 5;
SetQuery 7 5 1 returns one row giving the minimum and maximum tem-
peratures of the middle 10% of all temperatures.
1.2. Extensions In Some SQL Systems
Red Brick also offers three cumulative aggregates that
Beyond the five standard aggregate functions defined so
operate on ordered tables.
far, many SQL systems add statistical functions (median,
Cumulative(expression): Sums all values so far in
standard deviation, variance, etc.), physical functions
an ordered list.
(center of mass, angular momentum, etc.), financial analy-
Running_Sum(expression,n): Sums the most recent n
sis (volatility, Alpha, Beta, etc.), and other domain-
values in an ordered list. The initial n-1 values are
specific functions.
NULL.
Running_Average(expression,n): Averages the
Some systems allow users to add new aggregation func-
most recent n values in an ordered list. The initial n-1
tions. The Informix Illustra system, for example, allows
values are NULL.
users to add aggregate functions by adding a program with
These aggregate functions are optionally reset each time a
the following three callbacks to the database system [In-
grouping value changes in an ordered selection.
formix]:
Init (&handle): Allocates the handle and initializes
the aggregate computation. 2. Problems With GROUP BY:
Iter (&handle, value): Aggregates the next value Certain common forms of data analysis are difficult with
into the current aggregate. these SQL aggregation constructs. As explained next,
value = Final(&handle): Computes and returns the three common problems are: (1) Histograms, (2) Roll-up
resulting aggregate by using data saved in the handle. Totals and Sub-Totals for drill-downs, (3) Cross Tabula-
This invocation deallocates the handle. tions.
Consider implementing the Average() function. The The standard SQL GROUP BY operator does not allow a
handle stores the count and the sum initialized to zero. direct construction of histograms (aggregation over com-
When passed a new non-null value, Iter()increments puted categories). For example, for queries based on the
the count and adds the sum to the value. The Final() Weather table, it would be nice to be able to group times
call deallocates the handle and returns sum divided by into days, weeks, or months, and to group locations into
count. IBM’s DB2 Common Server [Chamberlin] has a areas (e.g., US, Canada, Europe,...). If a Nation() func-
similar mechanism. This design has been added to the tion maps latitude and longitude into the name of the coun-
Draft Proposed standard for SQL.[SQL97]. try containing that location, then the following query would
give the daily maximum reported temperature for each
Red Brick systems, one of the larger UNIX OLAP ven- nation.
dors, add some interesting aggregate functions that en- SELECT day, nation, MAX(Temp)
hance the GROUP BY mechanism [Red Brick]: FROM Weather
GROUP BY Day(Time) AS day,
Rank(expression): returns the expression’s rank in the Nation(Latitude , Longitude)
set of all values of this domain of the table. If there AS nation;
are N values in the column, and this is the highest Some SQL systems support histograms directly but the
value, the rank is N, if it is the lowest value the rank standard does not4. In standard SQL, histograms are com-
is 1. puted indirectly from a table-valued expression which is
N_tile(expression, n): The range of the expression then aggregated. The following statement demonstrates this
(over all the input values of the table) is computed SQL92 construct using nested queries.
and divided into n value ranges of approximately
equal population. The function returns the number of
the range containing the expression’s value. If your
bank account was among the largest 10% then your
rank(account.balance,10) would return 10.
Red Brick provides just N_tile(expression,3). 4
These criticisms led to a proposal to include theses features in the draft
SQL standard [SQL97].
Data Cube 3
SELECT day, nation, MAX(Temp) tables found in Excel (and now all other spreadsheets) [Ex-
FROM ( SELECT Day(Time) AS day,
Nation(Latitude, Longitude) cel], a popular data analysis feature of Excel5.
AS nation,
Temp Table 4: An Excel pivot table representation of Table 3
FROM Weather
) AS foo with Ford sales data included.
GROUP BY day, nation; Sum Year Color
Sales 1994 1994 1995 1995 Grand
A more serious problem, and the main focus of this paper, Total Total Total
relates to roll-ups using totals and sub-totals for drill- Model black white black white
down reports. Reports commonly aggregate data at a Chevy 50 40 90 85 115 200 290
coarse level, and then at successively finer levels. The car
sales report in Table 3 shows the idea (this and other ex- Ford 50 10 60 85 75 160 220
amples are based on the sales summary data in the table in Grand Total 100 50 150 170 190 360 510
Figure 4) . Data is aggregated by Model, then by Year,
then by Color. The report shows data aggregated at three
levels. Going up the levels is called rolling-up the data.
Table 4 an alternative representation of Table 3a (with
Going down is called drilling-down into the data. Data
Ford Sales data included) that illustrates how a pivot table
aggregated at each distinct level produces a sub-total.
in Excel can present the Sales data by Model, by Year, and
Table 3.a: Sales Roll Up by Model by Year by Color then by Color. The pivot operator transposes a spreadsheet:
Sales Sales Sales typically aggregating cells based on values in the cells.
Model Year Color by Model by Model by Model Rather than just creating columns based on subsets of col-
by Year by Year
umn names, pivot creates columns based on subsets of col-
by Color
umn values. This is a much larger set If one pivots on two
Chevy 1994 black 50
columns containing N and M values, the resulting pivot
white 40
90
table has NxM values. We cringe at the prospect of so
1995 black 85 many columns and such obtuse column names.
white 115
200 Rather than extend the result table to have many new col-
290 umns, a more conservative approach prevents the exponen-
tial growth of columns by overloading column values. The
Table 3.a suggests creating 2N aggregation columns for a idea is to introduce an ALL value. Table 5.a demonstrates
roll-up of N elements. Indeed, Chris Date recommends this relational and more convenient representation. The
this approach [Date1]. His design gives rise to Table 3.b dummy value "ALL" has been added to fill in the super-
aggregation items.:
Table 3.b: Sales Roll-Up by Model by Year by Color
as recommended by Chris Date [Date1]. Table 5.a: Sales Summary
Sales Sales Model Year Color Units
Model Year Color Sales by Model by Model Chevy 1994 black 50
by Year Chevy 1994 white 40
Chevy 1994 black 50 90 290 Chevy 1994 ALL 90
Chevy 1994 white 40 90 290 Chevy 1995 black 85
Chevy 1995 black 85 200 290
Chevy 1995 white 115
Chevy 1995 white 115 200 290 ALL
Chevy 1995 200
Chevy ALL ALL 290
The representation of Table 3.a is not relational because
the empty cells (presumably NULL values), cannot form a
key. Representation 3.b is an elegant solution to this prob-
lem, but we rejected it because it implies enormous num-
bers of domains in the resulting tables. We were intimi-
dated by the prospect of adding 64 columns to the answer
set of a 6D TPCD query. The representation of Table 3.b
is also not convenient -- the number of columns grows as
the power set of the number of aggregated attributes, cre-
ating difficult naming problems and very long names. The
5
approach recommended by Date is reminiscent of pivot It seems likely that a relational pivot operator will appear
in database systems in the near future.
Data Cube 4
Table 5.a is not really a completely new representation or The cross-tab-array representation (Table 6.a, 6.b) is
operation. Since Table 5.a is a relation, it is not surpris- equivalent to the relational representation using the ALL
ing that it can be built using standard SQL. The SQL value. Both generalize to an N-dimensional cross tab.
statement to build this SalesSummary table from the raw Most report writers build in a cross-tabs feature, building
Sales data is: the report up from the underlying tabular data such as Ta-
SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Sales) ble 5. See for example the TRANSFORM-PIVOT operator of
FROM Sales
WHERE Model = 'Chevy' Microsoft Access [Access].
UNION
SELECT Model, ‘ALL’, ‘ALL’, SUM(Sales) Table 6b: Ford Sales Cross Tab
FROM Sales
WHERE Model = 'Chevy' Ford 1994 1995 total (ALL)
GROUP BY Model black 50 85 135
UNION
SELECT Model, Year, ‘ALL’, SUM(Sales) white 10 75 85
FROM Sales total (ALL) 60 160 220
WHERE Model = 'Chevy'
GROUP BY Model, Year
UNION
SELECT Model, Year, Color, SUM(Sales) The representation suggested by Table 5 and unioned
FROM Sales GROUP BYs “solve” the problem of representing aggregate
WHERE Model = 'Chevy'
GROUP BY Model, Year, Color; data in a relational data model. The problem remains that
expressing roll-up, and cross-tab queries with conventional
This is a simple 3-dimensional roll-up. Aggregating over SQL is daunting. A six dimension cross-tab requires a 64-
N dimensions requires N such unions. way union of 64 different GROUP BY operators to build the
underlying representation.
Roll-up is asymmetric – notice that Table 5.a aggregates
sales by year but not by color. These rows are: There is another very important reason why it is inade-
quate to use GROUP BYs. The resulting representation of
Table 5.b: Sales Summary rows missing form aggregation is too complex to analyze for optimization.
Table 5.a to convert the roll-up into a cube. On most SQL systems this will result in 64 scans of the
Model Year Color Units data, 64 sorts or hashes, and a long wait.
Chevy ALL black 135
Chevy ALL white 155 3. CUBE and ROLLUP Operators
These additional rows could be captured by adding the The generalization of group by, roll-up and cross-tab ideas
following clause to the SQL statement above: seems obvious: Figure 3 shows the concept for aggregation
UNION up to 3-dimensions. The traditional GROUP BY generates
SELECT Model, ‘ALL’, Color, SUM(Sales) the N-dimensional data cube core. The N-1 lower-
FROM Sales dimensional aggregates appear as points, lines, planes,
WHERE Model = 'Chevy'
GROUP BY Model, Color; cubes, or hyper-cubes hanging off the data cube core.
The symmetric aggregation result is a table called a cross- The data cube operator builds a table containing all these
tabulation, or cross tab for short. Tables 5.a and 5.b are aggregate values. The total aggregate using function f()
the relational form of the cross-tabs, but cross tab data is is represented as the tuple:
routinely displayed in the more compact format of Table ALL, ALL, ALL,..., ALL, f(*)
6. Points in higher dimensional planes or cubes have fewer
ALL values.
Table 6.a: Chevy Sales Cross Tab
Chevy 1994 1995 total (ALL)
black 50 85 135
white 40 115 155
total (ALL) 90 200 290
Data Cube 5
Aggregate relation is ∏ (Ci + 1). The extra value in each domain is
Sum Group By ALL. For example, the SALES table has 2x3x3 = 18 rows,
(with total) while the derived data cube has 3x4x4 = 48 rows.
By Color
RED
WHITE DATA C UBE
SELECT Model, Year, Color, SUM(sales) AS Sales
BLUE FROM Sales
Model Year C olor Sales
Chevy 1990 blue 62
WHERE Model in {'Ford', 'Chevy'} Chevy 1990 red 5
AND Year BETWEEN 1990 AND 1992 Chevy 1990 white 95
Sum GROUP BY CUBE Model, Year, Color; Chevy 1990 ALL 154
Cross Tab Chevy 1991 blue 49
Chevy Ford By Color Chevy 1991 red 54
Chevy 1991 white 95
RED Chevy 1991 ALL 198
WHITE Chevy 1992 blue 71
Chevy 1992 red 31
BLUE
Chevy 1992 white 54
Chevy 1992 ALL 156
By Make Chevy ALL blue 182
The Data Cube and
Sum SALES Chevy ALL red 90
The Sub-Space Aggregates Model Year C olor Sales Chevy ALL white 236
FO Chevy ALL ALL 508
CH Chevy 1990 red 5
CUBE
RD 0
EV
Y 199 991 Chevy 1990 white 87
Ford 1990 blue 63
1 992 Ford 1990 red 64
1 993 Chevy 1990 blue 62 Ford 1990 white 62
By Year 1
Chevy 1991 red 54 Ford 1990 ALL 189
By Make Chevy 1991 white 95 Ford 1991 blue 55
By Make & Year Chevy 1991 blue 49 Ford 1991 red 52
Ford 1991 white 9
RED Chevy 1992 red 31
Ford 1991 ALL 116
WHITE Chevy 1992 white 54 Ford 1992 blue 39
Chevy 1992 blue 71 Ford 1992 red 27
BLUE
Ford 1990 red 64 Ford 1992 white 62
By Color & Year Ford 1992 ALL 128
Ford 1990 white 62
By Make & Color Ford 1990 blue 63 Ford ALL blue 157
Ford ALL red 143
Sum By Color Ford 1991 red 52
Ford ALL white 133
Ford 1991 white 9 Ford ALL ALL 433
Ford 1991 blue 55 ALL 1990 blue 125
Ford 1992 red 27 ALL 1990 red 69
Figure 3: The CUBE operator is the N-dimensional gener- Ford
Ford
1992
1992
white
blue
62
39
ALL
ALL
1990
1990
white
ALL
149
343
ALL 1991 blue 106
alization of simple aggregate functions. The 0D data cube ALL 1991 red 104
ALL 1991 white 110
is a point. The 1D data cube is a line with a point. The ALL
ALL
1991
1992
ALL
blue
314
110
2D data cube is a cross tabulation, a plane, two lines, and ALL
ALL
1992
1992
red
white
58
116
ALL 1992 ALL 284
a point. The 3D data cube is a cube with three intersect- ALL ALL blue 339
ALL ALL red 233
ing 2D cross tabs. ALL
ALL
ALL
ALL
white
ALL
369
941
Creating a data cube requires generating the power set (set Figure 4: A 3D data cube (right) built from the table at the
of all subsets) of the aggregation columns. Since the left by the CUBE statement at the top of the figure.
CUBE is an aggregation operation, it makes sense to ex-
If the application wants only a roll-up or drill-down report,
ternalize it by overloading the SQL GROUP BY operator. In
similar to the data in Table 3.a, the full cube is overkill.
fact, the CUBE is a relational operator, with GROUP BY and
Indeed, some parts of the full cube may be meaningless. If
ROLL UP as degenerate forms of the operator. This can be
the answer set is not is not normalized, there may be func-
conveniently specified by overloading the SQL GROUP
tional dependencies among columns. For example, a date
BY6.
functionally defines a week, month, and year. Roll-ups by
year, week, day are common, but a cube on these three
Figure 4 has an example of the cube syntax. To give an-
attributes would be meaningless.
other, here follows a statement to aggregate the set of
temperature observations:
SELECT day, nation, MAX(Temp) The solution is to offer ROLLUP in addition to CUBE.
FROM Weather ROLLUP produces just the super-aggregates:
GROUP BY CUBE (v1 ,v2 ,...,vn, f()),
Day(Time) AS day, (v1 ,v2 ,...,ALL, f()),
Country(Latitude, Longitude)
AS nation; ...
(v1 ,ALL,...,ALL, f()),
(ALL,ALL,...,ALL, f()).
The semantics of the CUBE operator are that it first aggre-
gates over all the <select list> attributes in the Cumulative aggregates, like running sum or running aver-
GROUP BY clause as in a standard GROUP BY. Then, it age, work especially well with ROLLUP because the answer
UNIONs in each super-aggregate of the global cube -- sub- set is naturally sequential (linear) while the full data
stituting ALL for the aggregation columns. If there are N cube is naturally non-linear (multi-dimensional). ROLLUP
attributes in the <select list>, there will be 2N-1 su- and CUBE must be ordered for cumulative operators to ap-
per-aggregate values. If the cardinality of the N attributes ply.
are C1, C2,..., CN then the cardinality of the resulting cube
We investigated letting the programmer specify the exact
6 list of super-aggregates but encountered complexities re-
An earlier version of this paper [Gray et. al.] and the Microsoft SQL
Server 6.5 product implemented a slightly different syntax. They suffix lated to collation, correlation, and expressions. We believe
the GROUP BY clause with a ROLLUP or CUBE modifier. The SQL ROLLUP and CUBE will serve the needs of most applica-
Standards body chose an infix notation so that GROUP BY and tions.
ROLLUP and CUBE could be mixed in a single statement. The im-
proved syntax is described here.
Data Cube 6
3.1. The GROUP, CUBE, ROLLUP Algebra <aggregation list> ::=
{ ( <column name> | <expression>)
[ AS <correlation name> ]
The GROUP BY, ROLLUP, and CUBE operators have an [ <collate clause> ]
interesting algebra. The CUBE of a ROLLUP or GROUP ,...}
BY is a CUBE. The ROLLUP of a GROUP BY is a ROLLUP.
Algebraically, this operator algebra can be stated as: These extensions are independent of the CUBE operator.
CUBE(ROLLUP) = CUBE They remedy some pre-existing problems with GROUP BY.
ROLLUP(GROUP BY) = ROLLUP Many systems already allow these extensions.
So it makes sense to arrange the aggregation operators in
the compound order where the “most powerful” cube op- Now extend SQL’s GROUP BY operator:
erator at the core, then a roll-up of the cubes and then a GROUP BY [<aggregation list> ]
group by of the roll-ups. Of course, one can use any sub- [ ROLLUP <aggregation list> ]
[ CUBE <aggregation list> ]
set of the three operators:
GROUP BY <select list>
ROLLUP <select list> 3.3. A Discussion of the ALL Value
CUBE <select list>
Is the ALL value really needed? Each ALL value really
The following SQL demonstrates a compound aggregate. represents a set – the set over which the aggregate was
The “shape” of the answer is diagrammed in Figure 5: computed7. In the Table 5 SalesSummary data cube, the
SELECT Manufacturer, respective sets are:
Year , Month, Day, Model.ALL = ALL(Model) = {Chevy, Ford }
Color, Model Year.ALL = ALL(Year) = {1990,1991,1992}
SUM(price) AS Revenue Color.ALL = ALL(Color) = {red,white,blue}
FROM Sales
GROUP BY Manufacturer, In reality, we have stumbled in to the world of nested rela-
ROLLUP Year(Time) AS Year , tions – relations can be values. This is a major step for
Month(Time) AS Month,
Day(Time) AS Day, relational systems. There is much debate on how to pro-
CUBE ceed. Rather than attack those problems here, we just use
Color, the ALL value as a token representing these sets. Thinking
Model; of the ALL value as the corresponding set defines the se-
mantics of the relational operators (e.g., equals and IN).
Manufacturer Year, Mo, Day The ALL string is for display. A new ALL() function gen-
erates the set associated with this value as in the examples
above. ALL() applied to any other value returns NULL.
Model xColor
7
This is distinct from saying that ALL represents one of the members of
the set.
Data Cube 7
Decoration’s interact with aggregate values. If the aggre-
It is convenient to know when a column value is an ag- gate tuple functionally defines the decoration value, then
gregate. One way to test this is to apply the ALL() func- the value appears in the resulting tuple. Otherwise the
tion to the value and test for a non-NULL value. This is so decoration field is NULL. For example, in the following
useful that we propose a Boolean function GROUPING() query the continent is not specified unless nation is.
that, given a select list element, returns TRUE if the ele- SELECT day,nation,MAX(Temp),
continent(nation) AS continent
ment is an ALL value, and FALSE otherwise. FROM Weather
GROUP BY CUBE
3.4. Avoiding the ALL Value Day(Time) AS day,
Country(Latitude, Longitude)
AS nation
Veteran SQL implementers will be terrified of the ALL The query would produce the sample tuples:
value – like NULL, it will create many special cases. If
the goal is to help report writer and GUI visualization
Table 7: Demonstrating decorations and ALL
software, then it may be simpler to adopt the following
day nation max(Temp) continent
approach8: 25/1/1995 USA 28 North America
• Use the NULL value in place of the ALL value. ALL USA 37 North America
• Do not implement the ALL() function. 25/1/1995 ALL 41 NULL
• Implement the GROUPING() function to discriminate ALL ALL 48 NULL
between NULL and ALL.
In this minimalist design, tools and users can simulate the 3.6. Dimensions Star, and Snowflake Queries
ALL value as by for example:
While strictly not part of the CUBE and ROLLUP operator
SELECT Model,Year,Color,SUM(sales), design, there is an important database design concept that
GROUPING(Model),
GROUPING(Year), facilitates the use of aggregation operations. It is common
GROUPING(Color) to record events and activities with a detailed record giving
FROM Sales all the dimensions of the event. For example, the sales
GROUP BY CUBE Model, Year, Color;
Wherever the ALL value appeared before, now the corre- item record in Figure 6 gives the id of the buyer, seller, the
sponding value will be NULL in the data field and TRUE in product purchased, the units purchased, the price, the date
the corresponding grouping field. For example, the global and the sales office that is credited with the sale. There
sum of Figure 4 will be the tuple: are probably many more dimensions about this sale, but
(NULL,NULL,NULL,941,TRUE,TRUE,TRUE) this example gives the idea.
rather than the tuple one would get with the “real” cube
operator: There are side tables that for each dimension value give its
( ALL, ALL, ALL, 941 ). attributes. For example, the San Francisco sales office is
in the Northern California District, the Western Region,
3.5. Decorations and the US Geography. This fact would be stored in a di-
mension table for the Office9. The dimension table may
The next step is to allow decorations, columns that do also have decorations describing other attributes of that
not appear in the GROUP BY but that are functionally de- Office. These dimension tables define a spectrum of ag-
pendent on the grouping columns. Consider the example: gregation granularities for the dimension. Analysists might
SELECT department.name, sum(sales) want to cube various dimensions and then aggregate or
FROM sales JOIN department roll-up the cube up at any or all of these granularities.
USING (department_number)
GROUP BY sales.department_number;
9
Database normalization rules [Date2] would recommend that the fact
that the California District be stored once, rather than storing it once for
each Office. So there might be an office, district, and region tables,
8
This is the syntax and approach used by Microsoft’s SQL Server (ver- rather than one big denormalize table. Query users find it convenient to
sion 6.5). use the denormalized table.
Data Cube 8
ALL
SELECT Model,Year,Color,SUM(Sales),
ALL Year
ALL SUM(Sales)/
Division Quarter (SELECT SUM(Sales)
Group Month FROM Sales
Cust Type Week
Unit WHERE Model IN {‘Ford’,‘Chevy’}
AND Year Between 1990 AND 1992
Product Seller Buyer Units Price Office Date )
Channel Discount District
FROM Sales
ALL ALL Region
WHERE Model IN { ‘Ford’ , ‘Chevy’ }
Geography
AND Year Between 1990 AND 1992
GROUP BY CUBE Model, Year, Color ;
ALL
Figure 6: A snowflake schema showing the core fact It seems natural to allow the shorthand syntax to name the
table and some of the many aggregation ganularities of the global aggregate:
core dimensions. SELECT Model, Year, Color
SUM(Sales) AS total,
SUM(Sales) / total(ALL,ALL,ALL)
The general schema of Figure 6 is so common that it has FROM Sales
been given a name: a snowflake schema. Simpler WHERE Model IN { ‘Ford’ , ‘Chevy’ }
schemas that have a single dimension table for each di- AND Year Between 1990 AND 1992
mension are called a star schema. Queries against these GROUP BY CUBE Model, Year, Color;
schemas are called snowflake queries and star queries
respectively. This leads into deeper water. The next step is a desire to
compute the index of a value -- an indication of how far the
The diagram of Figure 6 suggests that the granularities value is from the expected value. In a set of N values, one
form a pure hierarchy. In reality, the granularities typi- expects each item to contribute one Nth to the sum. So the
cally form a lattice. To take just a very simple example, 1D index of a set of values is:
days nest in weeks but weeks do not nest in months or index(vi) = vi / (Σj vj)
quarters or years (some weeks are partly in two years).
Analysts often think of dates in terms of weekdays, week- If the value set is two dimensional, this commonly used
ends, sale days, various holidays (e.g., Christmas and the financial function is a nightmare of indices. It is best de-
time leading up to it). So a fuller granularity graph of scribed in a programming language. The current approach
Figure 6 would be quite complex. Fortunately, graphical to selecting a field value from a 2D cube would read as:
tools like pivot tables with pull down lists of categories SELECT v
FROM cube
hide much of this complexity from the analyst. WHERE row = :i
AND column = :j
We recommend the simpler syntax:
4. Addressing The Data Cube cube.v(:i, :j)
as a shorthand for the above selection expression. With
Section 5 discusses how to compute data cubes and how this notation added to the SQL programming language, it
users can add new aggregate operators. This section con- should be fairly easy to compute super-super-aggregates
siders extensions to SQL syntax to easily access the ele- from the base cube.
ments of a data cube -- making it recursive and allowing
aggregates to reference sub-aggregates. 5. Computing Cubes and Roll-ups
It is not clear where to draw the line between the report-
ing-visualization tool and the query tool. Ideally, applica- CUBE and ROLLUP generalize aggregates and GROUP BY,
tion designers should be able to decide how to split the so all the technology for computing those results also apply
function between the query system and the visualization to computing the core of the cube [Graefe]. The basic
tool. Given that perspective, the SQL system must be a technique for computing a ROLLUP is to sort the table on
Turing-complete programming environment. the aggregating attributes and then compute the aggregate
functions (there is a more detailed discussion of the kind of
SQL3 defines a Turing-complete procedural programming aggregates in a moment.) If the ROLLUP result is small
language. So, anything is possible. But, many things are enough to fit in main memory, it can be computed by scan-
not easy. Our task is to make simple and common things ning the input set and applying each record to the in-
easy. memory ROLLUP. A cube is the union of many rollups, so
the naive algorithm computes this union.
The most common request is for percent-of-total as an
As Graefe [Graefe]. points out, the basic techniques for
aggregate function. In SQL this is computed as a nested
computing aggregates are:
SELECT SQL statements.
Data Cube 9
• To minimize data movement and consequent processing Figure 7: System defined and user
cost, compute aggregates at the lowest possible system defined aggregate functions are
level. initialized with a start() call that
start
• If possible, use arrays or hashing to organize the aggre- allocates and initializes a scratch-
gation columns in memory, storing one aggregate value pad cell to compute the aggregate.
for each array or hash entry. next Scratchpad Subsequently, the next() call is
• If the aggregation values are large strings, it may be wise invoked for each value to be aggre-
to keep a hashed symbol table that maps each string to gated. Finally, the end() call com-
end
an integer so that the aggregate values are small. When putes the aggregate from the
a new value appears, it is assigned a new integer. With scratchpad values, deallocates the
this organization, the values become dense and the ag- scratchpad and returns the result.
gregates can be stored as an N-dimensional array.
• If the number of aggregates is too large to fit in memory, The simplest algorithm to compute the cube is to allocate a
use sorting or hybrid hashing to organize the data by handle for each cube cell. When a new tuple: (x1, x2,....,
value and then aggregate with a sequential scan of the
xN, v) arrives, the Iter(handle, v) function is called 2N
sorted data.
• If the source data spans many disks or nodes, use paral- times -- once for each handle of each cell of the cube
lelism to aggregate each partition and then coalesce matching this value. The 2N comes from the fact that each
these aggregates. coordinate can either be xi or ALL. When all the input tu-
ples have been computed, the system invokes the fi-
Some innovation is needed to compute the "ALL" tuples of nal(&handle) function for each of the ∏(Ci+1) nodes
the cube and roll-up from the GROUP BY core. The ALL
in the cube. Call this the 2N-algorithm. There is a corre-
value adds one extra value to each dimension in the CUBE.
sponding order-N algorithm for roll-up.
So, an N-dimensional cube of N attributes each with car-
dinality Ci, will have ∏(Ci+1). If each Ci =4 then a 4D
If the base table has cardinality T, the 2N-algorithm in-
CUBE is 2.4 times larger than the base GROUP BY. We
expect the Ci to be large (tens or hundreds) so that the vokes the Iter() function T x 2N times. It is often faster
CUBE will be only a little larger than the GROUP BY. By
to compute the super-aggregates from the core GROUP BY,
comparison, an N-dimensional roll-up will add only N reducing the number of calls by approximately a factor of
records to the answer set. T. It is often possible to compute the cube from the core or
from intermediate results only M times larger than the core.
The cube operator allows many aggregate functions in the The following trichotomy characterizes the options in
aggregation list of the GROUP BY clause. Assume in computing super-aggregates.
this discussion that there is a single aggregate function F()
Consider aggregating a two dimensional set of values {Xij |
being computed on an N-dimensional cube. The exten-
sion to computing a list of functions is a simple generali- i = 1,...,I; j=1,...,J}. Aggregate functions can be classified
zation. into three categories:
Distributive: Aggregate function F() is distributive if
Figure 7 summarizes how aggregate functions are defined there is a function G() such that F({Xi,j}) = G({F({Xi,j
and implemented in many systems. It defines how the |i=1,...,I}) | j=1,...J}). COUNT(), MIN(), MAX(),
database execution engine initializes the aggregate func- SUM() are all distributive. In fact, F = G for all but
tion, calls the aggregate functions for each new value and COUNT(). G= SUM() for the COUNT() function. Once
then invokes the aggregate function to get the final value. order is imposed, the cumulative aggregate functions
More sophisticated systems allow the aggregate function also fit in the distributive class.
to declare a computation cost so that the query optimizer
knows to minimize calls to expensive functions. This
design (except for the cost functions) is now part of the
proposed SQL standard.
Data Cube 10
Algebraic: Aggregate function F() is algebraic if there is When the core GROUP BY operation completes, the CUBE
an M-tuple valued function G() and a function H() such algorithm passes the set of handles to each N-1 dimen-
that sional super-aggregate. When this is done the handles of
F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., J }). Aver- these super-aggregates are passed to the super-super ag-
age(), standard deviation, MaxN(), MinN(), cen- gregates, and so on until the (ALL, ALL, ..., ALL) aggregate
ter_of_mass() are all algebraic. For Average, the func- has been computed. This approach requires a new call for
tion G() records the sum and count of the subset. The distributive aggregates:
H() function adds these two components and then di- Iter_super(&handle, &handle)
vides to produce the global average. Similar techniques which folds the sub-aggregate on the right into the super
apply to finding the N largest values, the center of mass aggregate on the left. The same ordering idea (aggregate
of group of objects, and other algebraic functions. The on the smallest list) applies at each higher aggregation
key to algebraic functions is that a fixed size result (an level.
M-tuple) can summarize the sub-aggregation.
Holistic: Aggregate function F() is holistic if there is no
constant bound on the size of the storage needed to de-
scribe a sub-aggregate. That is, there is no constant M,
such that an M-tuple characterizes the computation
F({Xi,j |i=1,...,I}). Median(), MostFrequent() (also
called the Mode()), and Rank() are common examples
of holistic functions.
Data Cube 11
It is possible that the core of the cube is sparse. In that
case, only the non-null elements of the core and of the su-
Figure 8: Computing the per-aggregates should be represented. This suggests a
cube with a minimal num- hashing or a B-tree be used as the indexing scheme for
ber of calls to aggregation aggregation values [Essbase].
functions. If the aggrega-
tion operator is algebraic or 6. Maintaining Cubes and Roll-ups
distributive, then it is pos-
sible to compute the core of SQL Server 6.5 has supported the CUBE and ROLLUP
the cube as usual. operators for about six months now. We have been sur-
prised that some customers use these operators to compute
and store the cube. These customers then define triggers
on the underlying tables so that when the tables change, the
cube is dynamically updated.
Then, the higher dimen-
sions of the cube are com- This of course raises the question: how can one incremen-
puted by calling the super- tally compute (user-defined) aggregate functions after the
itterator function passing cube has been materialized? Harinarayn, Rajaraman, and
the lower-level scratch- Ullman have interesting ideas on pre-computing a sub-
pads. cubes of the cube assuming all functions are holistic [Hari-
narayn, Rajaraman, and Ullman]. Our view is that users
avoid holistic functions by using approximation tech-
niques. Most functions we see in practice are distributive
or algebraic. For example, medians and quartiles are ap-
proximated using statistical techniques rather than being
computed exactly.
Once an N-dimensional
space has been computed, The discussion of distributive, algebraic, and holistic func-
the operation repeats to tions in the previous section was completely focused on
compute the N-1 dimen- SELECT statements, not on UPDATE, INSERT, or
sional space. This repeats DELETE statements.
until N=0.
Surprisingly, the issues of maintaining a cube are quite
different from computing it in the first place. To give a
simple example: it is easy to compute the maximum value
in a cube – max is a distributive function. It is also easy to
propagate inserts into a “max” N-dimensional cube. When
a record is inserted into the base table, just visit the 2N
Interestingly, the distributive, algebraic, and holistic tax- super-aggregates of this record in the cube and take the
onomy is very useful in computing aggregates for parallel max of the current and new value. This computation can
database systems. In those systems, aggregates are com- be shortened -- if the new value “loses” one competition,
puted for each partition of a database in parallel. Then the then it will lose in all lower dimensions. Now suppose a
results of these parallel computations are combined. The delete or update changes the largest value in the base table.
combination step is very similar to the logic and mecha- Then 2N elements of the cube must be recomputed. The
nism used in Figure 8. recomputation needs to find the global maximum. This
seems to require a recomputation of the entire cube. So,
If the data cube does not fit into memory, array techniques max is a distributive for SELECT and INSERT, but it is ho-
do not work. Rather one must either partition the cube listic for DELETE.
with a hash function or sort it. These are standard tech-
niques for computing the GROUP BY. The super- This simple example suggests that there are orthogonal
aggregates are likely to be orders of magnitude smaller hierarchies for SELECT, INSERT, and DELETE functions
than the core, so they are very likely to fit in memory. (update is just delete plus insert). If a function is alge-
Sorting is especially convenient for ROLLUP since the user braic for insert, update, and delete (count() and sum() are
often wants the answer set in a sorted order – so the sort such a functions), then it is easy to maintain the cube. If
must be done anyway. the function is distributive for insert, update, and delete,
then by maintaining the scratchpads for each cell of the
cube, it is fairly inexpensive to maintain the cube. If the
Data Cube 12
function is delete-holistic (as max is) then it is expensive [Graefe] G. Graefe, “Query Evaluation Techniques for Large
to maintain the cube. These ideas deserve more study. Databases,” ACM Computing Surveys, 25.2, June 1993, pp.
73-170.
[Gray] J. Gray (Editor) The Benchmark Handbook, Morgan
7. Summary: Kaufman, San Francisco, CA 1991.
[Gray et. al.] J. Gray, A. Bosworth, A. Layman, H. Pirahesh, “
The cube operator generalizes and unifies several com- Data Cube: A Relational Operator Generalizing Group-By,
mon and popular concepts: Cross-Tab, and Roll-up,” Proc International Conf. On Data
aggregates, Engineering, IEEE Press, Feb 1996, New Orleans.
group by, [Harinarayn, Rajaraman, and Ullman] . V. Harinarayn, A.
histograms, Rajaraman, and J.D.Ullman, “Implementing Data
roll-ups and drill-downs and, Cubes Efficiently,” Proc. ACM SIGMOD, June 1996,
cross tabs. Montreal, pp. 205-216.
[Informix] DataBlade Developer's Kit: Users Guide 2.0, Infor-
mix Software, Menlo Park, CA, May 1996.
The cube operator is based on a relational representation [Melton & Simon] J. Melton and A. R. Simon, Understanding
of aggregate data using the ALL value to denote the set the New SQL: A Complete Guide, Morgan Kaufmann, San
over which each aggregation is computed. In certain Francisco, CA, 1993.
cases it makes sense to restrict the cube operator to just a [ADGNRS] R. Agrawal, P. Deshpande, A. Gupta, J. F. Naugh-
roll-up aggregation for drill-down reports. ton, R. Ramakrishnan, S. Sarawagi, “On the Computation of
Multidimensional Aggregates,” Proc. 21st VLDB, Bombay,
The data cube is easy to compute for a wide class of func- Sept 1996.
tions (distributive and algebraic functions). SQL’s basic [SQL Server] Microsoft SQL Server: Transact-SQL Reference,
Document 63900, May 1996, Microsoft Corp. Redmond,
set of five aggregate functions needs careful extension to
WA.
include functions such as rank, N_tile, cumulative, and [Red Brick] RISQL Reference Guide, Red Brick Warehouse VPT
percent of total to ease typical data mining operations. Version 3, Part no: 401530, Red Brick Systems, Los Gatos,
These are easily added to SQL by supporting user-defined CA, 1994.
aggregates. These extensions require a new super- [SDNR] A. Shukla, P. Deshpande, J. F. Naughton, K. Ramas-
aggregate mechanism to allow efficient computation of wamy: “Storage Estimation for Multidimensional Aggre-
cubes. gates in the Presence of Hierarchies,” Proc. 21st VLDB,
Bombay, Sept 1996.
[SQL] IS 9075 International Standard for Database Lan-
7. Acknowledgments guage SQL, document ISO/IEC 9075:1992, J. Melton,
Editor, October 1992.
Joe Hellerstein suggested interpreting the ALL value as a [SQL97] ISO/IEC DBL:MCI-006 (ISO Working Draft)
set. Tanj Bennett, David Maier and Pat O’Neil made Database Language SQL — Part 4: Persistent Stored
many helpful suggestions that improved the presentation. Modules (SQL/PSM), J. Melton, Editor, March 1996.
[TPC] The Benchmark Handbook for Database and Transaction
Processing Systems - 2nd edition, J. Gray (ed.), Morgan
8. References Kaufmann, San Francisco, CA, 1993. Or
http://www.tpc.org/
Data Cube 13