Está en la página 1de 3

Data Cube Aggregation

Aggregates and group-by operators produce 0- or 1-dimensional answers; cross-tabs produce 2-d
answers; the data-cube produces N-dimensional aggregates and answers by treating each of the
N aggregation attributes as a dimension of N-space.
Data analysis applications look for unusual patterns in data. They summarize data values, extract
statistical information, and then contrast one category with another. There are two steps to such
data analysis: extracting the aggregated data from the database into a file or table, and visualizing
the results in a graphical way. Visualization tools display trends, clusters, and differences. The
most exciting work in data analysis focuses on presenting new graphical metaphors that allow
people to discover data trends and anomalies.
Overview/Main Points

aggregation operators:
o 5 functions in SQL: count, sum, min, max, and avg ; these return a single value.
o group by operator creates a table of aggregates indexed by some set of attributes
(e.g. max(salary) group by department)
o proprietary extensions to SQL: cumulative sum, running sum, running average
(Red Brick systems).

problems with group by:


o does not easily permit histograms (need to construct table-valued expression, then
aggregate over the resulting table, and even then histogram does not necessarily
have evenly spaced categories)
o roll-up totals and sub-totals for drill-downs. Requires N unions for an Ndimensional roll-up.

roll-up total: reports aggregate data at a coarse level (e.g. average(sales)


group by model), then at a finer level (e.g. average(sales) group by
model,year), etc. for successively finer levels

going from finer to coarser is rolling-up the data, from coarser to finer is
drilling-down the data (I think - paper was ambiguous about which term
corresponds to which direction.)

o cross-tabulations. imagine data:

Model
Chevy
Chevy
Chevy
Chevy

Year
1994
1994
1995
1995

Color
black
white
black
white

sales
50
40
85
115

o The cross-tabulation for this adds rows and colums to give symmetric aggregation
results across each dimension:
Chevy
black
white
total(ALL)

1994
50
40
90

1995
85
115
200

total(ALL)
135
155
290

o The data in italics is the original data from the table. The extra column on the
right is the aggregation across years grouped by color, The extra row on the
bottom is the aggregation across color grouped by year (i.e. 1-dimensional
aggregate of 2-d data). The extra point in the bottom-right is the aggregation
across all data (i.e. 0-dimensional aggregate of the 2-d data).

data-cube: simply a generalization of the cross-tab to N-dimensions (usually 3). Core


cube is data. Extra planes are added for 2-d aggregations of the 3-d data. Extra edges are
added for 1-d aggregations of the 3-d data. An extra vertex is added for 0-d aggregation
of the 3-d data.

end up with 2^n aggregates for the N-d data cube, one for each set in the power set that
represents the cube.

issues:
o to implement as a relation, need the power set of the relation in question - use the
special token ALL to indicate fields that have been aggregated across (e.g. if a
tuple from the sales table is (Chevy, 1990, blue, 64), then you could imagine a
new field in the power set (Chevy, ALL, blue, 182) which is the aggregation
across all years of blue Chevys.)
o ALL token greatly complicates SQL code. Think of ALL as being a set-value, e.g.
above, ALL would represent multiple years {1990, 1991, 1992}.
o How do you efficiently compute this power set (which is really the data cube)? (If
data aggregation function has special properties, e.g. is Distributive, Algebraic, or
Holistic, can exploit those properties in the algorithm for computing the data
cube.
Submitted By,

Pranav Sharma
1325934(MCA Vth sem).

También podría gustarte