Está en la página 1de 40

Data Warehousing and

Data Mining
Data Warehouse Design
Logical Design

Requirement analysis
Requirement specification
Conceptual design
Logical design
Physical design

is not a design --- used just to describe


the business
should be a business model -- and not
data design model
should identify real world business
objects (e.g. Customer, Order, Sale,
Policy, etc)
the relationships between these objects

Translating the agreed business


requirements into system deliverables
within the scope defined by the
conceptual design results in Logical and
then Physical design
Logical Design
Is more conceptual and abstract than physical
Looks at the logical relationships among the
objects
Defines the types of information that are
needed

Entity-Relationship model
Identify the things of importance (Entities)
The properties of these things (Attributes)
Show how two or more entities are related
(Relationship)

The entities, attributes and


relationships existing in a ER model are
translated into a star model.

arrange data into a series of logical


relationships called entities and attributes
An entity
represents a chunk of information
often maps to a table

An attribute
a component of an entity that helps define the uniqueness of the
entity
often maps to a column.

A unique identifier is
what is added to tables so that it is possible to differentiate
between the same item when it appears in different places
usually a primary key.

record the associations between objects and


facts

In dimensional modeling, instead of


seeking to discover atomic units of
information (such as entities and
attributes) and all of the relationships
between them, we
identify which information belongs to a central
fact table and which information belongs to its
associated dimension tables
identify business subjects or fields of data,
define relationships between business subjects,
and name the attributes for each subject.

often start with a conceptual schema


and then generates relational structures
should result in
a set of entities and attributes
corresponding to fact tables and dimension
tables
a model of operational data from your
source into subject-oriented information in
the target data warehouse schema.

mapping the conceptual model


structures to the logical model ones
taking into account implementation
issues, which are not considered in
the conceptual schema

Identify all applicable entities (Conceptual


Model doesn't express all the details)
Attributize (either fully or mostly) data
entities (with business nomenclature)
Assign datatype domains (e.g. text, date,
numeric) vs. datatypes (varchar, integer)
Resolve M:M relationships (e.g. with an
associative entity, record versioning, etc)

Formalize keys (primary, alternate,


foreign)
carry out resolution of subtypes
Perform abstraction (e.g. abstracting
conceptual entities such as Customer,
Prospect, Supplier, etc. into a
generalized entity such as Party) as
part of the normalization process (so
that data can be stored once)

a type of database that is optimized


for data warehouse and online
analytical processing (OLAP)
applications.
is designed to make the best use of
storing and utilizing data

is created using input from existing


relational databases.
implies the ability to rapidly process
the data in the database so that
answers can be generated quickly

can receive data from a variety of


relational databases and structure the
information into categories and sections
that can be accessed in a number of
different ways
can obtain data more easily, more quickly
and more succinctly
uses the idea of a data cube to represent
the dimensions of data available to a user

Store data in dimensions

Multiple dimensions, aka cubes (also


called hypercube), allow users to
analyze any view of data

Can consolidate data much faster than


relational database

"sales" could be viewed in the


dimensions of product model,
geography, time, or some additional
dimension. In this case, "sales" is known
as the measure attribute of the data
cube and the other dimensions are seen
as feature attributes. More hierarchies
and levels can be defined within a
dimension (for example, state and city
levels within a regional hierarchy).

Product Dimension
ProductKey

Time Dimension
TimeKey

Sales Fact Table

Store Dimension
StoreKey

TimeKey (FK)
ProductKey (FK)
StoreKey (FK)
Sale Amount

SALES Fact Table

Time
Key

Product
Key

Stores
Key

Sales
Amount

Rs.1000

Rs.1200

Rs.1500

Product

3
2
1
1
1
2

Store

Time

Product

3
2
1
1

Rs.1000

1
2

Store

Time

Product

3
2
1
1
1
2

Store

Rs.1200

Time

Relational
Database
structured for keyword
searches and building a
query by specifying fields
and perimeters, using SQL

Multi Dimensional
Database
a user simply poses the
question in everyday
verbiage. The user is
helped by the several
online help tools associated
with software programs
such as word processing
and spreadsheet
applications, as well as
several of the more popular
search engines currently in
use

SQL has several aggregate operators:


SUM(), MIN(), MAX(), COUNT(), AVG()
Some systems extend this with many others:

Stat functions, financial functions


i.e. RANK(), N_TILE(), RATIO_TO_TOTAL()

The basic idea is:


Combine all values in a column
into a single scalar value

AVG(Temp)
Weather;

AVG()

13

SELECT
FROM

17
...

Syntax

GROUP BY allows aggregates over table sub-groups


Result is a new table
Syntax
SELECT
Time, Altitude, AVG(Temp)
FROM
Weather
GROUP BY Time, Altitude;
Time

Latitude

Longitude

Altitude
(m)

Temp

07/9/5:1500

20

24

Time

Altitude
(m)

AVG(Temp)

07/9/5:1500

20

22

07/9/5:1500

20

23

07/9/5:1500

100

17

07/9/5:1500

100

17

07/9/9:1500

50

19

07/9/9:1500

50

20

07/9/9:1500

50

21

Users want Histograms

MAX(Temp)

Suppose:
Day(): time day
Nation(): latitude & longitude name of the country
SELECT
day, nation, MAX(Temp)
FROM
Weather
GROUP BY Day(Time) AS day,
Nation(Latitude, Longitude) AS nation;

day, nation

The following is not a STANDARD SQL


query!!
SELECT
FROM
GROUP BY

day, nation, MAX(Temp)


Weather
Day(Time) AS day,
Nation(Latitude, Longitude) AS nation;

In standard SQL:
SELECT
FROM

day, nation, MAX(Temp)


(SELECT Day(Time) AS day,
Nation(Latitude,Longitude) AS nation,
FROM Weather) AS foo
GROUP BY day, nation;
A Nested Query

Users want Roll-Up Reports


Attributes: Model, Year, Color, and, Sales
Chevy Sales Roll Up by Model by Year by Color:

Keyword ALL

{Black, White}

{1994, 1995}

Problems with GROUP BY - Roll-Up


Reports

To build the Chevy Sales Roll Up


Unioned GROUP BYs

Too many
GROUP BYs and UNIONs!!

Users want Cross-Tabulations


Chevy Sales Cross-Tab

By adding the following clause

Problems with
GROUP BY
GROUP BY cannot directly
construct:
Histograms
Roll-Up Reports
Cross-Tabs

CUBE Operator
Generalize GROUP BY and RollUp and Cross-Tabs!!

CUBE

Think of ALL as a token representing the set

{red, white, blue}

{1990, 1991, 1992}

{Chevy, Ford}

Sample syntax:
Model, Make, Year, SUM(Sales)
SELECT
FROM
Sales
WHERE
Model IN {Chevy, Ford}
AND
Year BETWEEN 1990 AND 1994
GROUP BY CUBE Model, Make, Year
HAVING
SUM(Sales) > 0;
Note: GROUP BY operator repeats aggregate list

in select list
in group by list

Allows functional aggregations


(e.g., Sales by quarter):
SELECT
FROM
WHERE
GROUP BY
quarter;

Store, quarter, SUM(Sales)


Sales
nation=Korea AND Year=1994
ROLLUP Store, Quarter(Date) AS

ROLLUP Operator
A Subset of CUBE Operator
Return Sales Roll Up by Store by Quarter in 1994.

An Example of ..

By Year

Ch
ev

Fo
rd
y

3D Data Cube

90
9
1
91
19

9
19

2
93
9
1
Re

te
hi
W

Sum
By
Co
lo r

By Make

ue
l
B

&

Ye
ar

By

By Color

ke
a
M

&

r
lo
o
C

A multi-dimensional structure containing


data points that represent unique
combinations of several classifications
A flexible way of storing and
disseminating data

is a data structure that allows fast


analysis of data
overcomes a limitation of relational
databases
can be thought of as extensions to the
two-dimensional array of a spreadsheet
consists of numeric facts (also called
measures) which are categorized by
dimensions.

The cube metadata is typically created


from a star schema or snowflake
schema of tables in a relational
database. Measures are derived from
the records in the fact table and
dimensions are derived from the
dimension tables.

Year
Country 2000

2001

2002

2003

AAA

123 456 124 567 125 678 126 789

BBB

987 654 988 654 989 654 999 654

CCC

35 789

36 789

37 789

38 789

Many recent statistical data management


models and systems are based on cubes
Users can select just those data that are
of interest
Cubes can easily be expanded, e.g. for
extra years, countries, or other
categories
At least in theory, cubes can have an
infinite number of dimensions

También podría gustarte