Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Data Mining
Advantages of Warehousing
Advantages of Mediator
Systems
No need to copy data
less storage
no need to purchase data
Operational
databases
External data
sources
The Architecture
of Data Warehousing
Extract
Transform
Load
Refresh
Metadata
repository
Data Warehouse
Data
marts
Serves
OLAP
server
Reports
OLAP
Data mining
Data Sources
Data sources are often the operational
systems, providing the lowest level of data.
Data sources are designed for operational
use, not for decision support, and the data
reflect this fact.
Multiple data sources are often from different
systems, run on a wide range of hardware
and much of the software is built in-house or
highly customized.
Multiple data sources introduce a large
number of issues -- semantic conflicts.
Centralized
Distributed
Federated
Tiered
Client
Client
Central
Data
Warehouse
Source
Source
Centralized architecture
Local
Data
Marts
Logical
Data
Warehouse
Source
Source
Federated architecture
Physical
Data
Warehouse
Tiered architecture
Source
Source
Tiered architecture
The central data warehouse is physical
There exist local data marts on different
triers which store copies or summarization
of the previous trier.
Conceptual Modeling of
Data Warehouses
Three basic conceptual schemas:
Star schema
Snowflake schema
Fact constellations
Star schema
Star schema
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
store
storeId
city
customer
custId
name
address
city
Star schema
product
prodId
p1
p2
name price
bolt
10
nut
5
customer
custId
53
81
111
custId
53
53
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
store
storeId
c1
c2
c3
qty
1
2
5
amt
12
11
50
city
sfo
sfo
la
city
nyc
sfo
la
Terms
Basic notion: a measure (e.g. sales,
qty, etc)
Given a collection of numeric
measures
Each measure depends on a set of
dimensions (e.g. sales volume as a
function of product, time, and location)
Terms
Relation, which relates the
dimensions to the measure of
interest, is called the fact table (e.g.
sale)
Information about dimensions can be
represented as a collection of
relations called the dimension
tables (product, customer, store)
Each dimension can have a set of
associated attributes
Product
Date
Date
Month
Year
ProductNo
ProdName
ProdDesc
Category
QOH
Store
Store
StoreID
City
State
Country
Region
Customer
unit_sales
dollar_sales
schilling_sales
Measurements
Customer
CustId
CustName
CustCity
CustCountry
Dimension Hierarchies
For each dimension, the set of associated
attributes can be structured as a hierarchy
sType
store
customer
city
region
city
state
country
Dimension Hierarchies
store storeId
s5
s7
s9
cityId
sfo
sfo
la
tId
t1
t2
t1
mgr
joe
fred
nancy
sType tId
t1
t2
city
size
small
large
cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
Snowflake Schema
Snowflake schema: A refinement of
star schema where the dimensional
hierarchy is represented explicitly by
normalizing the dimension tables
Product
Month
Year
Year
Date
Date
Month
ProductNo
ProdName
ProdDesc
Category
QOH
City
State
Country
Country
Region
City
State
Store
Customer
StoreID
City
unit_sales
dollar_sales
schilling_sales
State
Country
Measurements
Cust
CustId
CustName
CustCity
CustCountry
Fact constellations
Activ ity
Choosing the process
Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing the precalculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query modes
Fact relation
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
Amt
12
11
50
8
p1
p2
c1
12
11
c2
8
c3
50
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
3-dimensional cube
Amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
8
c3
c3
50
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
result
81
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
result
Date
1
2
sum
81
48
Product
p1
p2
p1
p2
p1
p1
Client
c1
c1
c3
c2
c1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
Sum
56
4
50
11
8
c1
56
11
67
c2
4
8
12
c3
50
50
Sum
110
19
129
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
c2
4
8
c3
50
129
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
c2
4
8
...
sale(c1,*,*)
c3
50
sale(c2,p2,*)
129
sale(*,*,*)
Cube
*
day 2
day 1
sale(*,p2,*)
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
country
p1
p2
region A region B
12
50
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
c1
c2
c3
Pozna
c4
10
12
11
12
3
5
7
11
21
9
7
15
region
Date of
sale
CD
video
Camera
aggregation with
respect to city
NO
PN
Video
22
23
Camera
8
18
CD
30
22
1Q
2Q
3Q
4Q
sum
USA
sum
Canada
Mexico
sum
C
o
u
n
t
r
y
Exercise (1)
Suppose the AAA Automobile Co. builds a
data warehouse to analyze sales of its cars.
The measure - price of a car
We would like to answer the following typical
queries:
find total sales by day, week, month and year
find total sales by week, month, ... for each dealer
find total sales by week, month, ... for each car
model
find total sales by month for all dealers in a given
city, region and state.
Exercise (2)
Dimensions:
time (day, week, month, quarter, year)
dealer (name, city, state, region, phone)
cars (serialno, model, color, category , )
Metadata
Metadata is data about data that describes
the data warehouse.
Metadata can be classified into the following
Technical Metadata
Business Metadata
Data warehouse operational information
such as data history, ownership, extract
audit trail, usage data.
Technical Data
Information about data sources
Transformation description the mapping method from
operational database into the warehouse, and algorithms
used to convert/enhance/ transform data
Rules to perform data cleanup and data enhancement
Data structure definitions for data targets
Data-mapping operations when capturing data from
source systems and applying it to the target warehouse
database
Access authorisation, backup, history, archive history,
information delivery history, data acquisition history, data
access and so on
Business Metadata
Subject areas and information object type,
including queries, reports, images, video and/or
audio clips
Internet home pages
Other information to support all data
warehousing components. For example, the
information related to the information delivery
system should include subscription information;
scheduling information; details of delivery
destinations; and the business query objects
such as predefined queries, reports and
analyses.
The information directory and the entire metadata repository will have
the following attributes
Tool Taxonomy
OLAP tools
The OLAP tools can be classified as
multidimensional or MOLAP, relational or
ROLAP and hybrid or HOLAP tools.Some
of the more popular OLAP tools are
Microsoft Decision support services,
Microstartegy DSS server, Oracle
Express, Metacube from Informix and so
on.
Discovering knowledge
Segmentation
Classification
Association
Preferencing
Visualization
Data Marts
The data mart is directed at a partition of
data that is created for the use of a
dedicated group of users. A datamart is
set of denormalized, summarized or
aggregated data.
Data Mining
Data Mining
The process of employing one or
more computer learning techniques
to automatically analyze and
extract knowledge from data.
Data
Warehouse
SQL Queries
Data Mining
Interpretation
&
Evaluation
Result
Application
Problem Definition
Creating a Database for Datamining
Exploring the database
Preparation for creating a Data Mining
Model
Building a Data Mining Model
Evaluating the Data Mining Model
Deploying the Data Mining Model
Datamining Issues
Human Interaction
Overfitting
Outliers
Intrepretation of results
Visualization of results
Large datasets
High dimensionality
Multimedia data
Missing Data
Irrevelant data
Noisy data
Changing data
Integration
Application
Datamining metrics
Measuring the effectiveness or usefulness
of a data mining is called datamining
metric
It could be measured as increase in sales
and reduce in the advertisement cost and
cannot do as ROI
The metrics used include the traditional
metrics of space and time for example
similarity measures
Scalability
Real-world data
Update
Ease of use
Decision Tree
A tree structure where nonterminal nodes represent tests on
one or more attributes and
terminal nodes reflect decision
outcomes.
Sore
Throat
1
2
3
4
5
6
7
8
9
10
Yes
No
Yes
Yes
No
No
No
Yes
No
Yes
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Strep throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
Swollen
Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
Sore
Throat
11
12
13
No
Yes
No
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
No
Yes
No
Yes
No
No
Yes
No
No
Yes
Yes
Yes
?
?
?
Production Rules
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy
Mining Association
Rules: An Example
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
No
Yes
No
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Number of Items
7
4
6
5
5
8
6
4
Number of Items
4
5
5
4
4
5
4
5
4
4
4
General Considerations
We are interested in association rules that show a
lift in product sales where the lift is the result
of the products association with one or more
other products.
We are also interested in association rules that
show a
lower than expected confidence for a
particular association.
Nearest Neighbour
Objects that are near each other will also
have similar prediction values. Thus, if you
know the prediction value of one of the
objects, you can predict it for its nearest
neighbours.
Table 3.6
Instance
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
Cluster Centers
Cluster Points
(2.67,4.67)
2, 4, 6
Squared Error
14.50
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
15.94
9.60
(5,6)
f(x)
7
6
5
4
3
2
1
0
General Considerations
Requires real-valued data.
We must select the number of clusters present in
the data.
Works best when the clusters in the data are of
approximately equal size.
Attribute significance cannot be determined.
Lacks explanation capabilities.
Bayesian Classification
ID
1
2
3
4
5
6
7
8
9
10
11
Income
4
3
2
3
4
2
3
2
3
1
2
Credit
E
g
e
g
g
e
b
b
b
b
g
Class
h1
h1
h1
h1
h1
h1
h2
h2
h3
h4
h2
x(i)
x4
x7
x2
x7
x8
x2
x11
x10
x11
x9
x6
P(h1/xi)= p(xi/h1)*p(h1)/(sum(p(xi/hi)*p(hi))
Let h1= authorize purchase, h2= authorise after identification h3=do not authorize
h4=do not authorise report to police
Income Group
1
0-10000
2
10000-50000
3
50000-100000
4
100000- inf
Construct a Table
1
2
E x1
x2
g x5
x6
b x9
x10
3
x3
x7
x11
4
x4
x8
x12
Attribute
Gender
Height
Value
Count
Prob
Short
Medium
Tall
Short
Medium
Tall
2/8
3/3
6/8
0/3
0-1.6
2/4
1.6-1.7
2/4
1.7-1.8
4/8
1.9-2
1/8
1/3
2-
2/3
p(t/Short)= * 0 =0
P(t/medium)= 2/8* 1/8=0.031
p(t/tall)= 3/3*1/3=0.333
Likelyhood of being short = 0 * 0.267 =0
Likelyhood of being medium = 0.031 * .533=0.0166
Likelyhood of being tall = 0.33*0.2 = 0.066
P(t)=0+0.01666+0.066= 0.0826
P(short/t)= 0 * 0.267/0.0826 = 0
P(medium/t) = 0.031 * 0.533/0.0826 = 0.2
P(tall/t)= 0.333*0.2/0.0826 = 0.799
The data of t belongs to tall since the probability is higher.
ID3 Algorithm
The concept used to quantify information is
called entropy. Entropy is used to measure
the amount of uncertainty or surprise or
randomness in a set of data.
The basic strategy used by ID3 is to choose
splitting attributes with the highest
information gain first.
C4.5 or C5.0
Gainratio(D,S)=Gain(D,S)/H(|D1|/|D| .|Ds|/|D|)
To calculate the GainRatio for the gender split, we first find the entropy
associate with the split ignoring classes
H(9/15,6/15)=9/15 log (15/9) + 6/15 log(15/6)=0.292
This gives the GainRatio value for the gender attribute as
0.09688/0.292 = 0.332
The entropy for the split on height is
H(2/15,2/15,3/15,4/15,2/15)=2/15 log(15/2)+ 2/15 log(15/2)+ 3/15
log(15/3)+ 4/15 log(15/4)+ 2/15
log(15/2)=0.1166*3+0.1397+0.15307=0.64257
This gives the GainRatio value for the height attribute as
0.09688/0.64257=0.1507
Nueral Network
How to solve a classification problem using
Neural network as
Determine the number of output nodes and
attributes to be used as input
Determine the labels and functions to be used
for the graph
Determine the functions for the graph
Each tuple needs to be evaluated by filtering it
through the structure or the network
For each tuples ti belongs to Di propagate ti
through the network and classify the tuple.
Perceptron
The neural network of the simplest type is
named as perceptron.
The perception is sigmodial function
Association Rules
Let a data set I={I1 ,I2 ,I3 ,In } and a
database of transaction {t1 ,t2 , ..tn }
where t = { Ii1 , Ii2 , Ii3 ,..Iim } and IiJ
belongs to I. Association rule is an
implication of the form X=>Y where X,Y
contained in I are items of data set called
as itemsets and X intersection Y is 0.
Apriori Algorithm
This algorithm is an association rule algorithm that finds the large itemsets from a given dataset.
Transaction
Items
T1
Bread,Jam,Butter
T2
Bread,Butter
T3
Bread,Cold-drink,Butter
T4
Milk,Bread
T5
Milk,Cold-Drink
Candidates
Large Itemsets
{milk},{Bread},
{Jam}, {colddrink},{Butter}
{milk},{Bread},{Colddrink},{Butter}
{Bread,Butter}
Sampling Algorithm
To overcome of the counting of itemset with
large dataset in each scan, you use sampling
algorithm. The sampling algorithm reduces the
number of dataset scan 1 or 2 where 1 is for
best case and 2 is for worst case. Sampling
algorithm is also used to find the large itemset
for the sample from dta set like the apriori
algorithm. These samples are considered as
Potentially large itemsets that are used as
candidates for counting the entire database.
Clustering
Hierarchial
Agglomerative
Divisive
Partitional
Categorical
Large DB
Sampling
Compression
Hierarchical
A nested set of clusters is created. Each
level in the hierarchy has seperated set of
clusters
Agglomerative : Clusters are created in bottomup fashion.
Divisive:
Top-Down fashion
Hypothesis testing
Null hypothesis
Alternative hypothesis
Chi square testing
Regression and correlation