Está en la página 1de 132

Data Warehousing and

Data Mining

What is a Data Warehouse


A data warehouse is a subjectoriented, integrated, time-variant,
and nonvolatile collection of data in
support of managements decisionmaking process. --- W. H. Inmon
Collection of data that is used primarily
in organizational decision making
A decision support database that is
maintained
separately
from
the
organizations operational database

Data Warehouse - Subject


Oriented

Subject oriented: oriented to the major


subject areas of the corporation that
have been defined in the data model.
E.g. for an insurance company: customer,
product, transaction or activity, policy,
claim, account, and etc.

Operational DB and applications may


be organized differently
E.g. based on type of insurance's: auto,
life, medical, fire, ...

Data Warehouse Integrated


There is no consistency in encoding,
naming conventions, , among
different data sources
Heterogeneous data sources
When data is moved to the warehouse,
it is converted.

Data Warehouse - Non-Volatile


Operational data is regularly accessed
and manipulated a record at a time, and
update is done to data in the operational
environment.
Warehouse Data is loaded and
accessed. Update of data does not
occur
in
the
data
warehouse
environment.

Data Warehouse - Time Variance


The time horizon for the data warehouse is
significantly longer than that of operational
systems.
Operational database: current value data.
Data warehouse data : nothing more than a
sophisticated series of snapshots, taken of at
some moment in time.

The key structure of operational data may or


may not contain some element of time. The
key structure of the data warehouse always
contains some element of time.

Why Separate Data


Warehouse?
Performance
special data organization, access methods,
and implementation methods are needed
to support multidimensional views and
operations typical of OLAP
Complex OLAP queries would degrade
performance for operational transactions
Concurrency control and recovery modes
of OLTP are not compatible with OLAP
analysis

Why Separate Data


Warehouse?
Function
missing data: Decision support requires
historical data which operational DBs do
not typically maintain
data
consolidation:
DS
requires
consolidation (aggregation, summarization)
of data from heterogeneous sources:
operational DBs, external sources
data quality: different sources typically use
inconsistent data representations, codes
and formats which have to be reconciled.

Advantages of Warehousing

High query performance


Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify, summarize (store aggregates)
Add historical information

Advantages of Mediator
Systems
No need to copy data
less storage
no need to purchase data

More up-to-date data


Query needs can be unknown
Only query interface needed at sources
May be less draining on sources

Operational
databases

External data
sources

The Architecture
of Data Warehousing

Extract
Transform
Load
Refresh

Metadata
repository
Data Warehouse

Data
marts

Serves

OLAP
server

Reports

OLAP

Data mining

Data Sources
Data sources are often the operational
systems, providing the lowest level of data.
Data sources are designed for operational
use, not for decision support, and the data
reflect this fact.
Multiple data sources are often from different
systems, run on a wide range of hardware
and much of the software is built in-house or
highly customized.
Multiple data sources introduce a large
number of issues -- semantic conflicts.

Creating and Maintaining a


Warehouse
Data warehouse needs several tools that
automate or support tasks such as:
Data extraction from different external data
sources, operational databases, files of
standard applications (e.g. Excel, COBOL
applications), and other documents (Word,
WWW).
Data cleaning (finding and resolving
inconsistency in the source data)
Integration and transformation of data
(between
different
data
formats,
languages, etc.)

Creating and Maintaining a


Warehouse
Data loading (loading the data into the data
warehouse)
Data
replication
(replicating
source
database into the data warehouse)
Data refreshment
Data archiving
Checking for data quality
Analyzing metadata

Physical Structure of Data


Warehouse
There are three basic architectures for
constructing a data warehouse:

Centralized
Distributed
Federated
Tiered

The data warehouse is distributed for:


load balancing, scalability and higher
availability

Physical Structure of Data


Warehouse
Client

Client

Client

Central
Data
Warehouse

Source

Source

Centralized architecture

Physical Structure of Data


Warehouse
End
Users
Marketing
Financial
Distribution

Local
Data
Marts

Logical
Data
Warehouse

Source

Source

Federated architecture

Physical Structure of Data


Warehouse
Workstations
(higly summarized
data)
Local
Data
Marts

Physical
Data
Warehouse

Tiered architecture
Source

Source

Physical Structure of Data


Warehouse
Federated architecture
The logical data warehouse is only virtual

Tiered architecture
The central data warehouse is physical
There exist local data marts on different
triers which store copies or summarization
of the previous trier.

Conceptual Modeling of
Data Warehouses
Three basic conceptual schemas:
Star schema
Snowflake schema
Fact constellations

Star schema

Star schema: A single object (fact


table) in the middle connected to a
number of dimension tables

Star schema

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

store
storeId
city

customer
custId
name
address
city

Star schema
product

prodId
p1
p2

name price
bolt
10
nut
5

sale oderId date


o100 1/7/97
o102 2/7/97
o105 3/8/97

customer

custId
53
81
111

custId
53
53
111

name
joe
fred
sally

prodId
p1
p2
p1

storeId
c1
c1
c3

address
10 main
12 main
80 willow

store

storeId
c1
c2
c3

qty
1
2
5

amt
12
11
50

city
sfo
sfo
la

city
nyc
sfo
la

Terms
Basic notion: a measure (e.g. sales,
qty, etc)
Given a collection of numeric
measures
Each measure depends on a set of
dimensions (e.g. sales volume as a
function of product, time, and location)

Terms
Relation, which relates the
dimensions to the measure of
interest, is called the fact table (e.g.
sale)
Information about dimensions can be
represented as a collection of
relations called the dimension
tables (product, customer, store)
Each dimension can have a set of
associated attributes

Example of Star Schema

Product

Date
Date
Month
Year

Sales Fact Table


Date
Product

ProductNo
ProdName
ProdDesc
Category
QOH

Store
Store
StoreID
City
State
Country
Region

Customer
unit_sales
dollar_sales
schilling_sales

Measurements

Customer
CustId
CustName
CustCity
CustCountry

Dimension Hierarchies
For each dimension, the set of associated
attributes can be structured as a hierarchy
sType
store

customer

city

region

city

state

country

Dimension Hierarchies

store storeId
s5
s7
s9

cityId
sfo
sfo
la

tId
t1
t2
t1

mgr
joe
fred
nancy

sType tId
t1
t2
city

size
small
large

cityId pop
sfo
1M
la
5M

location
downtown
suburbs
regId
north
south

region regId
name
north cold region
south warm region

Snowflake Schema
Snowflake schema: A refinement of
star schema where the dimensional
hierarchy is represented explicitly by
normalizing the dimension tables

Product

Example of Snowflake Schema


Month
Year

Month
Year

Year

Date
Date
Month

ProductNo
ProdName
ProdDesc
Category
QOH

Sales Fact Table


Date
Product
Store

City
State
Country
Country
Region

City
State

Store

Customer

StoreID
City

unit_sales
dollar_sales
schilling_sales

State
Country
Measurements

Cust
CustId
CustName
CustCity
CustCountry

Fact constellations

Fact constellations: Multiple fact tables


share dimension tables

Database design methodology for


data warehouses (1)
Nine-step methodology proposed by Kimball
Step
1
2
3
4
5
6
7
8
9

Activ ity
Choosing the process
Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing the precalculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query modes

Database design methodology for


data warehouses (2)
There are many approaches that offer alternative
routes to the creation of a data warehouse
Typical approach decompose the design of the
data warehouse into manageable parts data marts,
At a later stage, the integration of the smaller data
marts leads to the creation of the enterprise-wide
data warehouse.
The methodology specifies the steps required for the
design of a data mart, however, the methodology
also ties together separate data marts so that over
time they merge together into a coherent overall data
warehouse.

Step 1: Choosing the process


The process (function) refers to the subject
matter of a particular data marts. The first data
mart to be built should be the one that is most
likely to be delivered on time, within budget, and
to answer the most commercially important
business questions.
The best choice for the first data mart tends to
be the one that is related to sales

Step 2: Choosing the grain


Choosing the grain means deciding exactly what a
fact table record represents. For example, the entity
Sales may represent the facts about each property
sale. Therefore, the grain of the Property_Sales fact
table is individual property sale.
Only when the grain for the fact table is chosen we
can identify the dimensions of the fact table.
The grain decision for the fact table also determines
the grain of each of the dimension tables. For
example, if the grain for the Property_Sales is an
individual property sale, then the grain of the Client
dimension is the detail of the client who bought a
particular property.

Step 3: Identifying and conforming


the dimensions
Dimensions set the context for formulating queries about
the facts in the fact table.
We identify dimensions in sufficient detail to describe
things such as clients and properties at the correct grain.
If any dimension occurs in two data marts, they must be
exactly the same dimension, or one must be a subset of
the other (this is the only way that two DM share one or
more dimensions in the same application).
When a dimension is used in more than one DM, the
dimension is referred to as being conformed.

Step 4: Choosing the facts


The grain of the fact table determines which facts can be
used in the data mart all facts must be expressed at
the level implied by the grain.
In other words, if the grain of the fact table is an
individual property sale, then all the numerical facts must
refer to this particular sale (the facts should be numeric
and additive).

Step 5: Storing pre-calculations in


the fact table
Once the facts have been selected each should be reexamined to determine whether there are opportunities
to use pre-calculations.
Common example: a profit or loss statement
These types of facts are useful since they are additive
quantities, from which we can derive valuable
information.
This is particularly true for a value that is fundamental to
an enterprise, or if there is any chance of a user
calculating the value incorrectly.

Step 6: Rounding out the dimension


tables
In this step we return to the dimension tables and add as
many text descriptions to the dimensions as possible.
The text descriptions should be as intuitive and
understandable to the users as possible

Step 7: Choosing the duration of the


data warehouse
The duration measures how far back in time the fact
table goes.
For some companies (e.g. insurance companies) there
may be a legal requirement to retain data extending back
five or more years.
Very large fact tables raise at least two very significant
data warehouse design issues:
The older data, the more likely there will be problems in reading
and interpreting the old files
It is mandatory that the old versions of the important dimensions
be used, not the most current versions (we will discuss this issue
later on)

Step 8: Tracking slowly changing


dimensions
The changing dimension problem means that the
proper description of the old client and the old branch
must be used with the old data warehouse schema
Usually, the data warehouse must assign a
generalized key to these important dimensions in
order to distinguish multiple snapshots of clients and
branches over a period of time
There are different types of changes in dimensions:
A dimension attribute is overwritten
A dimension attribute caauses a new dimension record to be
created
etc.

Step 9: Deciding the query priorities


and the query modes
In this step we consider physical design issues.

The presence of pre-stored summaries and aggregates


Indices
Materialized views
Security issue
Backup issue
Archive issue

Database design methodology for


data warehouses - summary
At the end of this methodology, we have a design for a
data mart that supports the requirements of a particular
bussiness process and allows the easy integration with
other related data marts to ultimately form the enterprisewide data warehouse.
A dimensional model, which contains more than one fact
table sharing one or more conformed dimension tables,
is referred to as a fact constellation.

Multidimensional Data Model


Sales of products may be represented
in one dimension (as a fact relation) or
in two dimensions, e.g. : clients and
products

Multidimensional Data Model

Multidimensional Data Model


Two-dimensional cube

Fact relation
sale

Product Client
p1
c1
p2
c1
p1
c3
p2
c2

Amt
12
11
50
8

p1
p2

c1
12
11

c2
8

c3
50

Multidimensional Data Model


Fact relation
sale

Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

Date
1
1
1
1
2
2

3-dimensional cube
Amt
12
11
50
8
44
4

day 2
day 1

p1
p2 c1
p1
12
p2
11

c1
44

c2
4
c2
8

c3
c3
50

Multidimensional Data Model


and Aggregates
Add up amounts for day 1
In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1
sale

Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

Date
1
1
1
1
2
2

Amt
12
11
50
8
44
4

result
81

Multidimensional Data Model


and Aggregates
Add up amounts by day
In SQL: SELECT Date, sum(Amt)
FROM SALE GROUP BY Date
sale

Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

Date
1
1
1
1
2
2

Amt
12
11
50
8
44
4

result

Date
1
2

sum
81
48

Multidimensional Data Model


and Aggregates
Add up amounts by client, product
In SQL: SELECT client, product, sum(amt)
FROM SALE
GROUP BY client, product

Multidimensional Data Model


and Aggregates
sale

Product
p1
p2
p1
p2
p1
p1

Client
c1
c1
c3
c2
c1
c2

Date
1
1
1
1
2
2

Amt
12
11
50
8
44
4

sale Product Client


p1
c1
p1
c2
p1
c3
p2
c1
p2
c2

Sum
56
4
50
11
8

Multidimensional Data Model


and Aggregates
In multidimensional data model
together with measure values usually
we store summarizing information
(aggregates)
p1
p2
Sum

c1
56
11
67

c2
4
8
12

c3
50
50

Sum
110
19
129

Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)

Cube Aggregation
day 2

day 1

p1
p2 c1
p1
12
p2
11

p1
p2

c1
56
11

c1
44

c2
4
c2

c3

Example: computing sums


...

c3
50

c2
4
8

c3
50

129

Cube Operators
day 2

day 1

p1
p2 c1
p1
12
p2
11

p1
p2

c1
56
11

c1
44

c2
4
c2

c3
c3
50

c2
4
8

...
sale(c1,*,*)

c3
50

sale(c2,p2,*)

129
sale(*,*,*)

Cube
*

day 2

day 1

sale(*,p2,*)

Aggregation Using Hierarchies

day 2
day 1

p1
p2 c1
p1
12
p2
11

c1
44

c2
4
c2

c3
c3
50

customer
region

country
p1
p2

region A region B
12
50
11
8

(customer c1 in Region A;
customers c2, c3 in Region B)

Aggregation Using Hierarchies


client
city
New
Orleans

c1
c2
c3

Pozna

c4

10
12
11
12

3
5
7
11

21
9
7
15

region
Date of
sale

CD
video
Camera

aggregation with
respect to city

NO
PN

Video
22
23

Camera
8
18

CD
30
22

A Sample Data Cube


Date
camera
video
CD

1Q

2Q

3Q

4Q

sum

USA

sum

Canada
Mexico
sum

C
o
u
n
t
r
y

Exercise (1)
Suppose the AAA Automobile Co. builds a
data warehouse to analyze sales of its cars.
The measure - price of a car
We would like to answer the following typical
queries:
find total sales by day, week, month and year
find total sales by week, month, ... for each dealer
find total sales by week, month, ... for each car
model
find total sales by month for all dealers in a given
city, region and state.

Exercise (2)
Dimensions:
time (day, week, month, quarter, year)
dealer (name, city, state, region, phone)
cars (serialno, model, color, category , )

Design the conceptual data warehouse


schema

Data warehouse Database


Different Technological approaches to the
datawarehouse database are
1. Parallel relational database designs that
require a parallel computing platform
2. An innovative approach to speed up a
traditional RDBMS by using new index
structures to bypass relational table scans
3. Multidimensional databases are designed to
overcome any limitations placed on the
warehouse by the nature of relational data
model.

Sourcing, Acquisition, Cleanup and


Transformation tools
The functionality includes the following
a. Removing unwanted data from
operational databases
b. Converting to common data names and
definitions
c. Calculating summaries and derived data
d. Establishing defaults for missing data
e. Accomdating source data definition
changes

Issues on datasourcing, cleanup,


extract, transformation
Database heterogeneity: DBMS are very
different in data models, data access
language, data navigation, operations,
concurrency, integrity, recovery and so on
Data heterogeneity: The way data is
defined and used in different models.

Metadata
Metadata is data about data that describes
the data warehouse.
Metadata can be classified into the following
Technical Metadata
Business Metadata
Data warehouse operational information
such as data history, ownership, extract
audit trail, usage data.

Technical Data
Information about data sources
Transformation description the mapping method from
operational database into the warehouse, and algorithms
used to convert/enhance/ transform data
Rules to perform data cleanup and data enhancement
Data structure definitions for data targets
Data-mapping operations when capturing data from
source systems and applying it to the target warehouse
database
Access authorisation, backup, history, archive history,
information delivery history, data acquisition history, data
access and so on

Business Metadata
Subject areas and information object type,
including queries, reports, images, video and/or
audio clips
Internet home pages
Other information to support all data
warehousing components. For example, the
information related to the information delivery
system should include subscription information;
scheduling information; details of delivery
destinations; and the business query objects
such as predefined queries, reports and
analyses.

The information directory and the entire metadata repository will have
the following attributes

Should be the gateway to the datawarehouse environment, and thus should be


accessible from anyplatform via transparent and seamless connections
The information directory components should be accessible by any browsers and run
on all major platforms.
The datastructures of the metadata repositry should be supported by on all major or
object-oriented databases.
Should support an easy distribution and replication of its content for high performance
and availability
Should be searchable by business-oriented key words
Should be able to define the content of structured and unstructured data
Should act as launch platform for end user data access and analysis tools
Should support the sharing of information objects
Should support a variety of scheduling options for requests against the data
warehouse, including on-demand, one-time, repetitve, event-driven and conditional
delivery
Should suport and provide interfaces to other applications such as e-mails, spread
sheets and so on.
Examples of metadata repositories include Microsoft Repositry, R&O Rochade,
Prism Solutions Directory Manager and CA/Platinum Technologies

Accessing and Visualizing


Information
Effective Data visualization provides the
user with the following
Capability to compare data
Capability to control scale
Capability to map the visualization back to
the detail data that created it
Capability to filter data to look only at
subsets of it

Tool Taxonomy

Data query and reporting tools


Application Development tools
Executive Information System tools
Online analytical processing tools
Data mining tools

Query and Reporting tools


Production reporting tools let companies
generate regular operational reports
Report writers are inexpensive desktop
tools designed for users
Managed query tools are designed for easeof use, point-and-click, visual navigation
that either accepts SQL or generates SQL
statements to query relational data stored
in the warehouse.

Application Development tools


Organizations will often rely on true and
proven approach of in-house application
development, using graphical data access
environments designed primarily for
client/server environments.

OLAP tools
The OLAP tools can be classified as
multidimensional or MOLAP, relational or
ROLAP and hybrid or HOLAP tools.Some
of the more popular OLAP tools are
Microsoft Decision support services,
Microstartegy DSS server, Oracle
Express, Metacube from Informix and so
on.

Data mining tools

Discovering knowledge
Segmentation
Classification
Association
Preferencing
Visualization

Data Marts
The data mart is directed at a partition of
data that is created for the use of a
dedicated group of users. A datamart is
set of denormalized, summarized or
aggregated data.

Datawarehouse Administration and


Management

Security and priority management


Monitoring updates from multiple sources
Data quality checks
Managing and updating metadata
Auditing and reporting data warehouse usage
and status
Purging data
Replicating, subsetting and distributing data
Backup and recovery

Data Mining

Data Mining
The process of employing one or
more computer learning techniques
to automatically analyze and
extract knowledge from data.

A Simple Data Mining


Process Model
Operational
Database

Data
Warehouse

SQL Queries

Data Mining

Interpretation
&
Evaluation

Result
Application

General Phases of Data Mining


Process

Problem Definition
Creating a Database for Datamining
Exploring the database
Preparation for creating a Data Mining
Model
Building a Data Mining Model
Evaluating the Data Mining Model
Deploying the Data Mining Model

Data Mining Tasks


The model that you determine to solve a problem are
classified as
Predictive model
Classification
Regression
Time Series Analysis
Predicition
Descriptive model
Clustering
Summarization
Association Rules
Sequence Discovery

Data Mining Techniques

Artificial neural networks: Non-linear predictive models that learn


through training and resemble biological neural networks in
structure.
Decision trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
dataset. Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) .
Genetic algorithms: Optimization techniques that use processes
such as genetic combination, mutation, and natural selection in a
design based on the concepts of evolution.
Nearest neighbor method: A technique that classifies each record
in a dataset based on a combination of the classes of the k record(s)
most similar to it in a historical dataset (where k 1). Sometimes
called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data
based on statistical significance.

Datamining Issues

Human Interaction
Overfitting
Outliers
Intrepretation of results
Visualization of results
Large datasets
High dimensionality
Multimedia data
Missing Data
Irrevelant data
Noisy data
Changing data
Integration
Application

Datamining metrics
Measuring the effectiveness or usefulness
of a data mining is called datamining
metric
It could be measured as increase in sales
and reduce in the advertisement cost and
cannot do as ROI
The metrics used include the traditional
metrics of space and time for example
similarity measures

Social implications of Datamining


Targeted advertising
Datamining applications can derive much
demographic data concerning customers
that was previously not known or hidden in
the data
Fraud detection, Criminal suspects,
prediction of terrorists.

Datamining from a Database


Perspective

Scalability
Real-world data
Update
Ease of use

Decision Tree
A tree structure where nonterminal nodes represent tests on
one or more attributes and
terminal nodes reflect decision
outcomes.

Table 1.1 Hypothetical Training Data for Disease Diagnosis


Patient
ID#

Sore
Throat

1
2
3
4
5
6
7
8
9
10

Yes
No
Yes
Yes
No
No
No
Yes
No
Yes

Fever

Swollen
Glands

Congestion

Headache

Diagnosis

Yes
No
Yes
No
Yes
No
No
No
Yes
Yes

Yes
No
No
Yes
No
No
Yes
No
No
No

Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes

Yes
Yes
No
No
No
No
No
Yes
Yes
Yes

Strep throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold

Swollen
Glands

No

Yes
Diagnosis = Strep Throat

Fever

No
Diagnosis = Allergy

Yes
Diagnosis = Cold

Table 1.2 Data Instances with an Unknown Classification


Patient
ID#

Sore
Throat

11
12
13

No
Yes
No

Fever

Swollen
Glands

Congestion

Headache

Diagnosis

No
Yes
No

Yes
No
No

Yes
No
No

Yes
Yes
Yes

?
?
?

Production Rules
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy

An Algorithm for Building


Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T.
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a
unique value
for the chosen attribute.
-Use the child link values to further subdivide the instances into
subclasses.
4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria or if the set
of
remaining attribute choices for this path is null, specify the
classification
for new instances following this decision path.
-If the subclass does not satisfy the criteria and there is at least one
attribute to further subdivide the path of the tree, let T be
the current set
of subclass
instances and return to step 2.

Generating Association Rules


Rule Confidence
Given a rule of the form If A then B, rule confidence is
the conditional probability that B is true when A is known
to be true.
Rule Support
The minimum percentage of instances in the database that
contain all items listed in a given association rule.

Mining Association
Rules: An Example

Table 3.3 A Subset of the Credit Card Promotion Database


Magazine
Promotion
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes

Watch
Promotion

Life Insurance
Promotion

Credit Card
Insurance

Sex

No
Yes
No
Yes
No
No
No
Yes
No
Yes

No
Yes
No
Yes
Yes
No
Yes
No
No
Yes

No
No
No
Yes
No
No
Yes
No
No
No

Male
Female
Male
Male
Female
Female
Male
Male
Male
Female

Table 3.4 Single-Item Sets


Single-Item Sets
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female

Number of Items
7
4
6
5
5
8
6
4

Table 3.5 Two-Item Sets


Two-Item Sets

Number of Items

Magazine Promotion = Yes & Watch Promotion = No


Magazine Promotion = Yes & Life Insurance Promotion = Yes
Magazine Promotion = Yes & Credit Card Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance Promotion = No
Watch Promotion = No & Credit Card Insurance = No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female

4
5
5
4
4
5
4
5
4
4
4

Two Possible Two-Item Set


Rules
IF Magazine Promotion =Yes
THEN Life Insurance Promotion =Yes
(5/7)
IF Life Insurance Promotion =Yes
THEN Magazine Promotion =Yes (5/5)

Three-Item Set Rules


IF

Watch Promotion =No & Life Insurance


Promotion = No
THEN Credit Card Insurance =No (4/4)
IF
Watch Promotion =No
THEN Life Insurance Promotion = No & Credit
Card Insurance = No (4/6)

General Considerations
We are interested in association rules that show a
lift in product sales where the lift is the result
of the products association with one or more
other products.
We are also interested in association rules that
show a
lower than expected confidence for a
particular association.

Nearest Neighbour
Objects that are near each other will also
have similar prediction values. Thus, if you
know the prediction value of one of the
objects, you can predict it for its nearest
neighbours.

The K-Means Algorithm


1. Choose a value for K, the total number of
clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not
change.

Table 3.6

K-Means Input Values

Instance

1
2
3
4
5
6

1.0
1.0
2.0
2.0
3.0
5.0

1.5
4.5
1.5
3.5
2.5
6.0

f(x)

7
6
5
4
3
2
1
0

Table 3.7 Several Applications of the K-Means Algorithm (K = 2)


Outcome

Cluster Centers

Cluster Points

(2.67,4.67)

2, 4, 6

Squared Error

14.50

(2.00,1.83)

1, 3, 5

(1.5,1.5)

1, 3

(2.75,4.125)

2, 4, 5, 6

(1.8,2.7)

1, 2, 3, 4, 5

15.94

9.60
(5,6)

f(x)

7
6
5
4
3
2
1
0

General Considerations
Requires real-valued data.
We must select the number of clusters present in
the data.
Works best when the clusters in the data are of
approximately equal size.
Attribute significance cannot be determined.
Lacks explanation capabilities.

Bayesian Classification
ID
1
2
3
4
5
6
7
8
9
10
11

Income
4
3
2
3
4
2
3
2
3
1
2

Credit
E
g
e
g
g
e
b
b
b
b
g

Class
h1
h1
h1
h1
h1
h1
h2
h2
h3
h4
h2

x(i)
x4
x7
x2
x7
x8
x2
x11
x10
x11
x9
x6

P(h1/xi)= p(xi/h1)*p(h1)/(sum(p(xi/hi)*p(hi))

Let h1= authorize purchase, h2= authorise after identification h3=do not authorize
h4=do not authorise report to police

Income Group
1
0-10000
2
10000-50000
3
50000-100000
4
100000- inf
Construct a Table
1
2
E x1
x2
g x5
x6
b x9
x10

3
x3
x7
x11

4
x4
x8
x12

P(x7/h1)=2/6, P(x4/h1)=1/6, p(x2/h1)=2/6 p(x8/h1)=1/6


P(h1/x4)= p(x4/h1)*p(h1)/sum of all= 1

Attribute

Gender

Height

Value

Count

Prob

Short

Medium

Tall

Short

Medium

Tall

2/8

3/3

6/8

0/3

0-1.6

2/4

1.6-1.7

2/4

1.7-1.8

4/8

1.9-2

1/8

1/3

2-

2/3

p(t/Short)= * 0 =0
P(t/medium)= 2/8* 1/8=0.031
p(t/tall)= 3/3*1/3=0.333
Likelyhood of being short = 0 * 0.267 =0
Likelyhood of being medium = 0.031 * .533=0.0166
Likelyhood of being tall = 0.33*0.2 = 0.066
P(t)=0+0.01666+0.066= 0.0826
P(short/t)= 0 * 0.267/0.0826 = 0
P(medium/t) = 0.031 * 0.533/0.0826 = 0.2
P(tall/t)= 0.333*0.2/0.0826 = 0.799
The data of t belongs to tall since the probability is higher.

ID3 Algorithm
The concept used to quantify information is
called entropy. Entropy is used to measure
the amount of uncertainty or surprise or
randomness in a set of data.
The basic strategy used by ID3 is to choose
splitting attributes with the highest
information gain first.

Given probabilities p1,p2, pS where


Sum(pi)=1 entropy is defined as
H(p1,p2,.pS)= sum(pi * log(1/pi))
Gain(D,S)= H(D)-sum(p(Di)*H(Di))

Short - 4/15, Medium 8/15 and tall 3/15.


The entropy of the starting set is
4/15 log(15/4)+8/15 log (8/15)+3/15log(15/3) =0.4384
Choosing the gender as the splitting attribute, 9 are F and 6 are M.
The entropy of the subset that are F is
3/9 log(9/3)+6/9log(9/6)=0.2764
The entropy of the subset that are M is
1/6 log(6/1)+2/6log(6/2)+3/6log(6/3)=0.4392
The ID3 algorithm must determine what the gain the information is by using this
split .
Calculate the weighted sum of these last two entropies to get
9/15 * 0.2764 + 6/15 * 0.4392 = 0.34152
The gain in entropy by using gender attribute is thus
0.4384-0.34152 = 0.09688
Looking at the height attribute, we divide into ranges:
(0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(2.0,inf)
(0,1.6]->(2/2(0)+0+0)=0, (1.6,1.7]->0, .(1.9,2.0]>(0+1/2log(2)+1/2log(2))=0.301
The gain in entropy by using the height attribute is
0.4384-2/15(0.301)=0.3983

C4.5 or C5.0
Gainratio(D,S)=Gain(D,S)/H(|D1|/|D| .|Ds|/|D|)
To calculate the GainRatio for the gender split, we first find the entropy
associate with the split ignoring classes
H(9/15,6/15)=9/15 log (15/9) + 6/15 log(15/6)=0.292
This gives the GainRatio value for the gender attribute as
0.09688/0.292 = 0.332
The entropy for the split on height is
H(2/15,2/15,3/15,4/15,2/15)=2/15 log(15/2)+ 2/15 log(15/2)+ 3/15
log(15/3)+ 4/15 log(15/4)+ 2/15
log(15/2)=0.1166*3+0.1397+0.15307=0.64257
This gives the GainRatio value for the height attribute as
0.09688/0.64257=0.1507

Nueral Network
How to solve a classification problem using
Neural network as
Determine the number of output nodes and
attributes to be used as input
Determine the labels and functions to be used
for the graph
Determine the functions for the graph
Each tuple needs to be evaluated by filtering it
through the structure or the network
For each tuples ti belongs to Di propagate ti
through the network and classify the tuple.

Various issues in the neural


network Classification are
Deciding the attributes to be used as splitting
attributes
Determination of the number of hidden nodes
Determination of the number of hidden layers to
choose the best number of hidden nodes per
hidden layer
Determination of the number of sinks
Interconnectivity of all the nodes
Using different activation function

Propagation in Neural Network


Output of each node i in the neural
networks is based on the definition of a
function fi which is called activation
function, fi, when applied to an input
{x1i,x2i,x3i,.xni} and weights
{w1i,w2i,w3i,.wni} the sum of these
inputs is
S=Sum( Whi Xhi)
h=1 to k

For each node in the input layer do


output x on each output arc fromi;
for each hidden layer do
for each node I do
S = WiJ XiJ
for each output arc from i do
Output (i-e-si )/(i+e-si )
for each node I in the output layer do
S = WiJ XiJ
output = 1/(i+e-csi )

Radial basis function network


A function whose value changes as it
moves away from a central point is known
as radial function.
fi(S)=e(-S2/V)

Perceptron
The neural network of the simplest type is
named as perceptron.
The perception is sigmodial function

Association Rules
Let a data set I={I1 ,I2 ,I3 ,In } and a
database of transaction {t1 ,t2 , ..tn }
where t = { Ii1 , Ii2 , Ii3 ,..Iim } and IiJ
belongs to I. Association rule is an
implication of the form X=>Y where X,Y
contained in I are items of data set called
as itemsets and X intersection Y is 0.

Basic concepts of Association Rule


Support :The support for an association rule x=>y is the
percentage of transaction in the database that consists
of XUY
Confidence: The confidence for an association rule X=>Y is
the ratio of the number of transactions that contains XUY
to the number of transactions that contains X.
Large Itemset : A large itemset is an item set whose
number of occurences is above a threshold or support.
L represents the comlete set of large itemsets and I
represents an individual item set. Those large itemsets
that are counted from data set are called as candidates
and the collection of all these counted large itemsets are
known as candidate item set.

Apriori Algorithm
This algorithm is an association rule algorithm that finds the large itemsets from a given dataset.

Transaction

Items

T1

Bread,Jam,Butter

T2

Bread,Butter

T3

Bread,Cold-drink,Butter

T4

Milk,Bread

T5

Milk,Cold-Drink

Candidates and Large Itemset


using Apriori
Scan

Candidates

Large Itemsets

{milk},{Bread},
{Jam}, {colddrink},{Butter}

{milk},{Bread},{Colddrink},{Butter}

{milk, Bread}, {milk,cold-drink}


{milk,butter}, {Bread,Cold-drink}
{Bread,butter}, {Colddrink,Butter}

{Bread,Butter}

Sampling Algorithm
To overcome of the counting of itemset with
large dataset in each scan, you use sampling
algorithm. The sampling algorithm reduces the
number of dataset scan 1 or 2 where 1 is for
best case and 2 is for worst case. Sampling
algorithm is also used to find the large itemset
for the sample from dta set like the apriori
algorithm. These samples are considered as
Potentially large itemsets that are used as
candidates for counting the entire database.

Clustering

Hierarchial

Agglomerative
Divisive

Partitional
Categorical
Large DB

Sampling
Compression

Hierarchical
A nested set of clusters is created. Each
level in the hierarchy has seperated set of
clusters
Agglomerative : Clusters are created in bottomup fashion.
Divisive:
Top-Down fashion

A tree data structure called a dendrogram


can be used to illustrate the heirarchical
cluster and the set of different clusters.

Similarity and Distance Measures


Centroid = Cm = sum(tmi)/N
Radius = Rm = sqrt(sum(tmi-C m)2/N)
Diameter = Dm = sqrt(sum(tmi-tmj)2/N*N-1)

Methods to calculate the distance


between clusters

Single link : Smallest distance between an element in one cluster and an


element in the other. We thus have dis(Ki, Kj )=min(dis(til, tjm) for every til
belongs to Ki and for every tjm belongs to Kj
Complete link : largest distance between an element in one cluster and an
element in the other. We thus have dis(Ki, Kj )=max(dis(til, tjm) for every til
belongs to Ki and for every tjm belongs to Kj
Average : Average distance between an element in one cluster and an
element in the other. We thus have dis(Ki, Kj )=mean(dis(til, tjm) for every til
belongs to Ki and for every tjm belongs to Kj
Centroid : If clusters have a representative centroid, then the centroid
distance is defined as the distance between the centroids. We thus have
dis(Ki, Kj )=dis(Ci, Cj ), where Ci is the centroid for Kj and similarly for Ci
Mediod: Using a medoid to reresent each cluster, the distance between the
clusters can be defined by the distance the medoids dis(Ki, Kj )=dis(mi, mj )

Hypothesis testing
Null hypothesis
Alternative hypothesis
Chi square testing
Regression and correlation

También podría gustarte