Presentation DW DM

Data Warehousing and
Data Mining
What is a Data Warehouse

A data warehouse is a subjectoriented, integrated, time-variant,
and nonvolatile collection of data in
support of managements decisionmaking process. --- W. H. Inmon
Collection of data that is used primarily
in organizational decision making
A decision support database that is
maintained
separately
from
the
organizations operational database
Data Warehouse - Subject

Oriented
Subject oriented: oriented to the major

subject areas of the corporation that
have been defined in the data model.
E.g. for an insurance company: customer,
product, transaction or activity, policy,
claim, account, and etc.
Operational DB and applications may

be organized differently
E.g. based on type of insurance's: auto,
life, medical, fire, ...
Data Warehouse Integrated

There is no consistency in encoding,
naming conventions, , among
different data sources
Heterogeneous data sources
When data is moved to the warehouse,
it is converted.
Data Warehouse - Non-Volatile

Operational data is regularly accessed
and manipulated a record at a time, and
update is done to data in the operational
environment.
Warehouse Data is loaded and
accessed. Update of data does not
occur
in
the
data
warehouse
environment.
Data Warehouse - Time Variance

The time horizon for the data warehouse is
significantly longer than that of operational
systems.
Operational database: current value data.
Data warehouse data : nothing more than a
sophisticated series of snapshots, taken of at
some moment in time.
The key structure of operational data may or

may not contain some element of time. The
key structure of the data warehouse always
contains some element of time.
Why Separate Data

Warehouse?
Performance
special data organization, access methods,
and implementation methods are needed
to support multidimensional views and
operations typical of OLAP
Complex OLAP queries would degrade
performance for operational transactions
Concurrency control and recovery modes
of OLTP are not compatible with OLAP
analysis
Why Separate Data

Warehouse?
Function
missing data: Decision support requires
historical data which operational DBs do
not typically maintain
data
consolidation:
DS
requires
consolidation (aggregation, summarization)
of data from heterogeneous sources:
operational DBs, external sources
data quality: different sources typically use
inconsistent data representations, codes
and formats which have to be reconciled.
Advantages of Warehousing
High query performance

Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify, summarize (store aggregates)
Add historical information
Advantages of Mediator
Systems
No need to copy data
less storage
no need to purchase data
More up-to-date data

Query needs can be unknown
Only query interface needed at sources
May be less draining on sources
Operational
databases
External data
sources
The Architecture
of Data Warehousing
Extract
Transform
Load
Refresh
Metadata
repository
Data Warehouse
Data
marts
Serves
OLAP
server
Reports
OLAP
Data mining
Data Sources
Data sources are often the operational
systems, providing the lowest level of data.
Data sources are designed for operational
use, not for decision support, and the data
reflect this fact.
Multiple data sources are often from different
systems, run on a wide range of hardware
and much of the software is built in-house or
highly customized.
Multiple data sources introduce a large
number of issues -- semantic conflicts.
Creating and Maintaining a

Warehouse
Data warehouse needs several tools that
automate or support tasks such as:
Data extraction from different external data
sources, operational databases, files of
standard applications (e.g. Excel, COBOL
applications), and other documents (Word,
WWW).
Data cleaning (finding and resolving
inconsistency in the source data)
Integration and transformation of data
(between
different
data
formats,
languages, etc.)
Creating and Maintaining a

Warehouse
Data loading (loading the data into the data
warehouse)
Data
replication
(replicating
source
database into the data warehouse)
Data refreshment
Data archiving
Checking for data quality
Analyzing metadata
Physical Structure of Data

Warehouse
There are three basic architectures for
constructing a data warehouse:
Centralized
Distributed
Federated
Tiered
The data warehouse is distributed for:

load balancing, scalability and higher
availability

Warehouse
Client
Client
Client
Central
Data
Warehouse
Source
Source
Centralized architecture

Warehouse
End
Users
Marketing
Financial
Distribution
Local
Data
Marts
Logical
Data
Warehouse
Source
Source
Federated architecture

Warehouse
Workstations
(higly summarized
data)
Local
Data
Marts
Physical
Data
Warehouse
Tiered architecture
Source
Source

Warehouse
Federated architecture
The logical data warehouse is only virtual
Tiered architecture
The central data warehouse is physical
There exist local data marts on different
triers which store copies or summarization
of the previous trier.
Conceptual Modeling of
Data Warehouses
Three basic conceptual schemas:
Star schema
Snowflake schema
Fact constellations
Star schema
Star schema: A single object (fact

table) in the middle connected to a
number of dimension tables
Star schema
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
store
storeId
city
customer
custId
name
address
city
Star schema
product
prodId
p1
p2
name price
bolt
10
nut
5
sale oderId date

o100 1/7/97
o102 2/7/97
o105 3/8/97
customer
custId
53
81
111
custId
53
53
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
store
storeId
c1
c2
c3
qty
1
2
5
amt
12
11
50
city
sfo
sfo
la
city
nyc
sfo
la
Terms
Basic notion: a measure (e.g. sales,
qty, etc)
Given a collection of numeric
measures
Each measure depends on a set of
dimensions (e.g. sales volume as a
function of product, time, and location)
Terms
Relation, which relates the
dimensions to the measure of
interest, is called the fact table (e.g.
sale)
Information about dimensions can be
represented as a collection of
relations called the dimension
tables (product, customer, store)
Each dimension can have a set of
associated attributes
Example of Star Schema
Product
Date
Date
Month
Year
Sales Fact Table

Date
Product
ProductNo
ProdName
ProdDesc
Category
QOH
Store
Store
StoreID
City
State
Country
Region
Customer
unit_sales
dollar_sales
schilling_sales
Measurements
Customer
CustId
CustName
CustCity
CustCountry
Dimension Hierarchies
For each dimension, the set of associated
attributes can be structured as a hierarchy
sType
store
customer
city
region
city
state
country
Dimension Hierarchies
store storeId
s5
s7
s9
cityId
sfo
sfo
la
tId
t1
t2
t1
mgr
joe
fred
nancy
sType tId
t1
t2
city
size
small
large
cityId pop
sfo
1M
la
5M
location
downtown
suburbs
regId
north
south
region regId
name
north cold region
south warm region
Snowflake Schema
Snowflake schema: A refinement of
star schema where the dimensional
hierarchy is represented explicitly by
normalizing the dimension tables
Product
Example of Snowflake Schema

Month
Year
Month
Year
Year
Date
Date
Month
ProductNo
ProdName
ProdDesc
Category
QOH
Sales Fact Table

Date
Product
Store
City
State
Country
Country
Region
City
State
Store
Customer
StoreID
City
unit_sales
dollar_sales
schilling_sales
State
Country
Measurements
Cust
CustId
CustName
CustCity
CustCountry
Fact constellations
Fact constellations: Multiple fact tables

share dimension tables
Database design methodology for

data warehouses (1)
Nine-step methodology proposed by Kimball
Step
1
2
3
4
5
6
7
8
9
Activ ity
Choosing the process
Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing the precalculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query modes

data warehouses (2)
There are many approaches that offer alternative
routes to the creation of a data warehouse
Typical approach decompose the design of the
data warehouse into manageable parts data marts,
At a later stage, the integration of the smaller data
marts leads to the creation of the enterprise-wide
data warehouse.
The methodology specifies the steps required for the
design of a data mart, however, the methodology
also ties together separate data marts so that over
time they merge together into a coherent overall data
warehouse.
Step 1: Choosing the process

The process (function) refers to the subject
matter of a particular data marts. The first data
mart to be built should be the one that is most
likely to be delivered on time, within budget, and
to answer the most commercially important
business questions.
The best choice for the first data mart tends to
be the one that is related to sales
Step 2: Choosing the grain

Choosing the grain means deciding exactly what a
fact table record represents. For example, the entity
Sales may represent the facts about each property
sale. Therefore, the grain of the Property_Sales fact
table is individual property sale.
Only when the grain for the fact table is chosen we
can identify the dimensions of the fact table.
The grain decision for the fact table also determines
the grain of each of the dimension tables. For
example, if the grain for the Property_Sales is an
individual property sale, then the grain of the Client
dimension is the detail of the client who bought a
particular property.
Step 3: Identifying and conforming

the dimensions
Dimensions set the context for formulating queries about
the facts in the fact table.
We identify dimensions in sufficient detail to describe
things such as clients and properties at the correct grain.
If any dimension occurs in two data marts, they must be
exactly the same dimension, or one must be a subset of
the other (this is the only way that two DM share one or
more dimensions in the same application).
When a dimension is used in more than one DM, the
dimension is referred to as being conformed.
Step 4: Choosing the facts

The grain of the fact table determines which facts can be
used in the data mart all facts must be expressed at
the level implied by the grain.
In other words, if the grain of the fact table is an
individual property sale, then all the numerical facts must
refer to this particular sale (the facts should be numeric
and additive).
Step 5: Storing pre-calculations in

the fact table
Once the facts have been selected each should be reexamined to determine whether there are opportunities
to use pre-calculations.
Common example: a profit or loss statement
These types of facts are useful since they are additive
quantities, from which we can derive valuable
information.
This is particularly true for a value that is fundamental to
an enterprise, or if there is any chance of a user
calculating the value incorrectly.
Step 6: Rounding out the dimension

tables
In this step we return to the dimension tables and add as
many text descriptions to the dimensions as possible.
The text descriptions should be as intuitive and
understandable to the users as possible
Step 7: Choosing the duration of the

data warehouse
The duration measures how far back in time the fact
table goes.
For some companies (e.g. insurance companies) there
may be a legal requirement to retain data extending back
five or more years.
Very large fact tables raise at least two very significant
data warehouse design issues:
The older data, the more likely there will be problems in reading
and interpreting the old files
It is mandatory that the old versions of the important dimensions
be used, not the most current versions (we will discuss this issue
later on)
Step 8: Tracking slowly changing

dimensions
The changing dimension problem means that the
proper description of the old client and the old branch
must be used with the old data warehouse schema
Usually, the data warehouse must assign a
generalized key to these important dimensions in
order to distinguish multiple snapshots of clients and
branches over a period of time
There are different types of changes in dimensions:
A dimension attribute is overwritten
A dimension attribute caauses a new dimension record to be
created
etc.
Step 9: Deciding the query priorities

and the query modes
In this step we consider physical design issues.
The presence of pre-stored summaries and aggregates

Indices
Materialized views
Security issue
Backup issue
Archive issue

data warehouses - summary
At the end of this methodology, we have a design for a
data mart that supports the requirements of a particular
bussiness process and allows the easy integration with
other related data marts to ultimately form the enterprisewide data warehouse.
A dimensional model, which contains more than one fact
table sharing one or more conformed dimension tables,
is referred to as a fact constellation.
Multidimensional Data Model

Sales of products may be represented
in one dimension (as a fact relation) or
in two dimensions, e.g. : clients and
products

Two-dimensional cube
Fact relation
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
Amt
12
11
50
8
p1
p2
c1
12
11
c2
8
c3
50

Fact relation
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
3-dimensional cube
Amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
8
c3
c3
50

and Aggregates
Add up amounts for day 1
In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
result
81

and Aggregates
Add up amounts by day
In SQL: SELECT Date, sum(Amt)
FROM SALE GROUP BY Date
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
result
Date
1
2
sum
81
48

and Aggregates
Add up amounts by client, product
In SQL: SELECT client, product, sum(amt)
FROM SALE
GROUP BY client, product

and Aggregates
sale
Product
p1
p2
p1
p2
p1
p1
Client
c1
c1
c3
c2
c1
c2
Date
1
1
1
1
2
2
Amt
12
11
50
8
44
4
sale Product Client

p1
c1
p1
c2
p1
c3
p2
c1
p2
c2
Sum
56
4
50
11
8

and Aggregates
In multidimensional data model
together with measure values usually
we store summarizing information
(aggregates)
p1
p2
Sum
c1
56
11
67
c2
4
8
12
c3
50
50
Sum
110
19
129
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
Example: computing sums

...
c3
50
c2
4
8
c3
50
129
Cube Operators
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
c2
4
8
...
sale(c1,*,*)
c3
50
sale(c2,p2,*)
129
sale(*,*,*)
Cube
*
day 2
day 1
sale(*,p2,*)
Aggregation Using Hierarchies
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
country
p1
p2
region A region B
12
50
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
Aggregation Using Hierarchies

client
city
New
Orleans
c1
c2
c3
Pozna
c4
10
12
11
12
3
5
7
11
21
9
7
15
region
Date of
sale
CD
video
Camera
aggregation with
respect to city
NO
PN
Video
22
23
Camera
8
18
CD
30
22
A Sample Data Cube

Date
camera
video
CD
1Q
2Q
3Q
4Q
sum
USA
sum
Canada
Mexico
sum
C
o
u
n
t
r
y
Exercise (1)
Suppose the AAA Automobile Co. builds a
data warehouse to analyze sales of its cars.
The measure - price of a car
We would like to answer the following typical
queries:
find total sales by day, week, month and year
find total sales by week, month, ... for each dealer
find total sales by week, month, ... for each car
model
find total sales by month for all dealers in a given
city, region and state.
Exercise (2)
Dimensions:
time (day, week, month, quarter, year)
dealer (name, city, state, region, phone)
cars (serialno, model, color, category , )
Design the conceptual data warehouse

schema
Data warehouse Database

Different Technological approaches to the
datawarehouse database are
1. Parallel relational database designs that
require a parallel computing platform
2. An innovative approach to speed up a
traditional RDBMS by using new index
structures to bypass relational table scans
3. Multidimensional databases are designed to
overcome any limitations placed on the
warehouse by the nature of relational data
model.
Sourcing, Acquisition, Cleanup and

Transformation tools
The functionality includes the following
a. Removing unwanted data from
operational databases
b. Converting to common data names and
definitions
c. Calculating summaries and derived data
d. Establishing defaults for missing data
e. Accomdating source data definition
changes
Issues on datasourcing, cleanup,

extract, transformation
Database heterogeneity: DBMS are very
different in data models, data access
language, data navigation, operations,
concurrency, integrity, recovery and so on
Data heterogeneity: The way data is
defined and used in different models.
Metadata
Metadata is data about data that describes
the data warehouse.
Metadata can be classified into the following
Technical Metadata
Business Metadata
Data warehouse operational information
such as data history, ownership, extract
audit trail, usage data.
Technical Data
Information about data sources
Transformation description the mapping method from
operational database into the warehouse, and algorithms
used to convert/enhance/ transform data
Rules to perform data cleanup and data enhancement
Data structure definitions for data targets
Data-mapping operations when capturing data from
source systems and applying it to the target warehouse
database
Access authorisation, backup, history, archive history,
information delivery history, data acquisition history, data
access and so on
Business Metadata
Subject areas and information object type,
including queries, reports, images, video and/or
audio clips
Internet home pages
Other information to support all data
warehousing components. For example, the
information related to the information delivery
system should include subscription information;
scheduling information; details of delivery
destinations; and the business query objects
such as predefined queries, reports and
analyses.
The information directory and the entire metadata repository will have
the following attributes
Should be the gateway to the datawarehouse environment, and thus should be

accessible from anyplatform via transparent and seamless connections
The information directory components should be accessible by any browsers and run
on all major platforms.
The datastructures of the metadata repositry should be supported by on all major or
object-oriented databases.
Should support an easy distribution and replication of its content for high performance
and availability
Should be searchable by business-oriented key words
Should be able to define the content of structured and unstructured data
Should act as launch platform for end user data access and analysis tools
Should support the sharing of information objects
Should support a variety of scheduling options for requests against the data
warehouse, including on-demand, one-time, repetitve, event-driven and conditional
delivery
Should suport and provide interfaces to other applications such as e-mails, spread
sheets and so on.
Examples of metadata repositories include Microsoft Repositry, R&O Rochade,
Prism Solutions Directory Manager and CA/Platinum Technologies
Accessing and Visualizing

Information
Effective Data visualization provides the
user with the following
Capability to compare data
Capability to control scale
Capability to map the visualization back to
the detail data that created it
Capability to filter data to look only at
subsets of it
Tool Taxonomy
Data query and reporting tools

Application Development tools
Executive Information System tools
Online analytical processing tools
Data mining tools
Query and Reporting tools

Production reporting tools let companies
generate regular operational reports
Report writers are inexpensive desktop
tools designed for users
Managed query tools are designed for easeof use, point-and-click, visual navigation
that either accepts SQL or generates SQL
statements to query relational data stored
in the warehouse.
Application Development tools

Organizations will often rely on true and
proven approach of in-house application
development, using graphical data access
environments designed primarily for
client/server environments.
OLAP tools
The OLAP tools can be classified as
multidimensional or MOLAP, relational or
ROLAP and hybrid or HOLAP tools.Some
of the more popular OLAP tools are
Microsoft Decision support services,
Microstartegy DSS server, Oracle
Express, Metacube from Informix and so
on.
Data mining tools
Discovering knowledge
Segmentation
Classification
Association
Preferencing
Visualization
Data Marts
The data mart is directed at a partition of
data that is created for the use of a
dedicated group of users. A datamart is
set of denormalized, summarized or
aggregated data.
Datawarehouse Administration and

Management
Security and priority management

Monitoring updates from multiple sources
Data quality checks
Managing and updating metadata
Auditing and reporting data warehouse usage
and status
Purging data
Replicating, subsetting and distributing data
Backup and recovery
Data Mining
Data Mining
The process of employing one or
more computer learning techniques
to automatically analyze and
extract knowledge from data.
A Simple Data Mining

Process Model
Operational
Database
Data
Warehouse
SQL Queries
Data Mining
Interpretation
&
Evaluation
Result
Application
General Phases of Data Mining

Process
Problem Definition
Creating a Database for Datamining
Exploring the database
Preparation for creating a Data Mining
Model
Building a Data Mining Model
Evaluating the Data Mining Model
Deploying the Data Mining Model
Data Mining Tasks

The model that you determine to solve a problem are
classified as
Predictive model
Classification
Regression
Time Series Analysis
Predicition
Descriptive model
Clustering
Summarization
Association Rules
Sequence Discovery
Data Mining Techniques
Artificial neural networks: Non-linear predictive models that learn

through training and resemble biological neural networks in
structure.
Decision trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
dataset. Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) .
Genetic algorithms: Optimization techniques that use processes
such as genetic combination, mutation, and natural selection in a
design based on the concepts of evolution.
Nearest neighbor method: A technique that classifies each record
in a dataset based on a combination of the classes of the k record(s)
most similar to it in a historical dataset (where k 1). Sometimes
called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data
based on statistical significance.
Datamining Issues
Human Interaction
Overfitting
Outliers
Intrepretation of results
Visualization of results
Large datasets
High dimensionality
Multimedia data
Missing Data
Irrevelant data
Noisy data
Changing data
Integration
Application
Datamining metrics
Measuring the effectiveness or usefulness
of a data mining is called datamining
metric
It could be measured as increase in sales
and reduce in the advertisement cost and
cannot do as ROI
The metrics used include the traditional
metrics of space and time for example
similarity measures
Social implications of Datamining

Targeted advertising
Datamining applications can derive much
demographic data concerning customers
that was previously not known or hidden in
the data
Fraud detection, Criminal suspects,
prediction of terrorists.
Datamining from a Database

Perspective
Scalability
Real-world data
Update
Ease of use
Decision Tree
A tree structure where nonterminal nodes represent tests on
one or more attributes and
terminal nodes reflect decision
outcomes.
Table 1.1 Hypothetical Training Data for Disease Diagnosis

Patient
ID#
Sore
Throat
1
2
3
4
5
6
7
8
9
10
Yes
No
Yes
Yes
No
No
No
Yes
No
Yes
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Strep throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
Swollen
Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
Table 1.2 Data Instances with an Unknown Classification

Patient
ID#
Sore
Throat
11
12
13
No
Yes
No
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
No
Yes
No
Yes
No
No
Yes
No
No
Yes
Yes
Yes
?
?
?
Production Rules
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy
An Algorithm for Building

Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T.
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a
unique value
for the chosen attribute.
-Use the child link values to further subdivide the instances into
subclasses.
4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria or if the set
of
remaining attribute choices for this path is null, specify the
classification
for new instances following this decision path.
-If the subclass does not satisfy the criteria and there is at least one
attribute to further subdivide the path of the tree, let T be
the current set
of subclass
instances and return to step 2.
Generating Association Rules

Rule Confidence
Given a rule of the form If A then B, rule confidence is
the conditional probability that B is true when A is known
to be true.
Rule Support
The minimum percentage of instances in the database that
contain all items listed in a given association rule.
Mining Association
Rules: An Example
Table 3.3 A Subset of the Credit Card Promotion Database

Magazine
Promotion
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
No
Yes
No
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Table 3.4 Single-Item Sets

Single-Item Sets
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female
Number of Items
7
4
6
5
5
8
6
4
Table 3.5 Two-Item Sets

Two-Item Sets
Number of Items
Magazine Promotion = Yes & Watch Promotion = No

Magazine Promotion = Yes & Life Insurance Promotion = Yes
Magazine Promotion = Yes & Credit Card Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance Promotion = No
Watch Promotion = No & Credit Card Insurance = No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female
4
5
5
4
4
5
4
5
4
4
4
Two Possible Two-Item Set

Rules
IF Magazine Promotion =Yes
THEN Life Insurance Promotion =Yes
(5/7)
IF Life Insurance Promotion =Yes
THEN Magazine Promotion =Yes (5/5)
Three-Item Set Rules

IF
Watch Promotion =No & Life Insurance

Promotion = No
THEN Credit Card Insurance =No (4/4)
IF
Watch Promotion =No
THEN Life Insurance Promotion = No & Credit
Card Insurance = No (4/6)
General Considerations
We are interested in association rules that show a
lift in product sales where the lift is the result
of the products association with one or more
other products.
We are also interested in association rules that
show a
lower than expected confidence for a
particular association.
Nearest Neighbour
Objects that are near each other will also
have similar prediction values. Thus, if you
know the prediction value of one of the
objects, you can predict it for its nearest
neighbours.
The K-Means Algorithm

1. Choose a value for K, the total number of
clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not
change.
Table 3.6
K-Means Input Values
Instance
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
Table 3.7 Several Applications of the K-Means Algorithm (K = 2)

Outcome
Cluster Centers
Cluster Points
(2.67,4.67)
2, 4, 6
Squared Error
14.50
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
15.94
9.60
(5,6)
f(x)
7
6
5
4
3
2
1
0
General Considerations
Requires real-valued data.
We must select the number of clusters present in
the data.
Works best when the clusters in the data are of
approximately equal size.
Attribute significance cannot be determined.
Lacks explanation capabilities.
Bayesian Classification
ID
1
2
3
4
5
6
7
8
9
10
11
Income
4
3
2
3
4
2
3
2
3
1
2
Credit
E
g
e
g
g
e
b
b
b
b
g
Class
h1
h1
h1
h1
h1
h1
h2
h2
h3
h4
h2
x(i)
x4
x7
x2
x7
x8
x2
x11
x10
x11
x9
x6
P(h1/xi)= p(xi/h1)*p(h1)/(sum(p(xi/hi)*p(hi))
Let h1= authorize purchase, h2= authorise after identification h3=do not authorize
h4=do not authorise report to police
Income Group
1
0-10000
2
10000-50000
3
50000-100000
4
100000- inf
Construct a Table
1
2
E x1
x2
g x5
x6
b x9
x10
3
x3
x7
x11
4
x4
x8
x12
P(x7/h1)=2/6, P(x4/h1)=1/6, p(x2/h1)=2/6 p(x8/h1)=1/6

P(h1/x4)= p(x4/h1)*p(h1)/sum of all= 1
Attribute
Gender
Height
Value
Count
Prob
Short
Medium
Tall
Short
Medium
Tall
2/8
3/3
6/8
0/3
0-1.6
2/4
1.6-1.7
2/4
1.7-1.8
4/8
1.9-2
1/8
1/3
2-
2/3
p(t/Short)= * 0 =0
P(t/medium)= 2/8* 1/8=0.031
p(t/tall)= 3/3*1/3=0.333
Likelyhood of being short = 0 * 0.267 =0
Likelyhood of being medium = 0.031 * .533=0.0166
Likelyhood of being tall = 0.33*0.2 = 0.066
P(t)=0+0.01666+0.066= 0.0826
P(short/t)= 0 * 0.267/0.0826 = 0
P(medium/t) = 0.031 * 0.533/0.0826 = 0.2
P(tall/t)= 0.333*0.2/0.0826 = 0.799
The data of t belongs to tall since the probability is higher.
ID3 Algorithm
The concept used to quantify information is
called entropy. Entropy is used to measure
the amount of uncertainty or surprise or
randomness in a set of data.
The basic strategy used by ID3 is to choose
splitting attributes with the highest
information gain first.
Given probabilities p1,p2, pS where

Sum(pi)=1 entropy is defined as
H(p1,p2,.pS)= sum(pi * log(1/pi))
Gain(D,S)= H(D)-sum(p(Di)*H(Di))
Short - 4/15, Medium 8/15 and tall 3/15.

The entropy of the starting set is
4/15 log(15/4)+8/15 log (8/15)+3/15log(15/3) =0.4384
Choosing the gender as the splitting attribute, 9 are F and 6 are M.
The entropy of the subset that are F is
3/9 log(9/3)+6/9log(9/6)=0.2764
The entropy of the subset that are M is
1/6 log(6/1)+2/6log(6/2)+3/6log(6/3)=0.4392
The ID3 algorithm must determine what the gain the information is by using this
split .
Calculate the weighted sum of these last two entropies to get
9/15 * 0.2764 + 6/15 * 0.4392 = 0.34152
The gain in entropy by using gender attribute is thus
0.4384-0.34152 = 0.09688
Looking at the height attribute, we divide into ranges:
(0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(2.0,inf)
(0,1.6]->(2/2(0)+0+0)=0, (1.6,1.7]->0, .(1.9,2.0]>(0+1/2log(2)+1/2log(2))=0.301
The gain in entropy by using the height attribute is
0.4384-2/15(0.301)=0.3983
C4.5 or C5.0
Gainratio(D,S)=Gain(D,S)/H(|D1|/|D| .|Ds|/|D|)
To calculate the GainRatio for the gender split, we first find the entropy
associate with the split ignoring classes
H(9/15,6/15)=9/15 log (15/9) + 6/15 log(15/6)=0.292
This gives the GainRatio value for the gender attribute as
0.09688/0.292 = 0.332
The entropy for the split on height is
H(2/15,2/15,3/15,4/15,2/15)=2/15 log(15/2)+ 2/15 log(15/2)+ 3/15
log(15/3)+ 4/15 log(15/4)+ 2/15
log(15/2)=0.1166*3+0.1397+0.15307=0.64257
This gives the GainRatio value for the height attribute as
0.09688/0.64257=0.1507
Nueral Network
How to solve a classification problem using
Neural network as
Determine the number of output nodes and
attributes to be used as input
Determine the labels and functions to be used
for the graph
Determine the functions for the graph
Each tuple needs to be evaluated by filtering it
through the structure or the network
For each tuples ti belongs to Di propagate ti
through the network and classify the tuple.
Various issues in the neural

network Classification are
Deciding the attributes to be used as splitting
attributes
Determination of the number of hidden nodes
Determination of the number of hidden layers to
choose the best number of hidden nodes per
hidden layer
Determination of the number of sinks
Interconnectivity of all the nodes
Using different activation function
Propagation in Neural Network

Output of each node i in the neural
networks is based on the definition of a
function fi which is called activation
function, fi, when applied to an input
{x1i,x2i,x3i,.xni} and weights
{w1i,w2i,w3i,.wni} the sum of these
inputs is
S=Sum( Whi Xhi)
h=1 to k
For each node in the input layer do

output x on each output arc fromi;
for each hidden layer do
for each node I do
S = WiJ XiJ
for each output arc from i do
Output (i-e-si )/(i+e-si )
for each node I in the output layer do
S = WiJ XiJ
output = 1/(i+e-csi )
Radial basis function network

A function whose value changes as it
moves away from a central point is known
as radial function.
fi(S)=e(-S2/V)
Perceptron
The neural network of the simplest type is
named as perceptron.
The perception is sigmodial function
Association Rules
Let a data set I={I1 ,I2 ,I3 ,In } and a
database of transaction {t1 ,t2 , ..tn }
where t = { Ii1 , Ii2 , Ii3 ,..Iim } and IiJ
belongs to I. Association rule is an
implication of the form X=>Y where X,Y
contained in I are items of data set called
as itemsets and X intersection Y is 0.
Basic concepts of Association Rule

Support :The support for an association rule x=>y is the
percentage of transaction in the database that consists
of XUY
Confidence: The confidence for an association rule X=>Y is
the ratio of the number of transactions that contains XUY
to the number of transactions that contains X.
Large Itemset : A large itemset is an item set whose
number of occurences is above a threshold or support.
L represents the comlete set of large itemsets and I
represents an individual item set. Those large itemsets
that are counted from data set are called as candidates
and the collection of all these counted large itemsets are
known as candidate item set.
Apriori Algorithm
This algorithm is an association rule algorithm that finds the large itemsets from a given dataset.
Transaction
Items
T1
Bread,Jam,Butter
T2
Bread,Butter
T3
Bread,Cold-drink,Butter
T4
Milk,Bread
T5
Milk,Cold-Drink
Candidates and Large Itemset

using Apriori
Scan
Candidates
Large Itemsets
{milk},{Bread},
{Jam}, {colddrink},{Butter}
{milk},{Bread},{Colddrink},{Butter}
{milk, Bread}, {milk,cold-drink}

{milk,butter}, {Bread,Cold-drink}
{Bread,butter}, {Colddrink,Butter}
{Bread,Butter}
Sampling Algorithm
To overcome of the counting of itemset with
large dataset in each scan, you use sampling
algorithm. The sampling algorithm reduces the
number of dataset scan 1 or 2 where 1 is for
best case and 2 is for worst case. Sampling
algorithm is also used to find the large itemset
for the sample from dta set like the apriori
algorithm. These samples are considered as
Potentially large itemsets that are used as
candidates for counting the entire database.
Clustering
Hierarchial
Agglomerative
Divisive
Partitional
Categorical
Large DB
Sampling
Compression
Hierarchical
A nested set of clusters is created. Each
level in the hierarchy has seperated set of
clusters
Agglomerative : Clusters are created in bottomup fashion.
Divisive:
Top-Down fashion
A tree data structure called a dendrogram

can be used to illustrate the heirarchical
cluster and the set of different clusters.
Similarity and Distance Measures

Centroid = Cm = sum(tmi)/N
Radius = Rm = sqrt(sum(tmi-C m)2/N)
Diameter = Dm = sqrt(sum(tmi-tmj)2/N*N-1)
Methods to calculate the distance

between clusters
Single link : Smallest distance between an element in one cluster and an

element in the other. We thus have dis(Ki, Kj )=min(dis(til, tjm) for every til
belongs to Ki and for every tjm belongs to Kj
Complete link : largest distance between an element in one cluster and an
element in the other. We thus have dis(Ki, Kj )=max(dis(til, tjm) for every til
Average : Average distance between an element in one cluster and an
element in the other. We thus have dis(Ki, Kj )=mean(dis(til, tjm) for every til
Centroid : If clusters have a representative centroid, then the centroid
distance is defined as the distance between the centroids. We thus have
dis(Ki, Kj )=dis(Ci, Cj ), where Ci is the centroid for Kj and similarly for Ci
Mediod: Using a medoid to reresent each cluster, the distance between the
clusters can be defined by the distance the medoids dis(Ki, Kj )=dis(mi, mj )
Hypothesis testing
Null hypothesis
Alternative hypothesis
Chi square testing
Regression and correlation

Presentation DW DM

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Presentation DW DM

Cargado por

Copyright:

Formatos disponibles

Data Warehousing and

What is a Data Warehouse

Data Warehouse - Subject

Subject oriented: oriented to the major

Operational DB and applications may

Data Warehouse Integrated

Data Warehouse - Non-Volatile

Data Warehouse - Time Variance

The key structure of operational data may or

Why Separate Data

Why Separate Data

High query performance

More up-to-date data

Creating and Maintaining a

Creating and Maintaining a

Physical Structure of Data

The data warehouse is distributed for:

Physical Structure of Data

Physical Structure of Data

Physical Structure of Data

Physical Structure of Data

Star schema: A single object (fact

sale oderId date

Example of Star Schema

Sales Fact Table

Example of Snowflake Schema

Sales Fact Table

Fact constellations: Multiple fact tables

Database design methodology for

Database design methodology for

Step 1: Choosing the process

Step 2: Choosing the grain

Step 3: Identifying and conforming

Step 4: Choosing the facts

Step 5: Storing pre-calculations in

Step 6: Rounding out the dimension

Step 7: Choosing the duration of the

Step 8: Tracking slowly changing

Step 9: Deciding the query priorities

The presence of pre-stored summaries and aggregates

Database design methodology for

Multidimensional Data Model

Multidimensional Data Model

Multidimensional Data Model

Multidimensional Data Model

Multidimensional Data Model

Multidimensional Data Model

Multidimensional Data Model

Multidimensional Data Model

sale Product Client

Multidimensional Data Model

Example: computing sums

Aggregation Using Hierarchies

Aggregation Using Hierarchies

A Sample Data Cube

Design the conceptual data warehouse

Data warehouse Database

Sourcing, Acquisition, Cleanup and

Issues on datasourcing, cleanup,

Should be the gateway to the datawarehouse environment, and thus should be

Accessing and Visualizing

Data query and reporting tools

Query and Reporting tools