Documentos de Académico
Documentos de Profesional
Documentos de Cultura
=
i
i i
n
n
n
n
V V d
2
2
1
1
2 1
) , (
Distance between nominal attribute values:
d(Single,Married)
= | 2/4 0/4 | + | 2/4 4/4 | = 1
d(Single,Divorced)
= | 2/4 1/2 | + | 2/4 1/2 | = 0
d(Married,Divorced)
= | 0/4 1/2 | + | 4/4 1/2 | = 1
d(Refund=Yes,Refund=No)
= | 0/3 3/7 | + | 3/3 4/7 | = 6/7
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
4 3 No
3 0 Yes
No Yes
Refund
Class
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48
Example: PEBLS
=
=
d
i
i i Y X
Y X d w w Y X
1
2
) , ( ) , (
Tid Refund Marital
Status
Taxable
Income
Cheat
X Yes Single 125K No
Y No Married 100K No
10
Distance between record X and record Y:
where:
correctly predicts X times of Number
prediction for used is X times of Number
=
X
w
w
X
1 if X makes accurate prediction most of the time
w
X
> 1 if X is not reliable for making predictions
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49
Bayes Classifier
O A probabilistic framework for solving classification
problems
O Conditional Probability:
O Bayes theorem:
) (
) ( ) | (
) | (
A P
C P C A P
A C P =
) (
) , (
) | (
) (
) , (
) | (
C P
C A P
C A P
A P
C A P
A C P
=
=
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50
Example of Bayes Theorem
O Given:
A doctor knows that meningitis causes stiff neck 50% of the
time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
O If a patient has stiff neck, whats the probability
he/she has meningitis?
0002 . 0
20 / 1
50000 / 1 5 . 0
) (
) ( ) | (
) | ( =
= =
S P
M P M S P
S M P
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51
Bayesian Classifiers
O Consider each attribute and class label as random
variables
O Given a record with attributes (A
1
, A
2
,,A
n
)
Goal is to predict class C
Specifically, we want to find the value of C that
maximizes P(C| A
1
, A
2
,,A
n
)
O Can we estimate P(C| A
1
, A
2
,,A
n
) directly from
data?
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52
Bayesian Classifiers
O Approach:
compute the posterior probability P(C | A
1
, A
2
, , A
n
) for
all values of C using the Bayes theorem
Choose value of C that maximizes
P(C | A
1
, A
2
, , A
n
)
Equivalent to choosing value of C that maximizes
P(A
1
, A
2
, , A
n
|C) P(C)
O How to estimate P(A
1
, A
2
, , A
n
| C )?
) (
) ( ) | (
) | (
2 1
2 1
2 1
n
n
n
A A A P
C P C A A A P
A A A C P
K
K
K =
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53
Nave Bayes Classifier
O Assume independence among attributes A
i
when class is
given:
P(A
1
, A
2
, , A
n
|C) = P(A
1
| C
j
) P(A
2
| C
j
) P(A
n
| C
j
)
Can estimate P(A
i
| C
j
) for all A
i
and C
j
.
New point is classified to C
j
if P(C
j
) P(A
i
| C
j
) is
maximal.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54
How to Estimate Probabilities from Data?
O Class: P(C) = N
c
/N
e.g., P(No) = 7/10,
P(Yes) = 3/10
O For discrete attributes:
P(A
i
| C
k
) = |A
ik
|/ N
c
where |A
ik
| is number of
instances having attribute
A
i
and belongs to class C
k
Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
k
Tid Refund Marital
Status
Taxable
Income
Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
c c c
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55
How to Estimate Probabilities from Data?
O For continuous attributes:
Discretize the range into bins
one ordinal attribute per bin
violates independence assumption
Two-way split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to
estimate the conditional probability P(A
i
|c)
k
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 56
How to Estimate Probabilities from Data?
O Normal distribution:
One for each (A
i
,c
i
) pair
O For (Income, Class=No):
If Class=No
sample mean = 110
sample variance = 2975
Tid Refund Marital
Status
Taxable
Income
Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2
2
2
) (
2
2
1
) | (
ij
ij i
A
ij
j i
e c A P
=
0072 . 0
) 54 . 54 ( 2
1
) | 120 (
) 2975 ( 2
) 110 120 (
2
= = =
e No Income P
X
1
X
2
X
3
Y
Black box
0.3
0.3
0.3
t=0.4
Output
node
Input
nodes
=
> + + =
otherwise 0
true is if 1
) ( where
) 0 4 . 0 3 . 0 3 . 0 3 . 0 (
3 2 1
z
z I
X X X I Y
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63
Artificial Neural Networks (ANN)
O Model is an assembly of
inter-connected nodes
and weighted links
O Output node sums up
each of its input value
according to the weights
of its links
O Compare output node
against some threshold t
X
1
X
2
X
3
Y
Black box
w
1
t
Output
node
Input
nodes
w
2
w
3
) ( t X w I Y
i
i i
=
Perceptron Model
) ( t X w sign Y
i
i i
=
or
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
General Structure of ANN
Activation
function
g(S
i
)
S
i
O
i
I
1
I
2
I
3
w
i1
w
i2
w
i3
O
i
Neuron i Input Output
threshold, t
Input
Layer
Hidden
Layer
Output
Layer
x
1
x
2
x
3
x
4
x
5
y
Training ANN means learning
the weights of the neurons
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 65
Algorithm for learning ANN
O Initialize the weights (w
0
, w
1
, , w
k
)
O Adjust the weights in such a way that the output
of ANN is consistent with class labels of training
examples
Objective function:
Find the weights w
i
s that minimize the above
objective function
e.g., backpropagation algorithm (see lecture notes)
| |
2
) , (
=
i
i i i
X w f Y E
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66
Support Vector Machines
O Find a linear hyperplane (decision boundary) that will separate the data
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67
Support Vector Machines
O One Possible Solution
B
1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68
Support Vector Machines
O Another possible solution
B
2
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69
Support Vector Machines
O Other possible solutions
B
2
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70
Support Vector Machines
O Which one is better? B1 or B2?
O How do you define better?
B
1
B
2
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 71
Support Vector Machines
O Find hyperplane maximizes the margin => B1 is better than B2
B
1
B
2
b
11
b
12
b
21
b
22
margin
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72
Support Vector Machines
B
1
b
11
b
12
0 = + b x w
r r
1 = + b x w
r r
1 + = + b x w
r r
+
+
=
1 b x w if 1
1 b x w if 1
) (
r r
r r
r
x f
2
|| ||
2
Margin
w
r
=
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73
Support Vector Machines
O We want to maximize:
Which is equivalent to minimizing:
But subjected to the following constraints:
This is a constrained optimization problem
Numerical approaches to solve it (e.g., quadratic programming)
2
|| ||
2
Margin
w
r
=
+
+
=
1 b x w if 1
1 b x w if 1
) (
i
i
r r
r r
r
i
x f
2
|| ||
) (
2
w
w L
r
=
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74
Support Vector Machines
O What if the problem is not linearly separable?
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 75
Support Vector Machines
O What if the problem is not linearly separable?
Introduce slack variables
Need to minimize:
Subject to:
+ +
+
=
i i
i i
1 b x w if 1
- 1 b x w if 1
) (
r r
r r
r
i
x f
|
.
|
\
|
+ =
=
N
i
k
i
C
w
w L
1
2
2
|| ||
) (
r
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76
Nonlinear Support Vector Machines
O What if decision boundary is not linear?
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77
Nonlinear Support Vector Machines
O Transform data into higher dimensional space
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78
Ensemble Methods
O Construct a set of classifiers from the training
data
O Predict class label of previously unseen records
by aggregating predictions made by multiple
classifiers
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 79
General Idea
Original
Training data
....
D
1
D
2
D
t-1
D
t
D
Step 1:
Create Multiple
Data Sets
C
1
C
2
C
t -1
C
t
Step 2:
Build Multiple
Classifiers
C
*
Step 3:
Combine
Classifiers
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80
Why does it work?
O Suppose there are 25 base classifiers
Each classifier has error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes
a wrong prediction:
=
|
|
.
|
\
|
25
13
25
06 . 0 ) 1 (
25
i
i i
i
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 81
Examples of Ensemble Methods
O How to generate an ensemble of classifiers?
Bagging
Boosting
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 82
Bagging
O Sampling with replacement
O Build classifier on each bootstrap sample
O Each sample has probability (1 1/n)
n
of being
selected
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83
Boosting
O An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
Initially, all N records are assigned equal
weights
Unlike bagging, weights may change at the
end of boosting round
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84
Boosting
O Records that are wrongly classified will have their
weights increased
O Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
Example 4 is hard to classify
Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 85
Example: AdaBoost
O Base classifiers: C
1
, C
2
, , C
T
O Error rate:
O Importance of a classifier:
( )
=
=
N
j
j j i j i
y x C w
N
1
) (
1
|
|
.
|
\
|
=
i
i
i
1
ln
2
1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86
Example: AdaBoost
O Weight update:
O If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
O Classification:
factor ion normalizat the is where
) ( if exp
) ( if exp
) (
) 1 (
j
i i j
i i j
j
j
i
j
i
Z
y x C
y x C
Z
w
w
j
j
=
=
+
( )
=
= =
T
j
j j
y
y x C x C
1
) ( max arg ) ( *
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 87
Boosting
Round 1
+ + + - - - - - - -
0.0094 0.0094 0.4623
B1
= 1.9459
Illustrating AdaBoost
Data points
for training
Initial weights for each data point
Original
Data
+ + + - - - - - + +
0.1 0.1 0.1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88
Illustrating AdaBoost
Boosting
Round 1
+ + + - - - - - - -
Boosting
Round 2
- - - - - - - - + +
Boosting
Round 3
+ + + + + + + + + +
Overall
+ + + - - - - - + +
0.0094 0.0094 0.4623
0.3037 0.0009 0.0422
0.0276 0.1819 0.0038
B1
B2
B3
= 1.9459
= 2.9323
= 3.8744