Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Supervised
Learning
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
An example application
Another application
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Approved or not
training data
Testing: Test the model using unseen test data
to assess the model accuracy
Accuracy
10
What do we mean by
learning?
Given
a data set D,
a task T, and
a performance measure M,
11
An example
12
Fundamental assumption of
learning
Assumption: The distribution of training
13
Fundamental assumption of
learning
Assumption: The distribution of training
14
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
15
Introduction
16
17
18
19
20
21
22
23
Choose an attribute to
partition data
24
25
26
Information theory
27
28
entropy ( D)
|C |
Pr(c ) log
j
Pr(c j )
j 1
|C |
Pr(c ) 1,
j
j 1
29
30
Information gain
31
32
An example
6
6 9
9
entropy ( D) log 2 log 2 0.971
15
15 15
15
6
9
entropy ( D1 ) entropy ( D2 )
15
15
6
9
0 0.918
15
15
0.551
5
5
5
entropy ( D1 ) entropy ( D2 ) entropy ( D3 )
15
15
15
5
5
5
0.971 0.971 0.722
15
15
15
0.888
entropy Age ( D)
33
34
Handling continuous
attributes
Handle continuous attribute by splitting into
35
An example in a continuous
space
36
Avoid overfitting in
classification
37
38
39
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
40
Evaluating classification
methods
Predictive accuracy
Efficiency
41
Evaluation methods
42
43
44
a training set,
a validation set and
a test set.
45
Classification measures
46
47
TP
r
.
TP FN
48
An example
49
50
Receive operating
characteristics curve
51
Then we have
52
53
54
55
56
57
An example
58
An example
59
Lift curve
Bin
210
42%
42%
120
24%
66%
60
12%
78%
40
22
8%
4.40%
86% 90.40%
18
12
7
3.60%
2.40%
1.40%
94% 96.40% 97.80%
10
6
1.20%
99%
5
1%
100%
100
90
80
70
60
lift
50
random
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
60
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Summary
61
Introduction
62
Sequential covering
Learn one rule at a time, sequentially.
After a rule is learned, the training examples
covered by the rule are removed.
Only the remaining data are used to find
subsequent rules.
The process repeats until some stopping
criteria are met.
Note: a rule covers an example if the example
satisfies the conditions of the rule.
We introduce two specific algorithms.
63
64
65
Differences:
66
Learn-one-rule-1 function
67
Learn-one-rule-1 function
)m, each (m-1)-condition rule in the
(cont
In iteration
68
Learn-one-rule-1 algorithm
69
Learn-one-rule-2 function
70
Learn-one-rule-2 algorithm
71
Rule evaluation in learn-one Let the current partially developed rule be:
rule-2
R: av1, .., avk class
gain( R, R ) p1 log 2
log 2
p1 n1
p 0 n0
72
pn
v( BestRule, PrunePos, PruneNeg )
pn
where p (n) is the number of examples in PrunePos
(PruneNeg) covered by the current rule (after a
deletion).
73
Discussions
74
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
75
Three approaches
76
77
78
79
80
Considerations in CAR
mining
Multiple minimum class supports
81
Building classifiers
82
83
84
85
Coverage: rare item rules are not found using classic algo.
Multiple min supports and support difference constraint
help a great deal.
86
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
87
Bayesian classification
88
Pr( A a ,..., A
1
| A|
a| A| | C cr ) Pr(C cr )
r 1
89
Computing probabilities
90
Conditional independence
assumption
Formally, we assume,
Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)
91
Pr(C c j ) Pr( Ai ai | C c j )
|C |
i 1
| A|
r 1
i 1
Pr(C cr ) Pr( Ai ai | C cr )
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
92
i 1
93
An example
94
An Example (cont )
For C = t, we have
2
1 2 2 2
Pr(C t ) Pr( A j a j | C t )
2 5 5 25
j 1
1 1 2 1
Pr(C f ) Pr( A j a j | C f )
2 5 5 25
j 1
95
Additional issues
Pr( Ai ai | C c j )
nij
n j ni
96
Advantages:
Easy to implement
Very efficient
Good results obtained in many applications
Disadvantages
97
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
98
Text
classification/categorization
Due to the rapid growth of online documents in
99
Probabilistic framework
Generative model: Each document is
generated by a parametric distribution
governed by a set of hidden parameters.
The generative model makes two
assumptions
100
Mixture model
101
An example
class 2
102
103
Document generation
Pr(d i | )
Pr(c
| ) Pr( d i | c j ; )
(23)
j 1
104
105
Multinomial distribution
106
t 1
(24)
|V |
it
| di |
t 1
Pr( wt | cj; ) 1.
(25)
t 1
107
Parameter estimation
N Pr(c | d )
)
Pr( w | c ;
.
N Pr(c | d )
t
ti
i 1
|V |
| D|
s 1
i 1
si
(26)
i 1 N ti Pr(c j | d i )
| D|
Pr( wt | c j ; )
CS583, Bing Liu, UIC
| V | s 1 i 1 N si Pr(c j | d i )
|V |
| D|
(27)
108
Pr(c | )
j
| D|
i 1
Pr(cj | di )
(28)
|D|
109
Classification
Given a test document di, from Eq. (23) (27) and (28)
Pr(
c
j | ) Pr( di | cj ; )
)
Pr(cj | di;
)
Pr(di |
|d i |
)
Pr(cj | )k 1 Pr( wd i ,k | cj;
|d i |
r 1 Pr(cr | )k 1 Pr(wdi ,k | cr ; )
|C |
110
Discussions
111
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
112
Introduction
113
Basic concepts
1 if w x i b 0
yi
1 if w x i b 0
114
The hyperplane
115
116
117
| w xi b |
|| w ||
(36)
|| w || w w w1 w2 ... wn
CS583, Bing Liu, UIC
(37)
118
| w xs b 1 |
1
d
|| w ||
|| w ||
(38)
2
margin d d
|| w ||
(39)
119
A optimization problem!
Definition (Linear SVM: separable case): Given a set of
linearly separable training examples,
D = {(x1, y1), (x2, y2), , (xr, yr)}
Learning is to solve the following constrained
minimization problem,
w w
Minimize :
2
Subject to : yi ( w x i b) 1, i 1, 2, ..., r
yi ( w x i b 1, i 1, 2, ..., r
w xi + b 1
w xi + b -1
CS583, Bing Liu, UIC
(40)
summarizes
for yi = 1
for yi = -1.
120
[ y ( w x b) 1]
i
(41)
i 1
121
Kuhn-Tucker conditions
122
123
Dual formulation
1 r
LD i
y i y j i j x i x j ,
2 i , j 1
i 1
(55)
124
125
w x b
y x x b 0
i
(57)
isv
sign( w z b) sign
y x z b
i
(58)
isv
If (58) returns 1, then the test instance z is classified
as positive; otherwise, it is classified as negative.
126
w w
Minimize :
2
Subject to : yi ( w x i b) 1, i 1, 2, ..., r
127
i 0, i =1, 2, , r.
CS583, Bing Liu, UIC
128
Geometric interpretation
129
objective function.
A natural way of doing it is to assign an extra
cost for errors to change the objective
function to
r
w w
(60)
Minimize :
C ( i ) k
2
i 1
k = 1 is commonly used, which has the
advantage that neither i nor its Lagrangian
multipliers appear in the dual formulation.
130
(61)
i 0, i 1, 2, ..., r
(62)
r
r
r
1
LP w w C i i [ yi ( w x i b) 1 i ] i i
2
i 1
i 1
i 1
131
Kuhn-Tucker conditions
132
133
Dual
134
1
b yi i x i x j 0.
yi i 1
CS583, Bing Liu, UIC
(73)
135
136
w x b
y x
i
x b 0
(75)
i 1
137
138
Space transformation
(77)
139
Geometric interpretation
140
141
An example space
transformation
( x1 , x2 ) ( x1 , x2 , 2 x1 x2 )
142
143
Kernel functions
144
Polynomial kernel
(83)
d
K(x, z) = x z
Let us compute the kernel with degree d = 2 in a 2dimensional space: x = (x1, x2) and z = (z1, z2).
x z 2 ( x1 z1 x 2 z 2 ) 2
2
x1 z1 2 x1 z1 x 2 z 2 x 2 z 2
2
(84)
( x1 , x 2 , 2 x1 x 2 ) ( z1 , z 2 , 2 z1 z 2 )
(x) (z ) ,
145
Kernel trick
146
Is it a kernel function?
147
148
149
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
150
k-Nearest Neighbor
Classification
(kNN)
151
kNNAlgorithm
152
A new point
Pr(science| )
?
153
Discussions
154
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging and Boosting
Summary
155
Combining classifiers
Bagging
Boosting
156
Bagging
Breiman, 1996
157
Bagging (cont)
Training
Testing
158
Bagging Example
Original
Training set 1
Training set 2
Training set 3
Training set 4
159
Bagging (cont )
Boosting
A family of methods:
Training
Testing
161
AdaBoost
called a weaker classifier
Weighted
training set
Build a classifier ht
whose accuracy on
training set >
(better than random)
Non-negative weights
sum to 1
Change weights
CS583, Bing Liu, UIC
162
AdaBoost algorithm
163
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
164
165
Road Map
Basic concepts
Decision tree induction
Evaluation of classifiers
Rule induction
Classification using association rules
Nave Bayesian classification
Nave Bayes for text classification
Support vector machines
K-nearest neighbor
Summary
166
Summary
167