Sigir03: Support Vector and Kernel Methods

SIGIR 2003 Tutorial
Support Vector and Kernel Methods
Thorsten Joachims
Cornell University
Computer Science Department
tj@cs.cornell.edu
http://www.joachims.org
0
Overview
What is an SVM?
• optimal hyperplane and soft-margin for inseparable data
• handling non-linear rules and non-standard data using kernels
How to use SVMs effectively and efficiently?

How to train SVMs?
• decomposition algorithms / primal vs. dual / shrinking
Why can SVMs learn?

• worst-case / average-case / relation to cross-validation
When do SVMs work well?

• properties of classification tasks - a case study in text classification
SVM-{ranking, novelty detection, regression, ...}?

• ranking e.g. learning retrieval functions
• novelty detection: e.g. topic detection
1
What I will not (really) talk about...
• SVMs in the transductive setting
[Vapnik, 1998][Joachims, 1999c][Bennet & Demiriz, 1999]
• Kernel Principal Component Analysis
[Schoelkopf et al., 1998]
• connection to related methods (i.e. Gaussian Process Classifiers,
Ridge Regression, Logistic Regression, Boosting)
[Cristianini & Shawe-Tylor, 2000][MacKay, 1997][Schoelkopf &
Smola, 2002]
Warning: At some points throughout this tutorial, precision is
sacrificed for better intuition (e.g. uniform convergence bounds for
SVMs).
2
Text Classification
E.D. And F. MAN TO BUY INTO HONG KONG

FIRM
The U.K. Based commodity house E.D. And F. Man
Ltd and Singapore’s Yeo Hiap Seng Ltd jointly
announced that Man will buy a substantial stake in
Yeo’s 71.1 pct held unit, Yeo Hiap Seng Enterprises
Ltd. Man will develop the locally listed soft drinks
manufacturer into a securities and commodities
brokerage arm and will rename the firm Man Pacific
(Holdings) Ltd.
About a corportate acquisition?

YES NO
3
Learning Text Classifiers
Real-World
Process
Training Set New Documents
Classifier
Learner
Goal:
• Learner uses training set to find classifier with low prediction error.
4
Representing Text as Attribute Vectors
Attributes: Words
0 baseball
3 specs
(Word-Stems)
From: xxx@sciences.sdsu.edu
Newsgroups: comp.graphics 0 graphics
Subject: Need specs on Apple QT
1 references
I need to get the specs, or at least a
0 hockey Values: Occurrence-
very verbose interpretation of the specs,
for QuickTime. Technical articles from 0 car Frequencies
magazines and references to books would
be nice, too.
0 clinton
.
I also need the specs in a fromat usable .
on a Unix or MS-Dos system. I can’t .
do much with the QuickTime stuff they
have on ...
1 unix
0 space
2 quicktime
0 computer
==> The ordering of words is ignored!
5
Paradoxon of Text Classification
30,000 10,000
Attributes Training Examples
... but this is not necessarily a problem!
Good News: SVMs can overcome this problem!

Bad News: This does not hold for all high-dimensional problems!
6
Experimental Results
Reuters Newswire WebKB Collection Ohsumed MeSH
• 90 categories • 4 categories • 20 categories
• 9603 training doc. • 4183 training doc. • 10000 training doc.
• 3299 test doc. • 226 test doc. • 10000 test doc.
• ~27000 features • ~38000 features • ~38000 features
microaveraged precision/recall
Reuters WebKB Ohsumed
breakeven-point [0..100]
Naive Bayes 72.3 82.0 62.4
Rocchio Algorithm 79.9 74.1 61.5
C4.5 Decision Tree 79.4 79.1 56.7
k-Nearest Neighbors 82.6 80.5 63.4
SVM 87.5 90.3 71.6

Table from [Joachims, 2002]
7
Part 1 (a): What is an SVM? (linear)
•prediction error vs. training error
•learning by empirical risk minimization
•VC-Dimension and learnability
•linear classification rules
•optimal hyperplane
•soft-margin separation
8
Generative vs. Discriminative Training
Process:
• Generator: Generates descriptions x according to distribution P ( x ) .
• Teacher: Assigns a value y to each description x based on P ( y x ) .
N
=> Training examples ( x 1, y1 ), …, ( x n, yn ) ∼ P ( x, y ) xi ∈ ℜ y i ∈ {1, – 1}
Generative Training Discriminative Training

• make assumptions about the • make assumptions about the
parametric form of P ( x, y ) . set H of classifiers
• estimate the parameters of • estimate error of classifiers in
P ( x, y ) from the training data H from the training data
• derive optimal classifier using • select classifier with lowest
Bayes’ rule error rate
• example: naive Bayes • example: SVM, decision tree
9
True (Prediction) Error
What is a “good” classification rule h ?
P(h( x) ≠ y) = ∫ ∆ ( h ( x ) ≠ y ) dP ( x, y ) = Err P ( h )
Loss function ∆:
• 1 if not equal
• 0 if equal
What is the “optimal” Learner L ?

Finds the classification rule h opt ∈ H for which Err P ( h ) is minimal:
h opt = arg min h ∈ H { Err P ( h ) }
Problem:
P ( x, y ) unknown. Known are training examples ( x 1, y 1 ), …, ( x n, y n ) .
10
Principle: Empirical Risk Minimization (ERM)
Learning Principle:
Find the decision rule h° ∈ H for which the training error is minimal:
h° = arg min h ∈ H { Err S ( h ) }
Training Error:
n
1
Err S ( h ) = ---
n ∑ ∆( yi ≠ h ( xi ) )
i=1
==> Number of misclassifications on training examples.
Central Problem of Statistical Learning Theory:

When does a low training error lead to a low generalization error?
11
When is it Possible to Learn?
Definition [Consistency]: ERM is consistent for
• a hypothesis space H and
• independent of the distribution P ( x, y )
if and only if the sequence

lim Err P ( h° ) = inf h ∈ H Err P ( h )
n→∞
lim Err S ( h° ) = inf h ∈ H Err P ( h )
n→∞
converges in probability.
<==> one-sided uniform convergence [Vapnik]

lim P { sup h ∈ H ( Err P ( h ) – Err S ( h ) ) > ε } = 0
n→∞
<==> VC-dimension of H is finite [Vapnik].
12
Vapnik/Chervonenkis Dimension
Definition: The VC-dimension of H is equal to the maximal number d
of examples that can be split into two sets in all 2d ways using
functions from H (shattering).
x1 x2 x3 ... xd
h1 + + + ... +
h2 - + + ... +
h3 + - + ... +
h4 - - + ... +
... ... ... ... ... ...
hN - - - ... -
13
Linear Classifiers
Rules of the Form: weight vector w , threshold b
 N
N 
h ( x ) = sign ∑

w i x i + b =  1 if ∑ wi xi + b > 0
 i=1
i=1 
 –1 else
Geometric Interpretation (Hyperplane):
14
Linear Classifiers (Example)
Text Classification: Physics (+1) versus Receipes (-1)
nuclear atom salt pepper water heat and

ID y
(x1) (x2) (x3) (x4) (x5) (x6) (x7)
D1 1 2 0 0 2 0 2 +1
D2 0 0 0 3 0 1 1 -1
D3 0 2 1 0 0 0 3 +1
D4 0 0 1 1 1 1 1 -1
w,b 2 3 -1 -3 -1 -1 0 b=1
7
D1: ∑ wi xi + b = [ 2 ⋅ 1 + 3 ⋅ 2 + ( –1 ) ⋅ 0 + ( –3 ) ⋅ 0 + ( –1 ) ⋅ 2 + ( –1 ) ⋅ 0 + 0 ⋅ 2]+1
i=1
7
D2: ∑ wi xi + b = [ 2 ⋅ 0 + 3 ⋅ 0 + ( –1 ) ⋅ 0 + ( –3 ) ⋅ 3 + ( –1 ) ⋅ 0 + ( –1 ) ⋅ 1 + 0 ⋅ 1]+1
i=1
15
VC-Dimension of Hyperplanes in ℜ2
• Three points in ℜ 2 can be shattered with hyperplanes.
• Four points cannot be shattered.
=> Hyperplanes in ℜ 2 -> VCdim=3

General: Hyperplanes in ℜ N -> VCdim=N+1
16
Rate of Convergence
Question: After n training examples, how close is the training error to
the true error?
With probablility η it hold for all h ∈ H :

Err P ( h ) – Err S ( h ) > Φ ( d, n, η )
η
d  ln 2n ------ + 1 – ln ---
1  d  4
Φ ( d, n ) = --
- 4 ---------------------------------------------
2 n
•n number of training examples

•d VC-dimension of hypothesis space H
==> Err P ( h ) ≤ Err S ( h ) + Φ ( d ,n, η )
17
SVM Motivation: Structural Risk Minimization
Err P ( h i ) ≤ Err S ( h i ) + Φ ( VCdim ( H ) ,n, η )
Idea: Structure on
hypothesis space.
h*
Goal: Minimize upper bound on

true error rate.
Err ( h )
P i
Φ ( VCdim ( H ) ,n, η )
Err ( h )
S i
opt VCdim(H)
18
Optimal Hyperplane (SVM Type 1)
Assumption: The training examples are linearly separable.
19
VC-Dimension of “thick” Hyperplanes
Lemma: The VCdim of hyperplanes 〈 w, b〉 with margin δ and
description vectors xi ≤ R is bounded by
2
VCdim ≤ R
------ + 1
2
δ
δ δ
The VC-dimension does not necessarily depend on the number of

attributes or the number of parameters!
20
Maximizing the Margin
The hyperplane with maximum margin

<~ (roughly, see later) ~>
The hypothesis space with minimal VC-dimension according to SRM
Support Vectors: Examples with minimal distance.
21
Computing the Optimal Hyperplane
N
Training Examples: ( x 1, y 1 ), …, ( x n, y n ) xi ∈ ℜ y i ∈ {1, – 1}
Requirement 1: Zero training error!

( y = –1 ) ⇒ [ w ⋅ xi + b ] < 0 
 ⇔ yi [ w ⋅ xi + b ] > 0
( y = 1 ) ⇒ [ w ⋅ xi + b ] > 0 

Distance δ of point x
Requirement 2: Maximum margin! from hyperplane
1 <w,b>:
maximize δ, with δ = min i ---------------
- [ w ⋅ xi + b ]
w⋅w 1
δ = ---------------- [ w ⋅ x + b ]
w⋅w
=> Requirement 1 & Requirement 2:
1 -
maximize δ, with ∀i ∈ [ 1…n ] yi  --------------- [ w ⋅ x i + b ] ≥ δ

w⋅w
22
Primal Optimization Problem
1
maximize δ, with ∀i ∈ [ 1…n ] yi  ---------------
- [ w ⋅ x i + b ] ≥ δ

w⋅w
1
Set ---------------
- = δ:
w⋅w 1 - 1 - 1
=> maximize --------------- , with ∀i ∈ [ 1…n ] y i  --------------- [ w ⋅ x i + b ] ≥ ----------------

w⋅w w⋅w w⋅w
Cancel:
1
=> maximize ---------------
- , with ∀i ∈ [ 1…n ] [ y i [ w ⋅ x i + b ] ≥ 1 ]
w⋅w
Minimize inverse and take square:

y1 [ w ⋅ x1 + b ] ≥ 1
=> minimize P ( w, b ) = 1--- w ⋅ w , with …
2
yn [ w ⋅ xn + b ] ≥ 1
23
Example: Optimal Hyperplane vs. Perceptron
Perceptron with eta=0.1
30
"perceptron_iter_trainerror.dat"
"perceptron_iter_testerror.dat"
25 hard_margin_svm_testerror.dat
Percent Training/Testing Errors
20
15
10
0
1 2 3 4 5 6 7 8 9 10
Iterations
Train on 1000 pos / 1000 neg examples for “acq” (Reuters-21578).
24
Non-Separable Training Samples
• Forsome training samples there is no separating hyperplane!
• Complete separation is suboptimal for many training samples!
=> minimize trade-off between margin and training error.
25
Soft-Margin Separation
Idea: Maximize margin and minimize training error simultanously.
Hard Margin: Soft Margin: n
minimize P ( w, b ) = 1--- w ⋅ w minimize P ( w, b, ξ ) = 1--- w ⋅ w + C ∑ ξ i

2 2
i=1
s. t. y i [ w ⋅ x i + b ] ≥ 1 s. t. yi [ w ⋅ x i + b ] ≥ 1 – ξ i and ξ i ≥ 0
Hard Margin
(separable)
ξi
δ Soft Margin
(training error) ξj
26
Controlling Soft-Margin Separation
n
Soft Margin: minimize P ( w, b, ξ ) = 1--- w ⋅ w + C ∑ ξ i

2
i=1
s. t. y i [ w ⋅ x i + b ] ≥ 1 – ξ i and ξ i ≥ 0
•
∑ ξi is an upper bound on the number of training errors.
•C is a parameter that controls trade-off between margin and error.
Large C
ξi
δ
ξj
Small C
27
Example Reuters “acq”: Varying C
4
"svm_trainerror.dat"
3.5 "svm_testerror.dat"
Percent Training/Testing Errors
2.5
1.5
0.5
hard-margin SVM
0
0.1 1 10
C
Observation: Typically no local optima, but not necessarily...
28
Part 1 (b): What is an SVM? (non-linear)
•quadraticprograms and duality
•properties of the dual
•non-linear classification rules
•kernels and their properties
•kernels for vectorial data
•kernels for non-vectorial data
29
Quadratic Program
 n  n n
1
minimize P(w) = –  k i w i + ---
∑ ∑∑ w i wj H ij
  2
n i = 1  i = 1j = 1 n
(1) (k)
s.t. ∑ wi gi ≤0 … ∑ wi gi ≤0
ni = 1 i n= 1
(1) (m)
∑ wi hi = 0 … ∑ wi hi = 0
i=1 i=1
•k linear inequality constraints

• m linear equality constraints n n
• Hessian H = H ( i, j ) is pos. semi-definite ∀α 1 …α n ; ∑ ∑ αi αj Hij ≥ 0

=> convex, no local optima i = 1j = 1
• α is feasible, if it fulfills constraints
30
Fermat Theorem
Given an unconstrained optimization problem
minimize P ( w )
with P ( w ) convex and differentiable, a necessary and sufficient

conditions for a point w° to be an optimum is that
δP ( w° ) = 0
------------------
δw
31
Lagrange Function
Given an optimization problem
minimize P ( w )
s.t. g 1 ( w ) ≤ 0 … gk ( w ) ≤ 0
h1 ( w ) = 0 … hm ( w ) = 0
the Lagrangian function is defined as

k m
L ( w, α, β ) = P ( w ) + ∑ αi gi ( w ) + ∑ βi hi ( w )
i=1 i=1
•α and β are called Lagrange Multipliers
32
Lagrange Theorem
minimize P ( w )
s.t. h 1 ( w ) = 0 … hm ( w ) = 0
with P ( w ) convex and differentiable and all h affine (w*x+b),

necessary and sufficient conditions for a point w° to be an optimum are
the existence of β° such that
m
δL ( w°, β° ) δL ( w°, β° )
--------------------------- = 0
δw
--------------------------- = 0
δβ
L ( w, β ) = P ( w ) + ∑ βi hi ( w )
i=1
=> L ( w°, β ) ≤ L ( w°, β° ) ≤ L ( w, β° )
33
Karush-Kuhn-Tucker Theorem
minimize P ( w )
s.t. g 1 ( w ) ≤ 0 … gk ( w ) ≤ 0
h1 ( w ) = 0 … hm ( w ) = 0
with P ( w ) convex and differentiable and all g and h affine, necessary

and sufficient conditions for a point w° to be an optimum are the
existence of α° and β° such that
δL ( w°, α°, β° ) δL ( w°, α°, β° )
------------------------------------- = 0 ------------------------------------- = 0
δw δβ
α i °g i ( w° ) = 0, i = 1, …, k
g i ( w° ) ≤ 0, i = 1, …, k
α i ° ≥ 0, i = 1, …, k
Sufficient for convex QP: max [ min L ( w, α, β ) ]

w
β, α ≥ 0
34
Dual Optimization Problem
n
Primal OP: minimize P ( w, b, ξ ) = 1--- w ⋅ w + C ∑ ξ i

2
i=1
s. t. yi [ w ⋅ x i + b ] ≥ 1 – ξ i and ξ i ≥ 0
Lemma: The solution w° can always

n
be written as a linear combination
w° = ∑ αi yi xi αi ≥ 0
of the training data.i = 1
 n  n n
1
Dual OP: maximize D(α) =  α i – ---
∑ ∑∑ α i αj y i y j ( x i ⋅ x j )
  2
n i = 1  i = 1j = 1
s.t. ∑ αi yi = 0 and 0 ≤ αi ≤ C
i=1
==> positive semi-definite quadratic program
35
Primal <=> Dual
Theorem: The primal OP and the dual OP have the same solution.
Given the solution α i ° of the dual OP,
n
1 pos neg
w° = ∑ α i°y i x i b° = --- ( w 0 ⋅ x + w 0 ⋅ x )
2
i=1
is the solution of the primal OP.
Theorem: For any set of feasible points P ( w, b ) ≥ D ( α ) .
=> two alternative ways to represent the learning result

• weight vector and threshold 〈 w, b〉
• vector of “influences” α 1, …, α n
36
Properties of the Soft-Margin Dual OP
 n  n n
Dual OP: maximize D ( α ) =  αi – 1---

∑ ∑∑
α i αj y i y j ( x i ⋅ x j )
2
n i = 1  i = 1j = 1
s. t. ∑ αi yi = 0 und 0 ≤ αi ≤ C
i=1
• typically single solution (i. e. 〈 w, b〉 is unique)

• one factor α i for each training example
• “influence” of single training example
limited by C
• 0 < α i < C <=> SV with ξ i = 0
ξi
• α i = C <=> SV with ξ i > 0
• α i = 0 else ξj
• based
exclusively on inner product
between training examples
37
Non-Linear Problems
==>
Problem:
• some tasks have non-linear structure
• no hyperplane is sufficiently accurate
How can SVMs learn non-linear classification rules?
38
Extending the Hypothesis Space
Idea: Input Space
Φ
Feature Space
==> Find hyperplane in feature space!
Example: a b c
Φ
a b c aa ab ac bb bc cc
==> The separating hyperplane in features space is a degree two

polynomial in input space.
39
Example
Input Space: x = ( x1, x 2 ) (2 Attributes)
Feature Space: Φ ( x ) = ( x 21, x 22, 2x1, 2x 2, 2x1 x2, 1 ) (6 Attributes)
40
Kernels
Problem: Very many Parameters! Polynomials of degree p over N
attributes in input space lead to O ( Np ) attributes in feature space!
Solution: [Boser et al., 1992] The dual OP need only inner products =>
Kernel Functions
K ( x i ,x j ) = Φ ( x i ) ⋅ Φ ( x j )
2 2
Example: For Φ ( x ) = ( x 1, x 2, 2x 1, 2x 2, 2x 1 x 2, 1 ) calculating
2
K ( x i ,x j ) = [ x i ⋅ x j + 1 ] = Φ ( x i ) ⋅ Φ ( x j )
gives inner product in feature space.
We do not need to represent the feature space explicitly!
41
SVM with Kernels
 n  n n
1
Training: maximize D(α) =  α i – ---
∑ ∑∑ α i αj y i y j K ( x i, x j )
  2
i = 1  i = 1j = 1
n
s. t. ∑ αi yi = 0 und 0 ≤ αi ≤ C
i=1
 
Classification: For new example x h ( x ) = sign 
 ∑ α i y i K ( x i ,x ) + b

x i ∈ SV
New hypotheses spaces through new Kernels:

Linear: K ( x i ,x j ) = xi ⋅ xj
d
Polynomial: K ( x i ,x j ) = [ xi ⋅ xj + 1 ]
2
Radial Basis Functions: K ( xi ,x j ) = exp ( – x i – xj ⁄ σ 2 )
Sigmoid: K ( xi ,x j ) = tanh ( γ ( x i – x j ) + c )
42
Example: SVM with Polynomial of Degree 2
2
Kernel: K ( x i ,x j ) = [ xi ⋅ xj + 1 ]
plot by Bell SVM applet
43
Example: SVM with RBF-Kernel
2
Kernel: K ( xi ,xj ) = exp ( – xi – xj ⁄ σ 2 ) plot by Bell SVM applet
44
What is a Valid Kernel?
Mercer’s Theorem (see [Cristianini & Shawe-Taylor, 2000])
Theorem [Saitoh]: Let X be a finite input space of n points ( x 1 ,…, xn ) .
A function K ( xi ,x j ) is a valid kernel in X iff it produces a Gram matrix
G ij = K ( x i ,x j )
that is symmetric
T
G = G
and positive semi-definite

n n
 T 
∀α  αGα = ∑∑ α i α j K ( x i ,x j ) ≥ 0
 
 i = 1j = 1 
45
How to Construct Valid Kernels?
Theorem: Let K1 and K2 be valid Kernels over X × X , X ⊆ ℜ N , a ≥ 0 ,
m
0 ≤ λ ≤ 1 , f a real-valued function on X , φ ;X → ℜ with K 3 a kernel over
m m
ℜ × ℜ , and K a summetric positive semi-definite matrix. Then the
following functions are valid Kernels
K ( x, z ) = λK 1 ( x, z ) + ( 1 – λ )K 2 ( x, z )
K ( x, z ) = aK 1 ( x, z )
K ( x, z ) = K 1 ( x, z )K 2 ( x, z )
K ( x, z ) = f ( x )f ( z )
K ( x, z ) = K 3 ( φ ( x ), φ ( z ) )
T
K ( x, z ) = x Kz
=> Construct complex Kernels from simple Kernels.
46
Kernels for Non-Vectorial Data
Kernels for Sequences: Two sequences are similar, if the have many
common and consecutive subsequences.
Example [Lodhi et al., 2002]: For 0 ≤ λ ≤ 1 consider the following
features space
c-a c-t a-t b-a b-t c-r a-r b-r

2 3 2
φ ( cat ) λ λ λ 0 0 0 0 0
2 3 2
φ ( car ) λ 0 0 0 0 λ λ 0
2 2 3
φ ( bat ) 0 0 λ λ λ 0 0 0
2 2 3
φ ( bar ) 0 0 0 λ 0 0 λ λ
=> K ( car, cat ) = λ 4 , efficient computation via dynamic programming.

=> Fisher Kernels [Jaakkola & Haussler, 1998]
47
Computing String Kernel (I)
Definitions:
n
• Σ : sequences of length n over alphabet Σ
• i = ( i 1, …, i n ) : index sequence (sorted)
• s ( i ) : substring operator
• r ( i ) = i n – i 1 + 1 : range of index sequence
Kernel: Average range of common subsequences of length n

in + jn – i1 – j1 + 2
K n ( s, t ) = ∑ ∑ ∑ λ
n
u ∈ Σ i ;u = s ( i ) j ;u = s ( j )
Auxiliary Function: Average range to end of sequence of common

subsequences of length n
s + t – i1 – j1 + 2
K d ′ ( s, t ) = ∑ ∑ ∑ λ
n
u ∈ Σ i ;u = s ( i ) j ;u = s ( j )
48
Computing String Kernel (II)
Kernel:
K n ( s, t ) = 0 if ( min ( s, t ) < n )
2
K n ( sx, t ) = K n ( s, t ) + ∑ K′ n – 1 ( s, t [ 1…j – 1 ] )λ
j ;t j = x
Auxiliary:
K′ 0 ( s, t ) = 1
K′ d ( s, t ) = 0 if ( min ( s, t ) < d )
t –j+2
K′ d ( sx, t ) = λK′ d ( s, t ) + ∑ K′ d – 1 ( s, t [ 1…j – 1 ] )λ
j ;t j = x
49
Other Kernels for Complex Data
General information on Kernels:
• Introduction to Kernels [Cristianini & Shawe-Taylor, 2000]
• All the details on Kernels + Background [Schoelkopf & Smola, 2002]
Kernels for specific structures:

• Diffusion Kernels for graphs [Kondor & Lafferty, 2002]
• Kernels for grammars [Collins & Duffy, 2002]
• Kernels for trees, lists, etc. [Gaertner et al., 2002]
50
Two Reasons for Using a Kernel
(1) Turn a linear learner into a non-linear learner

(e.g. RBF, polynomial, sigmoid)
(2) Make non-vectorial data accessible to learner

(e.g. string kernels for sequences)
51
Summary
What is an SVM?
Given:
N
• Training examples ( x 1, y 1 ), …, ( x n, y n ) x i ∈ ℜ y i ∈ {1, – 1}
• Hypothesis space according to kernel K ( x i ,x j )
• Parameter C for trading-off training error and margin size
Training:
• Finds hyperplane in feature space generated by kernel.
• The hyperplane has maximum margin in feature space with minimal
training error (upper bound ∑ ξ i ) given C.
• The result of training are α 1, …, α n . They determine 〈 w, b〉 .
 
Classification: For new example h ( x ) = sign 
 ∑ α i y i K ( x i ,x ) + b

x i ∈ SV
52
Part 2: How to use an SVM effectively and
efficiently?
•normalization of the input vectors

•selecting C
•handling unbalanced datasets
•selecting a kernel
•multi-class and multi-label classification
• selecting a training algorithm
53
Design Decisions in Working with SVMs
Setting up the learning task
•multi-class problems
•multi-label problems
Representation of the data (efficiency and effectiveness)

•selecting features
•selecting feature values
•normalizing the data (directional vs. non-directional data)
•selecting a kernel
Selecting a good value for the parameter C and kernel parameters

Selecting a training algorithm that is efficient for the particular QP
•kernel SVM vs. linear SVM
•many sparse features vs. few dense features
54
Handling Multi-Class / Multi-Label Problems
Standard classification SVM addresses binary problems y ∈ { 1, – 1 }
Multi-class classification: y ∈ { 1, …, k }
• one-against-rest decomposition into k binary problems

=  1 if ( y = i )
(i) (i)
• learn one binary SVM h per class with y
(i)  – 1 else
• assign new example to y = arg max [ h ( x ) ]
• pairwise decomposition into k ( k – 1 ) binary problems

=  1 if ( y = i )
(i) ( i, j )
• learn one binary SVM h per class pair y
 – 1 if ( y = j )
• assign new example by majority vote
• reducing number of classifications [Platt et al., 2000]
• multi-class SVM [Weston & Watkins, 1998]
• multi-class SVM via ranking [Crammer & Singer, 2001]
Multi-label classification: y ⊆ { 1, …, k }

=  1 if ( i ∈ y )
(i) (i)
• learn one binary SVM h per class with y
 –1 else
55
Which Features to Choose?
Things to take into consideration:
• if features sparse, then dimensionality of space no efficiency problem
• computations based on inner product between vectors
• consider frequency distribution of features (e.g. many rare features)
• Zipf distribution of words
• see TCat-model
• SVMs can handle redundancy in features
• bag-of-words representation redundant for topic classification
• see TCat-model
• as few irrelevant features as possible
• stopword removal often helps in text classification
• see TCat-model
56
How to Assign Feature Values?
• importance of feature is monotonic in its absolute value
• the larger the absolute value, the more influence the feature gets
• typical problem: number of doors [0-5], price [0-100000]
• want relevant features large / irrelevant features low (e.g. IDF)
• normalization to make features equally important
x – mean ( X )
• by mean and variance: x norm = -------------------------------
var ( X )
• by other distribution
• normalization to bring feature vectors onto the same scale
• directional data: text classification
x
• by normalizing the length of the vector x norm = -------- according to
some norm x
• changes whether a problem is (linearly) separable or not

• scale all vectors to a length that allows numerically stable training
57
Selecting a Kernel
• kernel can be thought of as a similarity measure
• examples in the same class should have high kernel value
• examples in different classes should have low kernel value
• ideal kernel: equivalence relation K ( x i ,x j ) = sign ( y i y j )
• normalization also applies to kernel
• relative weight for implicit features
• normalize per example for directional data
K ( x i ,x j )
K ( x i ,x j ) = ----------------------------------------------
-
K ( x i ,x i ) K ( x j ,x j )
• potential problems with large numbers, for example polynomial
d
kernel K ( xi ,x j ) = [ x i ⋅ x j + 1 ] for large d
58
Selecting Regularization Parameter C
Common Method
1
• a reasonable starting point and/or default value is C def = ---------------------------
• search for C on a log-scale, for example ∑ K ( xi ,xi )
–4 4
C ∈ [ 10 C def, …, 10 C def ]
• selectionvia cross-validation or via approximation of leave-one-out

[Jaakkola&Haussler,1999][Vapnik&Chapelle,2000][Joachims,2000]
Note
• optimal value of C scales with the feature values
• implicit slack variables via infrequent features
• if every example has one unique feature x i , then always separable
• unique features x i act like squared slack variable
n
minimize P ( w, b, ξ ) = 1--- w ⋅ w + 1--- ∑ w2i s. t. y i [ w ⋅ x i + b ] ≥ 1 – wi xi

2 2
i=1
59
Selecting Kernel Parameters
Problem
• results often very sensitive to kernel parameters (e.g. variance γ in
RBF kernel)
• need to simultaneously optimize C, since optimal C typically depends
on kernel parameters
Common Method
• search for combination of parameters via exhaustive search
• selection of kernel parameters typically via cross-validation
Advanced Approach
• avoiding exhaustive search for improved search efficiency [Chapelle
et al, 2002]
60
Handling Unbalanced Datasets
Problem
• often the number of positive examples is much lower than the number
of negative examples
• SVM minimizes error rate
=> always say “no” gives great error rate, but poor recall
Common Methods
• cost model that makes errors on positive examples more expensive
min P ( w, b, ξ ) = 1--- w ⋅ w + JC ∑ξ i + C ∑ ξ i s.t. y i [ w ⋅ x i + b ] ≥ 1– ξ i and ξ i ≥ 0

2
yi = 1 yi = –1
• change threshold b after training to some higher value b′

 
h ( x ) = sign 
 ∑ α i y i K ( x i ,x ) + b′

x i ∈ SV
61
Selecting an SVM Training Algorithm
SVMlight (also SVMtorch, mySVM, BSVM, etc.) [Joachims, 1999b]
• solve dual QP to obtain hyperplane from α -coefficients
• iteratively decompose large QP into a sequence of small QPs
• handles kernels and treats linear SVM as special case
SMO [Platt, 1999]

• special case of working sets of size two
• simple analytical solution of QP subproblems
ASVM [Mangasarian & Musicant, 2000]

• restricted to linear SVMs with quadratic loss
• fast for low dimensional data
Nearest Point Algorithm [Keerthi et al., 1999]

• restricted to quadratic loss
• compute distance between convex hulls
62
Part 3: How to Train SVMs?
•efficiency
of primal vs. dual
•decomposition algorithm
•working set selection
•optimality criteria
•caching
•shrinking
63
How can One Train SVMs Efficiently?
Solve one of the following quadratic optimization problems:
• n + N + 1 variables
n
• n linear inequality
1---
min P ( w, b, ξ ) = w ⋅ w + C ∑ ξ i constraints
2
i=1 • no direct use of kernels
• size scales O ( nN )
<= DUAL =>

 n  n n •n variables
1
max D(α) =  α i – ---
∑ ∑∑ α i αj y i y j K ( x i, x j ) • 1 linear equality, 2n box
  2
n i = 1  i = 1j = 1 constraints
• use of kernels natural
s. t. ∑ αi yi = 0 and 0 ≤ αi ≤ C
2
i=1 • size scales O ( n )
=> positive semi-definite quadratic program with n variables

64
Decomposition
Idea: Solve small subproblems until convergence (Osuna, et al.)!
T T
1 α1 α1 k 11 k 12 k 13 k 14 k 15 k 16 k 17 α 1
1 α2 α2 k 21 k 22 k 23 k 24 k 25 k 26 k 27 α 2
1 α3 α3 k 31 k 32 k 33 k 34 k 35 k 36 k 37 α 3
1
max 1 α 4 – --- α 4 k 41 k 42 k 43 k 44 k 45 k 46 k 47 α 4
2
1 α5 α5 k 51 k 52 k 53 k 54 k 55 k 56 k 57 α 5
1 α6 α6 k 61 k 62 k 63 k 64 k 65 k 66 k 67 α 6
1 α7 α7 k 71 k 73 k 73 k 74 k 75 k 76 k 77 α 7
65
Decomposition
Idea: Solve small subproblems until convergence (Osuna, et al.)!
T T
1 α1 α1 k 11 k 12 k 13 k 14 k 15 k 16 k 17 α 1
1 α2 α2 k 21 k 22 k 23 k 24 k 25 k 26 k 27 α 2
1 α3 α3 k 31 k 32 k 33 k 34 k 35 k 36 k 37 α 3
1
max 1 α 4 – --- α 4 k 41 k 42 k 43 k 44 k 45 k 46 k 47 α 4
2
1 α5 α5 k 51 k 52 k 53 k 54 k 55 k 56 k 57 α 5
1 α6 α6 k 61 k 62 k 63 k 64 k 65 k 66 k 67 α 6
1 α7 α7 k 71 k 73 k 73 k 74 k 75 k 76 k 77 α 7
Time complexity: working set of size 2 ≤ q ≤ 100 and f nonzero features:

2
• extracting subproblem: O ( q f )
3
• solving subproblem: O ( q )
• updating large problem with result of subproblem: O ( nqf )
66
What Working Set to Select Next?
Solution: Select subproblem with q variables that minimizes
T
V(d) = g(α) d
yT d = 0
d i ≥ 0, if ( α i = 0 )
subject to d i ≤ 0, if ( α i = C )
–1 ≤ d ≤ 1
{ di ≠ 0 } = q
Efficiency: Selection linear in number of examples.

Convergence: Proofs by Chi-Chen Lin / Keerthi under mild
assumptions.
67
How to Tell that we Found the Optimal Solution?
Karush-Kuhn-Tucker conditions lead to the following criterion:
 n  n n
1
maximize D(α) =  α i – ---
∑ ∑∑ α i αj y i y j ( x i ⋅ x j )
  2
n i = 1  i = 1j = 1 is optimal
s. t. ∑ αi yi = 0 and 0 ≤ αi ≤ C
i=1
<=>
n

 (α = 0) ⇒ y
 i i ∑
αj y j K ( x i, x j ) + b ≥ 1
 j=1
 n

∀i  ( 0 < α i < C ) ⇒ y i

∑
αj y j K ( x i, x j ) + b = 1
 n
j=1

 (α = C) ⇒ y


i i ∑
αj y j K ( x i, x j ) + b ≤ 1
j=1
68
Demo
The Steps of Solving a 2-d Problem.
69
Caching
Observation: Most CPU-time is spent on computing the Hessian!
Idea: Cache kernel evaluations.
300
250
200
frequency
150
100
50
0
1 10 100 1000 10000
rank by frequency (logscale)
Result: A small cache leads to a large improvement.
70
Shrinking
Idea: If we knew the set of SVs, we could solve a smaller problem!
(complexity per iteration from O ( nqf ) to O ( sqf ) )
Algorithm: • monitor the KKT-conditions in each iteration
• if a variable is “stuck at bound”, remove it
• do final optimality check
71
Summary
How can One Train SVMs Efficiently?
SVMlight (also SVMtorch, mySVM, BSVM, etc.)
• solve dual QP to obtain hyperplane from α -coefficients
• iteratively decompose large QP into a sequence of small QPs
• select working set according to steepest feasible descent criterion
• check optimality using Karush-Kuhn-Tucker conditions
Other training algorithms:

• SMO requires working set of size two => simple analytical solution
of QP subproblems [Platt, 1999]
• ASVM restricted to linear SVMs with quadratic loss => fast for low
dimensional data [Mangasarian & Musicant, 2000]
• Nearest Point Algorithm restricted to quadratic loss => compute
distance between convex hulls [Keerthi et al., 1999]
72
Part 3: Why do SVMs Work?
•worst-case bounds
•bounds on the expected generalization error
•leave-one-out estimation
•necessary criteria for leave-one-out
73
...classifies as well as possible!?
What is a “good” classification rule h ?
P(h(x) ≠ y) = ∫ ∆ ( h ( x ) ≠ y ) dP ( x, y ) = Err P ( h )
What is a “good” learner L ?

“Worst-Case” Learner:
P ( Err P ( h L ) > ε ) < η
“Average-Case” Learner:
E ( Err P ( h L ) ) = ∫ ErrP ( hL ) dP ( x1, y1 )…P ( xn, yn )
74
SVMs as Worst-Case Learner
Goal: Guarantee of the form
P ( Err P ( h L ) > ε ) < η
  R2 
Theorem: P ErrP ( h ) –ErrS ( h ) ≥ Φ  -----2-, n, η < η [Shawe-Taylor et al,1996]
 δ 
So, if
• the training error Err S ( h ) on sample S is low and
• the margin δ is large,
then with probablility η the SVM will output a classification rule with
true error  R2 
Err P ( h i ) ≤ Err S ( h i ) + Φ  2 ,n, η .
-----
-
δ 
Problem: For most practical problems this bound is vacuous,
i.e. ErrP ( h i ) ≤ 1 .
75
SVMs as Average-Case Learner
Theorem: The expected error of an SVM is bounded by
n
R 2  
2 
2E  -----2- + 2CR E ∑ ξ i
δ   
n = 1  1
E ( Err P ( h SVM ) ) ≤ ----------------------------------------------------------------- C ≥ --------2-
n 2R
n
R 2  
2  ξ i
2E  -----2- + 2 ( CR + 1 )E
δ   ∑ 
n = 1  1
E ( Err P ( h SVM ) ) ≤ ------------------------------------------------------------------------------- C < --------2-
n 2R
n
R  2  
with E  -----2- the expected soft margin and E  ∑ ξ i the expected training
δ   
n = 1 
error bound [Joachims, 2001] [Vapnik, 1998].
Problem: The expectations are unknown.
76
Leave-One-Out
Training set: ( x1, y 1 ), ( x2, y 2 ), ( x 3, y 3 ), …, ( xn, y n )
Approach: Repeatedly leave one example out for testing.
train on test on
( x 2, y 2 ), ( x 3, y 3 ), ( x 4, y 4 ), …, ( x n, y n ) ( x 1, y 1 )
( x 1, y 1 ), ( x 3, y 3 ), ( x 4, y 4 ), …, ( x n, y n ) ( x 2, y 2 ) => Error estimate:

n
( x 1, y 1 ), ( x 2, y 2 ), ( x 4, y 4 ), …, ( x n, y n ) ( x 3, y 3 ) 1
Errloo( h ) = --- ∑ h i ( x i )= y i
n
... ... i=1
(x 1, y 1), (x 2, y 2), (x 3, y 3), …, (x n–1,y n–1) ( x n, y n )
Question: Is there a connection between margin and the estimate?
77
Necessary Cond. for Leave-One-Out Error of SVM
Lemma: SVM h i ( xi ) ≠ y i ⇒ 2αi R2 + ξ i ≥ 1 [Joachims, 2000] [Jaakkola
& Haussler, 1999] [Vapnik
& Chapelle, 2000]
Input: Example:
• α i dual variable of example i
• ξ i slack variable of example i ραi R 2+ ξ i leave-one-out error
• x ≤ R bound on length 0.0 OK

0.7 OK
=>
3.5 ERROR
Available after training SVM 0.1 OK
on the full training data 1.3 OK
0.0 OK
0.0 OK
... ...
78
Case 1: Example is no SV
2
( α i = 0 ) ⇒ ( ξ i = 0 ) ⇒ ( 2αi R + ξ i < 1 ) ⇒ no leave – one – out error
79
Case 2: Example is SV with Low Influence
αj αi
αk
 0.5  2
 αi < ------
- < C ⇒ ( ξ = 0 ) ⇒ ( 2αi R + ξ < 1 ) ⇒ no leave – one – out error
i i
 2 
R
80
Case 2: Example is SV with Low Influence
 0.5  2
 αi < ------
- < C ⇒ ( ξ = 0 ) ⇒ ( 2αi R + ξ < 1 ) ⇒ no leave – one – out error
i i
 2 
R
81
Case 3: Example has Small Training Error
2 2
( α i = C ) ∧ ( ξ i < 1 – 2CR ) ⇒ ( 2αi R + ξ i < 1 ) ⇒ no leave – one – out error
α i =C
ξi
82
Case 3: Example has Small Training Error
2 2
( α i = C ) ∧ ( ξ i < 1 – 2CR ) ⇒ ( 2αi R + ξ i < 1 ) ⇒ no leave – one – out error
83
Experiment: Reuters-21578
• 6451 training examples
• 6451 test examples for holdout testing
• ~27,000 features
Average error estimate over 10 random training/test splits:

5
4.5 AvgEstimatedError
AvgHoldoutError
4 Default
3.5
3
Error
2.5
1.5
0.5
0
earn acq money−fx grain crude trade interest ship wheat corn
=> small bias, variance of estimators is approximately equal
84
Fast Leave-One-Out Estimation for SVMs
Lemma: Training errors are always leave-out-out errors.
Algorithm: • ( R, α, ξ )= train_SVM(X,0,0);
• for all training examples, do
• if ξ i > 1 then loo++;
2
• else if ( ρα i R + ξ i < 1 ) then loo=loo;
• else train_SVM( X i, α, ξ );
Experiment:
Training Retraining Steps (%) CPU-Time (sec)
Examples ρ = 1 ρ = 2 ρ = 1 ρ = 2
Reuters 6451 0.20% 0.58% 11.1 32.3
WebKB 2092 6.78% 20.42% 78.5 235.4
Ohsumed 10000 1.07% 2.56% 433.0 1132.3
85
Estimated Error of SVM
n
Leave-One-Out Error Estimate: Errloo( h ) = 1--- ∑ h i ( x i ) = yi

n
i=1
For general SVMs:

2
h i ( x i ) ≠ y i ⇒ 2α i R + ξ i ≥ 1 x ≤R
n
 
=> Errloo( h ) ≤ 1---  i 2α R2 + ξ ≥ 1  ≤ 1--- ∑ 2αi R2 + ξ i
n i i
 n
i=1
For separable problems:

2
hi ( xi ) ≠ yi ⇒ αi R ≥ 1 x ≤R
n
2
1  1 2 R
=> Errloo( h ) ≤ ---  i
n α i R ≥ 1  ≤ --n-
2

∑ α i R ≤ -----2-
δ
i=1
86
Summary
Why do SVMs Work?
If
• the training error Err S ( h ) (on the sample S / on average) is low and
• the margin δ/R (on the sample S / on average) is large
then
• the SVM has learned a classification rule with low error rate with
high probablility (worst-case).
• the SVM learns classification rules that have low error rate on
average.
• the SVM has learned a classification rule for which the (leave-one-
out) estimated error rate is low.
87
Part 4: When do SVMs Work Well?
Successful Use:
•Optical Character Recognition (OCR) [Vapnik, 1998]
•Face Recognition, etc. [Osuna et al., 1997]
•Text Classification [Joachims, 1997] [Dumais et al., 1998]
•...
Open Questions:
What characterizes these problems?
How can the good performance be explained?
What are “sufficient conditions” for using (linear) SVMs successfully?
88
Real-World
Process
Classifier
Learner
Goal:
89
Real-World
Process
Classifier
Learner
Goal:
Obstacle:
• No Free Lunch: There is no learner that does well on every task.
90
Learning Text Classifiers Successfully
Real-World
Process
Classifier
Learner
The learner produces a classifier with low error rate

<=>
The properties of the learner fit the properties of the process.
91
Learning SVM Text Classifiers Successfully
Real-World
Process
Classifier
Learner
SVM
The learner produces a classifier with low error rate
<=>
SVM
The properties of the learner fit the properties of the process.
92
Representing Text As Feature Vectors
Features: words
0 baseball
3 specs
(wordstems)
From: xxx@sciences.sdsu.edu
Newsgroups: comp.graphics 0 graphics
Subject: Need specs on Apple QT
1 references
I need to get the specs, or at least a
0 hockey Values: occurrence
very verbose interpretation of the specs,
for QuickTime. Technical articles from 0 car frequency
magazines and references to books would
be nice, too.
0 clinton
.
I also need the specs in a fromat usable .
on a Unix or MS-Dos system. I can’t .
do much with the QuickTime stuff they
have on ...
1 unix
0 space
2 quicktime
0 computer
==> Ignore ordering of words.
93
Paradox of Text Classification
30,000 10,000
attributes training examples
94
Experimental Results
Reuters Newswire WebKB Collection Ohsumed MeSH
• 90 categories • 4 categories • 20 categories
• 9603 training doc. • 4183 training doc. • 10000 training doc.
• 3299 test doc. • 226 test doc. • 10000 test doc.
• ~27000 features • ~38000 features • ~38000 features
microaveraged precision/recall
Reuters WebKB Ohsumed
breakeven-point [0..100]
Naive Bayes 72.3 82.0 62.4
Rocchio Algorithm 79.9 74.1 61.5
C4.5 Decision Tree 79.4 79.1 56.7
k-Nearest Neighbors 82.6 80.5 63.4
SVM 87.5 90.3 71.6

[Joachims, 2002]
95
Why Do SVMs Work Well for Text Classification?
A statistical learning model of text classification with SVMs:
text 5 low low

2 2
classification properties R ⁄δ E(Errn( h SVM ))
task => and of
model ∑ ξi SVM
96
Margin/Loss Based Bound on the Expected Error
Theorem: The expected error of a soft margin SVM is bounded by
n+1
R 2  
2 
ρE  -----2- + ρCR E ∑ ξ i
δ   
n = 1  1
E ( Err n ( h SVM ) ) ≤ ----------------------------------------------------------------- C ≥ ---------2
n+1 ρR
n+1
R 2  
ρE  -----2- + ρ ( CR + 1 )E  ξ i
2
δ   ∑ 
n = 1  1
E ( Err n ( h SVM ) ) ≤ ------------------------------------------------------------------------------- C < ---------2
n+1 ρR
n+1
R 2  
Where E  -----2- is the expected soft margin and E  ξ i
 ∑ is the
δ  n = 1 
expected training loss on training sets of size n + 1 .
97
First Step Completed
text
classification
? 5
properties
=>
low
2
R ⁄δ
and
2
low
E(Errn( h SVM ))
of
task
model ∑ ξi SVM
98
Properties 1+2: Sparse Examples in High Dimension
• High dimensional feature vectors (30,000 features)
• Sparse document vectors: only a few words of the whole language
occur in each document
Number
Training Distinct Words
of
Examples (Sparsity)
Features
Reuters 9,603 27,658 74
Newswire Articles (0.27%)
Ohsumed 10,000 38,679 100
MeSH Abstracts (0.26%)
WebKB 3,957 38,359 130
WWW-Pages (0.34%)
99
Property 3: Heterogeneous Use Of Words
MODULAIRE BUYS BOISE HOMES PROPERTY USX, CONS. NATURAL END TALKS
Modulaire Industries said it acquired the design USX Corp’s Texas Oil and Gas Corp subsidiary and
library and manufacturing rights of privately-owned Consolidated Natural Gas Co have mutually agreed
Boise Homes for an undisclosed amount of cash. not to pursue further their talks on Consolidated’s
Boise Homes sold commercial and residential
possible purchase of Apollo Gas Co from Texas
prefabricated structures, Modulaire said.
Oil. No details were given.
JUSTICE ASKS U.S. DISMISSAL OF TWA
FILING E.D. And F. MAN TO BUY INTO HONG KONG
The Justice Department told the Transportation FIRM
Department it supported a request by USAir Group The U.K. Based commodity house E.D. And F.
that the DOT dismiss an application by Trans World Man Ltd and Singapore’s Yeo Hiap Seng Ltd jointly
Airlines Inc for approval to take control of USAir. announced that Man will buy a substantial stake in
``Our rationale is that we reviewed the application Yeo’s 71.1 pct held unit, Yeo Hiap Seng Enterprises
for control filed by TWA with the DOT and Ltd. Man will develop the locally listed soft drinks
ascertained that it did not contain sufficient manufacturer into a securities and commodities
information upon which to base a competitive brokerage arm and will rename the firm Man
review,’’ James Weiss, an official in Justice’s Pacific (Holdings) Ltd.
Antitrust Division, told Reuters.
No pair of documents shares any words, but “it”, “the”, “and”, “of”,
“for”, “an”, “a”, “not”, “that”, “in”.
100
Property 4: High Level Of Redundancy
=> Few features are irrelevant!
101
Property 5: “Zipf’s Law”
k
Zipf’s Law: In text, the i-th frequent word occurs f i = ------------------
Θ
- times.
(c + i)
=> Most words occur very infrequently!
102
Text Classification Model
Definition: For the TCat-concept there are s disjoint sets of features.
TCat ( [ p 1 n 1 f 1 ], …, [ p s n s f s ] )
Each positive (negative) example contains p i ( n i ) occurrences from the

f i features in set i .
[20 20 100]
 
 [ 4 1 200 ] 
 [ 1 4 200 ] 
 
Example: TCat  [ 5 5 600 ] 
 
[ 9 1 3000 ]
 
[ 1 9 3000 ]
 
[10 10 4000]
103
TCat-Concept for WebKB “Course”
 [ 77 29 98 ], [ 4 21 52 ]  high frequency
 [ 16 2 431 ], [ 1 12 341 ] 
medium frequency
TCat  
 [ 9 1 5045 ], [ 1 21 24276 ] low frequency
 
 [ 169 191 8116 ]  rest
104
Real Text Classification Tasks as TCat-Concepts
 [ 33 2 65 ], [ 32 65 152 ]  high frequency
 [ 2 1 171 ], [ 3 21 974 ] 
medium frequency
Reuters “Earn”: TCat  
 [ 3 1 3455 ], [ 1 10 17020 ] low frequency
 
 [ 78 52 5821 ]  rest
 [ 77 29 98 ], [ 4 21 52 ]  high frequency
 [ 16 2 431 ], [ 1 12 341 ] 
medium frequency
Webkb “Course”: TCat  
 [ 9 1 5045 ], [ 1 21 24276 ] low frequency
 
 [ 169 191 8116 ]  rest
 [ 2 1 10 ], [ 1 4 22 ]  high frequency
 [ 2 1 92 ], [ 1 2 94 ]  medium frequency
Ohsumed “Pathology”: TCat 
[5 1 4080 ], [ 1 10 20922 ] low frequency
 
 [ 197 190 13459 ]  rest
105
Second Step Completed
text
classification
5
properties
=>
? low
2
R ⁄δ
and
2
low
E(Errn( h SVM ))
of
task
model ∑ ξi SVM
106
2
The Margin δ of TCat-Concepts
Lemma 1: For TCat ( [ p 1 n 1 f 1 ], …, [ p s n s f s ] ) -concepts there is
always a hyperplane passing through the origin with margin δ 2 at least
s
p 2i
a = ∑ -----
fi
i=1
s
2 n 2i
2 ad – b
δ ≥ -------------------------
a + 2b + d
with d = ∑ -----
fi
i=1
s
ni pi
b = ∑ ---------
fi
i=1
Example: The previous example WebKB “course” has a margin of at

least
2
δ ≥ 0.23
107
2
The Length R of Document Vectors
Lemma 2: If the ranked term frequencies f i in a document with l
words have the form of the generalized Zipf’s Law
k
f i = -------------------
Θ
(c + i)
based on their frequency rank i , then the Euclidean length of the

document vector x is bounded by
d d
k - 2
 ------------------ k - = l
x ≤ ∑ 
(c + i)
Θ
with ∑ ------------------
(c + i)
Θ
i=1 i=1
Example: For WebKB “course” with

470000 -
f i = -----------------------
1.25
(5 + i)
follows that R 2 ≤ 1900 .
108
2 2
R , δ , and ∑ ξi for Text Classification
Reuters Newswire Stories
n+1
• 10 most frequent categories R  2  
2 
• 9603 training examples E  ------ +CR E ∑ ξ i
 δ2  
 
• 27658 attributes i = 1
E ( Err P ( h SVM ) ) ≤ -----------------------------------------------------------
n+1
∑ ∑ ξi
2 2 2 2
R ⁄δ ξi R ⁄δ
earn 1143 0 trade 869 9
acq 1848 0 interest 2082 33
money-fx 1489 27 ship 458 0
grain 585 0 wheat 405 2
crude 810 4 corn 378 0
109
Learnability of TCat-Concepts
Theorem: For TCat ( [ p 1 n 1 f 1 ], …, [ p s n s f s ] ) -concepts and
documents with l words that follow the generalized Zipf’s Law
Θ
fi = k ⁄ ( c + i ) the expected generalization error of an unbiased
SVM after training on n examples is bounded by
s
p 2i
a = ∑ -----
fi
i=1
s
n 2i
2 2
d = ∑ -----
fi
R ad – b i=1
E ( Err n ( h SVM ) ) ≤ ------------ ------------------------- with s
n + 1 a + 2b + d ni pi
b = ∑ ---------
fi
i=1
d
2 k - 2
 ------------------
R ≤ ∑ 
(c + i)
Θ
i=1
110
Comparison Theory vs. Experiments
Predicted
Error Rate in
Learning Curve Bound Bound on
Experiment
Error Rate
Reuters 138
“earn” E ( Err n ( h SVM ) ) ≤ ------------ 1.5% 1.3%
n+1
WebKB 443
“course” E ( Err n ( h SVM ) ) ≤ ------------ 11.2% 4.4%
n+1
Ohsumed 9457
“pathology” E ( Err n ( h SVM ) ) ≤ ------------ 94.6% 23.1%
n+1
• Model can differentiate between “difficult” and “easy” tasks

• Predicts and reproduces the effect of information retrieval heuristics
(e.g. TFIDF-weighting)
111
Sensitivity Analysis
What makes a text classification problem suitable for a linear SVM?
High Redundancy: [ 40 40 50 ] high frequency
 
TCat  [ 25 5 1000 ], [ 5 25 1000 ] medium frequency
 
 [ 30 30 30000 ]  low frequency
High Discriminatory Power:

 [ 40 40 50 ]  high frequency
TCat  [ 15 0 500 ], [ 0 15 500 ], [ 15 15 1000 ] medium frequency
 
 [ 30 30 30000 ]  low frequency
High Frequency:
 [ 16 4 10 ], [ 4 16 10 ], [ 20 20 30 ] high frequency
TCat  [ 30 30 2000 ]  medium frequency
 
 [ 30 30 30000 ]  low frequency
112
What does this Model Provide?
Connects the statistical properties of text classification tasks with
generalization error of SVM!
text low low

2 2
classification 5 R ⁄δ E (Errn(h SVM))
task properties and of
∑ ξi SVM
=>
• Explains the behavior of (linear) SVMs on text classification tasks
• Gives guideline for when to apply (linear) SVMs
• Provides formal basis for developing new methods
113
Summary
When do (Linear) SVMs Work Well?
Intuition: If the problem can be cast as a TCat-concept with

• high redundancy,
• strongly discriminating features
• particularly in the high frequency region
then linear SVMs achieve a low generalization error [Joachims, 2002].

Assumptions and Restrictions:
• no noise (attribute and classification)
• no variance (only “average” examples)
• only upper bounds, no lower bounds
114
Part 3: SVM-X?
•common elements of SVMs for other problem
•learning ranking functions from preferences
•novelty and outlier detection
•regression
115
The Receipe for Cooking an SVMs
Ingredients:
• linear prediction rules h ( x ) = w ⋅ x + b
• training problem with objective a la min w ⋅ w + C
∑ ξi and with linear
constraints (=> quadratic program)
Stirr and add flavor:
• Classification
• Ranking [Herbrich et al., 2000][Joachims, 2002c]
• Novelty Detection [Schoelkopf et al., 2000]
• Regression [Vapnik, 1998][Smola & Schoelkopf, 1998]
That makes:
• nice SVM with global optimal solution and duality
• often sparse solution (#SVs < n)
• Hint: garnish the dual with kernel to get non-linear prediction rules
116
SVM Ranking
Query:
• “Support Vector Machine”
Goal:
• “rank the document I want
high in the list”
282,000 hits
117
Training Examples from Clickthrough
Assumption: If a user skips a link a and clicks on a link b ranked
lower, then the user preference reflects rank(b) < rank(a).
Example: (3 < 2) and (7 < 2), (7 < 4), (7 < 5), (7 < 6)
Ranking Presented to User:
1. Kernel Machines
http://svm.first.gmd.de/
2. Support Vector Machine
http://jbolivar.freeservers.com/
3. SVM-Light Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
4. An Introduction to Support Vector Machines
http://www.support-vector.net/
5. Support Vector Machine and Kernel ... References
http://svm.research.bell-labs.com/SVMrefs.html
6. Archives of SUPPORT-VECTOR-MACHINES ...
http://www.jiscmail.ac.uk/lists/SUPPORT...
7. Lucent Technologies: SVM demo applet
http://svm.research.bell-labs.com/SVT/SVMsvt.html
8. Royal Holloway Support Vector Machine
http://svm.dcs.rhbnc.ac.uk/
118
Training Examples from Clickthrough
Assumption: If a user skips a link a and clicks on a link b ranked
lower, then the user preference reflects rank(b) < rank(a).
Example: (3 < 2) and (7 < 2), (7 < 4), (7 < 5), (7 < 6)
Ranking Presented to User:
1. Kernel Machines
119
Learning to Rank
Assume:
• distribution of queries P(Q)
• distribution of target rankings for query P(R | Q)
Given:
• collection D of m documents
• i.i.d. training sample ( q 1, r 1 ), …, ( q n, r n )
Design:
D×D
• set of ranking functions F, with elements f: Q → P (weak ordering)
• loss function l ( r a, r b )
• learning algorithm
Goal:
∫
• find f° ∈ F with minimal R P ( f ) = l ( f ( q ), r ) dP ( q, r )
120
A Loss Function for Rankings
For two orderings r a and r b , a pair d i ≠ d j is
• concordant, if r a and r b agree in their ordering
P = number of concordant pairs
• discordant, if r a and r b disagree in their ordering
Q = number of discordant pairs
Loss function: [Wong et al., 88], [Cohen et al., 1999], [Crammer &
Singer, 01], [Herbrich et al., 98] ...
l ( r a, r b ) = Q
Example:
r a = ( a, c, d, b, e, f, g, h )
r b = ( a, b, c, d, e, f, g, h )
=> discordant pairs (c,b), (d,b) => l ( r a, r b ) = 2
121
l ( r a, r b ) = Q
Example:
r a = ( a, c, d, b, e, f, g, h )
r b = ( a, b, c, d, e, f, g, h )
122
l ( r a, r b ) = Q
Example:
r a = ( a, c, d, b, e, f, g, h )
r b = ( a, b, c, d, e, f, g, h )
123
Interpretation of Loss Function
Notation:
• P concordant pairs
• Q discordant pairs
Kendall’s Tau: r° total ordering, uniform sampling of document pairs

P–Q 2Q 2l ( r, r° )
τ ( r, r° ) = -------------- = 1 – -------------- = 1 – --------------------
P+Q  m  m
   
 2   2 
Average Precision: r° ordering with two ranks

2
–1  R 
  
AvgPrec ( r, r° ) ≥ l ( r, r° ) +  R + 1  ∑ i
 2   
i = 1 
124
What does the Ranking Function Look Like?
Sort documents d i by their “retrieval status value” rsv( q , d i ) with query
q [Fuhr, 89]:
rsv( q , d i ) = w1 * #(of query words in title of d i )
+ w2 * #(of query words in H1 headlines of d i )
...
+ w N * PageRank( d i )
= w Φ( q , d i ).
Select F as: di > dj 1 f1(q)

⇔ 2
( d i, d j ) ∈ f ( q )
w
⇔ 3
wΦ ( q, d i ) > wΦ ( q, d j ) f2(q)
4
125
Minimizing Training Loss For Linear Ranking
Functions
Given:
• training sample S = ( q 1, r 1 ), …, ( q n, r n )
Zero training loss on S:

∀( d i, d j ) ∈ r 1 ; wΦ ( q 1, d i ) > wΦ ( q 1, d j )
…
∀( d i, d j ) ∈ r n ; wΦ ( q n, d i ) > wΦ ( q n, d j )
Minimize (bound on) training loss (total ordering) on S:

min ∑ ξl, i, j
∀( d i, d j ) ∈ r 1 ;( wΦ ( q 1, d i ) ≥ wΦ ( q 1, d j ) + 1 – ξ 1, i, j )
…
∀( d i, d j ) ∈ r n ;( wΦ ( q n, d i ) ≥ wΦ ( q n, d j ) + 1 – ξ n, i, j )
126
Ranking Support Vector Machine
Optimization Problem (primal):
1
min --- w ⋅ w + C
2 ∑ ξl, i, j
∀( d i, d j ) ∈ r 1 ;( wΦ ( q 1, d i ) ≥ wΦ ( q 1, d j ) + 1 – ξ 1, i, j )
…
∀( d i, d j ) ∈ r n ;( wΦ ( q n, d i ) ≥ wΦ ( q n, d j ) + 1 – ξ n, i, j )
Properties:
• minimize trade-off between training 1
loss and margin size δ = 1 / ||w|| 2
• quadratic program, similar to
classification SVM (=> SVMlight)
• convex => unique global optimum R 3
• radius of ball containing the training
δ
points R 4 f(q)
127
How is this different from ...
... classification?
f1(q): - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
f2(q): - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
=> both have same error rate (always classify as non-relevant)
=> very different rank loss
... ordinal regression?
Training set S = ( x 1, y 1 ), …, ( x n, yn ) , with Y ordinal (and finite)
=> ranks need to be comparable between examples
1 2 3
ordinal regression
1 2 3
1 2 3
ranking
1 2 3
128
Experiment Setup
Collected training examples with partial feedback about ordering from
• user skipping links Ranking Presented to User:
1. Kernel Machines
=> (3 < 2) and (7 < 2), (7 < 4), (7 < 5), (7 < 6)
• clickedon document should be ranked higher than 50 random

documents
=> S = ( q 1, r′1 ), …, ( q n, r′ n )
129
Query/Document Match Features Φ(q,d)
Rank in other search engine:
• Google, MSNSearch, Altavista, Hotbot, Excite
Query/Content Match:
• cosine between URL-words and query
• cosine between title-words and query
• query contains domain-name
Popularity Attributes:
• length of URL in characters
• country code of URL
• domain of URL
• word “home” appears in title
• URL contains “tilde”
• URL as an atom
130
Experiment I: Learning Curve
Training examples: preferences from 112 queries
25 MSNSearch
Google
Learning
20
Prediction Error (%)
15
10
0
0 10 20 30 40 50 60 70 80
Number of Training Examples
131
Experiment II
Experiment Setup:
• meta-search engine (Google, MSNSearch, Altavista, Hotbot, Excite)
• approx. 20 users
• machine learning students and researchers from University of
Dortmund AI Unit (Prof. Morik)
• asked to use system as any other search engine
• display title and URL of document
October 31st November 20th December 2nd
collected training data trained test ranking

=> 260 training queries Ranking function
(with at least one click) SVM => 139 queries
132
Experiment: Learning vs. Google/MSNSearch
Ranking A Ranking B A better B better Tie Total

Learned Google 29 13 27 69
Learned MSNSearch 18 4 7 29
Learned Toprank 21 9 11 41
~20 users, as of 2nd of December
Toprank: rank by increasing mimium rank over all 5 search engines
=> Result: Learned > Google

Learned > MSNSearch
Learned > Toprank
133
Learned Weights
weight feature
0.60 cosine between query and abstract
0.48 ranked in top 10 from Google
0.24 cosine between query and the words in the URL
0.24 document was ranked at rank 1 by exactly one of the 5 search engines
...
0.17 country code of URL is “.de”
0.16 ranked top 1 by HotBot
...
-0.15 country code of URL is “.fi”
-0.17 length of URL in characters
-0.32 not ranked in top 10 by any of the 5 search engines
-0.38 not ranked top 1 by any of the 5 search engines
134
Summary: SVM Ranking
• An SVM method for learning ranking functions
• Training examples are rankings
=> pairwise preferences like “A should be ranked higher than B”
• Turn training examples into linear inequality constraints
• Results in quadratic program similar to classification
• Rank new examples by sorting according to distance from hyperplane
Applications:
• personalizing search engines
• tuning retrieval functions in XML intranets
• recommender systems
• betting on horses
135
SVM Novelty/Outlier Detection
Assume:
• distribution of feature vectors P(X)
Goal: [Schölkopf et al., 1995] [Tax & Duin, 2001]

• find the region R of ( 1 – ε ) -support for the distribution P(X), i.e.
P(x ∈ R) ≥ 1 – ε
• keep the volume of R as small as possible

=> new points falling outside of R are either outliers, or the distribution
must have changed.
Problem:
• estimate R from unlabeled oberservations x 1, …, x n
136
Example: Small and Large Volume Regions
Assume that we know the distribution P(X).
All following are regions with P ( x ∈ R ) ≥ 1 – ε :
R
P(X)
trivial optimal sub-optimal
137
Find Region using Examples
Problem:
• P(X) cannot be observed directly.
Given:
• training oberservations x 1, …, x n drawn according to P(X).
Approach: [Schoelkopf et al., 1995][Tax & Duin, 2001]

• find smallest ball that includes (most) training observations
Primal: Dual:
n n n n
min P ( c, r, ξ ) = r 2 + C ∑ ξ i max D ( α ) =∑α iK (xi, xi) – ∑∑αi αjK(xi, xj)

i=1 n i=1 i = 1j = 1
2
s. t. [ c – x i ] ≤ r 2 + ξ i s. t. ∑ αi = 1 and 0 ≤ αi ≤ C
ξi ≥ 0 i=1
•c is the center of the ball, r is its radius.
138
Properties of the Primal/Dual
Primal: Dual:
n n n n
min P ( c, r, ξ ) = r 2 + C ∑ ξ i max D ( α ) =∑α iK (xi, xi) – ∑∑αi αjK(xi, xj)

i=1 n i=1 i = 1j = 1
2 2
s. t. [ c – x i ] ≤ r + ξ i s. t. ∑ αi = 1 and 0 ≤ αi ≤ C
ξi ≥ 0 i=1
Properties:
• convex => global optimum
ξi
• ξ i measures distance from ball
• α i = 0 : example lies inside the ball r
c
• 0 < α i < C : example on hull of ball
• α i = C : example is training error
139
One-Class SVM: Separating from the Origin
Observation: [Schölkopf et al., 2000][Schölkopf et al., 2001]
• For kernels depending on the distance between points, the dual is the
same as for classification SVM with
• all training observations in the positive class (with slack)
• one virtual negative example with α = – 1 and K ( x i ,x j ) = 0 .
Dual:
n n n
max D ( α ) =∑α iK (xi, xi) – ∑∑α i αjK(x i, xj)

n i=1 i = 1j = 1
s. t. ∑ αi = 1 and 0 ≤ αi ≤ C
i=1 Constant for kernels depending on
distance between points (e.g. RBF)
2
=> Equivalent for RBF kernel K ( x i ,x j ) = exp ( – x i – x j ⁄ σ 2 ) !
140
Influence of C and RBF-Width σ2
small C
large width σ2
no outliers
small C
large width σ2
some ouliers
large C
large width σ2
small C
small width σ2
(plots courtesy of B. Schoelkopf)
141
Summary: SVM Novelty Detection
• Find small region where most observations fall
• One-Class SVM: separate observations from origin
• Outliers (or new observations after shift in distribution) lie outside of
region
• Training problem similar to classification SVM
Further work:
• Extension to ν -SVMs and error bounds [Schölkopf et al., 2001]
[Schölkopf et al., 2001]
• SVM clustering [Ben-Hur et al., 2001]
Applications:
• Text classification [Manevitz & Yousef, 2001]
• Topic detection
142
SVM Regression
Loss function:
• ε -insensitive region with zero loss
• linear loss beyond the “tube”
Graph taken from [Smola & Schoelkopf, 1998]
143
Primal SVM Optimization Problems
Classification:
n
minimize J ( w, b, ξ ) = 1--- w ⋅ w + C ∑ ξ i
2
i=1
δ
Regression:
n
minimize R ( w, b, ξ, ξ° ) = 1--- w ⋅ w + C ∑ ( ξ i + ξ°i )

2
i=1
s. t. yi – [ w ⋅ x i + b ] ≤ ε + ξ i and ξ i ≥ 0
ε
– y i + [ w ⋅ x i + b ] ≤ ε + ξ °i and ξ°i ≥ 0
144
Dual SVM Optimization Problems
 m  m m
1
maximize L(α) =  p i α i – ---
∑ ∑∑ α i αj u i u j K ( v i, v j )
  2
m i = 1  i = 1j = 1
s.t. ∑ αi ui = 0 and 0 ≤ αi ≤ C
i=1
N
Classification: ( x 1, y 1 ), …, ( x n, y n ) ∼ P ( x, y ) xi ∈ ℜ y i ∈ {1, – 1} m = n
• p i = 1 for 1 ≤ i ≤ n
• u i = y i for 1 ≤ i ≤ n
• v i = x i for 1 ≤ i ≤ n
Regression: ( x1, y1 ), …, ( x n, y n ) ∼ P ( x, y ) xi ∈ ℜN y i ∈ ℜ m = 2n
• p i = ε + y i for 1 ≤ i ≤ n and p i = ε – y i for n + 1 ≤ i ≤ 2n
• u i = 1 for 1 ≤ i ≤ n and u i = – 1 for n + 1 ≤ i ≤ 2n
• v i = x i for 1 ≤ i ≤ n and v i = x i for n + 1 ≤ i ≤ 2n
145
Conclusions
• What! How! Why! When! ...and that SVMs solve any other problem!
Info
• Chris Burges’ tutorial (Classification)
http://www.kernel-machines.org/papers/Burges98.ps.gz
• Smola & Schölkopf’s tutorial (Regression)
http://www.kernel-machines.org/papers/tr-30-1998.ps.gz
• Cristianini& Shawe-Taylor book: Introduction to SVMs, Cambridge
University Press, 2000.
• Schölkopf’ & Smola book: Learning with Kernels, MIT Press, 2002.
• My dissertation: Learning to Classify Text Using Support Vector
Machines, Kluwer.
• Software: SVM
light for Classification, Regression, and Ranking
http://svmlight.joachims.org/
• General: http://www.kernel-machines.org
146
Bibliography
• [Ben-Hur et al., 2001] Ben-Hur, A., Horn, D., Siegelmann, H.,
Vapnik, V. (2001). Support Vector Clustering. Journal of Machine
Learning Research, 2:125-137.
• [Bennet & Demiriz, 1999] K. Bennet and A. Demiriz (1999). Semi-
supervised support vector machines, Advances in Neural Information
Processing Systems, 12, pages 368-374, MIT Press.
• [Boser et al., 1992] Boser, B. E., Guyon, I. M., and Vapnik, V. N.
(1992). A traininig algorithm for optimal margin classifiers. In
Haussler, D., editor, Proceedings of the 5th Annual ACM Workshop
on Computational Learning Theory, pages 144-152.
• [Burges, 1998] C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Knowledge Discovery and Data
Mining, 2(2), 1998.
147
• [Chapelle et al., 2002] Chapelle, O., V. Vapnik, O. Bousquet and S.
Mukherjee (2002). Choosing Multiple Parameters for Support Vector
Machines. Machine Learning 46(1), pages 131-159.
• [Cohen et al., 1999] W. Cohen, R. Shapire, and Y. Singer (1999).
Learning to order things. Journal of Artificial Intelligence Research,
10:243-270.
• [Collins & Duffy, 2001] Collins, M., and Duffy, N. (2001).
Convolution kernels for natural language. Proceedings of Advances
in Neural Information Processing Systems (NIPS).
• [Cortes & Vapnik, 1995] Cortes, C. and Vapnik, V. N. (1995).
Support-vector networks. Machine Learning Journal, 20:273-297.
• [Crammer & Singer, 2001] K. Crammer and Y. Singer (2001). On the
algorithmic implementation of multiclass kernel-based vector
machines, Journal of Machine Learning Research, 2:265-292.
• [Cristianini & Shawe-Taylor, 2000] Cristianini, N. and Shawe-Taylor,
J. (2000). An Introduction to Support Vector Machines and Other
Kernel-Based Learning Methods. Cambridge University Press.
148
• [Dumais et al., 1998] Dumais, S., Platt, J., Heckerman, D., and
Sahami, M. (1998). Inductive learning algorithms and representations
for text categorization. In Proceedings of ACM-CIKM98.
• [Gaertner et al., 2002] Gaertner, T., Lloyd, J., Flach, P. (2002).
Kernels for structured data. Twelfth International Conference on
Inductive Logic Programming (ILP).
• [Herbrich et al., 2000] R. Herbrich, T. Graepel, K. Obermayer (2000).
Large Margin Rank Boundaries for Ordinal Regression, In Advances
in Large Margin Classifiers, pages 115-132, MIT Press.
• [Jaakkola, and Haussler, 1998] Jaakkola, T. and Haussler, D. (1998).
Exploiting generative models in discriminative classifiers. In
Advances in Neural Information Processing Systems 11, MIT Press.
• [Jaakkola and Haussler, 1999] Jaakkola, T. and Haussler, D. (1999).
Probabilistic kernel regression models. In Conference on AI and
Statistics.
• [Joachims, 1998] Joachims, T. (1998). Text categorization with
support vector machines: Learning with many relevant features. In
149
Proceedings of the European Conference on Machine Learning, pages
137 - 142, Berlin. Springer.
• [Joachims, 1999a] Joachims, T. (1999a). Aktuelles Schlagwort:
Support Vector Machines. Künstliche Intelligenz, 4.
• [Joachims, 1999b] Joachims, T. (1999b). Making Large-Scale SVM
Learning Practical. In Schölkopf, B., Burges, C., and Smola, A.,
editors, Advances in Kernel Methods - Support Vector Learning,
pages 169 - 184. MIT Press, Cambridge.
• [Joachims, 1999c] Joachims, T. (1999c) Transductive Inference for
Text Classification using Support Vector Machines. International
Conference on Machine Learning (ICML).
• [Joachims, 2002] Joachims, T. (2002). Learning to Classify Text
Using Support Vector Machines: Methods, Theory, and Algorithms,
Kluwer.
• [Joachims, 2002c] Joachims, T. (2002). Optimizing Search Engines
using Clickthrough Data, Conference on Knowledge Discovery and
Data Mining (KDD), ACM.
150
• [Keerthi et al., 1999] Keerthi, S., Shevade, S., Bhattacharyya, C., and
Murthy, K. (1999). A fast iterative nearest point algorithm for support
vector machine classifier design. Technical Report TR-ISL-99-03,
Indian Institute of Science, Bangalore, India.
• [Kindermann & Paass, 2000] Kindermann, Jörg and Paass, Gerhard:
Multiclass Classification with Error Correcting Codes, FGML, 2000.
• [Kondor & Lafferty, 2002] Kondor, I., and Lafferty, J. (2002).
Diffusion kernels on graphs and other discrete input spaces,
International Conference on Machine Learning (ICML).
• [Lodhi et al., 2002] Lodhi, H., Saunders, C., Shawe-Taylor, J.,
Cristianini, N., & Watkins, C. (2002). Text Classification using String
Kernels. Journal of Machine Learning Research, 2:419-444.
• [MacKay, 1997] D.J.C. MacKay. Introduction to gaussian processes,
1997. Tutorial at ICANN'97. ftp://wol.ra.phy.cam.ac.uk/pub/mackay/
gpB.ps.gz
151
• [Manevitz & Yousef, 2001] L. Manevitz and M. Yousef (2001). One-
Class SVMs for Document Classification, Journal of Machine
• [Mangasarian and Musicant, 2000] Mangasarian, O. L. and Musicant,
D. R. (2000). Active support vector machine classification. Technical
Report 00-04, Data Mining Institute, Computer Sciences Department,
University of Wisconsin, Madison, Wisconsin. ftp://ftp.cs.wisc.edu/
pub/dmi/tech-reports/00-04.ps.
• [Osuna et al., 1997] Osuna, E., Freund, R., and Girosi, F. (1997b).
Training support vector machines: An application to face detection.
In Proceedings CVPR'97.
• [Platt, 1999] Platt, J. (1999). Fast training of support vector machines
using sequential minimal optimization. In Schölkopf, B., Burges, C.,
and Smola, A., editors, Advances in Kernel Methods - Support Vector
Learning, chapter 12. MIT-Press.
152
• [Plattet al., 2000] Platt, J., Cristianini, N., and Shawe-Taylor, J.
(2000). Large margin dags for multiclass classification. In Advances
in Neural Information Processing Systems 12.
• [Schölkopf et al., 1995] Schölkopf, B., Burges, C., Vapnik, V. (1995).
Extracting support data for a given task. Proceedings of the
International Conference on Knowledge Discovery and Data Mining
(KDD).
• [Schölkopf et al., 1998] Schölkopf, B., Smola, A., and Mueller, K.-R.
(1998). Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, 10:1299-1319.
• [Schölkopf et al., 2000] Schölkopf, B., R. C. Williamson, A. J.
Smola, J. Shawe-Taylor, and J. C. Platt (2000). Support vector
method for novelty detection. In S. Solla, T. Leen, and K.-R. Müller
(Eds.), Advances in Neural Information Processing Systems 12, pp.
582-588. MIT Press.
153
• [Schölkopf et al., 2001] Schölkopf, B., J. Platt, J. Shawe-Taylor, A.J.
Smola and R.C. Williamson (2001). Estimating the support of a high-
dimensional distribution. Neural Computation 13(7);1443-1471.
• [Schölkopf & Smola, 2002] Schölkopf, B., Smola, A. (2002).
Learning with Kernels, MIT Press.
• [Shawe-Taylor et al., 1996] Shawe-Taylor, J., Bartlett, P., Williamson,
R., and Anthony, M. (1996). Structural risk minimization over data-
dependent hierarchies. Technical Report NC-TR-96-053,
NeuroCOLT.
• [Smola & Schölkopf, 1998] A. J. Smola and B. Schölkopf. A Tutorial
on Support Vector Regression. NeuroCOLT Technical Report NC-
TR-98-030, Royal Holloway College of London, UK, 1998.
• [Tax & Duin, 1999] D. Tax and R. Duin (1999). Support vector
domain description. Patter Recognition Letters, 20:1991-1999.
• [Tax & Duin, 2001] D. Tax and R. Duin (2001). Uniform Object
Generation for Optimizing One-Class Classifiers, Journal of Machine
154
• [Vapnik, 1998] Vapnik, V. (1998). Statistical Learning Theory. Wiley,
Chichester, GB.
• [Vapnik & Chapelle, 2000] V. Vapnik and O. Chapelle. Bounds on
Error Expectation for Support Vector Machines. In Neural
Computation, 2000, vol 12, 9.
• [Wapnik & Tscherwonenkis, 1979] Wapnik, W. and Tscherwonenkis,
A. (1979). Theorie der Zeichenerkennung. Akademie Verlag, Berlin.
• [Weston & Watkins, 1998] Weston, J. and Watkins, C. (1998). Multi-
class support vector machines. Technical Report CSD-TR-98-04,
Royal Holloway University of London.
155

Sigir03: Support Vector and Kernel Methods

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Sigir03: Support Vector and Kernel Methods

Cargado por

Copyright:

Formatos disponibles

SIGIR 2003 Tutorial

Support Vector and Kernel Methods

How to use SVMs effectively and efficiently?

Why can SVMs learn?

When do SVMs work well?

SVM-{ranking, novelty detection, regression, ...}?

E.D. And F. MAN TO BUY INTO HONG KONG

About a corportate acquisition?

==> The ordering of words is ignored!

... but this is not necessarily a problem!

Good News: SVMs can overcome this problem!

Naive Bayes 72.3 82.0 62.4

Rocchio Algorithm 79.9 74.1 61.5

C4.5 Decision Tree 79.4 79.1 56.7

k-Nearest Neighbors 82.6 80.5 63.4

SVM 87.5 90.3 71.6

Generative Training Discriminative Training

What is the “optimal” Learner L ?

==> Number of misclassifications on training examples.

Central Problem of Statistical Learning Theory:

if and only if the sequence

<==> one-sided uniform convergence [Vapnik]

<==> VC-dimension of H is finite [Vapnik].

Geometric Interpretation (Hyperplane):

nuclear atom salt pepper water heat and

• Four points cannot be shattered.

=> Hyperplanes in ℜ 2 -> VCdim=3

With probablility η it hold for all h ∈ H :

•n number of training examples

==> Err P ( h ) ≤ Err S ( h ) + Φ ( d ,n, η )

Goal: Minimize upper bound on

The VC-dimension does not necessarily depend on the number of

The hyperplane with maximum margin

Minimize inverse and take square:

Train on 1000 pos / 1000 neg examples for “acq” (Reuters-21578).

=> minimize trade-off between margin and training error.

minimize P ( w, b ) = 1--- w ⋅ w minimize P ( w, b, ξ ) = 1--- w ⋅ w + C ∑ ξ i

Soft Margin: minimize P ( w, b, ξ ) = 1--- w ⋅ w + C ∑ ξ i

Observation: Typically no local optima, but not necessarily...

•k linear inequality constraints

• Hessian H = H ( i, j ) is pos. semi-definite ∀α 1 …α n ; ∑ ∑ αi αj Hij ≥ 0

with P ( w ) convex and differentiable, a necessary and sufficient

the Lagrangian function is defined as

•α and β are called Lagrange Multipliers

with P ( w ) convex and differentiable and all h affine (w*x+b),

=> L ( w°, β ) ≤ L ( w°, β° ) ≤ L ( w, β° )

with P ( w ) convex and differentiable and all g and h affine, necessary

Sufficient for convex QP: max [ min L ( w, α, β ) ]

Primal OP: minimize P ( w, b, ξ ) = 1--- w ⋅ w + C ∑ ξ i

Lemma: The solution w° can always

of the training data.i = 1

is the solution of the primal OP.

Theorem: For any set of feasible points P ( w, b ) ≥ D ( α ) .

=> two alternative ways to represent the learning result

Dual OP: maximize D ( α ) =  αi – 1---

• typically single solution (i. e. 〈 w, b〉 is unique)

How can SVMs learn non-linear classification rules?

==> The separating hyperplane in features space is a degree two

gives inner product in feature space.

We do not need to represent the feature space explicitly!

New hypotheses spaces through new Kernels:

and positive semi-definite

=> Construct complex Kernels from simple Kernels.

c-a c-t a-t b-a b-t c-r a-r b-r