Está en la página 1de 24

Artificial Intelligence

Support Vector Machines


Stephan Dreiseitl
FH Hagenberg
Software Engineering

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

1 / 48

Overview

Motivation
Statistical learning theory
Optimal separating hyperplanes
Support vector classification and regression
Kernel functions

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

2 / 48

Motivation
Given data D = {xi , ti } distributed according to P(x, t),
which model is better representation of data?
30

30

20

20

10

10

10

10

20

20

30
3

30
3

high bias
low variance
Lecture 15: Support Vector Machines

low bias
high variance
Artificial Intelligence SS2009

3 / 48

Motivation (cont.)
Neural networks model p(t|x) by
topology restriction
early stopping
weight decay
Bayesian approach

In support vector machines, this is replaced by capacity


control
SVM concepts based on statistical learning theory
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

4 / 48

Statistical learning theory


Given: data set {xi , ti }, class labels ti {1, +1},
classifier output y (, xi ) {1, +1}
Find: parameter such that y (, xi ) = ti
Important questions: is learning consistent (does
performance increase with number of size of training
set)?
How to handle limited data (small training sets)?
Can performance on test set (generalization error) be
inferred from performance on training set?
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

5 / 48

Statistical learning theory (cont.)


Empirical error on a data set {xi , ti } with distribution
P(x, t) for classifier with parameter :
n

1 X
Remp () =
|y (, xi ) ti |
2n
i=1

Expected error of same classifier on unseen data with the


same distribution:
Z
1
R() =
|y (, xi ) ti | dP(x, t)
2
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

6 / 48

Statistical learning theory (cont.)


Fundamental question of statistical learning theory: How
can we relate Remp and R?
Key result: Generalization error R depends on both Remp
and capacity h of the classifier
The following holds with probability 1 :
r
h(log(2n/h) + 1) log(/4)
R() Remp () +
,
n
with h the Vapnik-Chervonenkis (VC) dimension of the
classifier, and n the size of training set
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

7 / 48

Shattering
A classifier shatters data points if, for any labeling, the
points can be correctly classified
Capacity of classifier depends on number of points that
can be shattered by a classifier
VC dimension is largest number of data points for which
there exists an arrangement that can be shattered
Not the same as the number of parameters in the
classifier!

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

8 / 48

Shattering examples
Straight lines can shatter 3 points in 2-space
Classifier: sign( x)

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

9 / 48

Shattering examples (cont.)


Other classifier:
sign(x x )

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.2

classifier for last


case:
sign(x x )

0.6

0.8

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.2

Lecture 15: Support Vector Machines

0.4

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

Artificial Intelligence SS2009

10 / 48

Shattering examples (cont.)


Extreme example: one parameter, but infinite VC
dimension
Consider classifier y (, x) = sign(sin(x))
Surprising fact: for any n there exists arrangement of
data points {xi } R that can be shattered by y (, x)
Choose data points as xi = 10i , i = 1, . . . , n
There is a clever way of encoding labeling information in
single number

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

11 / 48

Shattering examples (cont.)


For any labeling ti {1, +1}, construct as
n


1X
i
= 1+
(1 ti )10
2
i=1

For n = 5 and ti = (+1, 1, 1, +1, 1), = 101101


1
0
1
5

10

Lecture 15: Support Vector Machines

10

10

10

10

Artificial Intelligence SS2009

12 / 48

VC dimension
VC dimension is capacity measure for classifiers
VC dimension is largest number of data points for which
there exists an arrangement that can be shattered
For straight lines in 2-space, VC dimension is 3
For hyperplanes in n-space, VC dimension is n + 1
May be difficult to calculate VC dimension for classifiers

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

13 / 48

Structural risk minimization


Recall that, with probability 1 ,
r
h(log(2n/h) + 1) log(/4)
R() Remp () +
,
n
Induction principle for finding best classifier:
fix data set and order classifiers according to their

VC dimension
for each classier, train and calculate right-hand side
of inequality
best classifier is the one that minimizes right-hand
side
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

14 / 48

Structural risk minimization (cont.)


r
R() Remp () +
Model

Remp

h(log(2n/h) + 1) log(/4)
n
VC conf.

upper bound

y1 (, x)
y2 (, x)
y3 (, x)
y4 (, x)
y5 (, x)
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

15 / 48

Support vector machines


Algorithmic representation of concepts from statistical
learning theory
Implement hyperplanes, so VC dimension is known
SVM calculate optimal hyperplanes: hyperplanes that
maximize margin between classes
Decision function: sign(w x + w0 )

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

16 / 48

Geometry of hyperplanes
z
|w z + w0 |
kw k
{x | w x + w0 = 0}
|w0 |
kw k
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

17 / 48

Geometry of hyperplanes (cont.)


Hyperplanes invariant to scaling of parameters:
{x | w x + w0 = 0} = {x | cw x + cw0 = 0}

-2

-1

-2

-1

-2

-2

-4

-4

-6

-6

-8

-8

-10

-10

{x | 3x y 4 = 0}
Lecture 15: Support Vector Machines

{x | 6x 2y 8 = 0}
Artificial Intelligence SS2009

18 / 48

Optimal separating hyperplanes


We want
w x + w0 +1 for all xi with ti = +1
w x + w0 1 for all xi with ti = 1

w x + w0 = 1

+1

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

19 / 48

Optimal separating hyperplanes (cont.)

Points x and o on dashed lines satisfy w x + w0 = +1


and w o + w0 = 1, resp.
Distance between dashed lines is
|w x + w0 | |w o + w0 |
2
+
=
kw k
kw k
kw k
Find largest (optimal) margin by maximizing

2
kw k

This is equivalent to

1
kw k2
2
subject to ti (w xi + w0 ) 1
minimize

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

20 / 48

Algorithmic aspects
Constrained optimization problem transformed to
Lagrangian
n
X

1
2
kw k
i ti (w xi + w0 ) 1
2
i=1

Find saddle point (minimize w.r.t. w , w0 , maximize


w.r.t. i )
Leads to criteria
n
X
i ti = 0

and

w=

i=1

n
X

i ti xi

i=1

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

21 / 48

Algorithmic aspects (cont.)


Substituting constraints into Lagrangian results in dual
problem
n
n
X
1X
maximize
i
i j ti tj xi xj
2
i,j=1
i=1
n
X
subject to i 0 and
i ti = 0
i=1

Pn

With expansion w = i=1 i ti xi , decision function


sign(w x + w0 ) becomes
n
X

f (x) = sign
ti i xi x + w0
i=1
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

22 / 48

Summary algorithmic aspects


Optimal separating hyperplane has largest margin (SVM
are large margin classifiers)
Unique solution P
to convex constrained optimization
problem is w = i ti xi over all points xi with i 6= 0
Points xi with i 6= 0 lie on the margin (support
vectors), all other points irrelevant for solution!
Observe that data points enter calculation only via dot
product

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

23 / 48

Large margin classifiers


Arguments for the importance of large margins:
3
2.5

1.5

0.5

1
0

2
0.5
1.5

0.5

0.5

Lecture 15: Support Vector Machines

1.5

2.5

Artificial Intelligence SS2009

24 / 48

Soft margin classifiers


What happens when data set is not linearly separable?
Introduce slack variables i 0
1.5

0.5

0.5

1.5

2
1

0.5

Lecture 15: Support Vector Machines

0.5

1.5

2.5

Artificial Intelligence SS2009

25 / 48

Soft margin classifiers (cont.)


Constraints are then
w x + w0 +1 i for all xi with ti = +1
w x + w0 1 + i for all xi with ti = 1
Want slack variables as small as possible, include this in
objective function
Pn
1
2
Soft margin classifier minimizes 2 kw k + C i=1 i
Large value of C gives large penalty to data on wrong
side
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

26 / 48

Soft margin classifiers (cont.)


Little difference in dual formulation:
n
X

n
1X
maximize
i
i j ti tj xi xj
2
i=1
i,j=1
n
X
subject to 0 i C and
i ti = 0
i=1

Again, data points appear only via dot products

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

27 / 48

Support vector regression


Difference to classification: targets ti are real-valued
Prediction function for linear regression is
f (x) = w x + w0
Recall 0-1 loss in classification: 12 |f (xi ) ti |
Need different loss for regression (-insensitive loss):


|f (xi ) ti | := max 0, |f (xi ) ti |
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

28 / 48

Support vector regression (cont.)


-insensitive loss results in tube around regression
function
7
6
5
4
3
2
1
2

10

Minimize regularization term and error contribution


n
X
1
kw k2 + C
|f (xi ) ti |
2
i=1

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

29 / 48

Support vector regression (cont.)


Need slack variables for points outside tube
7
6
5
4
3
2
1
2

10

X

1
2

maximize kw k + C
i + i
2
i=1

subject to f (xi ) ti + i , i 0
ti f (xi ) + i , i 0
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

30 / 48

Support vector regression (cont.)


Convert to dual problem statement (omitting details)
Regression estimate is
f (x) =

n
X

(i i ) xi x + w0

i=1

Again, data points enter calculation only via dot products

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

31 / 48

Probabilities for SVM outputs


In many applications, want output to be P(ti = 1 | xi )
SVMs provide only 1 classifications
Probabilities can be obtained by fitting a sigmoid to
raw SVM output w x + w0
Functional form of sigmoid can be motivated
theoretically

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

32 / 48

Probabilities for SVM outputs (cont.)


Details: With 0/1 target encoding ti , SVM output fi and
sigmoid pi = 1/(1 + exp(A fi + B)), minimize
n
X

ti log(pi ) + (1 ti ) log(1 pi )

i=1

-4

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

-2

-4

-2

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

33 / 48

Nonlinear SVM
Idea: Do nonlinear projection (x) : Rm H of original
data points x into some higher-dimensional space H
Then, apply optimal margin hyperplane algorithm in H
2.5

1.5

( )

0.5

( )

( )
( )

1.5
0

( )

( )

( )

1
0.5

0.5
1

( )

1.5

2
1

( )
( )

0.5

0.5

Lecture 15: Support Vector Machines

1.5

2.5

0.5
1.5

0.5

0.5

1.5

Artificial Intelligence SS2009

2.5

34 / 48

Nonlinear SVM example


3

Idea: Project R2 R3 by

2
 
x
1
x1

2 x1 x2
x2
x22

3
3

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

35 / 48

Nonlinear SVM example (cont.)


Do the math (dot product in H):

2
2
   
x1
y1
x1
y1

= 2 x1 x2 2 y1 y2
x2
y2
x22
y22
= x12 y12 + 2x1 x2 y1 y2 + x22 y22
= (x1 y1 + x2 y2 )2
   2
x1
y
=
1
x2
y2

This means that dot product in H can be represented by


function in original space R2
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

36 / 48

The kernel trick


Recall that data enters maximum margin calculation only
via dot products xi xj or (xi ) (xj )
Instead of calculating (xi ) (xj ), use kernel function in
original space:
K (xi , xj ) = (xi ) (xj )
Advantage: no need to calculate
Advantage: no need to know H
Raises question: what are admissible kernel functions?
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

37 / 48

Kernel functions

Admissible kernel functions: Gram matrix K (xi , xj ) i,j is
positive definite
Most widely used kernel functions and their parameters:
polynomials (degree)
Gaussians (variance)

Practical importance of kernels: similarity measures on


data sets without dot products!
Great for text analysis, bioinformatics,. . .
Kernel-ization of other algorithms (Kernel PCA, LDA,. . . )
Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

38 / 48

Kernel function example


Dot product (x1 , x2 ) (y
1 , y2 ) after nonlinear
projection (x1 , x2 ) = (x12 , 2x1 x2 , x22 ) can be achieved
by kernel function K (x, y ) = (x y )2
3

5
0
4
2

3
4
0

1
2
1

3
4

3
3

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

39 / 48

SVM examples
Linearly separable

Lecture 15: Support Vector Machines

C = 100

Artificial Intelligence SS2009

40 / 48

SVM examples (cont.)


C = 100

Lecture 15: Support Vector Machines

C =1

Artificial Intelligence SS2009

41 / 48

SVM examples (cont.)


Linear function

Lecture 15: Support Vector Machines

Quad. polynomial, C = 10

Artificial Intelligence SS2009

42 / 48

SVM examples (cont.)


Quad. polynomial, C = 10

Lecture 15: Support Vector Machines

Quad. polynomial, C = 100

Artificial Intelligence SS2009

43 / 48

SVM examples (cont.)


Kubic polynomial

Lecture 15: Support Vector Machines

Gaussian, = 1

Artificial Intelligence SS2009

44 / 48

SVM examples (cont.)


Quad. polynomial

Lecture 15: Support Vector Machines

Kubic polynomial, C = 10

Artificial Intelligence SS2009

45 / 48

SVM examples (cont.)


Kubic polynomial, C = 10

Lecture 15: Support Vector Machines

deg. 4 polynomial, C = 10

Artificial Intelligence SS2009

46 / 48

SVM examples (cont.)


Gaussian, = 1

Gaussian, = 3

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

47 / 48

Summary
SVM based on statistical learning theory
Allows calculation of bounds on generalization
performance
Optimal separating hyperplanes
Kernel trick (projection)
Kernel functions are similarity measures
SVM perform comparable to neural networks

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

48 / 48

También podría gustarte