Perceptron

Perceptron Learning Algorithm

Jia Li
Department of Statistics
The Pennsylvania State University
Email: jiali@stat.psu.edu
http://www.stat.psu.edu/jiali
Jia Li
Separating Hyperplanes
Jia Li
Construct linear decision boundaries that explicitly try to

separate the data into different classes as well as possible.
Good separation is defined in a certain form mathematically.
Even when the training data can be perfectly separated by

hyperplanes, LDA or other linear methods developed under a
statistical framework may not achieve perfect separation.
Jia Li
Review of Vector Algebra

I
A hyperplane or affine set L is defined by the linear equation:

L = {x : f (x) = 0 + T x = 0} .
I
I
For any two points x1 and x2 lying in L, T (x1 x2 ) = 0, and

hence = / k k is the vector normal to the surface of L.
For any point x0 in L, T x0 = 0 .
The signed distance of any point x to L is given by
T (x x0 ) =
=
Jia Li
1
( T x + 0 )
kk
1
f (x) .
0
k f (x) k
Hence f (x) is proportional to the signed distance from x to

the hyperplane defined by f (x) = 0.
Jia Li
Rosenblatts Perceptron Learning

I
I
I
Goal: find a separating hyperplane by minimizing the distance

of misclassified points to the decision boundary.
Code the two classes by yi = 1, 1.
If yi = 1 is misclassified, T xi + 0 < 0. If yi = 1 is
misclassified, T xi + 0 > 0.
Since the signed distance from xi to the decision boundary is
T xi +0
kk , the distance from a misclassified xi to the decision
T
I
I
xi +0 )
boundary is yi (kk
.
Denote the set of misclassified points by M.
The goal is to minimize:
X
D(, 0 ) =
yi ( T xi + 0 ) .
iM
Jia Li
Stochastic Gradient Descent

I
Jia Li
To minimize D(, 0 ), compute the gradient (assuming M is

fixed):
D(, 0 )
D(, 0 )
0
yi xi ,
iM
yi .
iM
Stochastic gradient descent is used to minimize the piecewise

linear criterion.
Adjustment on , 0 is done after each misclassified point is

visited.
The update is:

+
yi xi
yi

.
Here is the learning rate, which in this case can be taken to

be 1 without loss of generality. (Note: if T x + 0 = 0 is the
decision boundary, T x + 0 = 0 is also the boundary.)
Jia Li
Issues
If the classes are linearly separable, the algorithm converges to

a separating hyperplane in a finite number of steps.
A number of problems with the algorithm:
I
Jia Li
When the data are separable, there are many solutions, and
which one is found depends on the starting values.
The number of steps can be very large. The smaller the gap,
the longer it takes to find it.
When the data are not separable, the algorithm will not
converge, and cycles develop. The cycles can be long and
therefore hard to detect.
Optimal Separating Hyperplanes

I
I
I
I
I
Suppose the two classes can be linearly separated.

The optimal separating hyperplane separates the two classes
and maximizes the distance to the closest point from either
class.
There is a unique solution.
Tend to have better classification performance on test data.
The optimization problem:
max C
,0
subject to
I
Jia Li
1
yi ( T xi + 0 ) C , i = 1, ..., N
kk
Every point is at least C away from the decision boundary

T x + 0 = 0.
Jia Li
For any solution of the optimization problem, any positively

scaled multiple is a solution as well. We can set k k= 1/C .
The optimization problem is equivalent to:
min
,0
1
k k2
2
subject to yi ( T xi + 0 ) 1, i = 1, ..., N
I
Jia Li
This is a convex optimization problem.
The Lagrange sum is:

N
X
1
LP = min k k2
ai [yi ( T xi + 0 ) 1] .
,0 2
i=1
Setting the derivatives to zero, we obtain:

=
0 =
N
X
i=1
N
X
i=1
Jia Li
ai yi xi ,
ai yi .
Substitute into LP , we obtain the Wolfe dual

LD =
N
X
i=1
ai
1 XX
ai ak yi yk xiT xk
2
i=1 k=1
subject to ai 0 .
This is a simpler convex optimization problem.
Jia Li
The Karush-Kuhn-Tucker conditions require:

ai [yi ( T xi + 0 ) 1] = 0 i .
If ai > 0, then yi ( T xi + 0 ) = 1, that is, xi is on the

boundary of the slab.
If yi ( T xi + 0 ) > 1, that is, xi is not on the boundary of the
slab, ai = 0.
The points xi on the boundary of the slab are called support

points.
The solution vector is a linear combination of the support

points:
X
=
ai yi xi .
i:ai >0
Jia Li

Perceptron

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Perceptron

Cargado por

Copyright:

Formatos disponibles

Perceptron Learning Algorithm

Perceptron Learning Algorithm

Perceptron Learning Algorithm

Construct linear decision boundaries that explicitly try to

Good separation is defined in a certain form mathematically.

Even when the training data can be perfectly separated by

Perceptron Learning Algorithm

Perceptron Learning Algorithm

Review of Vector Algebra

A hyperplane or affine set L is defined by the linear equation:

For any two points x1 and x2 lying in L, T (x1 x2 ) = 0, and

Hence f (x) is proportional to the signed distance from x to

Perceptron Learning Algorithm

Perceptron Learning Algorithm

Rosenblatts Perceptron Learning

Goal: find a separating hyperplane by minimizing the distance

Perceptron Learning Algorithm

Stochastic Gradient Descent

To minimize D(, 0 ), compute the gradient (assuming M is

Stochastic gradient descent is used to minimize the piecewise

Adjustment on , 0 is done after each misclassified point is

Perceptron Learning Algorithm

The update is:

Here is the learning rate, which in this case can be taken to

Perceptron Learning Algorithm

If the classes are linearly separable, the algorithm converges to

Perceptron Learning Algorithm

Optimal Separating Hyperplanes

Suppose the two classes can be linearly separated.

Every point is at least C away from the decision boundary

Perceptron Learning Algorithm

Perceptron Learning Algorithm

For any solution of the optimization problem, any positively

This is a convex optimization problem.

Perceptron Learning Algorithm

The Lagrange sum is:

Setting the derivatives to zero, we obtain:

Perceptron Learning Algorithm

Substitute into LP , we obtain the Wolfe dual

Perceptron Learning Algorithm

The Karush-Kuhn-Tucker conditions require:

If ai > 0, then yi ( T xi + 0 ) = 1, that is, xi is on the

The points xi on the boundary of the slab are called support

The solution vector is a linear combination of the support

También podría gustarte