Está en la página 1de 24

# Feed-forward Neural Networks

Ying Wu
Electrical Engineering and Computer Science
Northwestern University
Evanston, IL 60208
http://www.eecs.northwestern.edu/~yingwu
1 / 24
Connectionism

2 / 24
History

in 1940s

in 1950s

in 1960s

again in 1980s

Expert systems

again in 1990s

## SVM was so hot (Vapnik, 1995)

where to go?
3 / 24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
4 / 24
Neuron: the Basic Unit
x
1
x
2
x
d
w
1
w
2
w
d
.
.
.

Input x = [1, . . . , x
d
]
T
R
d+1

0
, . . . , w
d
]
T
R
d+1

net =
d

i =0
w
i
x
i
= w
T
x

## Activation function and output

y = f (net) = f (w
T
x)
5 / 24
Activation Function

We can use
f (x) = sgn(x) =
_
1 x 0
1 x < 0

f (x) =
2
1 + e
2x
1, f (x) (1, 1)
its derivative
f

(x) = 1 f
2
(x)

Or
f (x) =
1
1 + e
x
, f (x) (0, 1)
its derivative
f

## (x) = f (x)[1 f (x)] =

e
x
(1 + e
x
)
2
6 / 24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
7 / 24
Perceptron
.
.
.

x
1
.
.
.

x
2
input layer
output layer
x
d
z
1
z
c

## Two layers (input and linear output)

Desired output t = [t
1
, . . . , t
c
]
T
R
c

Actual output z
i
= w
T
i
x, i = 1, . . . , c

Learning (Widrow-Ho)
w
i
(t + 1) = w
i
(t) + (t
i
z
i
)x = w
i
(t) + (t
i
w
T
i
x)x

## It cannot even solve the simple XOR problem

8 / 24
Multi-layer Network
.
.
.

.
.
.

x
1
.
.
.

x
2
input layer hidden layer output layer
x
d

## Input layer x = [1, . . . , x

d
]
T
R
d+1

Hidden layer y
j
= f (w
T
j
x), j = 1, . . . , n
H

Output layer z
k
= f (w
T
k
y), k = 1, . . . , c

j
and input node x
i
is w
ji

## Weight between output node z

k
and hidden node y
j
is w
kj

## May have multiple hidden layers

9 / 24
Discriminant Function

g
k
(x) = z
k
= f
_
_
n
H

j =1
w
kj
f
_
d

i =1
w
ji
x
i
+ w
j 0
_
+ w
k0
_
_

## Kolmogorov showed that a 3-layer structure is enough to

approximate any nonlinear function

## Certainly, the nonlinearity depends on n

H
, the number of
hidden units

Larger n
H
results in overtting

Smaller n
H
10 / 24
Training the Network

1
, . . . , t
c
]
T

j
, w
k
}
J(w) =
1
2
t z
2

## We need to nd the best set of {w

j
, w
k
} that minimize J

## It can be done through gradient-based optimization

In a general form
w(k + 1) = w(k)
J
w

## It make it clear, lets do it component by component

11 / 24
Back-propagation (BP): output-hidden w
k

w
kj
is the weight between output node k and hidden node j
J
w
kj
=
J
net
k
net
k
w
kj

i
=
J
net
i

## In this case, for the output node k

k
=
J
net
k
=
J
z
k
z
k
net
k
= (t
k
z
k
)f

(net
k
)

As net
k
=

n
H
j =1
w
kj
y
j
, it is clear that
net
k
w
kj
= y
j

So we have w
kj
=
k
y
j
= (t
k
z
k
)f

(net
k
)y
j

## This is a generalization of Widrow-Ho

12 / 24
Back-propagation (BP): hidden-input w
j

As did before
J
w
ji
=
J
y
j
y
j
net
j
net
j
w
ji
. .
easy

J
y
j
=

y
j
_
1
2
c

k=1
(t
k
z
k
)
2
_
=
c

k=1
(t
k
z
k
)
z
k
y
j
=
c

k=1
(t
k
z
k
)
z
k
net
k
net
k
y
j
=
c

k=1
(t
k
z
k
)f

(net
k
)w
kj
=
c

k=1

k
w
kj

## We can compute the sensitivity for the hidden node j

j
=
J
net
j
=
J
y
j
y
j
net
j
= f

(net
j
)
c

k=1
w
kj

k
13 / 24
Why is it Called Back Propagation?
j
1
2
k
c
input layer
hidden layer
output layer
w
kj

Sensitivity
i
reects the information on node i

j
of a hidden node j combines two sources of information

c

k=1
w
kj

(net
j
)

## The learning rule for the hidden-input node is

w
ji
=
j
x
i
= f

(net
j
)
c

k=1
w
kj

k
x
i
14 / 24
Algorithm: Back-propagation (BP)
Algorithm 1: Stochastic Back-propagation
Init: n
H
, w, stop criterion , , k = 0
Do k k + 1
x
k
randomly pick
forward compute y and then z
backward compute {
k
} and then {
j
}
w
kj
w
kj
+
k
y
j
w
ji
w
ji
+
j
x
i
Until J(w) <
Return w

## It can be easily extended to batch training

15 / 24
Bayes Discriminant and MLP

## In the linear discriminative models, we know that the MSE and

MMSE approximate to Bayesian discriminate asymptotically

## Suppose we have c classes and the desired output is t

k
(x) = 1
if x
k
, and 0 otherwise

J(w) =

x
[g
k
(x; w)t
k
]
2
=

x
k
[g
k
(x; w)1]
2
+

x /
k
[g
k
(x; w)0]
2

## It can be shown that minimizing lim

n
J(w) is equivalent to
miniming
_
[g
k
(x; w) P(
k
|x]
2
p(x)dx

## This means the output units represents the posteriors

g
k
(x; w) P(
k
|x)
16 / 24
Outputs as Probabilities

may not sum to 1

## We can use a dierent activation function for the output layer

Softmax activation
z
k
=
e
net
k
c

m=1
e
net
m
17 / 24
Practice: Number of Hidden Units

MLP

## It determines the expressive power of the network and the

complexity of the decision boundary

## A smaller number leads to simpler boundary, and a larger

number can produce very complicated one

parameter

## Many heuristics were proposed

18 / 24
Practice: Learning Rates

## To speed it up, we need to use the 2nd order gradient

information, e.g., the Newtonian method in training
19 / 24
Practice: Plateaus and Momentum

J(w)
w
is very small

## Momentum uses the weight change at previous iteration

w(k + 1) = w(k) + (1 )w
bp
(k) + w(k 1)
Algorithm 2: Stochastic Back-propagation with Momentum
Init: n
H
, w, stop criterion , , k = 0
Do k k + 1
x
k
randomly pick
b
kj
(1 )
k
y
i
+ b
kj
; b
ji
(1 )
j
x
i
+ b
ji
w
ji
w
kj
+ b
kj
; w
ji
w
ji
+ b
ji
Until J(w) <
Return w
20 / 24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
21 / 24
Radial Basis Function Network
.
.
.

.
.
.

x
1
.
.
.

x
2
input layer hidden layer output layer
x
d
K
K
K
w
kj

## The activation function for the hidden units is the Radial

Basis Function (RBF), e.g.,
K(||x x
c
||) = exp{
x x
c

2
2
2
}

The output
z
k
(x) =
n
H

j =0
w
kj
K(x, x
j
)
22 / 24
Interpretation

## It can be treated as a function approximation, as a linear

combination of a set of bases

## The hidden units transform the original feature space to

another feature space (high-dim) by using the kernel

feature space

23 / 24
Learning

## The basis center x

j
for each hidden node

The weights W

## Once the RBF parameters are set, W can be done by

pseudo-inverse or Widrow-Ho

i
24 / 24