CH 4

Arti cial Neural Networks
Read Ch. 4] Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11] Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics
74
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Connectionist Models
Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 Connections per neuron ~ 10 ? Scene recognition time ~ :1 second 100 inference steps doesn't seem like enough ! much parallel computation
10 4 5
Properties of arti cial neural nets (ANN's): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically
75
When to Consider Neural Networks

Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition Waibel] Image classi cation Kanade, Baluja, Rowley] Financial prediction
76
ALVINN drives 70 mph on highways
Sharp Left
Straight Ahead
Sharp Right
30 Output Units
4 Hidden Units
30x32 Sensor Input Retina
77
Perceptron
x1 x2 w1 w2 x0=1 w0
. . .
xn
wn
i=0
wi xi
o=
1 if
i=0 -1 otherwise
wi xi > 0
o(x ; : : : ; xn) = ?1 if w + w x + 1 otherwise.

1 0 1 1
8 > < > :
+ wnxn > 0
Sometimes we'll use simpler vector notation: 8 > 1 if w ~ > 0 ~ x o(~ ) = < ?1 otherwise. x > :
78
Decision Surface of a Perceptron

x2 + + x1 + x1 + + x2
(a)
(b)
Represents some useful functions What weights represent g(x ; x ) = AND(x ; x )?

1 2 1 2
But some functions not representable e.g., not linearly separable Therefore, we'll want networks of these...
79
Perceptron training rule

wi wi + w i
where
Where: t = c(~ ) is target value x o is perceptron output is small constant (e.g., .1) called learning rate
wi = (t ? o)xi
80
Perceptron training rule

Can prove it will converge If training data is linearly separable and su ciently small
81
Gradient Descent
To understand, consider simpler linear unit, where o = w + w x + + w n xn Let's learn wi's that minimize the squared error E w] 2 d2D(td ? od) ~ 1X Where D is set of training examples
0 1 1 2
82
Gradient Descent
25 20 15
E[w]
10 5 0 2 1 -2 0 1 -1 2 3 w1 w0 -1 0
Gradient
Training rule: i.e.,
rE w] ~
2 6 4
@E ; @E ; @w @w
0 1
@E 3 7 5 @wn
w = ? rE w] ~ ~ @E wi = ? @w i
83
Gradient Descent
@E = @ 1 X(t ? o ) @wi @wi 2 d d d 1 X @ (t ? o ) = 2 d @w d d i @ 1 = 2 X 2(td ? od) @w (td ? od) d i @ (t ? w x ) X = d (td ? od) @w d ~ ~d i @E = X(t ? o )(?x ) @wi d d d i;d
2 2
84
Gradient Descent
Gradient-Descent
(training examples; ) Each training example is a pair of the form h~ ; ti, where ~ is the vector of input values, x x and t is the target output value. is the learning rate (e.g., .05). Initialize each wi to some small random value Until the termination condition is met, Do { Initialize each wi to zero. { For each h~ ; ti in training examples, Do x Input the instance ~ to the unit and x compute the output o For each linear unit weight wi, Do wi wi + (t ? o)xi { For each linear unit weight wi, Do wi w i + w i
85
Summary
Perceptron training rule guaranteed to succeed if Training examples are linearly separable Su ciently small learning rate Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given su ciently small learning rate Even when training data contains noise Even when training data not separable by H
86
Incremental (Stochastic) Gradient Descent

Batch mode Gradient Descent:
Do until satis ed 1. Compute the gradient rED w] ~ 2. w w ? rED w] ~ ~ ~ Incremental mode Gradient Descent: Do until satis ed For each training example d in D 1. Compute the gradient rEd w] ~ 2. w w ? rEd w] ~ ~ ~ 1 X (t ? o ) ED w] 2 d2D d d ~ 1 (t ? o ) Ed w ] 2 d d ~ Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if made small enough
2 2
87
Multilayer Networks of Sigmoid Units
head hid
...
...
whod hood
F1
F2
88
Sigmoid Unit
x1 x2 w1 w2 x0 = 1 w0
. . .
xn
wn
net = wi xi
i=0
o = (net) =
1 1+e
-net
(x) is the sigmoid function 1 1 + e?x Nice property: d dxx = (x)(1 ? (x)) We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoid units ! Backpropagation
( )
89
Error Gradient for a Sigmoid Unit

@E = @ 1 X (t ? o ) @wi @wi 2 d2D d d @ = 1 X @w (td ? od) 2d i 1 X 2(t ? o ) @ (t ? o ) = 2 d d d @w d d i 1 0 @o A @ = X(td ? od) B? @wd C d i @od @netd X = ? d (td ? od) @net @w d i But we know: @od = @ (netd) = o (1 ? o ) d d @netd @netd ~ x @netd = @ (w ~ d) = x i;d @wi @wi So: @E = ? X (t ? o )o (1 ? o )x d d d d i;d @wi d2D
2 2
90
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satis ed, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k ok(1 ? ok )(tk ? ok ) k 3. For each hidden unit h oh(1 ? oh) X wh;k k h 4. Update each network weight wi;j wi;j wi;j + wi;j where wi;j = j xi;j
k2outputs
91
More on Backpropagation
Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum { In practice, often works well (can run multiple times) Often include weight momentum wi;j(n) = j xi;j + wi;j (n ? 1) Minimizes error over training examples { Will it generalize well to subsequent examples? Training can take thousands of iterations ! slow! Using network after training is very fast
92
Learning Hidden Layer Representations

Inputs Outputs
A target function: Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
! ! ! ! ! ! ! !
Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
Can this be learned??

93
Learning Hidden Layer Representations

A network:
Inputs Outputs
Learned hidden layer representation: Input Hidden Output Values 10000000 ! .89 .04 .08 ! 10000000 01000000 ! .01 .11 .88 ! 01000000 00100000 ! .01 .97 .27 ! 00100000 00010000 ! .99 .97 .71 ! 00010000 00001000 ! .03 .05 .02 ! 00001000 00000100 ! .22 .99 .99 ! 00000100 00000010 ! .80 .01 .98 ! 00000010 00000001 ! .60 .94 .01 ! 00000001
94
Training
Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500
95
Training
Hidden unit encoding for input 01000000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500
96
Training
Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500
97
Convergence of Backpropagation
Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with di erent inital weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses
98
Expressive Capabilities of ANNs

Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers Cybenko 1988].
99
Over tting in ANNs

Error versus weight updates (example 1) 0.01 0.009 0.008 0.007 Training set error Validation set error
Error
0.006 0.005 0.004 0.003 0.002 0 5000 10000 15000 Number of weight updates 20000
Error versus weight updates (example 2) 0.08 0.07 0.06 0.05 Training set error Validation set error
Error
0.04 0.03 0.02 0.01 0 0 1000 2000 3000 4000 Number of weight updates 5000 6000
100
Neural Nets for Face Recognition

left strt rght up
...
...
30x32 inputs
Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces
101
Learned Hidden Unit Weights

left strt rght up Learned Weights
...
...
30x32 inputs
Typical input images http://www.cs.cmu.edu/ tom/faces.html
102
Alternative Error Functions

Penalize large weights: 1 X X (t ? o ) + X w E (w) 2 d2D k2outputs kd kd ~ ji i;j Train on target slopes as well as values: 2 0 1 3 1 X X 6(t ? o ) + X B @tkd ? @okd C 7 B C 7 6 E (w) 2 d2D k2outputs 6 kd kd ~ @ 4 j jA 7 j 2inputs @xd @xd 5
2 2 2 2
Tie together weights: e.g., in phoneme recognition network
103
Recurrent Networks
y(t + 1) y(t + 1)
x(t) (a) Feedforward network
x(t)
c(t)
(b) Recurrent network
y(t + 1)
x(t)
c(t) y(t)
x(t 1)
c(t 1) y(t 1)
(c) Recurrent network
x(t 2)
c(t 2)
unfolded in time
104

CH 4

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

CH 4

Cargado por

Copyright:

Formatos disponibles

Arti cial Neural Networks

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

When to Consider Neural Networks

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

ALVINN drives 70 mph on highways

30x32 Sensor Input Retina

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

o(x ; : : : ; xn) = ?1 if w + w x + 1 otherwise.

8 > < > :

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Decision Surface of a Perceptron

Represents some useful functions What weights represent g(x ; x ) = AND(x ; x )?

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron training rule

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron training rule

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Training rule: i.e.,

lecture slides for textbook

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Incremental (Stochastic) Gradient Descent

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Multilayer Networks of Sigmoid Units

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Error Gradient for a Sigmoid Unit

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Learning Hidden Layer Representations

Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

Can this be learned??

lecture slides for textbook

Learning Hidden Layer Representations

lecture slides for textbook

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

lecture slides for textbook