Está en la página 1de 31

Arti cial Neural Networks

Read Ch. 4] Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11] Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics

74

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Connectionist Models
Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 Connections per neuron ~ 10 ? Scene recognition time ~ :1 second 100 inference steps doesn't seem like enough ! much parallel computation
10 4 5

Properties of arti cial neural nets (ANN's): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically

75

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

When to Consider Neural Networks


Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition Waibel] Image classi cation Kanade, Baluja, Rowley] Financial prediction

76

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

ALVINN drives 70 mph on highways

Sharp Left

Straight Ahead

Sharp Right

30 Output Units

4 Hidden Units

30x32 Sensor Input Retina

77

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron
x1 x2 w1 w2 x0=1 w0

. . .
xn

wn

i=0

wi xi

o=

1 if

i=0 -1 otherwise

wi xi > 0

o(x ; : : : ; xn) = ?1 if w + w x + 1 otherwise.


1 0 1 1

8 > < > :

+ wnxn > 0

Sometimes we'll use simpler vector notation: 8 > 1 if w ~ > 0 ~ x o(~ ) = < ?1 otherwise. x > :

78

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Decision Surface of a Perceptron


x2 + + x1 + x1 + + x2

(a)

(b)

Represents some useful functions What weights represent g(x ; x ) = AND(x ; x )?


1 2 1 2

But some functions not representable e.g., not linearly separable Therefore, we'll want networks of these...

79

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron training rule


wi wi + w i

where

Where: t = c(~ ) is target value x o is perceptron output is small constant (e.g., .1) called learning rate

wi = (t ? o)xi

80

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Perceptron training rule


Can prove it will converge If training data is linearly separable and su ciently small

81

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Gradient Descent
To understand, consider simpler linear unit, where o = w + w x + + w n xn Let's learn wi's that minimize the squared error E w] 2 d2D(td ? od) ~ 1X Where D is set of training examples
0 1 1 2

82

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Gradient Descent

25 20 15

E[w]

10 5 0 2 1 -2 0 1 -1 2 3 w1 w0 -1 0

Gradient

Training rule: i.e.,

rE w] ~

2 6 4

@E ; @E ; @w @w
0 1

@E 3 7 5 @wn

w = ? rE w] ~ ~ @E wi = ? @w i
Machine Learning, T. Mitchell, McGraw Hill, 1997

83

lecture slides for textbook

Gradient Descent
@E = @ 1 X(t ? o ) @wi @wi 2 d d d 1 X @ (t ? o ) = 2 d @w d d i @ 1 = 2 X 2(td ? od) @w (td ? od) d i @ (t ? w x ) X = d (td ? od) @w d ~ ~d i @E = X(t ? o )(?x ) @wi d d d i;d
2 2

84

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Gradient Descent
Gradient-Descent

(training examples; ) Each training example is a pair of the form h~ ; ti, where ~ is the vector of input values, x x and t is the target output value. is the learning rate (e.g., .05). Initialize each wi to some small random value Until the termination condition is met, Do { Initialize each wi to zero. { For each h~ ; ti in training examples, Do x Input the instance ~ to the unit and x compute the output o For each linear unit weight wi, Do wi wi + (t ? o)xi { For each linear unit weight wi, Do wi w i + w i

85

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Summary
Perceptron training rule guaranteed to succeed if Training examples are linearly separable Su ciently small learning rate Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given su ciently small learning rate Even when training data contains noise Even when training data not separable by H

86

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Incremental (Stochastic) Gradient Descent


Batch mode Gradient Descent:
Do until satis ed 1. Compute the gradient rED w] ~ 2. w w ? rED w] ~ ~ ~ Incremental mode Gradient Descent: Do until satis ed For each training example d in D 1. Compute the gradient rEd w] ~ 2. w w ? rEd w] ~ ~ ~ 1 X (t ? o ) ED w] 2 d2D d d ~ 1 (t ? o ) Ed w ] 2 d d ~ Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if made small enough
2 2

87

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Multilayer Networks of Sigmoid Units

head hid

...

...

whod hood

F1

F2

88

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Sigmoid Unit
x1 x2 w1 w2 x0 = 1 w0

. . .
xn

wn

net = wi xi
i=0

o = (net) =

1 1+e
-net

(x) is the sigmoid function 1 1 + e?x Nice property: d dxx = (x)(1 ? (x)) We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoid units ! Backpropagation
( )

89

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Error Gradient for a Sigmoid Unit


@E = @ 1 X (t ? o ) @wi @wi 2 d2D d d @ = 1 X @w (td ? od) 2d i 1 X 2(t ? o ) @ (t ? o ) = 2 d d d @w d d i 1 0 @o A @ = X(td ? od) B? @wd C d i @od @netd X = ? d (td ? od) @net @w d i But we know: @od = @ (netd) = o (1 ? o ) d d @netd @netd ~ x @netd = @ (w ~ d) = x i;d @wi @wi So: @E = ? X (t ? o )o (1 ? o )x d d d d i;d @wi d2D
2 2

90

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Backpropagation Algorithm
Initialize all weights to small random numbers. Until satis ed, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k ok(1 ? ok )(tk ? ok ) k 3. For each hidden unit h oh(1 ? oh) X wh;k k h 4. Update each network weight wi;j wi;j wi;j + wi;j where wi;j = j xi;j
k2outputs

91

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

More on Backpropagation
Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum { In practice, often works well (can run multiple times) Often include weight momentum wi;j(n) = j xi;j + wi;j (n ? 1) Minimizes error over training examples { Will it generalize well to subsequent examples? Training can take thousands of iterations ! slow! Using network after training is very fast

92

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Learning Hidden Layer Representations


Inputs Outputs

A target function: Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

! ! ! ! ! ! ! !

Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001

Can this be learned??


Machine Learning, T. Mitchell, McGraw Hill, 1997

93

lecture slides for textbook

Learning Hidden Layer Representations


A network:
Inputs Outputs

Learned hidden layer representation: Input Hidden Output Values 10000000 ! .89 .04 .08 ! 10000000 01000000 ! .01 .11 .88 ! 01000000 00100000 ! .01 .97 .27 ! 00100000 00010000 ! .99 .97 .71 ! 00010000 00001000 ! .03 .05 .02 ! 00001000 00000100 ! .22 .99 .99 ! 00000100 00000010 ! .80 .01 .98 ! 00000010 00000001 ! .60 .94 .01 ! 00000001
Machine Learning, T. Mitchell, McGraw Hill, 1997

94

lecture slides for textbook

Training
Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500

95

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Training

Hidden unit encoding for input 01000000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500

96

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Training
Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500

97

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Convergence of Backpropagation
Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with di erent inital weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses

98

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Expressive Capabilities of ANNs


Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers Cybenko 1988].

99

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Over tting in ANNs


Error versus weight updates (example 1) 0.01 0.009 0.008 0.007 Training set error Validation set error

Error

0.006 0.005 0.004 0.003 0.002 0 5000 10000 15000 Number of weight updates 20000

Error versus weight updates (example 2) 0.08 0.07 0.06 0.05 Training set error Validation set error

Error

0.04 0.03 0.02 0.01 0 0 1000 2000 3000 4000 Number of weight updates 5000 6000

100

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Neural Nets for Face Recognition


left strt rght up

...

...

30x32 inputs

Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces

101

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Learned Hidden Unit Weights


left strt rght up Learned Weights

...

...

30x32 inputs

Typical input images http://www.cs.cmu.edu/ tom/faces.html

102

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Alternative Error Functions


Penalize large weights: 1 X X (t ? o ) + X w E (w) 2 d2D k2outputs kd kd ~ ji i;j Train on target slopes as well as values: 2 0 1 3 1 X X 6(t ? o ) + X B @tkd ? @okd C 7 B C 7 6 E (w) 2 d2D k2outputs 6 kd kd ~ @ 4 j jA 7 j 2inputs @xd @xd 5
2 2 2 2

Tie together weights: e.g., in phoneme recognition network

103

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

Recurrent Networks
y(t + 1) y(t + 1)

x(t) (a) Feedforward network

x(t)

c(t)

(b) Recurrent network

y(t + 1)

x(t)

c(t) y(t)

x(t 1)

c(t 1) y(t 1)

(c) Recurrent network

x(t 2)

c(t 2)

unfolded in time

104

lecture slides for textbook

Machine Learning, T. Mitchell, McGraw Hill, 1997

También podría gustarte