Está en la página 1de 14

Notes on Backpropaga!

on

Nota!ons

Let’s begin with a nota!on which lets us refer to weights in the network in an
l to denote the weight for the connec!on from the k th
unambiguous way. We’ll use wjk
neuron in the (l−1)th layer to the j th neuron in the lth layer. So, for example, the
diagram the below shows the weight on a connec!on from the 4th neuron in the 2nd
layer to the 2nd neuron in the 3rd layer of a network:

We use a similar nota!on for the network’s biases and ac!va!ons. Explicitly, we use blj
for the bias of the j th neuron in the lth layer. And we use alj for the ac!va!on of the
j th neuron in the lth layer. The following diagram shows examples of these nota!ons in
use:
With these nota!ons, the ac!va!on alj of the j th neuron in the lth layer is related to
the ac!va!ons in the (l − 1)th layer by the equa!on.

nl−1
alj = σ(∑ wjk
l
× al−1 l
k + bj )
k=1

where nl−1 is the number of neurons in the (l − 1)th layer, and σ is an ac!va!on
func!on, such as sigmoid, tanh and ReLU .

To rewrite this expression in a matrix form we define a weight matrix w l for each layer, l.
The entries of the weight matrix w l are just the weights connec!ng to the lth layer of
neurons, that is, the entry in the j th row and k th column is wjk
l . Similarly, for each layer

l we define a bias vector, bl . blj is the j th entry of the bias vector for the lth layer. And
finally, we define an ac!va!on vector al whose components are the ac!va!ons alj .

The last ingredient we need to rewrite in a matrix form is the idea of vectorizing a
func!on such as σ . The idea is that we want to apply a func!on such as σ to every
element in a vector v . We use the obvious nota!on σ(v) to denote this kind of
elementwise applica!on of a func!on. That is, the components of σ(v) are just
σ(v)j = σ(vj ).

With these nota!ons in mind, the above equa!on can be rewri"en in the beau!ful and
compact vectorized form:

al = σ(wl al−1 + bl )
That global view is o#en easier and more succinct (and involves fewer indices!) than the
neuron-by-neuron view we’ve taken to now. Think of it as a way of escaping index hell,
while remaining precise about what’s going on. The expression is also useful in prac!ce,
because most matrix libraries provide fast ways of implemen!ng matrix mul!plica!on,
vector addi!on, and vectoriza!on.

When using the above vectorized form equa!on to compute al , we compute the
intermediate quan!ty

z l = wl al−1 + bl

We call z l the weighted input to the neurons in layer l.

So we also write

al = σ(z l )

z l has components

zjl = ∑ wjk
l
× al−1
k + bl
j
k

zjl is just the weighted input to the ac!va!on func!on for neuron j in layer l.

Cost func!on

To train a neural network you need some measure of error between computed outputs
and the desired target outputs of the training data. The most common measure of error
is called mean squared error. However, there are some research results that suggest
using a different measure, called cross entropy error, is some!mes preferable to using
mean squared error.

So, which is be"er for neural network training: mean squared error or mean cross
entropy error? The answer is, as usual, it depends on the par!cular problem.
Research results in this area are rather difficult to compare. If one of the error
func!ons were clearly superior to the other func!on in all situa!ons, there would be
no need for ar!cles like this one. The consensus opinion among my immediate
colleagues is that it’s best to try mean cross entropy error first; then, if you have
!me, try mean squared error.

∂C ∂C
The goal of backpropaga!on is to compute the par!al deriva!ves ∂w l and ∂bl of the

cost func!on C with respect to weight w l or bias bl in each layer l of the network. For
backpropaga!on to work we need to make two main assump!ons about the form of the
cost func!on.

The first assump!on is that the cost func!on can be wri"en as an average over cost
func!ons for individual training examples.

M
1
C= ∑ Cm
M
m=1

where M is the number of training examples, and Cm is cost of the mth training
example.

What backpropaga!on actually lets us do is compute the par!al deriva!ves ∂Cm


∂wl
and
∂Cm ∂C ∂C
∂bl for a single training example. We then recover ∂w l and ∂bl by averaging over
training examples.

The second assump!on we make about the cost is that it can be wri"en as a func!on of
the outputs from the neural network.

The cost C is a func!on of the ac!va!ons aL


j of the last layer L.

C = C L (aL L L
1 , ⋯ , ai , ⋯ , anL )
= C L (σ(z1L ), ⋯ , σ(ziL ), ⋯ , σ(znLL ))
= f L (z1L , ⋯ , ziL , ⋯ , znLL )

where 1 ≤ i ≤ nL and nL is the number of neurons in the last layer L.


Before we go to detail calcula!on of backpropaga!on, we need to introduce some
related mul!variable calculus knowledge.

Chain rule for par!al deriva!ves of mul!variable func!ons


Given the following mul!variable func!ons f and s

y = y(x1 , ⋯ , xi , ⋯ , xn )
xi = xi (t1 , ⋯ , tj , ⋯ , tm ); 1 ≤ i ≤ n

Based on the Chain rule for par!al deriva!ves of mul!variable func!ons, we can get
n
∂y ∂y ∂xi
=∑ × ;1 ≤ j ≤ m
∂tj ∂xi ∂tj
i=1
If we slice a neural network and keep the layers from l to L, and treat all neurons in
these layers as one big black box, then we can also rewrite cost C as a new func!on
with input (z1l , ⋯ , zjl , ⋯ , znl l ).

C = C l (al1 , ⋯ , alj , ⋯ , alnl )


= C l (σ(z1l ), ⋯ , σ(zjl ), ⋯ , σ(znl l ))
= f l (z1l , ⋯ , zjl , ⋯ , znl l )

where 1 ≤ j ≤ nl and nl is the number of neurons in the layer l.

Note:

C is corresponding to y in the previous example of the chain rule of mul!variables.


(z1l , ⋯ , zjl , ⋯ , znl l ) is corresponding to (x1 , ⋯ , xi , ⋯ , xn ) in the previous
example.

We can also rewrite zjl as a new func!on with input (z1l−1 , ⋯ , zkl−1 , ⋯ , znl−1
l−1 ).

nl−1
zjl = ∑ wjk
l
× al−1
k + bl
j
k=1
= zjl (al−1
1 ,⋯ , al−1
k , ⋯ , al−1
nl−1 )
nl−1
= ∑ wjk
l
× σ(zkl−1 ) + blj
k=1
= zjl (σ(z1l−1 ), ⋯ , σ(zkl−1 ), ⋯ , σ(znl−1
l−1 ))

= slj (z1l−1 , ⋯ , zkl−1 , ⋯ , znl−1


l−1 )

where 1 ≤ k ≤ nl−1 and nl−1 is the number of neurons in the layer l − 1.

Note:

zjl is corresponding to xi in the previous example of the chain rule of mul!variables.


(z1l−1 , ⋯ , zkl−1 , ⋯ , znl−1
l−1 ) is corresponding to (t1 , ⋯ , tj , ⋯ , tm ) in the

previous example.
We will show how to apply the above chain rule to Backpropaga!on algorithm in the
neural networks soon.

Now we are going to introduce a linear algebraic opera!on used in Backpropaga!on.

The Hadamard product

Suppose s and t are two vectors of the same dimension. Then we use s ⊙ t to denote
the elementwise product of the two vectors. Thus the components of s ⊙ t are just
(s ⊙ t)j = sj × tj . For example:

⎡ s1 ⎤ ⎡ t1 ⎤ ⎡ s1 × t1 ⎤
⎢⋮⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ sj ⎥ ⊙ ⎢ tj ⎥ = ⎢ sj × tj ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢⋮⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
⎣sn ⎦ ⎣tn ⎦ ⎣sn × tn ⎦

Calcula!on of Backpropaga!on

Backpropaga!on is about understanding how changing the weights and biases in a


network changes the cost func!on. Ul!mately, this means compu!ng the par!al
∂C ∂C
deriva!ves ∂w l and ∂bl . But to compute those, we first introduce an intermediate
jk j

quan!ty, δjl , which we call the error in the j th neuron in the lth layer. Backpropaga!on
∂C ∂C
will give us a procedure to compute the error δjl , and then will relate δjl to ∂w l and ∂bl .
jk j

we define the error of neuron j in layer l by

∂C
δjl =
∂zjl

If we get δjl , then we can easily calculate


∂C ∂C ∂zjl
= × l
∂bj
l ∂zj
l ∂bj
nl−1 l
(∑k=1 wjk × al−1
k + bj )
l
= δjl ×
∂blj
= δjl × 1 = δjl

The above equa!on can be wri"en in the vectorized form, and we define the deriva!ve
of C with respect to bl as

∂C l
∇bl C = = δ
∂bl

We can also easily calculate

∂C ∂C ∂zjl
= ×
∂wjk
l ∂zj
l ∂wjk
l

nl−1 l
(∑k=1 wjk × al−1
k + bj )
l
= δjl ×
∂wjk
l

= δjl × al−1
k

The above equa!on can be wri"en in the vectorized form, and we define the deriva!ve
of C with respect to w l as

∂C l l−1 T
∇wl C = = δ × (a )
∂wl

Note:

wl and ∇wl C are both nl × nl−1 matrices. δ l is a vector or a nl × 1 matrix and


(al−1 )T is a 1 × nl−1 matrix.
∂C
When the ac!va!on al−1 l−1
k is small, ak ≈ 0, the gradient term ∂w l will also tend
jk

to be small. In this case, we’ll say the weight learns slowly, meaning that it’s not
changing much during gradient descent. In other words, one consequence is that
weights associated with low-ac!va!on neurons learn slowly.
The key issue is how to calculate δjl for each layer l (L ≥ l ≥ 2).

First, we can easily calculate δjL of the last layer L.

∂C ∂C ∂aLj ∂C ∂σ(zjL ) ∂C ′ L
δjL = = = = σ (zj )
∂zj
L ∂aj ∂zj
L L ∂aj ∂zj
L L ∂aj
L

The above equa!on can be rewri"en in the vectorized form:

δ L = ∇aL C ⊙ σ ′ (z L )

We can start from the layer L to go backwards layer by layer, and calculate the δkl−1 of
the previous layer l − 1 based on δjl of the current layer l based on mul!variable chain
rule we introduced before.

∂C
δkl−1 =
∂zkl−1
nl
∂f l ∂slj
= ∑ l × l−1
j=1
∂zj ∂zk
nl
∂C ∂zjl
= ∑ l × l−1
j=1
∂zj ∂zk

We have

nl−1 l
∂zjl ∂(∑k=1 wjk × al−1
k + bj )
l
=
∂zkl−1 ∂zkl−1
nl−1 l
∂(∑k=1 wjk × σ(zkl−1 ) + blj )
=
∂zkl−1
nl−1 l
∂(∑k=1 wjk × σ(zkl−1 ) + blj ) ∂σ(zkl−1 )
= ×
∂σ(zkl−1 ) zkl−1
l
= wjk × σ ′ (zkl−1 )

So
nl
δkl−1 = ∑ δjl × wjk
l
× σ ′ (zkl−1 )
j=1

The above equa!on can be wri"en in the vectorized form:

δ l−1 = (wl )T δ l ⊙ σ ′ (z l−1 )

Note:

From the above equa!on, we can get δ l = (w l+1 )T δ l+1 ⊙ σ ′ (z l ). Consider the
term σ ′ (zkl ), if the ac!va!on func!on σ is the sigmoid func!on that the σ
func!on becomes very flat when σ(zkl ) is approximately 0 or 1. When this occurs
∂C
we will have σ ′ (zkl ) ≈ 0. We know ∂w l = δjl × al−1
k , so the lesson is that a
jk

weight in the layer will learn slowly if the neuron ac!va!on σ(zkl ) is either low (≈ 0
) or high (≈ 1). In this case it’s common to say the neuron has saturated and, as a
result, the weight has stopped learning (or is learning slowly). Similar remarks hold
also for the biases of neuron.

We can summarize the above results in the vectorized form as follows

δ L = ∇aL C ⊙ σ ′ (z L )
δ l−1 = (wl )T δ l ⊙ σ ′ (z l−1 )
∇bl C = δ l
∇wl C = δ l × (al−1 )T

Implementa!on of Backpropaga!on

The previous sec!on introduces the inference of Backpropaga!on, and let’s explicitly
write its implementa!on out in the form of an algorithm.

First, we start with only a single training example x(m) .

1. Input vector x(m) : Assign x(m) to the ac!va!on a1 for the input layer.
2. Feedforward: For each layer l = 2, 3, ⋯ , L compute z l = w l al−1 + bl and
al = σ(z l ).
3. Error in the last layer δL : Compute the vector δ L = ∇aL Cm ⊙ σ ′ (z L ).
4. Backpropagate the error to previous layer: For each l = L − 1, L−2, ⋯ , 2
compute δ l = (w l+1 )T δ l+1 ⊙ σ ′ (z l ).
5. Output: The gradient of the cost func!on is given by ∇bl Cm = δ l and
∇wl Cm = δ l × (al−1 )T .

In prac!ce, it’s common to combine backpropaga!on using stochas!c gradient descent


with mini-batch of M training examples.

1. Input M training examples of a mini-batch


2. For each training example x(m) : Assign x(m) to the ac!va!on a1 for the input
layer.

Feedforward: For each layer l = 2, 3, ⋯ , L compute z l = w l al−1 + bl and


al = σ(z l ).
Error in the last layer δL **: Compute the vector δ L = ∇aL Cm ⊙ σ ′ (z L ).
Backpropagate the error to previous layer: For each l = L − 1, L−2, ⋯ , 2
compute δ l = (w l+1 )T δ l+1 ⊙ σ ′ (z l ).

3. Gradient descent: For each l = L, L−1, … , 2 update the weights according to


the rule

M
η
wl → wl − ∑ ∇bl Cm
M m=1
M
η
l l
b →b − ∑ ∇wl Cm
M m=1

Of course, to implement stochas!c gradient descent in prac!ce you also need an outer
loop genera!ng mini-batches of training examples, and an outer loop stepping through
mul!ple epochs of training. It’s omi"ed those for simplicity.

In what sense is backpropaga!on a fast algorithm?


You decide to regard the cost as a func!on of the weights C = C(w) alone (we’ll get
back to the biases in a moment). You number the weights w1 , w2 , ⋯ and want to
∂C
compute ∂w for some par!cular weight wj . An obvious way of doing that is to use the
j

approxima!on

∂C C(w + ϵej ) − C(w)



∂wj ϵ

where ϵ > 0 is a small posi!ve number, and ej is the unit vector in the j th direc!on. In
∂C
other words, we can es!mate ∂w by compu!ng the cost CC for two slightly different
j

values of wj . The same idea will let us compute the par!al deriva!ves ∂C
∂b with respect
to the biases.

Unfortunately, while this approach appears promising, when you implement the code it
turns out to be extremely slow. To understand why, imagine we have a million weights in
our network. Then for each dis!nct weight wj we need to compute C(w + ϵej ) in
∂C
order to compute ∂w . That means that to compute the gradient we need to compute
j

the cost func!on a million different !mes, requiring a million forward passes through the
network (per training example). We need to compute C(w) as well, so that’s a total of a
million and one passes through the network.

What’s clever about backpropaga!on is that it enables us to simultaneously compute all


∂C
the par!al deriva!ves ∂w using just one forward pass through the network, followed by
j

one backward pass through the network. Roughly speaking, the computa!onal cost of
the backward pass is about the same as the forward pass (This should be plausible, but it
requires some analysis to make a careful statement. It’s plausible because the dominant
computa!onal cost in the forward pass is mul!plying by the weight matrices, while in
the backward pass it’s mul!plying by the transposes of the weight matrices. These
opera!ons obviously have similar computa!onal cost). And so the total cost of
backpropaga!on is roughly the same as making just two forward passes through the
network.

so even though backpropaga!on appears superficially more complex than the previous
obvious simple approach, it’s actually much, much faster!
Reference

How the backpropaga!on algorithm works


(h"p://neuralnetworksanddeeplearning.com/chap2.html)

Neural Network Cross Entropy Error


(h"ps://visualstudiomagazine.com/Ar!cles/2014/04/01/Neural-Network-Cross-
Entropy-Error.aspx?p=1)

También podría gustarte