71
(7. Advanced Optimization Methods
So far inthis book, we have only studied and used Stochastic Gradient Descent (SGD) to optimize
‘our networks ~ but there are other optimization methods that are used in deep learning. Specifically,
these more advanced optimization techniques seck to either:
1, Reduce the amount of time (i.c., number of epochs) to obtain reasonable classification
2, Make the network more “well-behaved fora larger range of hyperparameters other than the
learning rate
3. Ideally, obtain higher classification aecuracy than what is possible with SGD.
With the latest incamation of deep learning, there has been an explosion of new optimization
techniques, each seeking to improve on SGD and provide the concept of adaptive learning rates. AS
‘we know, SGD modifies all parameters in a network egually in proportion toa given learning rate
However, given that the learning rate of a network is (1) the most important hyperparameter to tune
and (2) a hard, tedious hyperparameter to set correctly, deep learning researchers have postulated
that it's possible to adaptively tune the learning rate (and in some cases, per parameter) as the
network trains,
In this chapter, we'll review adaptive leaming rate methods. I'l also provide suggestions on
‘which optimization algorithms you should be using in your own projects.
Adaptive Leaming Rate Methods
In order to understand each of the optimization algorithms in this section, we are going to examine
them in terms of pseudocode — specifically the update step. Much of this chapter has been inspired
by the excellent overview of optimization methods by Karpathy [26] and Ruder (27], We'll
extend (and in some cases, simplify) their explanations of these methods to make the content more
digestible,
“To get started, let's take a look at an algorithm we are already familiar with — the update phase
of vanilla SGD.
mow84 Chapter 7. Advanced Opt
Here we have three values:
1, W: Our weight matrix.
2. Le: The learning rate
3. a: The gradient of W.
(ur learning rate here is fixed and, provided itis small enough, we know our loss will decrease
during taining. We've also seen extensions to SGD which incorporate momentum and Nesterov
acceleration in Chapter 7. Given this notation, let’s explore common adaptive learning rate
optimizers you will encounter in your deep learning career.
Adagrad
‘The first adaptive learning rate method we are going to explore is Adagrad, first introduced by
Duchi et al [28], Adagrad adapts the learning rate to the network parameters. Larger updates are
performed on parameters that change infrequently while smaller updates are done on parameters
that change frequently
Below we can see a pseudocode representation of the Adagrad update:
cache
Cai =
Ws