Está en la página 1de 6
71 (7. Advanced Optimization Methods So far inthis book, we have only studied and used Stochastic Gradient Descent (SGD) to optimize ‘our networks ~ but there are other optimization methods that are used in deep learning. Specifically, these more advanced optimization techniques seck to either: 1, Reduce the amount of time (i.c., number of epochs) to obtain reasonable classification 2, Make the network more “well-behaved fora larger range of hyperparameters other than the learning rate 3. Ideally, obtain higher classification aecuracy than what is possible with SGD. With the latest incamation of deep learning, there has been an explosion of new optimization techniques, each seeking to improve on SGD and provide the concept of adaptive learning rates. AS ‘we know, SGD modifies all parameters in a network egually in proportion toa given learning rate However, given that the learning rate of a network is (1) the most important hyperparameter to tune and (2) a hard, tedious hyperparameter to set correctly, deep learning researchers have postulated that it's possible to adaptively tune the learning rate (and in some cases, per parameter) as the network trains, In this chapter, we'll review adaptive leaming rate methods. I'l also provide suggestions on ‘which optimization algorithms you should be using in your own projects. Adaptive Leaming Rate Methods In order to understand each of the optimization algorithms in this section, we are going to examine them in terms of pseudocode — specifically the update step. Much of this chapter has been inspired by the excellent overview of optimization methods by Karpathy [26] and Ruder (27], We'll extend (and in some cases, simplify) their explanations of these methods to make the content more digestible, “To get started, let's take a look at an algorithm we are already familiar with — the update phase of vanilla SGD. mow 84 Chapter 7. Advanced Opt Here we have three values: 1, W: Our weight matrix. 2. Le: The learning rate 3. a: The gradient of W. (ur learning rate here is fixed and, provided itis small enough, we know our loss will decrease during taining. We've also seen extensions to SGD which incorporate momentum and Nesterov acceleration in Chapter 7. Given this notation, let’s explore common adaptive learning rate optimizers you will encounter in your deep learning career. Adagrad ‘The first adaptive learning rate method we are going to explore is Adagrad, first introduced by Duchi et al [28], Adagrad adapts the learning rate to the network parameters. Larger updates are performed on parameters that change infrequently while smaller updates are done on parameters that change frequently Below we can see a pseudocode representation of the Adagrad update: cache Cai = Ws

También podría gustarte