Optimization Algorithms
— for training deep learning networks
What are optimization algorithms?
optimization algorithms update the weights and biases based on the calculated gradients. These algorithms determine how the weights should be adjusted to minimize the loss function. They define the learning rate and update rules, and mechanisms to efficiently navigate the weight space and reach a minimum of the loss.
We’ll learn about different types of optimizers and their advantages and disadvantages:
Gradient Descent
Gradient descent is the fundamental optimization algorithm used in neural networks. It is a first-order optimization algorithm that is dependent on the first-order derivative of a loss function. It updates the weights and biases in the direction of the negative gradient of the loss function. GD adjusts the parameters by taking small steps proportional to the gradient, gradually minimizing the loss. However, vanilla GD can be slow in converging, especially in complex networks.
algorithm: θ=θ−α⋅∇J(θ)
Advantages:
- Easy computation.
- Easy to implement.
- Easy to understand.
Disadvantages:
- May stuck at local minima.
- Weights are changed after calculating the gradient on the whole dataset. So, if the dataset is too large then this may take years to converge to the minimum.
- Requires large memory to calculate gradient on the whole dataset.
Stochastic Gradient Descent(SGD):
SGD improves upon GD by updating the weights and biases after each individual training example or a small batch of examples. It introduces randomness, which can help avoid getting stuck in local minima and accelerate training. SGD is computationally efficient and is suitable for large datasets. However, it can exhibit high variance in the convergence path.
θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.
As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities.
Advantages:
- Frequent updates of model parameters hence, converges in less time.
- Requires less memory as no need to store values of loss functions.
- May get new minima.
Disadvantages:
- High variance in model parameters.
- May shoot even after achieving global minima.
- To get the same convergence as gradient descent need to slowly reduce the value of the learning rate.
Mini-Batch Gradient Descent
Mini-batch SGD combines the benefits of GD and SGD by updating the weights and biases using a mini-batch of examples. It strikes a balance between the computational efficiency of SGD and the stability of GD. Mini-batch GD reduces the variance in the gradient updates and converges faster than vanilla SGD.
θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples.
Advantages:
- Frequently updates the model parameters and also has less variance.
- Requires a medium amount of memory.
All types of Gradient Descent have some challenges:
- Choosing an optimum value of the learning rate. If the learning rate is too small then gradient descent may take ages to converge.
- Have a constant learning rate for all the parameters. There may be some parameters that we may not want to change at the same rate.
- May get trapped at local minima.
GD with Momentum
Momentum was invented to reduce high variance in SGD and fasten the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’.
V(t)=γV(t−1)+α.∇J(θ)
Now, the weights are updated by θ=θ−V(t).
The momentum term γ is usually set to 0.9 or a similar value.
Advantages:
- Reduces the oscillations and high variance of the parameters.
- Converges faster than gradient descent.
Disadvantages:
- One more hyper-parameter is added which needs to be selected manually and accurately.
Adagrad
Adagrad adapts the learning rate for each parameter based on the historical sum of squared gradients for that parameter. It scales down the learning rate for frequently occurring features and scales up for rarely occurring features. Adagrad is effective for sparse data, but it can have diminishing learning rates over time, making it less appropriate for long training processes.
This optimizer changes the learning rate. It changes the learning rate ‘η’ for each parameter and at every time step ‘t’. It’s a type of second-order optimization algorithm. It works on the derivative of an error function.
A derivative of loss function for given parameters at a given time t.
Update parameters for given input i and at time/iteration t
η is a learning rate that is modified for a given parameter θ(i) at a given time based on previous gradients calculated for a given parameter θ(i).
We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t, while ϵ is a smoothing term that avoids division by zero (usually on the order of 1e−8). Interestingly, without the square root operation, the algorithm performs much worse.
It makes big updates for less frequent parameters and a small step for frequent parameters.
Advantages:
- The learning rate changes for each training parameter.
- Don’t need to manually tune the learning rate.
- Able to train on sparse data.
Disadvantages:
- Computationally expensive as a need to calculate the second order derivative.
- The learning rate is always decreasing resulting in slow training.
RMSProp(Root Mean Square Propagation)
RMSProp is an optimizer that also adapts the learning rate for each parameter but is based only on the average of past squared gradients. By dividing the learning rate by the square root of the accumulated squared gradients, RMSProp dampens the updates for frequently occurring features and updates more efficiently for rare features, leading to improved convergence.
sum_of_gradient_squared = previous_sum_of_gradient_squared * decay_rate+ gradient² * (1- decay_rate)
delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
theta += delta
More precisely, the sum of gradient squared is actually the decayed sum of gradient squared. The decay rate is saying only recent gradient² matters, and the ones from long ago are basically forgotten. As a side note, the term “decay rate” is a bit of a misnomer. Unlike the decay rate we saw in momentum, in addition to decaying, the decay rate here also has a scaling effect: it scales down the whole term by a factor of (1 — decay_rate). In other words, if the decay_rate is set at 0.99, in addition to decaying, the sum of gradient squared will be sqrt(1–0.99) = 0.1 that of AdaGrad, and thus the step is on the order of 10x larger for the same learning rate.
ADAM (Adaptive Moment Estimation)
Adam is a popular adaptive optimization algorithm that combines ideas from RMSProp and momentum. It adapts the learning rate for each weight parameter based on both the average of past gradients and the average of past squared gradients. Adam individually adapts the learning rate for each parameter, making it robust and efficient for a wide range of neural network architectures. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search.
M(t) and V(t) are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradients respectively.
First and second order of momentum
Here, we are taking the mean of M(t) and V(t) so that E[m(t)] can be equal to E[g(t)] where E[f(x)] is an expected value of f(x).
To update the parameter:
Update the parameters
The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘ϵ’.
Advantages:
- The method is too fast and converges rapidly.
- Rectifies vanishing learning rate, and high variance.
Disadvantages:
Computationally costly.