Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Introduction

Setup

ANN

Working process ANN

Propagation

Bias parameter

Activation function

Loss function

Overfitting and Underfitting

Optimization function

Chain rule

Minima

Gradient problem

Weight initialization

Dropout

ANN Regression Exercise

ANN Classification Exercise

Hyper parameter tuning

CNN

CNN basics

Convolution

Padding

Pooling

Data argumentation

Flattening

Create Custom Dataset

Binary Classification Exercise

Multiclass Classification Exercise

Transfer learning

Transfer model Basic template

RNN

How RNN works

LSTM

Bidirectional RNN

Sequence to sequence

Attention model

Transformer model

Bag of words

Tokenization & Stop words

Stemming & Lemmatization

TF-IDF

N-Gram

Word embedding

Normalization

Pos tagging

Parser

semantic analysis

Regular expression

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Most used optimizer

What is Optimization function?

The optimization function is used to improve neural network models by minimizing the cost function or loss function value. To reduce the cost or lost function we updates the weight. To update the weights we use optimizer. The working of optimizers happens in the back propagation. The target is to reduce the value of cost or loss function by updating the weights to find the global minima.

So what optimizer does, it update the weights in such a way to reduce the loss or cost function value.

Some optimization functions are:

Gradient descent Optimizer

Gradient descent is dependent on the first-order derivative of a loss function. It calculates in which way the weights should be changed. The change is calculated so that the function can reach a minima.
Formula of gradient descent:
Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
dL/dWt-1=Slope

In gradient descent, we pass our whole data for training. Suppose we have 10K records. Then we will pass the whole record for training. Here iteration happens one time with whole data(10k). If our epochs are 10 then in each epoch one iteration will happen and in that one iteration will happen using the whole data. In this case 10k records.

Advantage:
This algorithm is easy to computation and implement.

Disadvantage:
1. Gradient descent uses a lot of memory and time to train.
2. It may also trap at local minima. It is costly.

Stochastic gradient descent Optimizer

Formula of stochastic gradient descent:
Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
dL/dWt-1=Slope

Here we can see the formula is same then what is the difference?
In gradient descent, we pass the whole data at a time for training but in stochastic gradient descent, we pass one data at a time. If we have 10k records then first time our model will train with one data, after training with the first record then our model will train with the second record, after training with the second record then the model will train with the third record. This way the training happens. Suppose our epoch size's 10 and we have 10k records in our dataset. For training in each epoch, 10k iteration will happen. It means the whole cycle of training in each epoch, will run the first time with the first record, the second time with the second record, the third time with the third record means for training 10k record the cycle of training will run 10k times.

Advantage:
Stochastic 1. gradient descent is less costly than gradient descent
2. Requires less memory.

Disadvantage:
1. It will also take a lot of time to train.
2. It's slowly reduce the value of learning rate.


Momentum in SGD :




We use momentum to smooth the curve or reducing high variance in SGD. Momentum softens the convergence. To reach the global minima, if we draw the line in the graph then the curve of gradient descent is a smooth line but the stochastic gradient descent line is not smooth. The line is too much up and down or we can say a zigzag line. It means we get noisy data in stochastic gradient descent. So our target is to reduce the noise means reducing the zigzag line and making it smooth. To do this we use momentum.

For smooth SGD curve we use Exponential smoothing:
The equation is :
Fvt=β*Fvt-1+(1-β)*AVt
Fvt=New forecasted value. Forecasted value means smoothen value
β= constant value
Fvt-1=Previous forecasted or smoothen value
AVt= Actual value

So the formula says that the new forecast value will be beta time multiplying with previous forecast value plus with previous beta times multiply with actual value at that time. In this technique, we assign a weight and the weight will be assigned in such a way that the present weight will be more than the previous weight. Suppose for a3 weight, a2 weight will be less than a3. Here beta is a hyper-parameter. It defines that how much important you want to give to your previous data.

So the main formula of optimization is :
W(new)=W(old)-η{dL/dW(t-1)}

But when we use momentum then the formula looks like:
W(new)=W(old)-η*Vdw

Formula of Vdw: Vdw=β*Vdw+(1-β)*dW

What is Vdw?
Vdw is the velocity of our weight. It means with how much velocity we want to incense or decrees our weight.

Advantage:
Does not miss the local minima.

Disadvantages:
Here we have to select the hyperparameter manually

Mini batch gradient descent Optimizer

Formula of mini batch gradient descent:
Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
dL/dW(t-1)=Slope

Here also the formula is same. In mini-batch gradient descent, we divide the whole data into small chunks, and then we use those small chunks for training. These chunks are called a batch. we will use all the batches but we pass one by one so it takes less time and use less memory. It updates the model parameters after every batch.

Suppose we have 1000 data in a dataset. We will create many chunks. For example, here we have created 10 chunks. It means each chunk contains 100 data. Now we will pass these chunks or batch one by one. It will take less time to compare gradient descent and stochastic gradient descent. Because it uses less memory and frequently updates the model parameters and also has less variance. In each epoch iteration number is equal to the number of batches or chunks we have created.

Suppose our epoch size is 10 and we have created 10 batches. So in each epoch, 10 iterations will happen. It means in each epoch 10 bags will go one by one for training. Here in the first chunk, we have the first 100 records. So in the first iteration in one epoch, the first chunk(first 100 data) will for training, then the second iteration in in one epoch, the second chunk(second 100 records) will go. This way it takes chunks or bags one by one for training. After training with the all 10 chunks or bags one epoch is done. So in next or in all epochs the same process will happen.

Advantage:
1. Mini batch gradients take less time, use less memory and less costly.
2. Here variance is less.

Adagrad Optimizer

How adagrad works?
We know that optimizer follow this formula given below.
The formula: Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate

In all types of gradient descent, the learning rate is always same for each and every neuron and hidden layer but in adagrad, we use different learning rates for each and every neuron based on different iterations.

Why do we need different types of learning rates?
In the data set, we can see two types of features while using ANN.
The features are:
Dense:
It is a representation in the form of a matrix. In the dense matrix, most of the values are not zero. It means one or some values can be zero but not all the values.
Example:

0 1 0
1 0 1
1 1 1
1 1 1

Sparse:
It is a representation in the form of a matrix. In a sparse matrix, most of the values are zero and some values are one.
For example:

1 0 1
0 1 0
0 0 0
0 0 0

With same leaning rate based on dense and sparse feature we can't update the weights properly. For dense we need different learning rate because here most of the values are one and for sparse we need different learning rate because here most of the values are zero.

So now the equation for adagrad is:
Wt=Wt-1-[η/√{αt+ε}]{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
ε= small positive value
αt=Previous weighted average

Why we need epsilon(ε)?
Let's think we don't have epsilon(ε). Now if our αt is zero then our learning rate(η) will become zero and if the learning rate(η) is zero then our new weight will become equal to the previous weight. To prevent this, we use the epsilon(ε) value which is always a small positive number so that our αt never can be zero.

What is αt?
In adagrad we don't use a constant learning rate after every iteration. We change the learning rate. It means use different learning rate values. To change the learning rate value we use αt. Here αt is the sum square of all previous gradients. It means derivative of loss and weight. Suppose we are now doing the third iteration. To find αt value for the third iteration we have to find the first and second iteration derivative loss and weight. After finding we have to do square it. After calculating the square of derivatives, we have to do the sum of all those square derivatives.

In this case, we are finding square derivatives for the first and second iteration and doing the sum of the first and second iteration square derivative.
Now the result will be the αt value.

If our alpha value increase then our learning rate will decrease. After every iteration, our αt value will increase and our learning rate will decrease. It means our learning rate is changing. This way we change our learning rate in each iteration. So here first time our learning rate will be big but it will get smaller when the process go-ahead

Limitation of adagrad:
Here we change our alpha value in every iteration. Sometimes our value of alpha increase too much and we know that if the value of alpha will start to increase then our learning rate will start to decrease. It means our learning rate becomes very small compared to the previous learning rate, if the alpha value increases too much. This will affect our model because if our learning rate is small then the difference between previous and new weight will be also smaller. Sometimes there can be no difference between learning rate and we know that we try to change our weight for reducing our error or loss function. So if there is no difference or very less difference then we will not be able to reduce our loss function.

RMSPROP optimizer

This optimizer is the solution of adagrad limitation or problem that we discussed in adagrad.

So the problem was, our learning rate sometimes becomes very very small for increasing alpha value. So what adadelta or RMSPROP does, it does not give alpha to increase so much. To find the value of alpha, we do the square the value of,
dL(loss function value)/dW(previous weight).
But in RMSprop we take the exponential moving average of these gradients. For doing this the value of alpha which increases in adagrad, now will not increase so much.

The main formula is same what we saw in adagrad but the technique of getting alpha is different.

Main formula:
Wt=Wt-1-[η/√{Wavgt+ε}]{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
ε= small positive value
Wavgt= Weighted avrage at time t

Formula for finding alpha:
Wavgt= β*St-1+(1-β)*(dL/dW)t2
Here,
β= constant value
St-1= Previous weighted average

We can see that we are again doing square of (dL/dW) but this time we are also multiplying it with (1-β) and the value of (1-β) is a very small number. So for doing multiplying the alpha will not increase too high.

Adadelta Optimizer

Adadelta refers to the difference between the current weight and the newly updated weight. The difference between adadelta and RMsprop is that adadelta removes the use of the learning rate parameter completely by replacing it with D(the exponential moving average of squared deltas).

So the formula will be:
Wt=Wt-1-[√{Dt+ε}/√{αt+ε}]{dL/dWt-1}

Formula of finding alpha and D:
αt=β*St-1+(1-β)*(dL/dW)t2 Dt=β*Dt-1+(1-β)*ΔW(t)
Here ΔW=Wt-Wt-1

Adam Optimizer

Adam optimizer stands for adaptive moment estimation. It is a combination of momentum and RMSprop. With the help of RMSprop we are actually changing the learning rate slowly where αt value don't become too high and with momentum, we are smoothing the curve as we do in SGD. So adam optimizer uses RMSprop to update weights and use momentum technique to smoothing.

Advantage:
Too fast and converges rapidly.

Disadvantage:
Computation costly

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.