Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Introduction
Setup
ANN
Working process ANN
Propagation
Bias parameter
Activation function
Loss function
Overfitting and Underfitting
Optimization function
Chain rule
Minima
Gradient problem
Weight initialization
Dropout
ANN Regression Exercise
ANN Classification Exercise
Hyper parameter tuning
CNN
CNN basics
Convolution
Padding
Pooling
Data argumentation
Flattening
Create Custom Dataset
Binary Classification Exercise
Multiclass Classification Exercise
Transfer learning
Transfer model Basic template
RNN
How RNN works
LSTM
Bidirectional RNN
Sequence to sequence
Attention model
Transformer model
Bag of words
Tokenization & Stop words
Stemming & Lemmatization
TF-IDF
N-Gram
Word embedding
Normalization
Pos tagging
Parser
semantic analysis
Regular expression
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
The optimization function is used to improve neural network models by minimizing the cost function or loss
function value. To reduce the cost or lost function we updates the weight. To update the weights we use
optimizer. The working of optimizers happens in the back propagation. The target is to reduce the value of
cost or loss function by updating the weights to find the global minima.
So what optimizer does, it update the weights in such a way to reduce the loss or cost function value.
Gradient descent is dependent on the first-order derivative of a loss function. It calculates in which way the
weights should be changed. The change is calculated so that the function can reach a minima.
Formula of gradient descent:
Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
dL/dWt-1=Slope
In gradient descent, we pass our whole data for training. Suppose we have 10K records. Then we will pass the
whole record for training. Here iteration happens one time with whole data(10k). If our epochs are 10 then in
each epoch one iteration will happen and in that one iteration will happen using the whole data. In this case
10k records.
Advantage:
This algorithm is easy to computation and implement.
Disadvantage:
1. Gradient descent uses a lot of memory and time to train.
2. It may also trap at local minima. It is costly.
Formula of stochastic gradient descent:
Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
dL/dWt-1=Slope
Here we can see the formula is same then what is the difference?
In gradient descent, we pass the whole data at a time for training but in stochastic gradient descent, we pass
one data at a time. If we have 10k records then first time our model will train with one data, after training
with the first record then our model will train with the second record, after training with the second record
then the model will train with the third record. This way the training happens. Suppose our epoch size's 10
and we have 10k records in our dataset. For training in each epoch, 10k iteration will happen. It means the
whole cycle of training in each epoch, will run the first time with the first record, the second time with the
second record, the third time with the third record means for training 10k record the cycle of training will
run 10k times.
Advantage:
Stochastic 1. gradient descent is less costly than gradient descent
2. Requires less memory.
Disadvantage:
1. It will also take a lot of time to train.
2. It's slowly reduce the value of learning rate.
We use momentum to smooth the curve or reducing high variance in SGD. Momentum softens the convergence. To
reach the global minima, if we draw the line in the graph then the curve of gradient descent is a smooth line
but the stochastic gradient descent line is not smooth. The line is too much up and down or we can say a
zigzag line. It means we get noisy data in stochastic gradient descent. So our target is to reduce the noise
means reducing the zigzag line and making it smooth. To do this we use momentum.
For smooth SGD curve we use Exponential smoothing:
The equation is :
Fvt=β*Fvt-1+(1-β)*AVt
Fvt=New forecasted value. Forecasted value means smoothen value
β= constant value
Fvt-1=Previous forecasted or smoothen value
AVt= Actual value
So the formula says that the new forecast value will be beta time multiplying with previous forecast value
plus with previous beta times multiply with actual value at that time. In this technique, we assign a weight
and the weight will be assigned in such a way that the present weight will be more than the previous weight.
Suppose for a3 weight, a2 weight will be less than a3. Here beta is a hyper-parameter. It defines that how
much important you want to give to your previous data.
So the main formula of optimization is :
W(new)=W(old)-η{dL/dW(t-1)}
But when we use momentum then the formula looks like:
W(new)=W(old)-η*Vdw
Formula of Vdw:
Vdw=β*Vdw+(1-β)*dW
What is Vdw?
Vdw is the velocity of our weight. It means with how much velocity we want to incense or decrees our
weight.
Advantage:
Does not miss the local minima.
Disadvantages:
Here we have to select the hyperparameter manually
Formula of mini batch gradient descent:
Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
dL/dW(t-1)=Slope
Here also the formula is same. In mini-batch gradient descent, we divide the whole data into small chunks, and
then we use those small chunks for training. These chunks are called a batch. we will use all the batches but
we pass one by one so it takes less time and use less memory. It updates the model parameters after every
batch.
Suppose we have 1000 data in a dataset. We will create many chunks. For example, here we have created 10
chunks. It means each chunk contains 100 data. Now we will pass these chunks or batch one by one. It will take
less time to compare gradient descent and stochastic gradient descent. Because it uses less memory and
frequently updates the model parameters and also has less variance. In each epoch iteration number is equal to
the number of batches or chunks we have created.
Suppose our epoch size is 10 and we have created 10 batches. So in each epoch, 10 iterations will happen. It
means in each epoch 10 bags will go one by one for training. Here in the first chunk, we have the first 100
records. So in the first iteration in one epoch, the first chunk(first 100 data) will for training, then the
second iteration in in one epoch, the second chunk(second 100 records) will go. This way it takes chunks or
bags one by one for training. After training with the all 10 chunks or bags one epoch is done. So in next or
in all epochs the same process will happen.
Advantage:
1. Mini batch gradients take less time, use less memory and less costly.
2. Here variance is less.
How adagrad works?
We know that optimizer follow this formula given below.
The formula:
Wt=Wt-1-η{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
In all types of gradient descent, the learning rate is always same for each and every neuron and hidden layer
but in adagrad, we use different learning rates for each and every neuron based on different iterations.
Why do we need different types of learning rates?
In the data set, we can see two types of features while using ANN.
The features are:
Dense:
It is a representation in the form of a matrix. In the dense
matrix, most of the values are not zero. It means one or some values can be zero but not all the values.
Example:
0 | 1 | 0 |
---|---|---|
1 | 0 | 1 |
1 | 1 | 1 |
1 | 1 | 1 |
Sparse:
It is a representation in the form of a matrix. In a sparse matrix, most of the values are zero and some
values are one.
For example:
1 | 0 | 1 |
---|---|---|
0 | 1 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
With same leaning rate based on dense and sparse feature we can't update the weights properly. For dense we
need different learning rate because here most of the values are one and for sparse we need different learning
rate because here most of the values are zero.
So now the equation for adagrad is:
Wt=Wt-1-[η/√{αt+ε}]{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
ε= small positive value
αt=Previous weighted average
Why we need epsilon(ε)?
Let's think we don't have epsilon(ε). Now if our αt is zero then our learning rate(η) will become
zero and if the learning rate(η) is zero then our new weight will become equal to the previous weight. To
prevent this, we use the epsilon(ε) value which is always a small positive number so that our αt
never can be zero.
What is αt?
In adagrad we don't use a constant learning rate after every iteration. We change the learning rate. It means
use different learning rate values. To change the learning rate value we use αt. Here αt
is the sum square of all previous gradients. It means derivative of loss and weight. Suppose we are now doing
the third iteration. To find αt value for the third iteration we have to find the first and second
iteration derivative loss and weight. After finding we have to do square it. After calculating the square of
derivatives, we have to do the sum of all those square derivatives.
In this case, we are finding square derivatives for the first and second iteration and doing the sum of the
first and second iteration square derivative.
Now the result will be the αt value.
If our alpha value increase then our learning rate will decrease. After every iteration, our αt
value will increase and our learning rate will decrease. It means our learning rate is changing. This way we
change our learning rate in each iteration. So here first time our learning rate will be big but it will get
smaller when the process go-ahead
Limitation of adagrad:
Here we change our alpha value in every iteration. Sometimes our value of alpha increase too much and we know
that if the value of alpha will start to increase then our learning rate will start to decrease. It means our
learning rate becomes very small compared to the previous learning rate, if the alpha value increases too
much. This will affect our model because if our learning rate is small then the difference between previous
and new weight will be also smaller. Sometimes there can be no difference between learning rate and we know
that we try to change our weight for reducing our error or loss function. So if there is no difference or very
less difference then we will not be able to reduce our loss function.
This optimizer is the solution of adagrad limitation or problem that we discussed in adagrad.
So the problem was, our learning rate sometimes becomes very very small for increasing alpha value. So what
adadelta or RMSPROP does, it does not give alpha to increase so much. To find the value of alpha, we do the
square the value of,
dL(loss function value)/dW(previous weight).
But in RMSprop we take the exponential moving average of these gradients. For doing this the value of alpha
which increases in adagrad, now will not increase so much.
The main formula is same what we saw in adagrad but the technique of getting alpha is different.
Main formula:
Wt=Wt-1-[η/√{Wavgt+ε}]{dL/dWt-1}
Here,
Wt=New weight
Wt-1=Previous weight
dL=Loss function value
dWt-1= Previous weight
η= Learning rate
ε= small positive value
Wavgt= Weighted avrage at time t
Formula for finding alpha:
Wavgt= β*St-1+(1-β)*(dL/dW)t2
Here,
β= constant value
St-1= Previous weighted average
We can see that we are again doing square of (dL/dW) but this time we are also multiplying it with (1-β) and
the value of (1-β) is a very small number. So for doing multiplying the alpha will not increase too high.
Adadelta refers to the difference between the current weight and the newly updated weight. The difference
between adadelta and RMsprop is that adadelta removes the use of the learning rate parameter completely by
replacing it with D(the exponential moving average of squared deltas).
So the formula will be:
Wt=Wt-1-[√{Dt+ε}/√{αt+ε}]{dL/dWt-1}
Formula of finding alpha and D:
αt=β*St-1+(1-β)*(dL/dW)t2 Dt=β*Dt-1+(1-β)*ΔW(t)
Here ΔW=Wt-Wt-1
Adam optimizer stands for adaptive moment estimation. It is a combination of momentum and RMSprop. With the
help of RMSprop we are actually changing the learning rate slowly where αt value don't become too
high and with momentum, we are smoothing the curve as we do in SGD. So adam optimizer uses RMSprop to update
weights and use momentum technique to smoothing.
Advantage:
Too fast and converges rapidly.
Disadvantage:
Computation costly