Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Introduction
Setup
ANN
Working process ANN
Propagation
Bias parameter
Activation function
Loss function
Overfitting and Underfitting
Optimization function
Chain rule
Minima
Gradient problem
Weight initialization
Dropout
ANN Regression Exercise
ANN Classification Exercise
Hyper parameter tuning
CNN
CNN basics
Convolution
Padding
Pooling
Data argumentation
Flattening
Create Custom Dataset
Binary Classification Exercise
Multiclass Classification Exercise
Transfer learning
Transfer model Basic template
RNN
How RNN works
LSTM
Bidirectional RNN
Sequence to sequence
Attention model
Transformer model
Bag of words
Tokenization & Stop words
Stemming & Lemmatization
TF-IDF
N-Gram
Word embedding
Normalization
Pos tagging
Parser
semantic analysis
Regular expression
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
We have already learned about the vanishing gradient problem. This problem occurs when in back propagation we update the weights. We saw this when we learned about ANN. In simple RNN same problem occurs while updating the weights. Here we don't have any long-term memory. Simple RNN networks can't use information from the distant past. Simple RNN can not learn patterns with long dependencies. So these are the problem of Simple RNN. To solve this problem we use LSTM.
In the image:
Red color marked area: is called long term memory cell
Sigma means: sigmoid activation function
tanh means: hyperbolic tangent activation function
Cross inside circle: is a gate and the gate does multiplication of values.
Plus inside circle: is also a gate and the gate does addition of values.
Cyan color marked area: is called short term memory cell
Ct-1: is the output of previous long term memory cell
Ht-1: is the previous main output
Ht: is the current main output
Ct: is the output of the current long term memory cell
X1: is the current input
Here three inputs comes in the neuron Ht-1, Ct-1, X1. Inside each neuron, we have long time memory cell. So
Ct-1 is the previous neuron's long time memory cell output value which we are taking inside the current
neuron's long time memory cell. Ht-1(previous output) and X1(current output) come inside the cell and get
concatenated.
At first, the concatenated value will go to the purple marked area.
Process happen in purple marked area:
After the concatenation the value goes to the purple marked area, there we will run the sigmoid activation
function.
Calculation:
Ft=σ(Wf[Ht-1,X1]+Bf)
here
Bf=bias
Wf=weight
Ht-1=Previous output
X1=Current output
σ=Sigmoid activation function
So here we multiply weight with the concatenating value of Ht-1 and X1, and then add bias. After that, we run
the sigmoid activation function. We know that the sigmoid activation function convert data into 0(less than
0.5 to negative infinity) or 1(greater than equal 0.5 to positive infinity). Then the result will go to the
first gate of the long-term memory cell( which has a cross sign in the image). We called this gate forget get
because this gate decides which data or pattern long-term memory should forget and which data or pattern
should remember if the new value comes. We get a new value by calculating new input(X1) and previous
input(Ht-1) and running a sigmoid activation function.
Calculation in the first get of long term memory cell:
Here we multiply the data which is already present in the long-term memory cell and the new data.
Calculation:
B=Ct-1*Ft
Suppose the value of Ft is [1, 0, 1] and value of Ct [1 , 2, 4]
So result of multiplication is =[1,2,4]*[1,0,1]=[1,0,4]
See one thing in the result value, 2 of Ct become 0 but 1 and 4 are as it is. So we can say that among 1,2,4
values the long-term memory will forget value 2 and will be remember value 1 and 4. Here sigmoid function
converts our Ft value between 0 to 1 and here 0 means no and 1 means yes. So after multiplying some value of
Ct become 0 means need to forget and some values keep same means don't need to forget .
Process happen in blue marked area:
Here at first our concatenate value of X1 and Ht-1 goes to the red marked area and passes through two
activations functions sigmoid and tanh.
Calculation:
Im=σ(Wi[Ht-1,X1]+Bi)
here,
Bi=bias
Wi=weight
Ht-1=Previous output
X1=Current output
σ=Sigmoid activation function
So here we multiply weight with concatenate value of Ht-1 and X1 and then add bias. After that, we run the
sigmoid activation function. This activation function convert data into 0(less than 0.5 to negative infinity)
or 1(greater than equal 0.5 to positive infinity). On another side, we run the tanh activation function on the
concatenating value of Ht-1 and X1.
Calculation:
Ct'=tanh(Wc[Ht-1,X1]+Bc)
here,
Bc=bias
Wc=weight
Ht-1=Previous output
X1=Current output
tanh=hyperbolic tangent activation function
So here we multiply weight with concatenate value of Ht-1 and X1 and then add bias. After that, we run the
tanh activation function. This activation function convert data into -1(less than 0 to negative infinity) or
1(greater than equal 0 to positive infinity).>
Now these two output(after applying sigmoid and tanh activation function) goes into a cross sign gate. What
the gate does is that it multiply those two outputs that came from sigmoid and tanh activation functions.
Calculation:
A=Ct'*Im
This multiplication works like same as what we did in the case of forget gate. After the multiplication, the
value goes to the long-term memory cell's second gate. The second gate is a plus sing gate. This gate does the
addition of values.
Now there can be a question, addition of what?
This gate does addition of long-term memory cells those values which we get after the calculation of forget
gate and that value which we get from the recent calculation.
Now there can be another question that why we do addition or why do we use this gate?
By using forget gate we find that which value or pattern of our long-term memory should forget when a new
value comes. So we can say that by using forget gate we are removing unnecessary data or patterns. So if we
remove some data then we have to add some data to fulfill our long-term memory if necessary according to new
data.
Suppose we have a sentence in long-term memory and that is "I love Jenny and she is a very good girl". Now if,
in the new input we have Rafsun. Here Rafsun is a male name. For this male name, we need to change some words
related to gender like she and boy. So according to the new input we have to forget she and girl and we did
this using forget gate but besides forget we have to add new word he and boy. So this type of addition happens
by using the second gate of the long-term memory cell.
Calculation:
Ct=A+B
Suppose the value of B is [1,0,4] and value of A [0,3,0]
So result of multiplication is =[1,0,4]+[0,5,0]=[1,3,4]
See one thing, in the result value 0 of Ct become 3 but 1 and 4 are the same. So we can say that our new
long-term memory values are 1, 3, 4 and we can also say that our long-term memory memories some new things
because in the position of 0, 3 comes. This is the value of Ct. Now, this Ct value will go to the next neuron
as Ct-1.
Process happen in green marked area:
From this area, we get the output. After the concatenate value goes to the green marked area, there we will
run the sigmoid activation function.
Calculation:
ol=σ(Wl[Ht-1,X1]+Bl)
Here,
Bl=bias
Wl=weight
Ht-1=Previous output
X1=Current output
σ=Sigmoid activation function
So here we multiply the weight by concatenating the value of Ht-1 and X1 and then add bias. After that, we run
the sigmoid activation function. We know that the sigmoid activation function converts data into 0(less than
0.5 to negative infinity) or 1(greater than equal 0.5 to positive infinity). After this calculation, the value
goes to a cross sign gate. This gate does the multiplication of values. Here in the multiplication, one value
we get from the calculation that we just did and the other value is the value of long-term memory cell value
that we calculated some time ago. But before this multiplication, long-term memory cell passes through tanh
activation function.
So the calculation to get Ht is:
Ht=ol*tanh(Ct)
After this calculation, the result is Ht. Now this Ht output will become the input for the next neuron(in the
position Ht-1).
So this way all process happen in LSTM.
One disadvantage of LSTM is that it can detect patterns with 100 steps but struggles with 100s/1000s of steps.
GRU stands for Gated Recurrent Units. It is a modified or lightweight version of LSTM. The difference is, in LSTM, it has two different cells one long-term memory cell and the other one is short-term memory cell but in GRU we have only one cell which contains both long and short-term memory. LSTM has three gates input gate, output gate, and forget gate but GRU has two gates one is update gates and other is the reset gate. Here update gate works for how much memory it should retain and the reset gate works for how much past memory it should forget.