Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Introduction
Setup
ANN
Working process ANN
Propagation
Bias parameter
Activation function
Loss function
Overfitting and Underfitting
Optimization function
Chain rule
Minima
Gradient problem
Weight initialization
Dropout
ANN Regression Exercise
ANN Classification Exercise
Hyper parameter tuning
CNN
CNN basics
Convolution
Padding
Pooling
Data argumentation
Flattening
Create Custom Dataset
Binary Classification Exercise
Multiclass Classification Exercise
Transfer learning
Transfer model Basic template
RNN
How RNN works
LSTM
Bidirectional RNN
Sequence to sequence
Attention model
Transformer model
Bag of words
Tokenization & Stop words
Stemming & Lemmatization
TF-IDF
N-Gram
Word embedding
Normalization
Pos tagging
Parser
semantic analysis
Regular expression
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
In the transformer model, we have an encoder and decoder. Our input first goes to the encoder, then goes to
decoder, and ending of working of decoder we get our output. We can see in the image that there are six
decoders and six encoders.
The flow is, first the input will go to the first encoder and after completing the process in the first
encoder, then information goes to the second encoder and after completing work of second then third encoder
and this way information go to the last encoder. It means that information passes through one encoder to
another encoder after completing all the processes that can happen in each encoder and this will go on till
the last encoder.
After reaching the last encoder, the information goes to all the decoders at a time. Then some work or process
happens in each decoder. After completing the process of each decoder, all the information goes to the last
decoder, and from the last decoder, we get the output. The output we get from the last decoder, is the final
output.
Here we took 6 decoders and encoder. We can change the number according to our need but researchers get good
accuracy when they used 6 encoders and decoders.
Inside the encoder of the transformer, we have main two layers one is self-attention and the other is the
feed-forward neural network. This feed-forward neural network is an example of ANN. Each
sub-layer(self-attention, feed-forward neural network) in each encoder has a residual connection around it and
the layer name is add and normalize layer. All the words go in the encoder at the same time. If we have five
words then five words will go in the encoder at the same time.
So first we convert each word into a vector(X1, X2) using the word embedding technique where the dimension is
512 but we can change the dimension. Now, this vector will pass to the self-attention layer. Then
self-attention will give outputs Z1, Z2. These outputs will go to the feed-forward layer and the feed-forward
layer will give new outputs R1, R2. Now, these outputs will go to the second encoder and the there same
process will happen. Then second encoder feed-forward layer will give the final outputs of the second encoder.
Then those outputs will go to the third encoder. This way works happen in the encoder.
The vectors of self-attention and feed-forward layer flow is, first self-attention layer then add and
normalize layer then feed-forward layer and again add and normalize layer.
Process happen in add and normalize layer:
In add and normalize layer we do addition of vectors. If this layer is after the self-attention layer then we
do the addition of the input vector and output vector of the self-attention layer. If this layer is after the
feed-forward layer then we do the addition of input and output vectors of the feed-forward layer. Suppose the
input of the self-attention layer is X1 matrix and the output is Z1 matrix, then we will do X1+Z1 in the add
and normalize layer.
Suppose we have a sentence:
"The cat didn't cross the road because it was injured".
What does 'it' refer to in this sentence?
Is it referring to road or cat?
For the human brain the answer is very easy but algorithms don't understand this.
So in the image, we can see that we have two inputs Shundor(X1) chobi(X2). At first, we convert the words into
vector form using the word embedding technique and this process happens only one time in the whole cycle.
Here embedding means we convert these two words(shundor chodi) into the vector form(shundor(X1),chobi(X2))
using the embedding technique. Here dimension of X1 and X2 is 512. After embedding we also do the positional
encoding.
suppose we have four words x1, x2, x3, and x4. Positional encoding means which word is near to which words or
in the sentence which word is after or before which word or we can say, word position in the sentence. After
this, we do concatenate of embedding vectors matrix and positional vector matrix and this is called embedding
with time signal.
The first encoder will get this 'word embedding with time signal' as input but the other encoders will take
the output of the previous encoder output as input.
Step 1:
Creating three vectors Wq, Wk, Wv:
The first step is to create three vectors from each of the encoder’s input vectors(vector that we get after
using the embedding technique). We create a Key vector, a Query vector, and a Value vector.
We create these three vectors by multiplying the words(that we convert into the vector using embedding X1, X2)
by three matrices that we trained during the training process.
Here we have three weights(matrices) Wq(use to get queries vector), Wk(use to get key vector) and Wv(use to
get value vector). The value of these matrices(weights) will be randomly initialized.
Calculation to get k1, k2, q1, q1, v1 and v2:
To get queries vector values (q1 and q2), we multiply X1 and X2 with Wq.
To get key vector values(k1 and k2) we multiply X1 and X2 with Wk.
To get values vector(v1 and v2) we multiply X1 and X2 with Wv.
We will use this queries vector, value vector, and key vector values for the next calculations. Here dimension
of q1, q2, k1, k2, v1, v2 is 64. Here dimension is changeable but researchers get good accuracy by using
dimension 64.
Step 2:
In step two we have to find the score for each word by multiplying queries with all the key vectors.
When we encode a word at a certain position, the score determines, how much focus have to give on other parts
of the input sentence.
For the first word(shundor):
first, we will multiply q1*k1
Second, we will multiply q1*k2
For the second word(chobi):
first, we will multiply q2*k2
Second, we will multiply q2*k1
If there are more than two words then what will happen?
Suppose we have another word right.
Then for the first word(shundor):
first, we will multiply q1*k1
Second, we will multiply q1*k2
Third, we will multiply q1*k3
For other words the same process will happen.
Step-3:
In step 3 we have to divide each score of each word by dk.
Here dk means the square root of queries or key or value vector dimension.
Here the query dimension is 64, and the square root of 64 is 8.
So we will divide by 8.
After this, we have to run an activation function. Here we use the softmax activation function. Softmax
normalizes the scores and in softmax summation of all values is 1. How much each word will be expressed at
this position is determined by this softmax score.
For first word(shundor):
first, we will divide (q1*k1)/8
Second, we will divide (q1*k2)/8
For the second word(chobi):
first, we will divide (q2*k2)/8
Second, we will divide (q2*k1)/8
Running activation function
for first word(Shundor):
first softmax{(q1*k1)/8}
Second softmax{(q1*k2)/8}
Running activation function
for second word(Chobi):
first softmax{(q2*k2)/8}
Second softmax{(q2*k1)/8}
Step-4:
In step four we multiply each value vector by softmax score.
The target, is to keep taking the values of the word that we want to focus on and drown out the irrelevant
word.
For first word(shundor):
first, we will multiply softmax1*v1
Second we will multiply softmax2*v2
For second word(chobi):
first, we will multiply softmax1*v1
Second we will multiply softmax2*v2
Step:5
Here we do the sum of weighted value vectors.
So the result of summation is the output of the self-attention layer and this output is the input for the
feed-forward layer.
For first word(shundor):
Sum FNV1+FNV2
For second word(chobi):
Sum SNV1+SNV1
Multi-headed attention:
At first, we used one attention layer but using one self-attention layer has some drawbacks.
To improve the performance we use multi-headed attention layer.
Multi-headed attention means in the encoder we will use more than one self-attention layer.
It means that each embedded word will pass through more than one self-attention layer in each encoder. We can
use any number of attention layers but it must be more than one.
Researchers said to use eight because they got very good accuracy.
The Calculation of each attention layer is same. It means first find three vectors then find the score, then
divide with dk, then run activation function, multiplication of activation value and value vector and then do
summation to get z value. Everything is same. We are just using more than one self-attention layer.
If we use one self-attention layer then we will get one output z matrix for each word. But if we use eight
attention layer then we will get eight z(z0, z1, z1, z3, z4, z5, x6, z7) matrices.
Here we have eight z matrices but we need one z matrix.
To get that,
we do concatenation of these matrices(z=z0+z1+z2+z3+z4+z5+z6+z7) and then multiply it with an additional
weight matrix Wo.
After this calculation, we get a single z matrix which is the final output for a single word.
If we have multiple words then for all the words, same calculation will happen.
Let's consider two encoders and two decoder transformer model. So the output that we get from the second or
last encoder will go to the first decoder, encoder-decoder attention layer.
The encoder-decoder attention layer helps a decoder to focus on the correct places in the input sequence.
The self-attention layer in a decoder work a little different way compared to the encoder. The self-attention
layer of the decoder is only allowed to attend to earlier positions in the output sequence. We do this by
masking future positions before the softmax step calculation in the self-attention layer.
The “Encoder-Decoder Attention” layer works same as multi-headed self-attention, except the creation of its
queries matrix from its below layer and takes Values and Keys matrix from the output of the encoder.
Feed-forward and add and normalize layer works same like encoder Feed-froward and add and normalize layer.
Here each decoder takes two inputs one is from the last encoder and one is the previous decoder output.
So the flow is, first output of the last encoder goes to each decoder, encoder-decoder attention layer and
then, previous decoder output goes in the self-attention layer of a decoder.
Then self-attention layer output goes to add and normalize layer and then the output of add and normalize
layer, goes to encoder-decoder attention layer.
Now encoder-decoder attention layer has two inputs one is the previous decoder input which is processed by
self-attention and add and normalize layer and input from the last encoder layer.
Encoder-decoder attention layer works with these two inputs and after complete work, the output of these
layers goes to add and normalize layer then feed-forward layer and then again add and normalize layer.
After this, we get the final output from a decoder. The final output that we get from the last decoder is in
the form of vectors or floats. To convert it into the word, we apply the liner layer which is followed by the
softmax layer. The output we get from the linear layer after applying softmax is our final result.
Now there can be a question that is, the second decoder gets the previous decoder output but how first decoder
get the previous decoder output?
Suppose my first English output of shundor word is beautiful. So this output will go to the first decoder as
the previous decoder input for the second time step or second word.
This happens because to get a second or another word we must need previous words to maintain the correct
sequence to get the correct sentence.
In transformer, we pass all the inputs at once but get output one by one. Here all the encoders run one time.
It means after getting the final output from the encoder, all the encoders stop working but the decoder runs
many times. It means that suppose we have 5 words then we will pass all the words at a time in the encoder and
encoder will give final output for all the words ones. After this in decoder, word goes one by one from
encoder and decoder gives output one by one means for first word it will run one time, for second word it will
run again another time and this way it will go on.