Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Introduction

Setup

ANN

Working process ANN

Propagation

Bias parameter

Activation function

Loss function

Overfitting and Underfitting

Optimization function

Chain rule

Minima

Gradient problem

Weight initialization

Dropout

ANN Regression Exercise

ANN Classification Exercise

Hyper parameter tuning

CNN

CNN basics

Convolution

Padding

Pooling

Data argumentation

Flattening

Create Custom Dataset

Binary Classification Exercise

Multiclass Classification Exercise

Transfer learning

Transfer model Basic template

RNN

How RNN works

LSTM

Bidirectional RNN

Sequence to sequence

Attention model

Transformer model

Bag of words

Tokenization & Stop words

Stemming & Lemmatization

TF-IDF

N-Gram

Word embedding

Normalization

Pos tagging

Parser

semantic analysis

Regular expression

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Learn everything about transformer model

How transformer model works?




In the transformer model, we have an encoder and decoder. Our input first goes to the encoder, then goes to decoder, and ending of working of decoder we get our output. We can see in the image that there are six decoders and six encoders.

The flow is, first the input will go to the first encoder and after completing the process in the first encoder, then information goes to the second encoder and after completing work of second then third encoder and this way information go to the last encoder. It means that information passes through one encoder to another encoder after completing all the processes that can happen in each encoder and this will go on till the last encoder.

After reaching the last encoder, the information goes to all the decoders at a time. Then some work or process happens in each decoder. After completing the process of each decoder, all the information goes to the last decoder, and from the last decoder, we get the output. The output we get from the last decoder, is the final output.

Here we took 6 decoders and encoder. We can change the number according to our need but researchers get good accuracy when they used 6 encoders and decoders.

Process happen in encoder




Inside the encoder of the transformer, we have main two layers one is self-attention and the other is the feed-forward neural network. This feed-forward neural network is an example of ANN. Each sub-layer(self-attention, feed-forward neural network) in each encoder has a residual connection around it and the layer name is add and normalize layer. All the words go in the encoder at the same time. If we have five words then five words will go in the encoder at the same time.

So first we convert each word into a vector(X1, X2) using the word embedding technique where the dimension is 512 but we can change the dimension. Now, this vector will pass to the self-attention layer. Then self-attention will give outputs Z1, Z2. These outputs will go to the feed-forward layer and the feed-forward layer will give new outputs R1, R2. Now, these outputs will go to the second encoder and the there same process will happen. Then second encoder feed-forward layer will give the final outputs of the second encoder. Then those outputs will go to the third encoder. This way works happen in the encoder.

The vectors of self-attention and feed-forward layer flow is, first self-attention layer then add and normalize layer then feed-forward layer and again add and normalize layer.

Process happen in add and normalize layer:
In add and normalize layer we do addition of vectors. If this layer is after the self-attention layer then we do the addition of the input vector and output vector of the self-attention layer. If this layer is after the feed-forward layer then we do the addition of input and output vectors of the feed-forward layer. Suppose the input of the self-attention layer is X1 matrix and the output is Z1 matrix, then we will do X1+Z1 in the add and normalize layer.

How self attention works?

Suppose we have a sentence:
"The cat didn't cross the road because it was injured".

What does 'it' refer to in this sentence?
Is it referring to road or cat?
For the human brain the answer is very easy but algorithms don't understand this.
So in the image, we can see that we have two inputs Shundor(X1) chobi(X2). At first, we convert the words into vector form using the word embedding technique and this process happens only one time in the whole cycle.

Here embedding means we convert these two words(shundor chodi) into the vector form(shundor(X1),chobi(X2)) using the embedding technique. Here dimension of X1 and X2 is 512. After embedding we also do the positional encoding.
suppose we have four words x1, x2, x3, and x4. Positional encoding means which word is near to which words or in the sentence which word is after or before which word or we can say, word position in the sentence. After this, we do concatenate of embedding vectors matrix and positional vector matrix and this is called embedding with time signal.

The first encoder will get this 'word embedding with time signal' as input but the other encoders will take the output of the previous encoder output as input.




Step 1:
Creating three vectors Wq, Wk, Wv:
The first step is to create three vectors from each of the encoder’s input vectors(vector that we get after using the embedding technique). We create a Key vector, a Query vector, and a Value vector.
We create these three vectors by multiplying the words(that we convert into the vector using embedding X1, X2) by three matrices that we trained during the training process.

Here we have three weights(matrices) Wq(use to get queries vector), Wk(use to get key vector) and Wv(use to get value vector). The value of these matrices(weights) will be randomly initialized.

Calculation to get k1, k2, q1, q1, v1 and v2:
To get queries vector values (q1 and q2), we multiply X1 and X2 with Wq.
To get key vector values(k1 and k2) we multiply X1 and X2 with Wk.
To get values vector(v1 and v2) we multiply X1 and X2 with Wv.

We will use this queries vector, value vector, and key vector values for the next calculations. Here dimension of q1, q2, k1, k2, v1, v2 is 64. Here dimension is changeable but researchers get good accuracy by using dimension 64.


Step 2:
In step two we have to find the score for each word by multiplying queries with all the key vectors.
When we encode a word at a certain position, the score determines, how much focus have to give on other parts of the input sentence.

For the first word(shundor):
first, we will multiply q1*k1
Second, we will multiply q1*k2

For the second word(chobi):
first, we will multiply q2*k2
Second, we will multiply q2*k1

If there are more than two words then what will happen?

Suppose we have another word right.
Then for the first word(shundor):
first, we will multiply q1*k1
Second, we will multiply q1*k2
Third, we will multiply q1*k3
For other words the same process will happen.


Step-3:
In step 3 we have to divide each score of each word by dk.

Here dk means the square root of queries or key or value vector dimension.
Here the query dimension is 64, and the square root of 64 is 8.
So we will divide by 8.

After this, we have to run an activation function. Here we use the softmax activation function. Softmax normalizes the scores and in softmax summation of all values is 1. How much each word will be expressed at this position is determined by this softmax score.

For first word(shundor):
first, we will divide (q1*k1)/8
Second, we will divide (q1*k2)/8

For the second word(chobi):
first, we will divide (q2*k2)/8
Second, we will divide (q2*k1)/8

Running activation function
for first word(Shundor):
first softmax{(q1*k1)/8}
Second softmax{(q1*k2)/8}

Running activation function
for second word(Chobi):
first softmax{(q2*k2)/8}
Second softmax{(q2*k1)/8}


Step-4:
In step four we multiply each value vector by softmax score.

The target, is to keep taking the values of the word that we want to focus on and drown out the irrelevant word.

For first word(shundor):
first, we will multiply softmax1*v1
Second we will multiply softmax2*v2

For second word(chobi):
first, we will multiply softmax1*v1
Second we will multiply softmax2*v2


Step:5
Here we do the sum of weighted value vectors.

So the result of summation is the output of the self-attention layer and this output is the input for the feed-forward layer.

For first word(shundor):
Sum FNV1+FNV2

For second word(chobi):
Sum SNV1+SNV1


Multi-headed attention:
At first, we used one attention layer but using one self-attention layer has some drawbacks.
To improve the performance we use multi-headed attention layer.

Multi-headed attention means in the encoder we will use more than one self-attention layer.
It means that each embedded word will pass through more than one self-attention layer in each encoder. We can use any number of attention layers but it must be more than one.
Researchers said to use eight because they got very good accuracy.
The Calculation of each attention layer is same. It means first find three vectors then find the score, then divide with dk, then run activation function, multiplication of activation value and value vector and then do summation to get z value. Everything is same. We are just using more than one self-attention layer.

If we use one self-attention layer then we will get one output z matrix for each word. But if we use eight attention layer then we will get eight z(z0, z1, z1, z3, z4, z5, x6, z7) matrices.

Here we have eight z matrices but we need one z matrix.
To get that,
we do concatenation of these matrices(z=z0+z1+z2+z3+z4+z5+z6+z7) and then multiply it with an additional weight matrix Wo.
After this calculation, we get a single z matrix which is the final output for a single word.
If we have multiple words then for all the words, same calculation will happen.

Process happen in decoder cell




Let's consider two encoders and two decoder transformer model. So the output that we get from the second or last encoder will go to the first decoder, encoder-decoder attention layer.

The encoder-decoder attention layer helps a decoder to focus on the correct places in the input sequence.

The self-attention layer in a decoder work a little different way compared to the encoder. The self-attention layer of the decoder is only allowed to attend to earlier positions in the output sequence. We do this by masking future positions before the softmax step calculation in the self-attention layer.

The “Encoder-Decoder Attention” layer works same as multi-headed self-attention, except the creation of its queries matrix from its below layer and takes Values and Keys matrix from the output of the encoder.

Feed-forward and add and normalize layer works same like encoder Feed-froward and add and normalize layer. Here each decoder takes two inputs one is from the last encoder and one is the previous decoder output.

So the flow is, first output of the last encoder goes to each decoder, encoder-decoder attention layer and then, previous decoder output goes in the self-attention layer of a decoder.
Then self-attention layer output goes to add and normalize layer and then the output of add and normalize layer, goes to encoder-decoder attention layer.

Now encoder-decoder attention layer has two inputs one is the previous decoder input which is processed by self-attention and add and normalize layer and input from the last encoder layer.

Encoder-decoder attention layer works with these two inputs and after complete work, the output of these layers goes to add and normalize layer then feed-forward layer and then again add and normalize layer.

After this, we get the final output from a decoder. The final output that we get from the last decoder is in the form of vectors or floats. To convert it into the word, we apply the liner layer which is followed by the softmax layer. The output we get from the linear layer after applying softmax is our final result.

Now there can be a question that is, the second decoder gets the previous decoder output but how first decoder get the previous decoder output?

Suppose my first English output of shundor word is beautiful. So this output will go to the first decoder as the previous decoder input for the second time step or second word.
This happens because to get a second or another word we must need previous words to maintain the correct sequence to get the correct sentence.

In transformer, we pass all the inputs at once but get output one by one. Here all the encoders run one time. It means after getting the final output from the encoder, all the encoders stop working but the decoder runs many times. It means that suppose we have 5 words then we will pass all the words at a time in the encoder and encoder will give final output for all the words ones. After this in decoder, word goes one by one from encoder and decoder gives output one by one means for first word it will run one time, for second word it will run again another time and this way it will go on.

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.