Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Introduction
Setup
ANN
Working process ANN
Propagation
Bias parameter
Activation function
Loss function
Overfitting and Underfitting
Optimization function
Chain rule
Minima
Gradient problem
Weight initialization
Dropout
ANN Regression Exercise
ANN Classification Exercise
Hyper parameter tuning
CNN
CNN basics
Convolution
Padding
Pooling
Data argumentation
Flattening
Create Custom Dataset
Binary Classification Exercise
Multiclass Classification Exercise
Transfer learning
Transfer model Basic template
RNN
How RNN works
LSTM
Bidirectional RNN
Sequence to sequence
Attention model
Transformer model
Bag of words
Tokenization & Stop words
Stemming & Lemmatization
TF-IDF
N-Gram
Word embedding
Normalization
Pos tagging
Parser
semantic analysis
Regular expression
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Here we have three sentences:
Sentence 1--> He is a good boy
Sentence 2--> He is a good girl
Sentence 3-->Boy and girl are good.
At first, we have to convert the sentences words into lower case. Because in the first sentence we have "boy"
and in the third sentence we have "Boy". Now if we don't lower the sentence words then "Boy" will consider as
one word and "boy" will consider as one word. It means two different words but both are same. So we have to
convert the sentence words into lower case.
Sometimes it creates problems like if we have USA an us then we can't perform this technique. That time we can
consider those words in a separate list and later on we can just keep that particular words and remaining all
words we can convert into lower case. Then we have to perform tokenization, stop words, lemmatization, or
stemming.
For example, we took three sentences.
After performing the sentences will look like this:
Sentence 1--> good boy
Sentence 2--> good girl
Sentence 3-->Boy girl good.
Now we will create a histogram:
In the histogram, we will find the frequency for each word. Frequency means the count of a word. Suppose first
we take 'boy'. Now in the first sentence we have 'boy' one time and in the second sentence we don't have 'boy'
and in the third sentence, we have 'boy' one time. So total we have 'boy' two times. So the frequency of 'boy'
is 2. Similarly frequency of girl is 2 and good is 3.
good--->3
girl--->2
boy---->2
We have to always sort this frequency in descending order.
Converting into vector:
f1 | f2 | f3 | ||
---|---|---|---|---|
good | girl | boy | O/P | |
Sentence 1 | 1 | 0 | 1 | |
Sentence 2 | 1 | 1 | 0 | |
Sentence 3 | 1 | 1 | 1 |
Here in the sentences, if that word is present then we wrote 1 and if not then we wrote 0. In the first
sentence, we have good and boy so we write 1, and there we don't have girl so we write 0.
Suppose we have boy in the first sentence two times. In that case for the first sentence where we wrote 1 for
boy, there we will write 2. Now if we have boy three times then we will write 3.
disadvantage:
Here can't say which word is more important. In this example, good word is more import than other but here all
the values are 1. So in the bag of word, we don't get the more important or the order of important words.