Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Introduction

Setup

ANN

Working process ANN

Propagation

Bias parameter

Activation function

Loss function

Overfitting and Underfitting

Optimization function

Chain rule

Minima

Gradient problem

Weight initialization

Dropout

ANN Regression Exercise

ANN Classification Exercise

Hyper parameter tuning

CNN

CNN basics

Convolution

Padding

Pooling

Data argumentation

Flattening

Create Custom Dataset

Binary Classification Exercise

Multiclass Classification Exercise

Transfer learning

Transfer model Basic template

RNN

How RNN works

LSTM

Bidirectional RNN

Sequence to sequence

Attention model

Transformer model

Bag of words

Tokenization & Stop words

Stemming & Lemmatization

TF-IDF

N-Gram

Word embedding

Normalization

Pos tagging

Parser

semantic analysis

Regular expression

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Everything about bag of words in rnn

What is bag of words and how does it work?

Here we have three sentences:
Sentence 1--> He is a good boy
Sentence 2--> He is a good girl
Sentence 3-->Boy and girl are good.

At first, we have to convert the sentences words into lower case. Because in the first sentence we have "boy" and in the third sentence we have "Boy". Now if we don't lower the sentence words then "Boy" will consider as one word and "boy" will consider as one word. It means two different words but both are same. So we have to convert the sentence words into lower case.

Sometimes it creates problems like if we have USA an us then we can't perform this technique. That time we can consider those words in a separate list and later on we can just keep that particular words and remaining all words we can convert into lower case. Then we have to perform tokenization, stop words, lemmatization, or stemming.

For example, we took three sentences.
After performing the sentences will look like this:
Sentence 1--> good boy
Sentence 2--> good girl
Sentence 3-->Boy girl good.


Now we will create a histogram:
In the histogram, we will find the frequency for each word. Frequency means the count of a word. Suppose first we take 'boy'. Now in the first sentence we have 'boy' one time and in the second sentence we don't have 'boy' and in the third sentence, we have 'boy' one time. So total we have 'boy' two times. So the frequency of 'boy' is 2. Similarly frequency of girl is 2 and good is 3.
good--->3
girl--->2
boy---->2
We have to always sort this frequency in descending order.

Converting into vector:


f1 f2 f3
good girl boy O/P
Sentence 1 1 0 1
Sentence 2 1 1 0
Sentence 3 1 1 1

Here in the sentences, if that word is present then we wrote 1 and if not then we wrote 0. In the first sentence, we have good and boy so we write 1, and there we don't have girl so we write 0.

Suppose we have boy in the first sentence two times. In that case for the first sentence where we wrote 1 for boy, there we will write 2. Now if we have boy three times then we will write 3.

disadvantage:
Here can't say which word is more important. In this example, good word is more import than other but here all the values are 1. So in the bag of word, we don't get the more important or the order of important words.

Example:
Input
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
form nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

paragraph="""This website is a awesome website for learning programming. I love this website.Everyone should try this.Here you can learn a lot."""

ps=PorterStemmer()
wordnet=WordNetLemmatizer()
sentences=nltk.sent_tokenize(paragraph)
corpus=[]

#We have created this empty list because after cleaning
# we will store our sentence here.

for i in range(len(sentences)):
   #We will iterate through all the sentences.

   review=re.sub("[^a-zA-Z]"," ",sentences[i])
   #Here we are cleaning the paragraph sentences.Here we are
   #removing all the punctuation and in that place we use space.

   review=review.lower()
   #Here we are converting our words in lower case

   review=review.split()
   #Here we are split the sentence and after this
   #we will get a list of words.

   review=[ps.stem(word) for word in review if not word in set(stopwords.words("english"))]
   # this code says that if a word is not present in the stopwords the
#we will perform stemming.

   review=" ".join(review)
   #Here we are joining all the list of words in the review.

   corpus.append(review)
#here we are appending the joined words in the empty list.

cv=CountVectorizer()
x=cv.fit_transform(corpus).toarry()
   #here we are appending the joined words in the empty list.

#Creating bag of words
cv=CountVectorizer()
x=cv.fit_transform(corpus).toarry()
Output

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.

+