Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Introduction

Setup

ANN

Working process ANN

Propagation

Bias parameter

Activation function

Loss function

Overfitting and Underfitting

Optimization function

Chain rule

Minima

Gradient problem

Weight initialization

Dropout

ANN Regression Exercise

ANN Classification Exercise

Hyper parameter tuning

CNN

CNN basics

Convolution

Padding

Pooling

Data argumentation

Flattening

Create Custom Dataset

Binary Classification Exercise

Multiclass Classification Exercise

Transfer learning

Transfer model Basic template

RNN

How RNN works

LSTM

Bidirectional RNN

Sequence to sequence

Attention model

Transformer model

Bag of words

Tokenization & Stop words

Stemming & Lemmatization

TF-IDF

N-Gram

Word embedding

Normalization

Pos tagging

Parser

semantic analysis

Regular expression

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Learn everything about TF-IDF in rnn

In this technique, we have two things one is TF(Term frequency) and the other is IDF(Inverse document frequency). We find TF and IDF separately and then multiply TF and IDF.

Formula of TF: TF= No of represent of words in sentence / No of words in sentence

Formula of IDF: IDF=log(no of sentences / No of sentences containing words)

Here we have three sentences.
Sentence 1--> He is a good boy
Sentence 2--> He is a good girl
Sentence 3-->Boy and girl are good.

At first, we have to convert the sentences words into lower case. Because in the first sentence we have 'boy' and in the third sentence we have 'Boy'. Now if we don't lower the sentence words then 'Boy' will consider as one word and 'boy' will consider as one word. It means two different words but both are same. So we have to convert the sentence words into lower case. Sometimes it creates problems like if we have 'USA' and 'us', then we can't perform this technique. That time what we can do is that, we can consider those words in a separate list, and later on we can just keep that particular words and remaining all words we can convert into lower case. Then we have to perform tokenization, stop words, lemmatization, or stemming. For example, we took three sentences, not a paragraph.

After performing our sentences will look like this:
Sentence 1--> good boy
Sentence 2--> good girl
Sentence 3-->Boy girl good

Now we will create a histogram:
In histogram we will find the frequency for each word. Frequency means the count of a word. Suppose first we take 'boy', now in the first sentence we have 'boy' one time and in the second sentence we don't have 'boy' and in the third sentence, we have 'boy' one time. So total we have 'boy' two times. So the frequency of 'boy' is 2. Similarly, the frequency of 'girl' is 2 and good is 3.
good--->3
girl--->2
boy---->2
We have to always sort this frequency in descending order.

Table for TF:

We will get the table using the TF formula.

	Sentence 1	Sentence 2	Sentence 3
good	1/2	1/2	1/3
girl	1/2	1/2	1/3
boy	0	0	1/3

Here in the first sentence we have two words and we have "good" one time so we wrote 1/2 and we have "boy" one time so we wrote 1/2 and we don't have "girl" in the first sentence so we wrote 0. Similar for other.

Table for IDF:

We will get the table using the IDF formula.

Words	IDF
good	log(3/3)
girl	log(2/3)
boy	log(2/3)

Here we have three sentences. Here three sentences contain "good" words. So we wrote log(3/3) and two sentences containing "girl" and "boy" so we wrote log(2/3).

	f1 Good	f2 boy	f3 girl
Sentence 1	0	(1/2)*{log(2/3)}	0
Sentence 2	0	0	(1/2)*{log(2/3)}
Sentence 3	(1/2)*{log(3/3)}	0	0

TF-IDF we can see that those words which are not that important have value 0 and that word which is more important has different value than other words present in a sentence.

Code:

Example:
Input
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
form nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidVectorizer

paragraph="""This website is a awesome website for learning programming. I love this website.Everyone should try this.Here you can learn a lot."""

ps=PorterStemmer()
wordnet=WordNetLemmatizer()
sentences=nltk.sent_tokenize(paragraph)
corpus=[]

#We have created this empty list because after cleaning
# we will store our sentence here.

for i in range(len(sentences)):
   #We will iterate through all the sentences.

   review=re.sub("[^a-zA-Z]"," ",sentences[i])
   #Here we are cleaning the paragraph sentences.Here we are
   #removing all the punctuation and in that place we use space.

   review=review.lower()
   #Here we are converting our words in lower case

   review=review.split()
   #Here we are split the sentence and after this
   #we will get a list of words.

   review=[ps.stem(word) for word in review if not word in set(stopwords.words("english"))]
   # this code says that if a word is not present in the stopwords the
#we will perform stemming.

   review=" ".join(review)
   #Here we are joining all the list of words in the review.

   corpus.append(review)
   #here we are appending the joined words in the empty list.

cv=TfidVectorizer()
x=cv.fit_transform(corpus).toarry()
Output

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us