Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

Learn Machine learning

Learn Github

Learn OpenCV

Introduction

Setup

ANN

Working process ANN

Propagation

Bias parameter

Activation function

Loss function

Overfitting and Underfitting

Optimization function

Chain rule

Minima

Gradient problem

Weight initialization

Dropout

ANN Regression Exercise

ANN Classification Exercise

Hyper parameter tuning

CNN

CNN basics

Convolution

Padding

Pooling

Data argumentation

Flattening

Create Custom Dataset

Binary Classification Exercise

Multiclass Classification Exercise

Transfer learning

Transfer model Basic template

RNN

How RNN works

LSTM

Bidirectional RNN

Sequence to sequence

Attention model

Transformer model

Bag of words

Tokenization & Stop words

Stemming & Lemmatization

TF-IDF

N-Gram

Word embedding

Normalization

Pos tagging

Parser

semantic analysis

Regular expression

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Working process of Tokenization and stop words

What is Tokenization?

Tokenization is a process of slicing a paragraph into sentences and sentences into words and these words are called tokens. These tokens help to understand the context or develop the model for the NLP. Here we split a paragraph or text into sentences and sentences into words. Here our token can be a sentence or word.

To slice or separate the word we can use different different techniques like:
1.White Space Tokenization:
In this technique, we tokenize paragraph words or sentences whenever we get white space.

2.Dictionary Based Tokenization:
In this technique, the tokens are found based on the tokens already existing in the dictionary.

3.Rule Based Tokenization:
Here we create a set of rules for the specific problem. The tokenization is done based on the rules.

Besides this, We have a lot of techniques.
Example:Suppose we have a paragraph:
This website is an awesome website for learning programming. I love this website. Everyone should try this.

If we do tokenization then we will get three sentences and 17 words. So we can say that here we have three sentence tokens and 17 word tokens.

Let's see code for tokenization

Input
import nltk

#We have to download and import nltk

paragraph="""This website is a awesome website for learning programming. I love this website.Everyone should try this.Here you can learn a lot."""

sentence_token=nltk.sent_tokenize(paragraph)
print(sentence_token)

word_token=nltk.word_tokenize(paragraph)
print(word_token)
Output

What is stop words?

Stop words mean getting rid of some common language or common words. In this process, we remove those very common words which have little or no value in NLP. Therefore we remove recurring terms which are not that much informative about the related text at all. So in short we can say that here we remove those words which are repeating words, prepositions, articles, etc, which don't have that much effect on a particular project or application.

Suppose we have a sentence "He is a good boy". Here 'is', 'a', and 'he' is not that much important. So in the stop words technique 'he', 'is', and 'a' will be removed. But sometimes these words play an important role. In that case, we don't remove these words.

Suppose we have a paragraph:
This website is an awesome website for learning programming. I love this website. Everyone should try this. Here you can learn a lot.

In the example we have is, a, for, you, here, I, etc. These words do not have that much effect on this particular project. So these will get removed by stop words.

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us