Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Introduction
Setup
ANN
Working process ANN
Propagation
Bias parameter
Activation function
Loss function
Overfitting and Underfitting
Optimization function
Chain rule
Minima
Gradient problem
Weight initialization
Dropout
ANN Regression Exercise
ANN Classification Exercise
Hyper parameter tuning
CNN
CNN basics
Convolution
Padding
Pooling
Data argumentation
Flattening
Create Custom Dataset
Binary Classification Exercise
Multiclass Classification Exercise
Transfer learning
Transfer model Basic template
RNN
How RNN works
LSTM
Bidirectional RNN
Sequence to sequence
Attention model
Transformer model
Bag of words
Tokenization & Stop words
Stemming & Lemmatization
TF-IDF
N-Gram
Word embedding
Normalization
Pos tagging
Parser
semantic analysis
Regular expression
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Tokenization is a process of slicing a paragraph into sentences and sentences into words and these words are
called tokens. These tokens help to understand the context or develop the model for the NLP. Here we split a
paragraph or text into sentences and sentences into words. Here our token can be a sentence or word.
To slice or separate the word we can use different different techniques like:
1.White Space Tokenization:
In this technique, we tokenize paragraph words or sentences whenever we get white space.
2.Dictionary Based Tokenization:
In this technique, the tokens are found based on the tokens already existing in the dictionary.
3.Rule Based Tokenization:
Here we create a set of rules for the specific problem. The tokenization is done based on the rules.
Besides this, We have a lot of techniques.
Example:Suppose we have a paragraph:
This website is an awesome website for learning programming. I love this website. Everyone should try this.
If we do tokenization then we will get three sentences and 17 words. So we can say that here we have three
sentence tokens and 17 word tokens.
Stop words mean getting rid of some common language or common words. In this process, we remove those very
common words which have little or no value in NLP. Therefore we remove recurring terms which are not that much
informative about the related text at all. So in short we can say that here we remove those words which are
repeating words, prepositions, articles, etc, which don't have that much effect on a particular project or
application.
Suppose we have a sentence "He is a good boy". Here 'is', 'a', and 'he' is not that much important. So in the
stop words technique 'he', 'is', and 'a' will be removed. But sometimes these words play an important role. In
that case, we don't remove these words.
Suppose we have a paragraph:
This website is an awesome website for learning programming. I love this website. Everyone should try this.
Here you can learn a lot.
In the example we have is, a, for, you, here, I, etc. These words do not have that much effect on this
particular project. So these will get removed by stop words.