Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Machine learning exercise 5: Fake News Prediction

Dataset Link

About the dataset

1. id: Unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article, cloud be incomplete
5. label: a label that marks whether the news article is real or fake
Here
1 means fake news
0 means real news

Importing Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Download stopwords from nltk library

import nltk
nltk.download("stopwords")

Getting Dataset

df = pd.read_csv("D:/fake news dataset.csv")
df.head()

Gathering some information about the data

df.shape

df.describe()

df.info()

Let's create a dataset using the number of missing value present in the dataset and do sorting

x=[]
z=[]
for i in df:
    v=df[i].isnull().sum()/df.shape[0]*100
    x.append(v)
    z.append(i)
q={"Feature Name":z,"Percentage of missing values":x,}
missingPercentageDataset=pd.DataFrame(q).sort_values(by="Percentage of missing values",ascending=False)
pd.set_option("display.max_rows",None)
missingPercentageDatase

Replacing the null values with empty string

df=df.fillna('')

Merging the author name and news title

df["content"]=df["author"]+" "+df["title"]

df.head()

Stemming process

Stemming is the process where we remove the prefixes and suffixes from words so that they are reduced to a simpler form which called stems.

For example:
If we have words like history or historical and after stemming it will convert into histori. If we have words like finally or final or finalized then after stemming it will convert into fina. Similarly, go or goes will convert into go.

Let's see the stopwords

Stop words are those word which doesn't have that much value in the sentence and if we remove these from the sentence then it will not change the meaning of the sentence. In short here remove those words, repeating words, prepositions, articles, etc which don't have that much effect on a particular project or application.

print(stopwords.words("english"))

Porter_Stemmer=PorterStemmer()
def stemming(cont):
    s_content=re.sub("[^a-zA-Z]"," ",cont)
    '''
    Here we will take all the string value means words
    from the each value of the content column of the data set.
    It means from the sentence we will take only the words and create a     sentence
    Here we are taking string because there numbers are present
in value     of the content column.
    So we will remove all the number, punctuation etc and have
only words     of the sentence.
    Here we will take the word and creating the sentence
    '''

    s_content=s_content.lower()
    '''
    Now we have to change the case of the
    text into lower. Suppose you have to     words love and LOVE. Here the meaning
    of love is same but according to the     case difference the machine will treat
    these two words as two different words.     To avoid this problem we have to convert
    these words into same case
    '''

    s_content=s_content.split()
    '''
    Here we will split the the sentence
    and make array of words of the sentence.     We we will do this because we will run a for
    loop an will find that each any element of the array is a stopword or not.
    If stop word then remove that element. To find stopword or not we will use stopwords
    dictionary and will try to math each element of the array that
     created by splitting a sentence
    '''

    s_content=[Porter_Stemmer.stem(word) for word in s_content if not word in stopwords.words("english")]
    '''
    we will run a for loop where we
    will take each word from the recently created array and will perform stopwords.
    Stopwrods means if the taken     word is present in the stopwords then
    don't take it and if it is not present then
    take the word and perform stemming on that word.
    '''

    s_content=" ".join(s_content)
    '''
    Here we will join each stemmed     word and create a sentence
    '''

    return s_content
df["content"]=df["content"].apply(stemming)
'''
Here we will apply the stemming sentence on the content
column to replace the not
stemmed value with the stemmed value
'''

df.head()

Separating the data and label

X=df["content"]
Y=df["label"]

Converting the text into numerical data

Look we ml model can't work with text data. It can only work with numerical data. In content column we have text data. To use it for training we need to convert it into numerical form. To do this we will use TF-IDF method.

vector_text=TfidfVectorizer()
vector_text.fit(X)
X=vector_text.transform(X)

Let's see the X data after perform tf-idf

Here we only tf-idf on X data because Y data is in numerical form.

print(X)

Splitting data into train and test data

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.1,stratify=Y, random_state=51) #stratify parameter is only used in classification problem print(X.shape, X_train.shape, X_test.shape)

Train the model using Logistic regression algorithm

model= LogisticRegression()
model.fit(X_train, Y_train)

Model evaluation

X_train_prediction= model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction, Y_train)
print("accuracy on training data:",training_data_accuracy )

X_test_prediction= model.predict(X_test)
testing_data_accuracy=accuracy_score(X_test_prediction, Y_test)
print("accuracy on testing data:",testing_data_accuracy )

Let's do predictions

#getting the input
inputData=X_test[3]

prediction=model.predict(inputData)
print(prediction)
if (prediction[0]==0):
print("Real news")
else:
print("Fake news")

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us