Machine learning exercise 4: Spam Mail Prediction
Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import
train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Getting Dataset
df = pd.read_csv("D:/spam mail.csv")
df.head()
Gathering some information about the data
df.shape
df.info()
Let's create a dataset using the number of missing value present in the dataset and do sorting
x=[]
z=[]
for i in df:
v=df[i].isnull().sum()/df.shape[0]*100
x.append(v)
z.append(i)
q={"Feature Name":z,"Percentage of missing values":x,}
missingPercentageDataset=pd.DataFrame(q).sort_values(by="Percentage of missing values",ascending=False)
pd.set_option("display.max_rows",None)
missingPercentageDataset
Label Encoding
Here in Category column we only have two values spam or ham. Now we will convert spam mail into 0 and ham into
1.
df.replace({"Category":{"spam":0,"ham":1}},inplace=True)
df.head()
Transform the text data to feature vectors that can be used as input to the logistic regression
XX=df["Message"]
feature_extraction= TfidfVectorizer(min_df=1, stop_words="english", lowercase="True")
'''
Here min_df means, if a word repeated just 1 time then we don't need that word.
'''
Separating the data and label
X=feature_extraction.fit_transform(XX)
Y=df["Category"].astype("int")
df.info()
Splitting data into train and test data
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.1,stratify=Y, random_state=51)
Train the model using Logistic regression algorithm
model= LogisticRegression()
model.fit(X_train,Y_train)
Model evaluation
X_train_prediction= model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction, Y_train)
print("accuracy on training data:",training_data_accuracy )
X_test_prediction= model.predict(X_test)
testing_data_accuracy=accuracy_score(X_test_prediction, Y_test)
print("accuracy on testing data:",testing_data_accuracy )
Let's do predictions
input_dt=["WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To
claim call 09061701461. Claim code KL341. Valid 12 hours only."]
input_data=feature_extraction.transform(input_dt)
prediction=model.predict(input_data)
prediction
if prediction[0]==1:
print("ham mail")
else:
print("spam mail")