Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Machine learning exercise 4: Spam Mail Prediction

Dataset Link

Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import
train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Getting Dataset

df = pd.read_csv("D:/spam mail.csv")
df.head()

Gathering some information about the data

df.shape

df.info()

Let's create a dataset using the number of missing value present in the dataset and do sorting

x=[]
z=[]
for i in df:
v=df[i].isnull().sum()/df.shape[0]*100
x.append(v)
z.append(i)
q={"Feature Name":z,"Percentage of missing values":x,} missingPercentageDataset=pd.DataFrame(q).sort_values(by="Percentage of missing values",ascending=False)
pd.set_option("display.max_rows",None)
missingPercentageDataset

Label Encoding

Here in Category column we only have two values spam or ham. Now we will convert spam mail into 0 and ham into 1.

df.replace({"Category":{"spam":0,"ham":1}},inplace=True)
df.head()

Transform the text data to feature vectors that can be used as input to the logistic regression

XX=df["Message"]

feature_extraction= TfidfVectorizer(min_df=1, stop_words="english", lowercase="True")
'''
Here min_df means, if a word repeated just 1 time then we don't need that word.
'''

Separating the data and label

X=feature_extraction.fit_transform(XX)
Y=df["Category"].astype("int")

df.info()

Splitting data into train and test data

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.1,stratify=Y, random_state=51)
#stratify parameter is only used in classification problem

Train the model using Logistic regression algorithm

model= LogisticRegression()
model.fit(X_train,Y_train)

Model evaluation

X_train_prediction= model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction, Y_train)
print("accuracy on training data:",training_data_accuracy )

X_test_prediction= model.predict(X_test)
testing_data_accuracy=accuracy_score(X_test_prediction, Y_test)
print("accuracy on testing data:",testing_data_accuracy )

Let's do predictions

input_dt=["WINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."]
input_data=feature_extraction.transform(input_dt)

prediction=model.predict(input_data)
prediction
if prediction[0]==1:
print("ham mail")
else:
print("spam mail")

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us