Machine learning exercise 1: Car Price Prediction
Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn import metrics
Getting the data
df = pd.read_csv("D:/car data.csv")
df.head()
Gathering some information about the data
df.shape
df.describe()
df.info()
Let's create a dataset using the number of missing value present in the dataset and do sorting
x=[]
z=[]
for i in df:
v=df[i].isnull().sum()/df.shape[0]*100
x.append(v)
z.append(i)
q={"Feature Name":z,"Percentage of missing values":x,}
missingPercentageDataset=pd.DataFrame(q).sort_values(by="Percentage of missing values",ascending=False)
pd.set_option("display.max_rows",None)
missingPercentageDataset
print("Fuel type")
print(df["Fuel_Type"].value_counts())
print("================")
print("Seller_Type")
print(df["Seller_Type"].value_counts())
print("================")
print("Transmission")
print(df["Transmission"].value_counts())
Label encoding
df.replace({"Fuel_Type":{"Petrol":2,"Diesel":1,"CNG":0}},inplace=True)
df.replace({"Seller_Type":{"Dealer":1,"Individual":0}},inplace=True)
df.replace({"Transmission":{"Manual":1,"Automatic":0}},inplace=True)
df.head()
Scaling the data
independent_features=df.drop(columns=["Car_Name","Selling_Price"], axis=1)
scaler= StandardScaler()
transform_scale_data=scaler.fit_transform(independent_features)
transform_scale_data
Separating the data and label
X=transform_scale_data
Y=df["Selling_Price"]
X
Splitting data into train and test data
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2, random_state=1)
Train the model using linear regression algorithm
model= LinearRegression()
model.fit(X_train,Y_train)
Model evaluation
We can't use accuracy score for regression problem because all the values are numerical values. Here we will
check or count the correctly predicted values by the model and then we will do comparison with the original
value using r square, mean absolute error, etc error method. For regression problem these error methods are
used and for classification problem this methods are used
Train data evaluation
training_data_prediction=model.predict(X_train)
training_data_prediction
Visualization of the actual value and training data predicted value
plt.scatter(Y_train,training_data_prediction)
plt.xlabel("actual value")
plt.ylabel("Predicted value")
plt.title("Actual value vs Predicted value")
plt.show()
R squared error training data
In training_data_prediction we have the prediction of our model and in Y_train we have the actual values of
Price column. Let's use r2_score to see the error. By calculating the error we will be able to see that how
much accurate result our model is giving.
score_using_r2_score=metrics.r2_score(Y_train,training_data_prediction)
score_using_r2_score
Mean absolute error for training data
In training_data_prediction we have the prediction of our model and in Y_train we have the actual values of
Price column. Let's use mean_absolute_error to see the error. By calculating the error we will be able to see
that how much accurate result our model is giving.
score_using_mean_absolute_error=metrics.mean_absolute_error(Y_train,training_data_prediction)
score_using_mean_absolute_error
Prediction on the testing data
testing_data_prediction=model.predict(X_test)
testing_data_prediction
Visualization of the actual value and testing data predicted value
plt.scatter(Y_test,testing_data_prediction)
plt.xlabel("actual value")
plt.ylabel("Predicted value")
plt.title("Actual value vs Predicted value")
plt.show()
R squared error testing data
score_using_r2_score=metrics.r2_score(Y_test,testing_data_prediction)
score_using_r2_score
Mean absolute error for testing data
score_using_mean_absolute_error=metrics.mean_absolute_error(Y_test,testing_data_prediction)
score_using_mean_absolute_error