Machine learning exercise 2: Flight Fare Prediction
Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
Getting the data
If we have train and test separate data then at fist concat test and train data and create one dataset.
Getting train data
train_data = pd.read_excel("D:/Flight_fare_train_data.xlsx")
train_data.head()
train_data.shape
Getting test data
test_data = pd.read_excel("D:/Flight_fare_test_data.xlsx")
test_data.head()
test_data.shape
Concatenating train and test data to create final dataset
df=pd.concat([train_data,test_data],ignore_index=True,sort=False)
pd.set_option("display.max_rows",None)
df.head()
Gathering some information of the new final dataset
df.shape
df.describe()
df.info()
Data cleaning
Checking columns nan values containing percentage and create a data frame
x=[]
z=[]
for i in df:
v=df[i].isnull().sum()/df.shape[0]*100
x.append(v)
z.append(i)
q={"Feature Name":z,"Percentage of missing values":x,}
missingPercentageDataset=pd.DataFrame(q).sort_values(by="Percentage of missing values",ascending=False)
pd.set_option("display.max_rows",None) missingPercentageDataset
Creating a graph of column and percentage of nan values
plt.figure(figsize=(15,7),dpi=100)
plt.xticks(rotation=90)
sns.barplot(x="Feature Name",y="Percentage of missing values",data=missingPercentageDataset)
plt.title("Barplot",fontsize=15)
plt.xlabel("Days",fontsize=12)
plt.ylabel("Total bill",fontsize=15)
plt.show()
Filling missing values of categorical column
Let's print all the categorical columns
catagorical_column=df.select_dtypes(include="object")
catagorical_column.head(2)
Let's see the percentage of missing values contained by each categorical column.
cmp=catagorical_column.isnull().sum()/df.shape[0]*100
cmp
Get that column names which contain missing value more than 17%
dmvc=cmp[cmp>17].keys()
dmvc
Dropping columns the which contain missing value more than 17%
catagorical_column.drop(columns=dmvc,axis=1)
catagorical_column.head(2)
Let's see the unique name of the Airline column
This Airline column unique values will be used to filter data of those columns which contain missing values
df["Airline"].unique()
Let's fill the missing values
indiGo=df.Airline=="IndiGo"
air_india=df.Airline=="Air India"
jet_airways=df.Airline=="Jet Airways"
air_asia=df.Airline=="Air Asia"
multiple_carriers=df.Airline=="Multiple carriers"
goAir=df.Airline=="GoAir"
spice_jet=df.Airline=="SpiceJet"
vistara=df.Airline=="Vistara"
vistara_premium_economy=df.Airline=="Vistara Premium economy"
multiple_premium_economy=df.Airline=="Multiple carriers Premium economy"
jet_airways_business=df.Airline=="Jet Airways Business"
indiGo_F=df.loc[indiGo,["Route","Duration","Total_Stops","Additional_Info"]]
air_india_F=df.loc[air_india,["Route","Duration","Total_Stops","Additional_Info"]]
jet_airways_F=df.loc[jet_airways,["Route","Duration","Total_Stops","Additional_Info"]]
air_asia_F=df.loc[air_asia,["Route","Duration","Total_Stops","Additional_Info"]]
multiple_carriers_F=df.loc[multiple_carriers,["Route","Duration","Total_Stops","Additional_Info"]]
goAir_F=df.loc[goAir,["Route","Duration","Total_Stops","Additional_Info"]]
spice_jet_F=df.loc[spice_jet,["Route","Duration","Total_Stops","Additional_Info"]]
vistara_F=df.loc[vistara,["Route","Duration","Total_Stops","Additional_Info"]]
vistara_premium_economy_F=df.loc[vistara_premium_economy,["Route","Duration","Total_Stops","Additional_Info"]]
multiple_premium_economy_F=df.loc[multiple_premium_economy,["Route","Duration","Total_Stops","Additional_Info"]]
jet_airways_business_F=df.loc[jet_airways_business,["Route","Duration","Total_Stops","Additional_Info"]]
indiGo_M=indiGo_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
air_india_M=air_india_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
jet_airways_M=jet_airways_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
air_asia_M=air_asia_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
multiple_carriers_M=multiple_carriers_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
goAir_M=goAir_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
spice_jet_M=spice_jet_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
vistara_M=vistara_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
vistara_premium_economy_M=vistara_premium_economy_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
multiple_premium_economy_M=multiple_premium_economy_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
jet_airways_business_M=jet_airways_business_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
indiGo_FM=indiGo_F["Route"].fillna(indiGo_M.loc[0][0])
air_india_FM=air_india_F["Route"].fillna(air_india_M.loc[0][0])
jet_airways_FM=jet_airways_F["Route"].fillna(jet_airways_M.loc[0][0])
air_asia_FM=air_asia_F["Route"].fillna(air_asia_M.loc[0][0])
multiple_carriers_FM=multiple_carriers_F["Route"].fillna(multiple_carriers_M.loc[0][0])
goAir_FM=goAir_F["Route"].fillna(goAir_M.loc[0][0])
spice_jet_FM=spice_jet_F["Route"].fillna(spice_jet_M.loc[0][0])
vistara_FM=vistara_F["Route"].fillna(vistara_M.loc[0][0])
vistara_premium_economy_FM=vistara_premium_economy_F["Route"].fillna(vistara_premium_economy_M.loc[0][0])
multiple_premium_economy_FM=multiple_premium_economy_F["Route"].fillna(multiple_premium_economy_M.loc[0][0])
jet_airways_business_FM=jet_airways_business_F["Route"].fillna(jet_airways_business_M.loc[0][0])
route_f=pd.concat([indiGo_FM,air_india_FM,jet_airways_FM,air_asia_FM,multiple_carriers_FM,goAir_FM,spice_jet_FM,vistara_FM,vistara_premium_economy_FM,multiple_premium_economy_FM,jet_airways_business_FM],ignore_index=True)
indiGo_FM2=indiGo_F["Duration"].fillna(indiGo_M.loc[0][1])
air_india_FM2=air_india_F["Duration"].fillna(air_india_M.loc[0][1])
jet_airways_FM2=jet_airways_F["Duration"].fillna(jet_airways_M.loc[0][1])
air_asia_FM2=air_asia_F["Duration"].fillna(air_asia_M.loc[0][1])
multiple_carriers_FM2=multiple_carriers_F["Duration"].fillna(multiple_carriers_M.loc[0][1])
goAir_FM2=goAir_F["Duration"].fillna(goAir_M.loc[0][1])
spice_jet_FM2=spice_jet_F["Duration"].fillna(spice_jet_M.loc[0][1])
vistara_FM2=vistara_F["Duration"].fillna(vistara_M.loc[0][1])
vistara_premium_economy_FM2=vistara_premium_economy_F["Duration"].fillna(vistara_premium_economy_M.loc[0][1])
multiple_premium_economy_FM2=multiple_premium_economy_F["Duration"].fillna(multiple_premium_economy_M.loc[0][1])
jet_airways_business_FM2=jet_airways_business_F["Duration"].fillna(jet_airways_business_M.loc[0][1])
duration_f=pd.concat([indiGo_FM2,air_india_FM2,jet_airways_FM2,air_asia_FM2,multiple_carriers_FM2,goAir_FM2,spice_jet_FM2,vistara_FM2,vistara_premium_economy_FM2,multiple_premium_economy_FM2,jet_airways_business_FM2],ignore_index=True)
indiGo_FM3=indiGo_F["Total_Stops"].fillna(indiGo_M.loc[0][2])
air_india_FM3=air_india_F["Total_Stops"].fillna(air_india_M.loc[0][2])
jet_airways_FM3=jet_airways_F["Total_Stops"].fillna(jet_airways_M.loc[0][2])
air_asia_FM3=air_asia_F["Total_Stops"].fillna(air_asia_M.loc[0][2])
multiple_carriers_FM3=multiple_carriers_F["Total_Stops"].fillna(multiple_carriers_M.loc[0][2])
goAir_FM3=goAir_F["Total_Stops"].fillna(goAir_M.loc[0][2])
spice_jet_FM3=spice_jet_F["Total_Stops"].fillna(spice_jet_M.loc[0][2])
vistara_FM3=vistara_F["Total_Stops"].fillna(vistara_M.loc[0][2])
vistara_premium_economy_FM3=vistara_premium_economy_F["Total_Stops"].fillna(vistara_premium_economy_M.loc[0][2])
multiple_premium_economy_FM3=multiple_premium_economy_F["Total_Stops"].fillna(multiple_premium_economy_M.loc[0][2])
jet_airways_business_FM3=jet_airways_business_F["Total_Stops"].fillna(jet_airways_business_M.loc[0][2])
totalStops_f=pd.concat([indiGo_FM3,air_india_FM3,jet_airways_FM3,air_asia_FM3,multiple_carriers_FM3,goAir_FM3,spice_jet_FM3,vistara_FM3,vistara_premium_economy_FM3,multiple_premium_economy_FM3,jet_airways_business_FM3],ignore_index=True)
indiGo_FM4=indiGo_F["Additional_Info"].fillna(indiGo_M.loc[0][3])
air_india_FM4=air_india_F["Additional_Info"].fillna(air_india_M.loc[0][3])
jet_airways_FM4=jet_airways_F["Additional_Info"].fillna(jet_airways_M.loc[0][3])
air_asia_FM4=air_asia_F["Additional_Info"].fillna(air_asia_M.loc[0][3])
multiple_carriers_FM4=multiple_carriers_F["Additional_Info"].fillna(multiple_carriers_M.loc[0][3])
goAir_FM4=goAir_F["Additional_Info"].fillna(goAir_M.loc[0][3])
spice_jet_FM4=spice_jet_F["Additional_Info"].fillna(spice_jet_M.loc[0][3])
vistara_FM4=vistara_F["Additional_Info"].fillna(vistara_M.loc[0][3])
vistara_premium_economy_FM4=vistara_premium_economy_F["Additional_Info"].fillna(vistara_premium_economy_M.loc[0][3])
multiple_premium_economy_FM4=multiple_premium_economy_F["Additional_Info"].fillna(multiple_premium_economy_M.loc[0][3])
jet_airways_business_FM4=jet_airways_business_F["Additional_Info"].fillna(jet_airways_business_M.loc[0][3])
additionalInfo_f=pd.concat([indiGo_FM4,air_india_FM4,jet_airways_FM4,air_asia_FM4,multiple_carriers_FM4,goAir_FM4,spice_jet_FM4,vistara_FM4,vistara_premium_economy_FM4,multiple_premium_economy_FM4,jet_airways_business_FM4],ignore_index=True)
cleaned_data_1=pd.merge(route_f,duration_f,right_index=True,left_index=True)
cleaned_data_2=pd.merge(totalStops_f,additionalInfo_f,right_index=True,left_index=True)
categorical_cleaned_data=pd.merge(cleaned_data_1,cleaned_data_2,right_index=True,left_index=True)
categorical_cleaned_data.head()
Let's see if all the data are cleaned or not
cmp=categorical_cleaned_data.isnull().sum()/df.shape[0]*100
cmp
Filling missing values of numerical column
Let's print all the numeric columns name.
numerical_column=df.select_dtypes(include=["int64","float64"]).columns
print(numerical_column)
Let's print all the numeric columns
numerical_column=df.select_dtypes(include=["int64","float64"])
numerical_column.head(2)
Let's see how much missing value contains by each numeric column in percentage.
nmp=numerical_column.isnull().sum()/df.shape[0]*100
nmp
Getting the column names which contain missing value more than 17%
bournday_col_missing=nmp[nmp>17].keys()
bournday_col_missing
dropping that columns which contain missing value more than 17%
numerical_col=numerical_column.drop(columns=bournday_col_missing,axis=1)
Let's see the unique name of the Airline column
This Airline column unique values will be used to filter data of those columns which contain missing values
df["Airline"].unique()
Let's fill the missing values
indiGo=df.Airline=="IndiGo"
air_india=df.Airline=="Air India"
jet_airways=df.Airline=="Jet Airways"
air_asia=df.Airline=="Air Asia"
multiple_carriers=df.Airline=="Multiple carriers"
goAir=df.Airline=="GoAir"
spice_jet=df.Airline=="SpiceJet"
vistara=df.Airline=="Vistara"
vistara_premium_economy=df.Airline=="Vistara Premium economy"
multiple_premium_economy=df.Airline=="Multiple carriers Premium economy"
jet_airways_business=df.Airline=="Jet Airways Business"
indiGo_F=df.loc[indiGo,["Price"]]
air_india_F=df.loc[air_india,["Price"]]
jet_airways_F=df.loc[jet_airways,["Price"]]
air_asia_F=df.loc[air_asia,["Price"]]
multiple_carriers_F=df.loc[multiple_carriers,["Price"]]
goAir_F=df.loc[goAir,["Price"]]
spice_jet_F=df.loc[spice_jet,["Price"]]
vistara_F=df.loc[vistara,["Price"]]
vistara_premium_economy_F=df.loc[vistara_premium_economy,["Price"]]
multiple_premium_economy_F=df.loc[multiple_premium_economy,["Price"]]
jet_airways_business_F=df.loc[jet_airways_business,["Price"]]
indiGo_M=indiGo_F[["Price"]].mean()
air_india_M=air_india_F[["Price"]].mean()
jet_airways_M=jet_airways_F[["Price"]].mean()
air_asia_M=air_asia_F[["Price"]].mean()
multiple_carriers_M=multiple_carriers_F[["Price"]].mean()
goAir_M=goAir_F[["Price"]].mean()
spice_jet_M=spice_jet_F[["Price"]].mean()
vistara_M=vistara_F[["Price"]].mean()
vistara_premium_economy_M=vistara_premium_economy_F[["Price"]].mean()
multiple_premium_economy_M=multiple_premium_economy_F[["Price"]].mean()
jet_airways_business_M=jet_airways_business_F[["Price"]].mean()
indiGo_FM=indiGo_F.fillna(indiGo_M)
air_india_FM=air_india_F.fillna(air_india_M)
jet_airways_FM=jet_airways_F.fillna(jet_airways_M)
air_asia_FM=air_asia_F.fillna(air_asia_M)
multiple_carriers_FM=multiple_carriers_F.fillna(multiple_carriers_M)
goAir_FM=goAir_F.fillna(goAir_M)
spice_jet_FM=spice_jet_F.fillna(spice_jet_M)
vistara_FM=vistara_F.fillna(vistara_M)
vistara_premium_economy_FM=vistara_premium_economy_F.fillna(vistara_premium_economy_M)
multiple_premium_economy_FM=multiple_premium_economy_F.fillna(multiple_premium_economy_M)
jet_airways_business_FM=jet_airways_business_F.fillna(jet_airways_business_M)
price_f=pd.concat([indiGo_FM,air_india_FM,jet_airways_FM,air_asia_FM,multiple_carriers_FM,goAir_FM,spice_jet_FM,vistara_FM,vistara_premium_economy_FM,multiple_premium_economy_FM,jet_airways_business_FM],ignore_index=True)
numerical_clean_data=pd.DataFrame(price_f)
numerical_clean_data.head()
Let's see if all the data are cleaned or not
numerical_clean_data.isnull().sum()
creating clean dataset
Here at first we will take those columns on which we didn't perform cleaning in a variable. Then we will merge
categorical cleaned data and numerical cleaned data and put it in a variable. After these we will create a new
dataset using these two variable .
Let's those column which doesn't contain any missing value from df
get_data_from_df=df[["Airline","Date_of_Journey","Source","Destination","Dep_Time","Arrival_Time"]]
merging categorical column and numerical column which is cleaned using method
marging_numerical_and_categorical_clean_data=pd.merge(categorical_cleaned_data,numerical_clean_data,right_index=True,left_index=True)
merging get_data_from_df and marging_numerical_and_categorical_clean_data to get the final dataset
fdf=pd.merge(get_data_from_df,marging_numerical_and_categorical_clean_data,right_index=True,left_index=True)
Checking for nan values in final cleaned data
nmpp=fdf.isnull().sum()/df.shape[0]*100
nmpp
Convert the all the data into correct data type
fdf.info()
Creating new features
splitting and creating new feature from Total_Stops
fdf["Total_Stops"]=fdf["Total_Stops"].astype(str)
Y=[]
for i in fdf["Total_Stops"]:
if len(i)==7:
Y.append(i[:1])
if len(i)==6:
Y.append(i[:1])
if i=="non-stop":
Y.append(i.replace("non-stop","0"))
if i=="nan":
Y.append(i.replace("nan","0"))
dic={"Stops":Y}
tempDataFrame=pd.DataFrame(dic)
fdf=fdf.join(tempDataFrame)
del fdf["Total_Stops"]
fdf.head(2)
splitting and creating new feature from Duration column
fdf["Duration"]=fdf["Duration"].astype(str)
hour=[]
mnts=[]
for i in fdf["Duration"]:
cnt=0
for k in str(i):
if k=="h":
break
if k=="m":
break
else:
cnt=cnt+1
hour.append(i[:cnt])
if cnt==2:
if len(i)==7:
mnts.append(i[4:6])
if len(i)==6:
mnts.append(i[4:5])
if cnt==1:
if len(i)==5:
mnts.append(i[3:4])
if len(i)==6:
mnts.append(i[3:5])
if len(i)==2:
mnts.append(i[:1])
if len(i)==3:
mnts.append(i[:2])
durationdic={"Duration_Hours":hour,"Duration_minutes":mnts}
durationTempDataFrame=pd.DataFrame(durationdic)
fdf=fdf.join(durationTempDataFrame)
del fdf["Duration"]
fdf.head(2)
splitting and creating new feature from Dep_Time column
fdf["Dep_Time"]=fdf["Dep_Time"].astype(str)
dephour=[]
depmnts=[]
for i in fdf["Dep_Time"]:
cnt=0
for k in str(i):
if k==":":
break
else:
cnt=cnt+1
dephour.append(i[:cnt])
depmnts.append(i[cnt+1:])
if len(i)==cnt:
dephour.append(i)
dparturedic={"Departure_Hours":dephour,"Departure_minutes":depmnts}
durationTempDataFrame=pd.DataFrame(dparturedic)
fdf=fdf.join(durationTempDataFrame)
del fdf["Dep_Time"]
fdf.head(2)
splitting and creating new feature from Arrival_Time column
fdf["Arrival_Time"]=fdf["Arrival_Time"].astype(str)
arrivalHour=[]
arrivalMnts=[]
for i in fdf["Arrival_Time"]:
cnt=0
for k in str(i):
if k==":":
break
else:
cnt=cnt+1
arrivalHour.append(i[:cnt])
arrivalMnts.append(i[cnt+1:5])
arrivaldic={"Arival_Hours":arrivalHour,"Arival_minutes":arrivalMnts}
arrivaldicTempDataFrame=pd.DataFrame(arrivaldic)
fdf=fdf.join(arrivaldicTempDataFrame)
del fdf["Arrival_Time"]
fdf.head(2)
splitting and creating new feature from Date_of_Journey column
fdf["Date_of_Journey"]=pd.to_datetime(fdf["Date_of_Journey"],format="%d/%m/%Y")
journeyDay=[]
journeyMonth=[]
journeyYear=[]
for i in fdf["Date_of_Journey"]:
cnt=0
for k in str(i):
if k=="-":
break
else:
cnt=cnt+1
k=str(i)
journeyYear.append(k[:cnt])
journeyMonth.append(k[cnt+1:len(k)-12])
journeyDay.append(k[cnt+4:len(k)-9])
journeydic={"journey_year":journeyYear,"journey_Month":journeyMonth,"journey_Day":journeyDay}
journeydicTempDataFrame=pd.DataFrame(journeydic)
fdf=fdf.join(journeydicTempDataFrame)
del fdf["Date_of_Journey"]
fdf.head(2)
Again change the data type of features into correct data types
Because we have created new features and removed new features, so again we need to check the data type and
change into correct type
fdf["Stops"]=fdf["Stops"].astype(int)
fdf["Duration_Hours"]=fdf["Duration_Hours"].astype(int)
fdf["Duration_minutes"]=fdf["Duration_minutes"].astype(int)
fdf["Departure_Hours"]=fdf["Departure_Hours"].astype(int)
fdf["Departure_minutes"]=fdf["Departure_minutes"].astype(int)
fdf["Arival_Hours"]=fdf["Arival_Hours"].astype(int)
fdf["Arival_minutes"]=fdf["Arival_minutes"].astype(int)
fdf["journey_year"]=fdf["journey_year"].astype(int)
fdf["journey_Month"]=fdf["journey_Month"].astype(int)
fdf["journey_Day"]=fdf["journey_Day"].astype(int)
checking all the columns are changed into correct data type or not
fdf.info()
Drop some feature if it's doesn't have that much effect on the target variable
fdf.drop(columns=["Route","Additional_Info"] , axis=1, inplace=True)
fdf.head()
Select features according to the feature selection technique
We perform feature selection technique on all the independent variable. Here we will take those independent
variable and put it in a variable
independent_feature=fdf.drop("Price",axis=1)
independent_feature.head(2)
Plotting heatmap of correlation between columns
plt.figure(figsize=(12,10))
cor=independent_feature.corr()
sns.heatmap(cor,annot=True,cmap=plt.cm.CMRmap_r)
plt.show()
Getting the correlation columns
def correlation(dataset,threshold):
col_corr=set()
corr_matrix=dataset.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if
abs(corr_matrix.iloc[i,j])>threshold:
colname=corr_matrix.columns[i]
col_corr.add(colname)
return col_corr
Getting that column name which is more than 70% correlated and need to drop
corr_features=correlation(independent_feature,0.7)
corr_features
Drop the column which are more than 70% correlated
independent_feature.drop(corr_features,axis=1,inplace=True)
independent_feature.head(2)
Creating new data set with that columns after performing correlation and target variable
dependent_feature=fdf["Price"]
fdfn=pd.merge(independent_feature,dependent_feature,right_index=True,left_index=True)
Encoding
encoding= OneHotEncoder(sparse=False, drop="first")
result_encod=encoding.fit_transform(fdfn[["Airline","Source","Destination"]])
result_encod
Getting the each columns name after encoding using dummy variable
Getting the column names which we will get after encoding. Here we used dummy variable to get the column names
because in both process(dummy variable and one-hot encoding) are the same and we will get same column names.
pd.get_dummies(fdfn[["Airline","Source","Destination"]],drop_first=True).keys()
Getting the each columns name after encoding using dummy variable
dataFrame_encode_col=pd.DataFrame(result_encod,columns=['Airline_Air India', 'Airline_GoAir',
'Airline_IndiGo', 'Airline_Jet Airways', 'Airline_Jet Airways Business', 'Airline_Multiple carriers',
'Airline_Multiple carriers Premium economy', 'Airline_SpiceJet', 'Airline_Trujet', 'Airline_Vistara',
'Airline_Vistara Premium economy', 'Source_Chennai', 'Source_Delhi', 'Source_Kolkata', 'Source_Mumbai',
'Destination_Cochin', 'Destination_Delhi', 'Destination_Hyderabad', 'Destination_Kolkata', 'Destination_New
Delhi'])
fdfn.columns
Getting all the other columns which are not encoded
numeric_col=fdfn[['Price', 'Stops', 'Duration_minutes', 'Departure_Hours', 'Departure_minutes',
'Arival_Hours', 'Arival_minutes', 'journey_year', 'journey_Month', 'journey_Day']]
Creating dataframe of encoded columns and not encoded columns
data_after_enco=pd.merge(dataFrame_encode_col,numeric_col,left_index=True,right_index=True)
data_after_enco.head(2)
Feature Scaling
xx=data_after_enco.drop(columns="Price",axis=1)
sc=StandardScaler()
XX_scaled=sc.fit_transform(xx)
Separating target and dependent columns
X=XX_scaled
y=df["Price"]
Model training
model=RandomForestRegressor(n_estimators=100,criterion="mse")
model.fit(X_train,Y_train)
model.score(X_test,Y_test)
Model accuracy or evaluation
Evaluation on training data
training_data_prediction=model.predict(X_train)
training_data_prediction
Visualization of the actual value and training data predicted value
plt.scatter(Y_train,training_data_prediction)
plt.xlabel("actual value")
plt.ylabel("Predicted value")
plt.title("Actual value vs Predicted value")
plt.show()
R squared error training data
In training_data_prediction we have the prediction of our model and in Y_train we have the actual values of
Price column. Let's use r2_score to see the error. By calculating the error we will be able to see that how
much accurate result our model is giving.
score_using_r2_score=metrics.r2_score(Y_train,training_data_prediction) score_using_r2_score
Evaluation on the testing data
testing_data_prediction=model.predict(X_test)
testing_data_prediction
Visualization of the actual value and testing data predicted value
plt.scatter(Y_test,testing_data_prediction)
plt.xlabel("actual value")
plt.ylabel("Predicted value")
plt.title("Actual value vs Predicted value")
plt.show()
R squared error testing data
In testing_data_prediction we have the prediction of our model and in Y_test we have the actual values of
Price column. Let's use r2_score to see the error. By calculating the error we will be able to see that how
much accurate result our model is giving
score_using_r2_score=metrics.r2_score(Y_test,testing_data_prediction) score_using_r2_score
Mean absolute error for testing data
In testing_data_prediction we have the prediction of our model and in Y_test we have the actual values of
Price column. Let's use mean_absolute_error to see the error. By calculating the error we will be able to see
that how much accurate result our model is giving.
score_using_mean_absolute_error=metrics.mean_absolute_error(Y_test,testing_data_prediction)
score_using_mean_absolute_error