Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Machine learning exercise 2: Flight Fare Prediction

Dataset Link

Importing Libraries

#Basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Feature selection libraries
from sklearn.feature_selection import VarianceThreshold

#Encoding libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#Feature scaling library
from sklearn.preprocessing import StandardScaler

#Data split library
from sklearn.model_selection import train_test_split

#importing training model
from sklearn.ensemble import RandomForestRegressor

from sklearn import metrics

Getting the data

If we have train and test separate data then at fist concat test and train data and create one dataset.

Getting train data

train_data = pd.read_excel("D:/Flight_fare_train_data.xlsx")
train_data.head()

train_data.shape

Getting test data

test_data = pd.read_excel("D:/Flight_fare_test_data.xlsx")
test_data.head()

test_data.shape

Concatenating train and test data to create final dataset

df=pd.concat([train_data,test_data],ignore_index=True,sort=False)
pd.set_option("display.max_rows",None)
df.head()

Gathering some information of the new final dataset

df.shape

df.describe()

df.info()

Data cleaning

Checking columns nan values containing percentage and create a data frame

x=[]
z=[]
for i in df:
    v=df[i].isnull().sum()/df.shape[0]*100
    x.append(v)
    z.append(i)
q={"Feature Name":z,"Percentage of missing values":x,}
missingPercentageDataset=pd.DataFrame(q).sort_values(by="Percentage of missing values",ascending=False)
pd.set_option("display.max_rows",None) missingPercentageDataset

Creating a graph of column and percentage of nan values

plt.figure(figsize=(15,7),dpi=100)
plt.xticks(rotation=90)
sns.barplot(x="Feature Name",y="Percentage of missing values",data=missingPercentageDataset)
plt.title("Barplot",fontsize=15)
plt.xlabel("Days",fontsize=12)
plt.ylabel("Total bill",fontsize=15)
plt.show()

Filling missing values of categorical column

Let's print all the categorical columns

catagorical_column=df.select_dtypes(include="object")
catagorical_column.head(2)

Let's see the percentage of missing values contained by each categorical column.

cmp=catagorical_column.isnull().sum()/df.shape[0]*100
cmp

Get that column names which contain missing value more than 17%

dmvc=cmp[cmp>17].keys()
dmvc

Dropping columns the which contain missing value more than 17%

catagorical_column.drop(columns=dmvc,axis=1)
catagorical_column.head(2)

Let's see the unique name of the Airline column

This Airline column unique values will be used to filter data of those columns which contain missing values

df["Airline"].unique()

Let's fill the missing values

indiGo=df.Airline=="IndiGo"
air_india=df.Airline=="Air India"
jet_airways=df.Airline=="Jet Airways"
air_asia=df.Airline=="Air Asia"
multiple_carriers=df.Airline=="Multiple carriers"
goAir=df.Airline=="GoAir"
spice_jet=df.Airline=="SpiceJet"
vistara=df.Airline=="Vistara"
vistara_premium_economy=df.Airline=="Vistara Premium economy"
multiple_premium_economy=df.Airline=="Multiple carriers Premium economy"
jet_airways_business=df.Airline=="Jet Airways Business"


#----------------------------------------------------------------
#------------------------------------------------------------------
#---------------Performing Filters according to Airline -----------------
#----------------------------------------------------------------
#------------------------------------------------------------------


indiGo_F=df.loc[indiGo,["Route","Duration","Total_Stops","Additional_Info"]]
air_india_F=df.loc[air_india,["Route","Duration","Total_Stops","Additional_Info"]]
jet_airways_F=df.loc[jet_airways,["Route","Duration","Total_Stops","Additional_Info"]]
air_asia_F=df.loc[air_asia,["Route","Duration","Total_Stops","Additional_Info"]]
multiple_carriers_F=df.loc[multiple_carriers,["Route","Duration","Total_Stops","Additional_Info"]]
goAir_F=df.loc[goAir,["Route","Duration","Total_Stops","Additional_Info"]]
spice_jet_F=df.loc[spice_jet,["Route","Duration","Total_Stops","Additional_Info"]]
vistara_F=df.loc[vistara,["Route","Duration","Total_Stops","Additional_Info"]]
vistara_premium_economy_F=df.loc[vistara_premium_economy,["Route","Duration","Total_Stops","Additional_Info"]]
multiple_premium_economy_F=df.loc[multiple_premium_economy,["Route","Duration","Total_Stops","Additional_Info"]]
jet_airways_business_F=df.loc[jet_airways_business,["Route","Duration","Total_Stops","Additional_Info"]]


#------------------------------------------------
#--------------------------------------------------
#-----------------Getting mode value---------------
#------------------------------------------
#--------------------------------------------


indiGo_M=indiGo_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
air_india_M=air_india_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
jet_airways_M=jet_airways_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
air_asia_M=air_asia_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
multiple_carriers_M=multiple_carriers_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
goAir_M=goAir_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
spice_jet_M=spice_jet_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
vistara_M=vistara_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
vistara_premium_economy_M=vistara_premium_economy_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
multiple_premium_economy_M=multiple_premium_economy_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()
jet_airways_business_M=jet_airways_business_F[["Route","Duration","Total_Stops","Additional_Info"]].mode()


#-------------------------------------------------
#--------------------------------------------------
#---------------Filling Route------------------------
#------------------------------------------
#--------------------------------------------


indiGo_FM=indiGo_F["Route"].fillna(indiGo_M.loc[0][0])
air_india_FM=air_india_F["Route"].fillna(air_india_M.loc[0][0])
jet_airways_FM=jet_airways_F["Route"].fillna(jet_airways_M.loc[0][0])
air_asia_FM=air_asia_F["Route"].fillna(air_asia_M.loc[0][0])
multiple_carriers_FM=multiple_carriers_F["Route"].fillna(multiple_carriers_M.loc[0][0])
goAir_FM=goAir_F["Route"].fillna(goAir_M.loc[0][0])
spice_jet_FM=spice_jet_F["Route"].fillna(spice_jet_M.loc[0][0])
vistara_FM=vistara_F["Route"].fillna(vistara_M.loc[0][0])
vistara_premium_economy_FM=vistara_premium_economy_F["Route"].fillna(vistara_premium_economy_M.loc[0][0])
multiple_premium_economy_FM=multiple_premium_economy_F["Route"].fillna(multiple_premium_economy_M.loc[0][0])
jet_airways_business_FM=jet_airways_business_F["Route"].fillna(jet_airways_business_M.loc[0][0])

route_f=pd.concat([indiGo_FM,air_india_FM,jet_airways_FM,air_asia_FM,multiple_carriers_FM,goAir_FM,spice_jet_FM,vistara_FM,vistara_premium_economy_FM,multiple_premium_economy_FM,jet_airways_business_FM],ignore_index=True)


#-------------------------------------------------
#--------------------------------------------------
#---------------Filling Duration------------------------
#------------------------------------------
#--------------------------------------------
indiGo_FM2=indiGo_F["Duration"].fillna(indiGo_M.loc[0][1])
air_india_FM2=air_india_F["Duration"].fillna(air_india_M.loc[0][1])
jet_airways_FM2=jet_airways_F["Duration"].fillna(jet_airways_M.loc[0][1])
air_asia_FM2=air_asia_F["Duration"].fillna(air_asia_M.loc[0][1])
multiple_carriers_FM2=multiple_carriers_F["Duration"].fillna(multiple_carriers_M.loc[0][1])
goAir_FM2=goAir_F["Duration"].fillna(goAir_M.loc[0][1])
spice_jet_FM2=spice_jet_F["Duration"].fillna(spice_jet_M.loc[0][1])
vistara_FM2=vistara_F["Duration"].fillna(vistara_M.loc[0][1])
vistara_premium_economy_FM2=vistara_premium_economy_F["Duration"].fillna(vistara_premium_economy_M.loc[0][1])
multiple_premium_economy_FM2=multiple_premium_economy_F["Duration"].fillna(multiple_premium_economy_M.loc[0][1])
jet_airways_business_FM2=jet_airways_business_F["Duration"].fillna(jet_airways_business_M.loc[0][1])
duration_f=pd.concat([indiGo_FM2,air_india_FM2,jet_airways_FM2,air_asia_FM2,multiple_carriers_FM2,goAir_FM2,spice_jet_FM2,vistara_FM2,vistara_premium_economy_FM2,multiple_premium_economy_FM2,jet_airways_business_FM2],ignore_index=True)


#-------------------------------------------------
#--------------------------------------------------
#---------------Filling Total_Stops------------------------
#------------------------------------------
#--------------------------------------------


indiGo_FM3=indiGo_F["Total_Stops"].fillna(indiGo_M.loc[0][2])
air_india_FM3=air_india_F["Total_Stops"].fillna(air_india_M.loc[0][2])
jet_airways_FM3=jet_airways_F["Total_Stops"].fillna(jet_airways_M.loc[0][2])
air_asia_FM3=air_asia_F["Total_Stops"].fillna(air_asia_M.loc[0][2])
multiple_carriers_FM3=multiple_carriers_F["Total_Stops"].fillna(multiple_carriers_M.loc[0][2])
goAir_FM3=goAir_F["Total_Stops"].fillna(goAir_M.loc[0][2]) spice_jet_FM3=spice_jet_F["Total_Stops"].fillna(spice_jet_M.loc[0][2])
vistara_FM3=vistara_F["Total_Stops"].fillna(vistara_M.loc[0][2])
vistara_premium_economy_FM3=vistara_premium_economy_F["Total_Stops"].fillna(vistara_premium_economy_M.loc[0][2])
multiple_premium_economy_FM3=multiple_premium_economy_F["Total_Stops"].fillna(multiple_premium_economy_M.loc[0][2])
jet_airways_business_FM3=jet_airways_business_F["Total_Stops"].fillna(jet_airways_business_M.loc[0][2])

totalStops_f=pd.concat([indiGo_FM3,air_india_FM3,jet_airways_FM3,air_asia_FM3,multiple_carriers_FM3,goAir_FM3,spice_jet_FM3,vistara_FM3,vistara_premium_economy_FM3,multiple_premium_economy_FM3,jet_airways_business_FM3],ignore_index=True)


#-------------------------------------------------
#--------------------------------------------------
#---------------Filling Additional_Info------------------------
#------------------------------------------
#--------------------------------------------


indiGo_FM4=indiGo_F["Additional_Info"].fillna(indiGo_M.loc[0][3])
air_india_FM4=air_india_F["Additional_Info"].fillna(air_india_M.loc[0][3])
jet_airways_FM4=jet_airways_F["Additional_Info"].fillna(jet_airways_M.loc[0][3])
air_asia_FM4=air_asia_F["Additional_Info"].fillna(air_asia_M.loc[0][3])
multiple_carriers_FM4=multiple_carriers_F["Additional_Info"].fillna(multiple_carriers_M.loc[0][3])
goAir_FM4=goAir_F["Additional_Info"].fillna(goAir_M.loc[0][3])
spice_jet_FM4=spice_jet_F["Additional_Info"].fillna(spice_jet_M.loc[0][3])
vistara_FM4=vistara_F["Additional_Info"].fillna(vistara_M.loc[0][3])
vistara_premium_economy_FM4=vistara_premium_economy_F["Additional_Info"].fillna(vistara_premium_economy_M.loc[0][3])
multiple_premium_economy_FM4=multiple_premium_economy_F["Additional_Info"].fillna(multiple_premium_economy_M.loc[0][3])
jet_airways_business_FM4=jet_airways_business_F["Additional_Info"].fillna(jet_airways_business_M.loc[0][3])

additionalInfo_f=pd.concat([indiGo_FM4,air_india_FM4,jet_airways_FM4,air_asia_FM4,multiple_carriers_FM4,goAir_FM4,spice_jet_FM4,vistara_FM4,vistara_premium_economy_FM4,multiple_premium_economy_FM4,jet_airways_business_FM4],ignore_index=True)


#--------------------------------------------------------------
#---------------------------------------------------------------
#---------------Merging cleaned column to create a dataset---------
#---------------------------------------------------------
#-----------------------------------------------------------


#Here we have too many columns so merge two columns at a time
cleaned_data_1=pd.merge(route_f,duration_f,right_index=True,left_index=True)
cleaned_data_2=pd.merge(totalStops_f,additionalInfo_f,right_index=True,left_index=True)


#Creating the final cleaned categorical column
categorical_cleaned_data=pd.merge(cleaned_data_1,cleaned_data_2,right_index=True,left_index=True)
categorical_cleaned_data.head()

Let's see if all the data are cleaned or not

cmp=categorical_cleaned_data.isnull().sum()/df.shape[0]*100
cmp

Filling missing values of numerical column

Let's print all the numeric columns name.

numerical_column=df.select_dtypes(include=["int64","float64"]).columns
print(numerical_column)

Let's print all the numeric columns

numerical_column=df.select_dtypes(include=["int64","float64"])
numerical_column.head(2)

Let's see how much missing value contains by each numeric column in percentage.

nmp=numerical_column.isnull().sum()/df.shape[0]*100
nmp

Getting the column names which contain missing value more than 17%

bournday_col_missing=nmp[nmp>17].keys()
bournday_col_missing

dropping that columns which contain missing value more than 17%

numerical_col=numerical_column.drop(columns=bournday_col_missing,axis=1)

Let's see the unique name of the Airline column

This Airline column unique values will be used to filter data of those columns which contain missing values

df["Airline"].unique()

Let's fill the missing values

indiGo=df.Airline=="IndiGo"
air_india=df.Airline=="Air India"
jet_airways=df.Airline=="Jet Airways"
air_asia=df.Airline=="Air Asia"
multiple_carriers=df.Airline=="Multiple carriers"
goAir=df.Airline=="GoAir"
spice_jet=df.Airline=="SpiceJet"
vistara=df.Airline=="Vistara"
vistara_premium_economy=df.Airline=="Vistara Premium economy"
multiple_premium_economy=df.Airline=="Multiple carriers Premium economy"
jet_airways_business=df.Airline=="Jet Airways Business"


#----------------------------------------------------------------
#------------------------------------------------------------------
#---------------Performing Filters according to Airline -----------------
#----------------------------------------------------------------
#------------------------------------------------------------------


indiGo_F=df.loc[indiGo,["Price"]]
air_india_F=df.loc[air_india,["Price"]]
jet_airways_F=df.loc[jet_airways,["Price"]]
air_asia_F=df.loc[air_asia,["Price"]]
multiple_carriers_F=df.loc[multiple_carriers,["Price"]]
goAir_F=df.loc[goAir,["Price"]]
spice_jet_F=df.loc[spice_jet,["Price"]]
vistara_F=df.loc[vistara,["Price"]]
vistara_premium_economy_F=df.loc[vistara_premium_economy,["Price"]]
multiple_premium_economy_F=df.loc[multiple_premium_economy,["Price"]]
jet_airways_business_F=df.loc[jet_airways_business,["Price"]]


#-------------------------------------------------
#--------------------------------------------------
#-----------------Getting mean value---------------
#------------------------------------------
#--------------------------------------------


indiGo_M=indiGo_F[["Price"]].mean()
air_india_M=air_india_F[["Price"]].mean()
jet_airways_M=jet_airways_F[["Price"]].mean()
air_asia_M=air_asia_F[["Price"]].mean()
multiple_carriers_M=multiple_carriers_F[["Price"]].mean()
goAir_M=goAir_F[["Price"]].mean()
spice_jet_M=spice_jet_F[["Price"]].mean()
vistara_M=vistara_F[["Price"]].mean()
vistara_premium_economy_M=vistara_premium_economy_F[["Price"]].mean()
multiple_premium_economy_M=multiple_premium_economy_F[["Price"]].mean()
jet_airways_business_M=jet_airways_business_F[["Price"]].mean()


#-------------------------------------------------
#--------------------------------------------------
#---------------Filling Price column------------------------
#------------------------------------------
#--------------------------------------------


indiGo_FM=indiGo_F.fillna(indiGo_M)
air_india_FM=air_india_F.fillna(air_india_M)
jet_airways_FM=jet_airways_F.fillna(jet_airways_M)
air_asia_FM=air_asia_F.fillna(air_asia_M)
multiple_carriers_FM=multiple_carriers_F.fillna(multiple_carriers_M)
goAir_FM=goAir_F.fillna(goAir_M)
spice_jet_FM=spice_jet_F.fillna(spice_jet_M)
vistara_FM=vistara_F.fillna(vistara_M)
vistara_premium_economy_FM=vistara_premium_economy_F.fillna(vistara_premium_economy_M)
multiple_premium_economy_FM=multiple_premium_economy_F.fillna(multiple_premium_economy_M)
jet_airways_business_FM=jet_airways_business_F.fillna(jet_airways_business_M)
price_f=pd.concat([indiGo_FM,air_india_FM,jet_airways_FM,air_asia_FM,multiple_carriers_FM,goAir_FM,spice_jet_FM,vistara_FM,vistara_premium_economy_FM,multiple_premium_economy_FM,jet_airways_business_FM],ignore_index=True)


#----------------------------------------------------------------
#-----------------------------------------------------------------
#-----------------Creating the final cleaned dataframe---------------
#--------------------------------------------------------------
#-----------------------------------------------------------


numerical_clean_data=pd.DataFrame(price_f)
numerical_clean_data.head()

Let's see if all the data are cleaned or not

numerical_clean_data.isnull().sum()

creating clean dataset

Here at first we will take those columns on which we didn't perform cleaning in a variable. Then we will merge categorical cleaned data and numerical cleaned data and put it in a variable. After these we will create a new dataset using these two variable .

Let's those column which doesn't contain any missing value from df

get_data_from_df=df[["Airline","Date_of_Journey","Source","Destination","Dep_Time","Arrival_Time"]]

merging categorical column and numerical column which is cleaned using method

marging_numerical_and_categorical_clean_data=pd.merge(categorical_cleaned_data,numerical_clean_data,right_index=True,left_index=True)

merging get_data_from_df and marging_numerical_and_categorical_clean_data to get the final dataset

fdf=pd.merge(get_data_from_df,marging_numerical_and_categorical_clean_data,right_index=True,left_index=True)

Checking for nan values in final cleaned data

nmpp=fdf.isnull().sum()/df.shape[0]*100
nmpp

Convert the all the data into correct data type

fdf.info() '''
Here all the data are in the correct data type so we don't need to change it.
'''

Creating new features

splitting and creating new feature from Total_Stops

fdf["Total_Stops"]=fdf["Total_Stops"].astype(str)

Y=[]
for i in fdf["Total_Stops"]:
    if len(i)==7:
        Y.append(i[:1])
    if len(i)==6:
        Y.append(i[:1])
    if i=="non-stop":
        Y.append(i.replace("non-stop","0"))
    if i=="nan":
        Y.append(i.replace("nan","0"))

dic={"Stops":Y}
tempDataFrame=pd.DataFrame(dic)
fdf=fdf.join(tempDataFrame)
del fdf["Total_Stops"]
fdf.head(2)

splitting and creating new feature from Duration column

fdf["Duration"]=fdf["Duration"].astype(str)

hour=[]
mnts=[]
for i in fdf["Duration"]:
    cnt=0
    for k in str(i):
        if k=="h":
            break
        if k=="m":
            break
        else:
            cnt=cnt+1
    hour.append(i[:cnt])
    if cnt==2:
        if len(i)==7:
            mnts.append(i[4:6])
        if len(i)==6:
            mnts.append(i[4:5])
    if cnt==1:
        if len(i)==5:
            mnts.append(i[3:4])
        if len(i)==6:
            mnts.append(i[3:5])
    if len(i)==2:
        mnts.append(i[:1])
    if len(i)==3:
        mnts.append(i[:2])

durationdic={"Duration_Hours":hour,"Duration_minutes":mnts}
durationTempDataFrame=pd.DataFrame(durationdic)
fdf=fdf.join(durationTempDataFrame)
del fdf["Duration"]
fdf.head(2)

splitting and creating new feature from Dep_Time column

fdf["Dep_Time"]=fdf["Dep_Time"].astype(str)

dephour=[]
depmnts=[]
for i in fdf["Dep_Time"]:
    cnt=0
    for k in str(i):
        if k==":":
            break
        else:
            cnt=cnt+1
    dephour.append(i[:cnt])
    depmnts.append(i[cnt+1:])
    if len(i)==cnt:
        dephour.append(i)

dparturedic={"Departure_Hours":dephour,"Departure_minutes":depmnts}
durationTempDataFrame=pd.DataFrame(dparturedic)
fdf=fdf.join(durationTempDataFrame)
del fdf["Dep_Time"]
fdf.head(2)

splitting and creating new feature from Arrival_Time column

fdf["Arrival_Time"]=fdf["Arrival_Time"].astype(str)

arrivalHour=[]
arrivalMnts=[]
for i in fdf["Arrival_Time"]:
    cnt=0
    for k in str(i):
        if k==":":
            break
        else:
            cnt=cnt+1
    arrivalHour.append(i[:cnt])
    arrivalMnts.append(i[cnt+1:5])

arrivaldic={"Arival_Hours":arrivalHour,"Arival_minutes":arrivalMnts}
arrivaldicTempDataFrame=pd.DataFrame(arrivaldic)
fdf=fdf.join(arrivaldicTempDataFrame)
del fdf["Arrival_Time"]
fdf.head(2)

splitting and creating new feature from Date_of_Journey column

fdf["Date_of_Journey"]=pd.to_datetime(fdf["Date_of_Journey"],format="%d/%m/%Y")

journeyDay=[]
journeyMonth=[]
journeyYear=[]

for i in fdf["Date_of_Journey"]:
    cnt=0
    for k in str(i):
        if k=="-":
            break
        else:
            cnt=cnt+1
    k=str(i)
    journeyYear.append(k[:cnt])
    journeyMonth.append(k[cnt+1:len(k)-12])
    journeyDay.append(k[cnt+4:len(k)-9])

journeydic={"journey_year":journeyYear,"journey_Month":journeyMonth,"journey_Day":journeyDay}
journeydicTempDataFrame=pd.DataFrame(journeydic)
fdf=fdf.join(journeydicTempDataFrame)
del fdf["Date_of_Journey"]
fdf.head(2)

Again change the data type of features into correct data types

Because we have created new features and removed new features, so again we need to check the data type and change into correct type

fdf["Stops"]=fdf["Stops"].astype(int)
fdf["Duration_Hours"]=fdf["Duration_Hours"].astype(int)
fdf["Duration_minutes"]=fdf["Duration_minutes"].astype(int)
fdf["Departure_Hours"]=fdf["Departure_Hours"].astype(int)
fdf["Departure_minutes"]=fdf["Departure_minutes"].astype(int)
fdf["Arival_Hours"]=fdf["Arival_Hours"].astype(int)
fdf["Arival_minutes"]=fdf["Arival_minutes"].astype(int)
fdf["journey_year"]=fdf["journey_year"].astype(int)
fdf["journey_Month"]=fdf["journey_Month"].astype(int)
fdf["journey_Day"]=fdf["journey_Day"].astype(int)

checking all the columns are changed into correct data type or not

fdf.info()

Drop some feature if it's doesn't have that much effect on the target variable

fdf.drop(columns=["Route","Additional_Info"] , axis=1, inplace=True)
fdf.head()

Select features according to the feature selection technique

We perform feature selection technique on all the independent variable. Here we will take those independent variable and put it in a variable

independent_feature=fdf.drop("Price",axis=1)
independent_feature.head(2)

Plotting heatmap of correlation between columns

plt.figure(figsize=(12,10))
cor=independent_feature.corr()
sns.heatmap(cor,annot=True,cmap=plt.cm.CMRmap_r)
plt.show()

Getting the correlation columns

def correlation(dataset,threshold):
    col_corr=set()
    #here we set all the correlated columns names
    corr_matrix=dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i,j])>threshold:
                #We are interested in absolute coeff value
                colname=corr_matrix.columns[i]
                #getting the name of the column
                col_corr.add(colname)
    return col_corr

Getting that column name which is more than 70% correlated and need to drop

corr_features=correlation(independent_feature,0.7)
corr_features

Drop the column which are more than 70% correlated

independent_feature.drop(corr_features,axis=1,inplace=True)
independent_feature.head(2)

Creating new data set with that columns after performing correlation and target variable

#In dependent feature there is no price column
dependent_feature=fdf["Price"]
fdfn=pd.merge(independent_feature,dependent_feature,right_index=True,left_index=True)

Encoding

encoding= OneHotEncoder(sparse=False, drop="first")
result_encod=encoding.fit_transform(fdfn[["Airline","Source","Destination"]])
result_encod

Getting the each columns name after encoding using dummy variable

Getting the column names which we will get after encoding. Here we used dummy variable to get the column names because in both process(dummy variable and one-hot encoding) are the same and we will get same column names.

pd.get_dummies(fdfn[["Airline","Source","Destination"]],drop_first=True).keys()

Getting the each columns name after encoding using dummy variable

dataFrame_encode_col=pd.DataFrame(result_encod,columns=['Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo', 'Airline_Jet Airways', 'Airline_Jet Airways Business', 'Airline_Multiple carriers', 'Airline_Multiple carriers Premium economy', 'Airline_SpiceJet', 'Airline_Trujet', 'Airline_Vistara', 'Airline_Vistara Premium economy', 'Source_Chennai', 'Source_Delhi', 'Source_Kolkata', 'Source_Mumbai', 'Destination_Cochin', 'Destination_Delhi', 'Destination_Hyderabad', 'Destination_Kolkata', 'Destination_New Delhi'])
fdfn.columns

Getting all the other columns which are not encoded

numeric_col=fdfn[['Price', 'Stops', 'Duration_minutes', 'Departure_Hours', 'Departure_minutes', 'Arival_Hours', 'Arival_minutes', 'journey_year', 'journey_Month', 'journey_Day']]

Creating dataframe of encoded columns and not encoded columns

data_after_enco=pd.merge(dataFrame_encode_col,numeric_col,left_index=True,right_index=True)
data_after_enco.head(2)

Feature Scaling

xx=data_after_enco.drop(columns="Price",axis=1)
sc=StandardScaler()
XX_scaled=sc.fit_transform(xx)

Separating target and dependent columns

X=XX_scaled
y=df["Price"]

Model training

model=RandomForestRegressor(n_estimators=100,criterion="mse")
model.fit(X_train,Y_train)
model.score(X_test,Y_test)

Model accuracy or evaluation

Evaluation on training data

training_data_prediction=model.predict(X_train)
training_data_prediction

Visualization of the actual value and training data predicted value

plt.scatter(Y_train,training_data_prediction)
plt.xlabel("actual value")
plt.ylabel("Predicted value")
plt.title("Actual value vs Predicted value")
plt.show()

R squared error training data

In training_data_prediction we have the prediction of our model and in Y_train we have the actual values of Price column. Let's use r2_score to see the error. By calculating the error we will be able to see that how much accurate result our model is giving.

score_using_r2_score=metrics.r2_score(Y_train,training_data_prediction) score_using_r2_score

Evaluation on the testing data

testing_data_prediction=model.predict(X_test)
testing_data_prediction

Visualization of the actual value and testing data predicted value

plt.scatter(Y_test,testing_data_prediction)
plt.xlabel("actual value")
plt.ylabel("Predicted value")
plt.title("Actual value vs Predicted value")
plt.show()

R squared error testing data

In testing_data_prediction we have the prediction of our model and in Y_test we have the actual values of Price column. Let's use r2_score to see the error. By calculating the error we will be able to see that how much accurate result our model is giving

score_using_r2_score=metrics.r2_score(Y_test,testing_data_prediction) score_using_r2_score

Mean absolute error for testing data

In testing_data_prediction we have the prediction of our model and in Y_test we have the actual values of Price column. Let's use mean_absolute_error to see the error. By calculating the error we will be able to see that how much accurate result our model is giving.

score_using_mean_absolute_error=metrics.mean_absolute_error(Y_test,testing_data_prediction) score_using_mean_absolute_error

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.