Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Learn everything about cross validation

We will use the given code in every example

import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Loading the dataset
load_data = datasets.load_iris()
df = pd.DataFrame(load_data.data,columns=load_data.feature_names)
df['target'] = pd.Series(load_data.target)
df['target_names'] = df['target']

#separating independent and dependent variable
X = df.drop(['target','target_names'], axis=1)
Y = df['target_names']

What is cross validation?

Cross-validation is a process to select the right model or if you have a selected model then how we can gain a good accuracy. You know that you divide the whole dataset into train and test by using the train_test_split method. In train_test_split you pass two parameters random_state and how many records you want as testing and training data. Suppose you have 1000 records and you want to take 70% data as training and 30% data as testing. Now the selection of the will happen randomly. It can possible that the type of data present in the training dataset is not present in the testing dataset. If this happens then the accuracy will goes down and this type of thing can be possible because you are doing random selection for splitting the data and the selection is controlled by the random_state parameter. If you pass 0 in the random_state parameter then data will split one way. If we change the value of the random state then you will get different types of data for training and testing and this will do an impact on the accuracy. For different random_state values sometimes accuracy will increase and sometimes accuracy will goes down. So in this case you can't say the actual accuracy of the model. To solve this problem cross-validation is used.

Types of cross validation:

HoldOut validation

train test split is called holdout validation. Here data is split into train and test data randomly according to the random state parameter. Now the problem in this technique is that, if we change the value of the random state parameter then the accuracy will also be changed.

# split training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=12)

# model training
model = KNeighborsClassifier(n_neighbors=8)
model.fit(X_train, Y_train)

# Let's check the accuracy
model.score(X_test, Y_test)

Leave One Out CV(LOOCV):

Suppose you have thousand records. So for the first iteration, you will take records number 1 as testing data and the remaining 999 records will be the training data. So it means that for the first iteration you will train the data on 999 records and test data with 1 record. For the second iteration, you will take record number 2 as testing data and the remaining 999 data will use as training data. So here if you have 1000 records then 1000 iteration will happen and each time you will take new single records like first iteration 1, second iteration 2, third iteration 3, and so on. See one thing, here you have only 1000 records and 1000 iteration is not a big deal but if we have millions of data then a million iterations will use a lot of memory and then the work will become very costly. This is a disadvantage.

from sklearn.model_selection import LeaveOneOut
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

#Model training using LeaveOneOut
model_LeaveOneOut_val=SVC(kernel="rbf")
leaveOneOut=LeaveOneOut()
result=cross_val_score(model_LeaveOneOut_val,X,Y,cv=leaveOneOut)

print(np.mean(result))

K fold cross validation:

Suppose you have 1000 records. K fold cross-validation takes input of k value. Here k means how many iterations will happen and how to choose, train and test data. Suppose you select k value 5. It means that 5 iterations will happen. To select training and testing data you divide the whole data by k value. Suppose you have 1000 records and k value is 5 so 100/5=200. Now in the first iteration, the first 200 data will consider as testing data and the remaining 800 will be training data. It means that now the model will train on 800(record number 201-1000) data and will testing on 200(records number 1-200) data. After completing the first iteration now the next 200(records number 201-400) records will be testing data and the remaining 800(records number 1-200 and 401-1000 ) records will be training data. Similarly, the next remaining iteration will happen means again test data will shift to the next 200 records and the remaining 800 will be considered as training data. So after completing every iteration you will get accuracy from each iteration. So final accuracy will be the average of each iteration accuracy. Now here is one disadvantage and that is you are taking records for testing in a row and because of doing this the testing data can be similar means a similarity can come in testing data and this can down the accuracy.

from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

#Model training using K fold cross validation
model_Kfold_val = KNeighborsClassifier(n_neighbors=8)
Kfold_validation=KFold(10)
result=cross_val_score(model_Kfold_val,X,Y,cv=Kfold_validation)
print(result)
print(np.mean(result))

In the example, you took the decision tree classifier model. There X and Y are dependent and independent features. K fold(10) means here you will do 10 different experiments and you will also get 10 different results.

Stratified CV:

Stratified cross-validation is created to prevent the disadvantage of K fold cross-validation. If your dataset is imbalanced then this method can give very effective result.Here you also select k value. K value is equal to the number of iteration and you divide the total number of data with k value to find the number of testing and training data means the main working process is the same but the difference is how to take testing data. The process of taking the number of testing data is same but selecting the types of data is different. Here stratified cv makes sure that all the same types of data don't come. Suppose you have three different types of data good, bad, very bad. What happens in k fold cv is that there sometimes all the training data can be good means same type data or it can be some data is good and most of the data is bad. If it happens then this will down the accuracy but in stratified cv, all the data never can be the same. Stratified CV chooses a given number of testing data by maintaining balance means it will take some data good, some bad, and some very bad. Which data, How many will take its depend on the number of that type of data present in the dataset. It can choose 25% data good,55% data bad, and 20% data very bad. So the difference between stratified CV and k fold cross-validation is, in K fold cross-validation all the testing data types can be the same or maximum types of data can be the same but in stratified CV all the testing data can't be same, it will choose all the types of data in a different ratio depending on how many times the data present in the dataset. Without this, all the process of stratified CV is the same as k fold cross-validation

from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

#Model training using stratified K fold cross validation
model_stratified_Kfold_val=SVC(kernel="rbf")
skfold=StratifiedKFold(n_splits=15)
result=cross_val_score(model_stratified_Kfold_val,X,Y,cv=skfold)
print(result)
print(np.mean(result))

ShuffleSplit:

In this cross validation technique the dataset is randomly shuffled, and after shuffling the data is divided into a specified number of train-test splits. During each split, the data is divided into a training set and a test set randomly. The advantage of this technique is that it allows repeated random sampling of the dataset. For this reason, we can get a more robust estimation of the model's performance.

from sklearn.model_selection import ShuffleSplit
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

#Model training using ShuffleSplit validation
model_shuffleSplitt_val=SVC(kernel="rbf")
shuffleSplit=ShuffleSplit(n_splits=12, test_size=0.2)
result=cross_val_score(model_shuffleSplitt_val,X,Y,cv=shuffleSplit)
print(result)
print(np.mean(result))

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us