Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
introduction
Setup
Read data
Data preprocessing
Data cleaning
Handle date-time column
Handling outliers
Encoding
Feature_Engineering
Feature selection filter methods
Feature selection wrapper methods
Multicollinearity
Data split
Feature scaling
Supervised Learning
Regression
Classification
Bias and Variance
Overfitting and Underfitting
Regularization
Ensemble learning
Unsupervised Learning
Clustering
Association Rule
Common
Model evaluation
Cross Validation
Parameter tuning
Code Exercise
Car Price Prediction
Flight Fare Prediction
Diabetes Prediction
Spam Mail Prediction
Fake News Prediction
Boston House Price Prediction
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Cross-validation is a process to select the right model or if you have a selected model then how we can gain a good accuracy. You know that you divide the whole dataset into train and test by using the train_test_split method. In train_test_split you pass two parameters random_state and how many records you want as testing and training data. Suppose you have 1000 records and you want to take 70% data as training and 30% data as testing. Now the selection of the will happen randomly. It can possible that the type of data present in the training dataset is not present in the testing dataset. If this happens then the accuracy will goes down and this type of thing can be possible because you are doing random selection for splitting the data and the selection is controlled by the random_state parameter. If you pass 0 in the random_state parameter then data will split one way. If we change the value of the random state then you will get different types of data for training and testing and this will do an impact on the accuracy. For different random_state values sometimes accuracy will increase and sometimes accuracy will goes down. So in this case you can't say the actual accuracy of the model. To solve this problem cross-validation is used.
train test split is called holdout validation. Here data is split into train and test data randomly according to the random state parameter. Now the problem in this technique is that, if we change the value of the random state parameter then the accuracy will also be changed.
Suppose you have thousand records. So for the first iteration, you will take records number 1 as testing data
and the remaining 999 records will be the training data. So it means that for the first iteration you will
train the data on 999 records and test data with 1 record. For the second iteration, you will take record
number 2 as testing data and the remaining 999 data will use as training data. So here if you have 1000
records then 1000 iteration will happen and each time you will take new single records like first iteration 1,
second iteration 2, third iteration 3, and so on. See one thing, here you have only 1000 records and 1000
iteration is not a big deal but if we have millions of data then a million iterations will use a lot of memory
and then the work will become very costly. This is a disadvantage.
Suppose you have 1000 records. K fold cross-validation takes input of k value. Here k means how many
iterations will happen and how to choose, train and test data. Suppose you select k value 5. It means that 5
iterations will happen. To select training and testing data you divide the whole data by k value. Suppose you
have 1000 records and k value is 5 so 100/5=200. Now in the first iteration, the first 200 data will consider
as testing data and the remaining 800 will be training data. It means that now the model will train on
800(record number 201-1000) data and will testing on 200(records number 1-200) data. After completing the
first iteration now the next 200(records number 201-400) records will be testing data and the remaining
800(records number 1-200 and 401-1000 ) records will be training data. Similarly, the next remaining iteration
will happen means again test data will shift to the next 200 records and the remaining 800 will be considered
as training data. So after completing every iteration you will get accuracy from each iteration. So final
accuracy will be the average of each iteration accuracy. Now here is one disadvantage and that is you are
taking records for testing in a row and because of doing this the testing data can be similar means a
similarity can come in testing data and this can down the accuracy.
In the example, you took the decision tree classifier model. There X and Y are dependent and independent features. K fold(10) means here you will do 10 different experiments and you will also get 10 different results.
Stratified cross-validation is created to prevent the disadvantage of K fold cross-validation.
If your dataset is imbalanced then this method can give very effective result.Here you also select k
value. K value is equal to the number of iteration and you divide the total number of data with k value to
find the number of testing and training data means the main working process is the same but the difference is
how to take testing data. The process of taking the number of testing data is same but selecting the types of
data is different. Here stratified cv makes sure that all the same types of data don't come. Suppose you have
three different types of data good, bad, very bad. What happens in k fold cv is that there sometimes all the
training data can be good means same type data or it can be some data is good and most of the data is bad. If
it happens then this will down the accuracy but in stratified cv, all the data never can be the same.
Stratified CV chooses a given number of testing data by maintaining balance means it will take some data good,
some bad, and some very bad. Which data, How many will take its depend on the number of that type of data
present in the dataset. It can choose 25% data good,55% data bad, and 20% data very bad. So the difference
between stratified CV and k fold cross-validation is, in K fold cross-validation all the testing data types
can be the same or maximum types of data can be the same but in stratified CV all the testing data can't be
same, it will choose all the types of data in a different ratio depending on how many times the data present
in the dataset. Without this, all the process of stratified CV is the same as k fold cross-validation
In this cross validation technique the dataset is randomly shuffled, and after shuffling the data is divided into a specified number of train-test splits. During each split, the data is divided into a training set and a test set randomly. The advantage of this technique is that it allows repeated random sampling of the dataset. For this reason, we can get a more robust estimation of the model's performance.