Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Learn about multicollinearity in machine learning

What is multicollinearity?

If two or multiple independent variables are highly correlated with each other in a regression problem, then it is called multicollinearity. It means, because the independent variables are highly correlated with each other then we can predict one independent variable by another independent variable. This correlation can be a positive correlation or a negative correlation. This problem occurs only in a regression problem.

Example:
Let's take linear regression model for example:
Suppose we have two independent variables X1 and X2 and dependent variable Y.

So, the equation is:
Y=a0+A*X1+B*X2

Now if we shift our X1 variable by one unit the Y will be also shift by one unit but other things and X2 will remain same or constant. Now if shift X2 by one unit then Y again will shift by one unit but X1 will remain same or constant. Now if multicollinearity is present between these two X1 and X2 variables and then if we shift X1 or X2 by one then X2 or X1 will be also shift by one. For this reason we will not able to see any individual effects of these independent variables on dependent variable Y.

Multicollinearity may not affect the accuracy of the model that much but for multicollinearity lose the effects of individual independent features on the dependent feature of our model while training.

Causes of multicollinearity?

1. If encoding technique create Dummy variables then it can create multicollinearity problem.
2. Multicollinearity could also occur when new variables are created which are dependent on other variables.
3. Insufficient data can also create multicollinearity problem.
4. While creating new features if the new feature is dependent on any other feature then it can also create multicollinearity problem.

How to solve multicollinearity problem?

1. Increase the sample size.
2. If two variables creating multicollinearity problem then drop one of them. If multiple variables are creating multicollinearity problem take one variable among those features and drop other features.
3. Re-code the form of the independent variables.

Practical code to solve multicollinearity problem?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Getting dataset
from sklearn.datasets import load_iris
def sklearn_dataset(sklearn_data):
    df = pd.DataFrame(sklearn_data.data, columns=sklearn_data.feature_names)
    df['target_var'] = pd.Series(sklearn_data.target)
    return df
df = sklearn_dataset(load_iris())
df

#Creating the correlation matrix
correlation = df.corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=bool))

sns.heatmap(correlation, center=0, linewidth=0.4,
             cbar_kws={"shrink": .7}, annot=True, mask=mask)
plt.title("MultiColinearity Heat map")

#Calculating the VIF
vif_var = pd.DataFrame()
vif_var["Features_name"] = df.columns
vif_var["VIF col"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
vif.head()

'''
If VIF score is above 4 or below 0.25 indicates that multicollinearity can occurs but investigation is required. If the VIF score is higher than 10 or lower than 0.1, then there multicollinearity is present.
'''

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us

© Copyright All rights reserved www.CodersAim.com. Developed by CodersAim.