Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Feature scaling in machine learning

What is feature scaling?

Feature scaling is a method to scale multiple numeric features in the same scale or range(like -1 to 1, 0 to 1). It's mean that you can have features on a different scale and for doing machine learning work you have to bring all those features in one or a particular range or scale. It is also called data normalization.

Why feature scaling is needed?

The scale of raw features is different according to its units. The features can be in different unit's like some features are can be in meter some can be in kg or feet, and machine learning algorithms can't understand features units, it understands only numbers, which means it doesn't matter that what are the units of features because our ml model doesn't understand those, it only understand numbers. So if the features units are doesn't in a particular range or scale then it will do a bad impact on the machine learning model and as a result, the machine learning model will not be able to give a good prediction. That's why feature scaling is needed.

For Example, two people's height is given, the first person's height is 150cm and the other is 8.2 feet. Now if I ask that who is taller then you can easily say 8.2 feet. You can say the correct answer because you understand the units but in this case, the machine learning model will say 150cm. Because it doesn't understand units, it understands numbers and here 150 is greater than 8.2. To solve this problem you have to bring this unit into one scale or range.

Important Note: Some algorithms required feature scaling and some algorithms doesn't. Algorithms like Linear Regression, Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Neural Networks that use gradient descent for optimization required feature scaling. We can say that those algorithm which works using distance and optimization techniques required feature scaling.
Decision Trees, Random Forests, and Gradient Boosting algorithms, Naive Bayes, rule-based algorithms like FP growth, Association rule doesn't required feature scaling.

Types of features scaling:
1. Min max scaler
2. Standard scaler
3. Robust scaler

Standardization

What is Standardization?

Standardization rescales the feature such as mean is 0 and the standard deviation is 1. It's mean that it will scale the data and after scaling the mean of the data will be 0 and the standard deviation will be 1.
Formula: z=(x- μ)/σ
Here,
x=That value which you want to scale from a particular column
μ=mean of that column, from where you took X
σ=The standard deviation of that column, from where you took X First find the mean of data, then find the standard deviation. Then subtract mean from x and divide that with deviation.

When standard deviation is used?

This is used when data follow normal/gaussian/bell curve distribution. There is no rule that we only use standard deviation when data is in normal distribution but the maximum time people use this. If the original distribution is normal or skewed then the standard distribution will also be skewed or normal.

import pandas as pd
from sklearn.preprocessing import StandardScaler

df=pd.read_csv("my_bill.csv")
'''
dataset link: https://www.kaggle.com/code/yashsharmabharatpur/tips-collected-at-a-restaurant/data
'''
XX=df.drop("Total_bill",axis=1)
sc=StandardScaler()
XX_scaled=sc.fit_transform(XX)
X=XX_scaled
y=df["Total_bill"]

Normalization

What is normalization?

Normalization rescales the feature in a fixed range between -1 to 1 or 0 to 1 where the mean is μ=0. It's mean that it will rescale the features into a fixed scale and which is only -1 to 1 or 0 to 1. It's doesn't matter that what is the unit of feature, it will convert all units into a fixed scale which is -1 to 1 or 0 to 1. Normalization is also called as Min-Max scaling. If data doesn't follow normal/bell curve/ gaussian distribution then normalization is used. There is no rule to use this only in that particular situation, it can be used any time but maximum time in that situation people use it.
Formula: X=(X-Xmin)/(Xmax-Xmin)
Here,
X=That value that we want to scale from a particular column
Xmin=The minimum value of that column, from where you took X
Xmax=The maximum value of that column,from where you took X

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df=pd.read_csv("my_bill.csv")
'''
dataset link: https://www.kaggle.com/code/yashsharmabharatpur/tips-collected-at-a-restaurant/data
'''
XX=df.drop("Total_bill",axis=1)
mms=MinMaxScaler()
mms.fit(XX)
XX_scaled=mms.transform(XX)

X=XX_scaled
Y=df["Total_bill"]

Standardization vs Normalization

Standardization	Normalization
Here the mean will be always 0 and the standard deviation will be 1	Here we always scale the data between 0 to 1 or -1 to 1
There is no thumb rule to use Standardization	There is no thumb rule to use Normalization
Mostly Standardization is used for clustering analysis, algorithms related to distance, etc when our data follow normal/gaussian/bell curve distribution.	Normalization is mostly used for image preprocessing because there the pixel intensity is between 0 to 255, neural network algorithms data in scale 0-1, K-nearest neighbors, and when our data don't follow normal/gaussian/bell curve distribution.

Robust Scalar

What is robust scaler?

It is one of the best scaling techniques. If we have outliers in our dataset then use this technique. It scales the data accordingly to the interquartile range (IQR = 75 Quartile — 25 Quartile). Here the IQR is the range between the 1st quartile (25th quartile) and the 3rd quartile (75th quartile). In this method, we subtract all the data points with the median value and after that we divide it by the IQR(Inter Quartile Range) value.

When to use robust scaler?

Use robust scaler when we have outliers present in our dataset

import pandas as pd
from sklearn.preprocessing import RobustScale

df=pd.read_csv("my_bill.csv")
'''
dataset link: https://www.kaggle.com/code/yashsharmabharatpur/tips-collected-at-a-restaurant/data
'''
XX=df.drop("Total_bill",axis=1)
rs=RobustScale()
rs.fit(XX)
XX_scaled=rs.transform(XX)

X=XX_scaled
Y=df["Total_bill"]

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us