Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

Everything about data preprocessing in machine learning

What is data preprocessing?

Data preprocessing is a process to convert raw data into meaningful data using different techniques.

Why data preprocessing is important?

Data in the real world is dirty. It means you can have incomplete data, missing data, unnecessary data, noisy data, duplicate data, etc. Using this type of raw data in a machine learning model will not give a good performance. To get good and accurate performance you must preprocessed the data, means converting raw data into meaningful data.

Major steps of data preprocessing:

Data Cleaning:
It means:
1. Remove useless pieces of data from the system.
2.Remove duplicate values because these values are similar to useless values.
3.Convert into correct data type according to the column.
4.Fill in the missing values or delete the columns, rows, or cells containing missing values.

Be careful while removing the data because removing important data will create a problem. Also, delete all those data which are not that much related to work. Also, remove those columns or rows which contain too much empty data.

In data cleaning process two things can be done:
1.Fill the missing value: Do it if there are a few missing values.
2.Remove the column or row: Do it If there are too many missing values. If there is a much less number of missing data then also remove those data. For example, if there are 1M data and a total of 500 data are missing, in this case, remove those 500 data because this will not do that much effect.

Example:
Data with missing values:

Id	Name
1	A
2	B
3	nan
4	D
5	nan

Data with no missing value means after cleaning:
Here the given data set has a few missing values so we fill the missing values. Filling the missing values is not so easy because if the missing values are filled by wrong values then it can create a big problem so be careful. To fill the missing values there are different techniques

Id	Name
1	A
2	B
3	C
4	D
5	E

Data with missing value:

Id	Name	Department
1	A	S
2	B	C
2	nan	A
4	D	S
5	nan	S
6	nan	A
7	D	C
8	nan	C

Removing row for data cleaning:
Here the given data set has too many missing values in a row. that's why removed those rows.

Id	Name	Department
1	A	S
2	B	C
4	D	S
7	D	C

Removing column for data cleaning:
Here the given data set has too many missing values in a column. that's why removed those columns.

Data with missing values:

Id	Name	Department
1	A	S
2	B	C
2	nan	A
4	D	S
5	nan	S
6	nan	A
7	nan	C
8	nan	C

Data after cleaning:

Id	Department
1	S
2	C
2	A
4	S
5	S
6	A
7	C
8	C

There are a lot of techniques to handle the missing data. Go to the pandas section and learn about those techniques.

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us