Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
introduction
Setup
Read data
Data preprocessing
Data cleaning
Handle date-time column
Handling outliers
Encoding
Feature_Engineering
Feature selection filter methods
Feature selection wrapper methods
Multicollinearity
Data split
Feature scaling
Supervised Learning
Regression
Classification
Bias and Variance
Overfitting and Underfitting
Regularization
Ensemble learning
Unsupervised Learning
Clustering
Association Rule
Common
Model evaluation
Cross Validation
Parameter tuning
Code Exercise
Car Price Prediction
Flight Fare Prediction
Diabetes Prediction
Spam Mail Prediction
Fake News Prediction
Boston House Price Prediction
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Data preprocessing is a process to convert raw data into meaningful data using different techniques.
Data in the real world is dirty. It means you can have incomplete data, missing data, unnecessary data, noisy data, duplicate data, etc. Using this type of raw data in a machine learning model will not give a good performance. To get good and accurate performance you must preprocessed the data, means converting raw data into meaningful data.
Data Cleaning:
It means:
1. Remove useless pieces of data from the system.
2.Remove duplicate values because these values are similar to useless values.
3.Convert into correct data type according to the column.
4.Fill in the missing values or delete the columns, rows, or cells containing missing values.
Be careful while removing the data because removing important data will create a problem. Also, delete all
those data which are not that much related to work. Also, remove those columns or rows which contain too much
empty data.
In data cleaning process two things can be done:
1.Fill the missing value: Do it if there are a few missing values.
2.Remove the column or row: Do it If there are too many missing values.
If there is a much less number of missing data then also remove those data. For example, if there are 1M data
and a total of 500 data are missing, in this case, remove those 500 data because this will not do that much
effect.
Id | Name |
---|---|
1 | A |
2 | B |
3 | nan |
4 | D |
5 | nan |
Data with no missing value means after cleaning:
Here the given data set has a few missing values so we fill the missing values. Filling the missing values is
not so easy because if the missing values are filled by wrong values then it can create a big problem so be
careful. To fill the missing values there are different techniques
Id | Name |
---|---|
1 | A |
2 | B |
3 | C |
4 | D |
5 | E |
Id | Name | Department |
---|---|---|
1 | A | S |
2 | B | C |
2 | nan | A |
4 | D | S |
5 | nan | S |
6 | nan | A |
7 | D | C |
8 | nan | C |
Removing row for data cleaning:
Here the given data set has too many missing values in a row. that's why removed those rows.
Id | Name | Department |
---|---|---|
1 | A | S |
2 | B | C |
4 | D | S |
7 | D | C |
Removing column for data cleaning:
Here the given data set has too many missing values in a column. that's why removed those columns.
Id | Name | Department |
---|---|---|
1 | A | S |
2 | B | C |
2 | nan | A |
4 | D | S |
5 | nan | S |
6 | nan | A |
7 | nan | C |
8 | nan | C |
Id | Department |
---|---|
1 | S |
2 | C |
2 | A |
4 | S |
5 | S |
6 | A |
7 | C |
8 | C |
There are a lot of techniques to handle the missing data. Go to the pandas section and learn about those techniques.