Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Pandas Introduction
Pandas Series
Pandas DataFrame
Pandas Read Files
Pandas Some functions and properties
Pandas Math Function
Pandas Selection
Pandas Change Type
Pandas Concatenate & Split
Pandas Sorting
Pandas Filter
Pandas Data Cleaning
Pandas Group by
Pandas Time Series
sPandas Analysis 1
Pandas Analysis 2
Pandas Analysis 3
Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Missing completely at random:
We called a variable missing completely at random when there is no relationship between the missing data and
any other values or with the dataset. We can also say that all those data points are a random subset of the
data.
Missing data not at random:
We called a variable missing data not at random when there is absolutely some relationship between the missing
data and any other values or with the dataset. We can also say that all those data points are not a random
subset of the data.
Drop the row:
This means, suppose a row containing missing values. In this method, we will ignore(delete) that row that
contains missing values.
We do this when the row doesn't that much important.
But if the row is important then we never use this method.
So we can say that we use this method on low priority or less important rows.
Drop the column:
If a column contains too many missing values then we drop that column.
Use global constants:
This means we will fill all the missing using one value like null, missing, sorry, ignore, etc.
Because we are filling all the missing values using one value so that we call that value global constant
But we don't use this technique. Because this can be the reason for bad accuracy.
Fill missing value using mean, median or mode:
Suppose a row has a missing value and the missing value column name is price.
Now we will take the mean or median or mode of that column and will fill that column with all missing cells
using that mean or median or mode value.
For categorical column, we can only use mode
For numerical column we can use mean, median or mode
Now there can be a question about when we should use mean and median?
The answer is when the data is normally distributed then we will use mean
and if the data is skewed distributed then we will use median.
Use of algorithms:
In this method, we use machine learning algorithms to predict the value of the missing cells. This an advance
technique.
Note:
Look it doesn't matter which technique we are using but the main thing we do is that we fill the missing
values or delete the missing values.
Here we select a range. If the row or column contains more missing values from the range then we drop the row
or column and if contains less from the range then we try to fill the missing value.
When we delete the missing value?
Here we drop the row or column which contains missing values.
Suppose there is a column that contains 80% missing values. In this case, we will delete that column.
Suppose there are 1 million data and only 100 data are missing. In this case most of the time we delete the
missing value.
Now suppose a column contain 10% missing value.
In this case if the data is important then we fill the missing value.
Now suppose a column contains 5% missing value.
In this case, we will see the number of data and the importance of that data. If the data is very important
then we try to fill the missing values and if the data is not that important then we delete the cells.
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | NaN | 10 | 1 | sat | no |
3 | 159.0 | female | NaN | 12 | 4 | sat | NaN |
4 | 250.0 | male | lunch | 3 | 3 | NaN | no |
5 | 222.0 | female | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | female | NaN | 5 | 1 | mon | no |
8 | 150.0 | male | dinner | NaN | 4 | mon | NaN |
9 | NaN | female | lunch | 38 | 4 | mon | no |
10 | 478.0 | female | lunch | 28 | 3 | tue | yes |
11 | 320.0 | nan | NaN | 14 | 3 | tue | yes |
12 | NaN | male | lunch | 32 | 2 | tue | NaN |
13 | 520.0 | male | lunch | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | NaN | 3 | tue | yes |
15 | NaN | female | NaN | 40 | 4 | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | NaN |
17 | 99.0 | Nan | NaN | 40 | 1 | wed | yes |
18 | 199.0 | female | lunch | 35 | 3 | wed | no |
19 | 288.0 | female | dinner | 21 | 2 | wed | NaN |
20 | 120.0 | NaN | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | thu | yes |
22 | 120.0 | female | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | NaN | NaN | 2 | NaN | yes |
By using replace function you can replace a value with another value. The first parameter is the value name that you want to replace and the second parameter is that value that you want to replace with the first parameter value.
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | Lunch | 30 | 2 | New sat | no |
1 | 648.0 | male | dinner | 35 | 3 | New sat | yes |
2 | 75.0 | New female | NaN | 10 | 1 | New sat | no |
3 | 159.0 | New female | NaN | 12 | 4 | New sat | NaN |
4 | 250.0 | NaN | Lunch | 40 | 3 | NaN | no |
5 | NaN | male | Lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | New female | NaN | 5 | 1 | mon | no |
8 | 150.0 | New female | dinner | 15 | 4 | mon | NaN |
9 | NaN | NaN | Lunch | 38 | 4 | mon | no |
10 | 478.0 | NaN | Lunch | 28 | 3 | NaN | yes |
11 | 320.0 | New female | NaN | 14 | 3 | tue | yes |
12 | NaN | male | Lunch | 32 | 2 | tue | NaN |
13 | 520.0 | male | Lunch | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | 40 | 3 | NaN | yes |
15 | NaN | NaN | NaN | 40 | 4 | wed | yes |
16 | 100.0 | New female | dinner | 6 | 1 | wed | NaN |
17 | NaN | New female | NaN | 40 | 1 | wed | yes |
18 | 199.0 | NaN | Lunch | 35 | 3 | NaN | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | NaN |
20 | 120.0 | New female | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | Lunch | 15 | 4 | NaN | yes |
22 | NaN | New female | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | NaN | 35 | 2 | NaN | yes |
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | Lunch new | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | NaN | 10 | 1 | sat | no |
3 | 159.0 | female | NaN | 12 | 4 | sat | NaN |
4 | 250.0 | NaN | Lunch new | 40 | 3 | NaN | no |
5 | NaN | male | Lunch new | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | female | NaN | 5 | 1 | mon | no |
8 | 150.0 | female | dinner | 15 | 4 | mon | NaN |
9 | NaN | NaN | Lunch new | 38 | 4 | mon | no |
10 | 478.0 | NaN | Lunch new | 28 | 3 | NaN | yes |
11 | 320.0 | female | NaN | 14 | 3 | tue | yes |
12 | NaN | male | Lunch new | 32 | 2 | tue | NaN |
13 | 520.0 | male | Lunch new | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | 40 | 3 | NaN | yes |
15 | NaN | NaN | NaN | 40 | 4 | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | NaN |
17 | NaN | female | NaN | 40 | 1 | wed | yes |
18 | 199.0 | NaN | Lunch new | 35 | 3 | NaN | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | NaN |
20 | 120.0 | female | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | Lunch new | 15 | 4 | NaN | yes |
22 | NaN | female | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | NaN | 35 | 2 | NaN | yes |
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | Value change | no |
1 | 648.0 | male | dinner | 35 | 3 | Value change | yes |
2 | 75.0 | Value change | NaN | 10 | 1 | Value change | no |
3 | 159.0 | Value change | NaN | 12 | 4 | Value change | NaN |
4 | 250.0 | NaN | lunch | 40 | 3 | NaN | no |
5 | NaN | male | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | Value change | NaN | 5 | 1 | mon | no |
8 | 150.0 | Value change | dinner | 15 | 4 | mon | NaN |
9 | NaN | NaN | lunch | 38 | 4 | mon | no |
10 | 478.0 | NaN | lunch | 28 | 3 | NaN | yes |
11 | 320.0 | Value change | NaN | 14 | 3 | tue | yes |
12 | NaN | male | lunch | 32 | 2 | tue | NaN |
13 | 520.0 | male | lunch | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | 40 | 3 | NaN | yes |
15 | NaN | NaN | NaN | 40 | 4 | wed | yes |
16 | 100.0 | Value change | dinner | 6 | 1 | wed | NaN |
17 | NaN | Value change | NaN | 40 | 1 | wed | yes |
18 | 199.0 | NaN | lunch | 35 | 3 | NaN | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | NaN |
20 | 120.0 | Value change | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | NaN | yes |
22 | NaN | Value change | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | NaN | 35 | 2 | NaN | yes |
Dropna function is used to delete rows, columns that contain nan value. Here axis plays a very important role. By default, the axis is 0, which means that this function will work row-wise and if it gets any nan in a row then it will drop the row but if axis=1, then this function will search column wise and will drop column if gets any nan. If it doesn't pass any parameter then it will drop all those rows (if axis=1 then will drop all the columns), which have at least a single(if have more then will also drop.) nan value. If you use how=" all" then the dropna function will drop those columns or rows which have all the nan values. You can also define that, how many nan values are required to drop a row or column by using thresh parameter.
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
20 | 120.0 | female | dinner | 18 | 1 | thu | yes |
time | tip | |
---|---|---|
0 | 30 | 2 |
1 | 35 | 3 |
2 | 10 | 1 |
3 | 12 | 4 |
4 | 40 | 3 |
5 | 25 | 3 |
6 | 30 | 4 |
7 | 5 | 1 |
8 | 15 | 4 |
9 | 38 | 4 |
10 | 28 | 3 |
11 | 14 | 3 |
12 | 32 | 2 |
13 | 33 | 3 |
14 | 40 | 3 |
15 | 40 | 4 |
16 | 6 | 1 |
17 | 40 | 1 |
18 | 35 | 3 |
19 | 21 | 2 |
20 | 18 | 1 |
21 | 15 | 4 |
22 | 10 | 3 |
23 | 35 | 2 |
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | NaN | 10 | 1 | sat | no |
3 | 159.0 | female | NaN | 12 | 4 | sat | NaN |
4 | 250.0 | NaN | lunch | 40 | 3 | NaN | no |
5 | NaN | male | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | female | NaN | 5 | 1 | mon | no |
8 | 150.0 | female | dinner | 15 | 4 | mon | NaN |
9 | NaN | NaN | lunch | 38 | 4 | mon | no |
10 | 478.0 | NaN | lunch | 28 | 3 | NaN | yes |
11 | 320.0 | female | NaN | 14 | 3 | tue | yes |
12 | NaN | male | lunch | 32 | 2 | tue | NaN |
13 | 520.0 | male | lunch | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | 40 | 3 | NaN | yes |
15 | NaN | NaN | NaN | 40 | nan | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | NaN |
17 | NaN | female | NaN | 40 | 1 | wed | yes |
18 | 199.0 | NaN | lunch | 35 | 3 | NaN | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | NaN |
20 | 120.0 | female | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | NaN | yes |
22 | NaN | female | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | NaN | 35 | 2 | NaN | yes |
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | NaN | 10 | 1 | sat | no |
3 | 159.0 | female | NaN | 12 | 4 | sat | NaN |
4 | 250.0 | NaN | lunch | 40 | 3 | NaN | no |
5 | NaN | male | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | female | NaN | 5 | 1 | mon | no |
8 | 150.0 | female | dinner | 15 | 4 | mon | NaN |
9 | NaN | NaN | lunch | 38 | 4 | mon | no |
10 | 478.0 | NaN | lunch | 28 | 3 | NaN | yes |
11 | 320.0 | female | NaN | 14 | 3 | tue | yes |
12 | NaN | male | lunch | 32 | 2 | tue | NaN |
13 | 520.0 | male | lunch | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | 40 | 3 | NaN | yes |
15 | NaN | NaN | NaN | 40 | nan | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | NaN |
17 | NaN | female | NaN | 40 | 1 | wed | yes |
18 | 199.0 | NaN | lunch | 35 | 3 | NaN | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | NaN |
20 | 120.0 | female | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | NaN | yes |
22 | NaN | female | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | NaN | 35 | 2 | NaN | yes |
Fillna is used to fill missing values with a specific value. If you pass method="ffill" in the fillna function then all the missing values will be filled by the previous value. If you pass method="bfill" in the fillna function then all the missing values will be filled by the next value.
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | FFFFF | 10 | 1 | sat | no |
3 | 159.0 | female | FFFFF | 12 | 4 | sat | FFFFF |
4 | 250.0 | FFFFF | lunch | 40 | 3 | FFFFF | no |
5 | FFFFF | male | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | FFFFF |
7 | 99.0 | female | FFFFF | 5 | 1 | mon | no |
8 | 150.0 | female | dinner | 15 | 4 | mon | FFFFF |
9 | FFFFF | FFFFF | lunch | 38 | 4 | mon | no |
10 | 478.0 | FFFFF | lunch | 28 | 3 | FFFFF | yes |
11 | 320.0 | female | FFFFF | 14 | 3 | tue | yes |
12 | FFFFF | male | lunch | 32 | 2 | tue | FFFFF |
13 | 520.0 | male | lunch | 33 | 3 | FFFFF | yes |
14 | 367.0 | male | dinner | 40 | 3 | FFFFF | yes |
15 | FFFFF | FFFFF | FFFFF | 40 | 4 | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | FFFFF |
17 | FFFFF | female | FFFFF | 40 | 1 | wed | yes |
18 | 199.0 | FFFFF | lunch | 35 | 3 | FFFFF | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | FFFFF |
20 | 120.0 | female | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | FFFFF | yes |
22 | FFFFF | female | dinner | 10 | 3 | thu | FFFFF |
23 | 284.0 | male | FFFFF | 35 | 2 | FFFFF | yes |
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | dinner | 10 | 1 | sat | no |
3 | 159.0 | female | dinner | 12 | 4 | sat | no |
4 | 250.0 | female | lunch | 40 | 3 | sat | no |
5 | 250.0 | male | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | yes |
7 | 99.0 | female | dinner | 5 | 1 | mon | no |
8 | 150.0 | female | dinner | 15 | 4 | mon | no |
9 | 150.0 | female | lunch | 38 | 4 | mon | no |
10 | 478.0 | female | lunch | 28 | 3 | mon | yes |
11 | 320.0 | female | NalunchN | 14 | 3 | tue | yes |
12 | 320.0 | male | lunch | 32 | 2 | tue | yes |
13 | 520.0 | male | lunch | 33 | 3 | tue | yes |
14 | 367.0 | male | dinner | 40 | 3 | tue | yes |
15 | 367.0 | male | dinner | 40 | 4 | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | yes |
17 | 100.0 | female | dinner | 40 | 1 | wed | yes |
18 | 199.0 | female | lunch | 35 | 3 | wed | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | no |
20 | 120.0 | female | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | thu | yes |
22 | 100.0 | female | dinner | 10 | 3 | thu | yes |
23 | 284.0 | male | dinner | 35 | 2 | thu | yes |
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | lunch | 10 | 1 | sat | no |
3 | 159.0 | female | lunch | 12 | 4 | sat | no |
4 | 250.0 | male | lunch | 40 | 3 | mon | no |
5 | 356.0 | male | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | no |
7 | 99.0 | female | dinner | 5 | 1 | mon | no |
8 | 150.0 | female | dinner | 15 | 4 | mon | no |
9 | 478.0 | female | lunch | 38 | 4 | mon | no |
10 | 478.0 | female | lunch | 28 | 3 | tue | yes |
11 | 320.0 | female | lunch | 14 | 3 | tue | yes |
12 | 520 | male | lunch | 32 | 2 | tue | yes |
13 | 520.0 | male | lunch | 33 | 3 | wed | yes |
14 | 367.0 | male | dinner | 40 | 3 | wed | yes |
15 | 100 | female | dinner | 40 | 4 | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | yes |
17 | 199 | female | lunch | 40 | 1 | wed | yes |
18 | 199.0 | male | lunch | 35 | 3 | wed | no |
19 | 288.0 | male | dinner | 21 | 2 | wed | yes |
20 | 120.0 | female | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | thu | yes |
22 | 284 | female | dinner | 10 | 3 | thu | yes |
23 | 284.0 | male | NaN | 35 | 2 | NaN | yes |
This function used to fill the nan value.
Here you will create random sample values and then will fill the nan cells using those random sample values. Random sample values will be created according to other values present in the selected column.
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | NaN | 10 | 1 | sat | no |
3 | 159.0 | female | NaN | 12 | 4 | sat | NaN |
4 | 250.0 | male | lunch | 3 | 3 | NaN | no |
5 | 222.0 | female | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | female | NaN | 5 | 1 | mon | no |
8 | 150.0 | male | dinner | 15 | 4 | mon | NaN |
9 | NaN | female | lunch | 38 | 4 | mon | no |
10 | 478.0 | female | lunch | 28 | 333 | tue | yes |
11 | 320.0 | nan | NaN | 14 | 3 | tue | yes |
12 | NaN | male | lunch | 32 | 2 | tue | NaN |
13 | 520.0 | male | lunch | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | 30 | 3 | tue | yes |
15 | NaN | female | NaN | 40 | 4 | wed | yes |
16 | 100.0 | female | dinner | 356 | 1 | wed | NaN |
17 | 99.0 | Nan | NaN | 40 | 1 | wed | yes |
18 | 199.0 | female | lunch | 35 | 203 | wed | no |
19 | 288.0 | female | dinner | 21 | 102 | wed | NaN |
20 | 120.0 | NaN | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | thu | yes |
22 | 120.0 | female | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | NaN | 26 | 2 | NaN | yes |
total_bill | sex | time | tip | size | day | smoker | |
---|---|---|---|---|---|---|---|
0 | 500.0 | male | lunch | 30 | 2 | sat | no |
1 | 648.0 | male | dinner | 35 | 3 | sat | yes |
2 | 75.0 | female | dinner | 10 | 1 | sat | no |
3 | 159.0 | female | lunch | 12 | 4 | sat | NaN |
4 | 250.0 | male | lunch | 3 | 3 | NaN | no |
5 | 222.0 | female | lunch | 25 | 3 | mon | yes |
6 | 356.0 | male | dinner | 30 | 4 | mon | NaN |
7 | 99.0 | female | dinner | 5 | 1 | mon | no |
8 | 150.0 | male | dinner | NaN | 4 | mon | NaN |
9 | NaN | female | lunch | 38 | 4 | mon | no |
10 | 478.0 | female | lunch | 28 | 3 | tue | yes |
11 | 320.0 | nan | dinner | 14 | 3 | tue | yes |
12 | NaN | male | lunch | 32 | 2 | tue | NaN |
13 | 520.0 | male | lunch | 33 | 3 | NaN | yes |
14 | 367.0 | male | dinner | NaN | 3 | tue | yes |
15 | NaN | female | dinner | 40 | 4 | wed | yes |
16 | 100.0 | female | dinner | 6 | 1 | wed | NaN |
17 | 99.0 | Nan | lunch | 40 | 1 | wed | yes |
18 | 199.0 | male | lunch | 35 | 3 | wed | no |
19 | 288.0 | female | dinner | 21 | 2 | wed | NaN |
20 | 120.0 | NaN | dinner | 18 | 1 | thu | yes |
21 | 168.0 | male | lunch | 15 | 4 | thu | yes |
22 | 120.0 | female | dinner | 10 | 3 | thu | NaN |
23 | 284.0 | male | dinner | NaN | 2 | NaN | yes |
Checking for how many missing values are present in each column.
Checking for how many missing values present in the dataset
Filling missing values of categorical column
Here at first print all the categorical columns names
Let's print all the categorical columns
sex | time | day | smoker | |
---|---|---|---|---|
0 | male | lunch | sat | no |
1 | male | dinner | sat | yes |
2 | female | dinner | sat | no |
3 | female | lunch | sat | NaN |
4 | NaN | lunch | NaN | no |
Let's see the percentage of missing values contained by each categorical column.
If you have a lot of missing values present in a column then remove that column, because it will not matter which way you fill that column, it will always cause bad accuracy. What you have to do is that, you have to select a boundary for the missing value contain by each column, if a column has more missing values than the boundary, then remove that column. You already printed which column is containing how much missing values in percentage. So now print those columns names that contain more than 17 percentage missing values.
Now drop those columns which contain missing values more than the boundary. Here boundary is 17%. You already have those columns' names.
sex | day | |
---|---|---|
0 | male | sat |
1 | male | sat |
2 | female | sat |
3 | female | sat |
4 | NaN | NaN |
Printing unique value for time column because to filter data for categorical columns you will use it.
How we will fill the missing values?
For example lunch present 5 times in the time column and dinner present 4 times. Now for 5 times lunch, there
5 values should be present in sex column . But suppose only three values are present and other two values are
missing. Now we will take mode of those 3 values of sex column which have lunch value in the time column and
with that mode value we will fill other 2 missing values present in sex column which time column value is
lunch.
For dinner we will do the same.
So we can say that for lunch we will find mode value in sex and day column and will fill those two missing
values and for dinner we will find mode value of sex and day column and will fill the missing values.
sex | day | |
---|---|---|
0 | male | sat |
1 | male | sat |
2 | female | sat |
3 | female | sat |
4 | male | mon |
5 | female | mon |
6 | male | mon |
7 | female | mon |
8 | male | mon |
9 | female | mon |
10 | female | tue |
11 | male | tue |
12 | male | tue |
13 | male | mon |
14 | male | tue |
15 | female | wed |
16 | female | wed |
17 | male | wed |
18 | female | wed |
19 | female | wed |
20 | male | thu |
21 | male | thu |
22 | female | thu |
23 | male | mon |
Filling numeric missing values of columns
Let's print all the numeric columns name.
Let's print all the numeric columns
total_bill | tip | size | |
---|---|---|---|
0 | 500.0 | 30 | 2 |
1 | 648.0 | 35 | 3 |
2 | 75.0 | 10 | 1 |
3 | 159.0 | 12 | 4 |
4 | 250.0 | NaN | 3 |
Let's see how much missing value contains by each numeric column in percentage.
Same thing what we did while categorical column in step 6.
Drop those columns which contain missing values more than the boundary. You already have those column names.
total_bill | size | |
---|---|---|
0 | 500.0 | 2 |
1 | 648.0 | 3 |
2 | 75.0 | 1 |
3 | 159.0 | 4 |
4 | 250.0 | 3 |
Printing unique value for time column because to filter data for numerical columns you will use it.
total_bill | size | |
---|---|---|
0 | 500.0 | 2 |
1 | 648.0 | 3 |
2 | 75.0 | 1 |
3 | 159.0 | 4 |
4 | 250.0 | 3 |
5 | 222.0 | 3 |
6 | 356.0 | 4 |
7 | 99.0 | 1 |
8 | 150.0 | 4 |
9 | 363.2 | 4 |
10 | 478.0 | 3 |
11 | 320.0 | 3 |
12 | 363.2 | 2 |
13 | 520.0 | 3 |
14 | 367.0 | 3 |
15 | 363.2 | 4 |
16 | 100.0 | 1 |
17 | 99.0 | 1 |
18 | 199.0 | 3 |
19 | 288.0 | 2 |
20 | 120.0 | 1 |
21 | 168.0 | 4 |
22 | 120.0 | 3 |
23 | 284.0 | 2 |
Now merge categorical and numerical recently cleaned columns.
sex | day | total_bill | size | |
---|---|---|---|---|
0 | male | sat | 500.0 | 2 |
1 | male | sat | 648.0 | 3 |
2 | female | sat | 75.0 | 1 |
3 | female | sat | 159.0 | 4 |
4 | male | mon | 250.0 | 3 |
5 | female | mon | 222.0 | 3 |
6 | male | mon | 356.0 | 4 |
7 | female | mon | 99.0 | 1 |
8 | male | mon | 150.0 | 4 |
9 | female | mon | 363.2 | 4 |
10 | female | tue | 478.0 | 3 |
11 | male | tue | 320.0 | 3 |
12 | male | tue | 363.2 | 2 |
13 | male | mon | 520.0 | 3 |
14 | male | tue | 367.0 | 3 |
15 | female | wed | 363.2 | 4 |
16 | female | wed | 100.0 | 1 |
17 | male | wed | 99.0 | 1 |
18 | female | wed | 199.0 | 3 |
19 | female | wed | 288.0 | 2 |
20 | male | thu | 120.0 | 3 |
21 | male | thu | 168.0 | 4 |
22 | female | thu | 120.0 | 3 |
23 | male | mon | 284.0 | 2 |
Now merge categorical and numerical recently merged data with remaining columns of the dataset which we doesn't clean to get the final clean dataset.