Learn Python

Learn Data Structure & Algorithm

Learn Numpy

Learn Pandas

Learn Matplotlib

Learn Seaborn

Learn Statistics

Learn Math

Learn MATLAB

introduction

Setup

Read data

Data preprocessing

Data cleaning

Handle date-time column

Handling outliers

Encoding

Feature_Engineering

Feature selection filter methods

Feature selection wrapper methods

Multicollinearity

Data split

Feature scaling

Supervised Learning

Regression

Classification

Bias and Variance

Overfitting and Underfitting

Regularization

Ensemble learning

Unsupervised Learning

Clustering

Association Rule

Common

Model evaluation

Cross Validation

Parameter tuning

Code Exercise

Car Price Prediction

Flight Fare Prediction

Diabetes Prediction

Spam Mail Prediction

Fake News Prediction

Boston House Price Prediction

Learn Github

Learn OpenCV

Learn Deep Learning

Learn MySQL

Learn MongoDB

Learn Web scraping

Learn Excel

Learn Power BI

Learn Tableau

Learn Docker

Learn Hadoop

All the techniques of filter methods to select feature in machine learning

How feature selection techniques works?

Feature selection is done because sometimes all the features are not that much important for the ml model and those features which are not that important can be causes of bad accuracy. So to get a good accuracy select those features which are playing an important role.

Suppose we have three independent features and one dependent feature or target variable. Now using feature selection techniques we will try to find correlation between dependent and independent feature or target variable. If a independent feature is correlated with target variable then we keep that feature and if not then we remove that feature.

Correlated means if an independent feature increases then the target variable or dependent feature will also increase or decreases or if the feature decreases then the independent or target feature will also increase or decrease. If this happens then we can say that the dependent feature is correlated with the target variable because if ones get down then the other also get down or up or one gets up the other also get down or up.

To find the important features there are so many techniques that are going to be discussed. For categorical columns, encoding needs to be done before applying the feature selection technique. . Encoding will be discussed in the next lecture.

Feature selection using variance:

By using this technique those features will be removed which contain constant values or we can say low variance features. Those features which contain constant values are not important for the machine learning algorithm.

Low Variance or constant Feature : Not a good model
Hight Variance or no constant Feature: Good model

If there are constant features or low variance features then by using variance feature selection technique we can remove that.

Use this technique on numerical column which contain constant value.
For categorical column at first convert them into numeric and then perform variance feature selection.
Let's see an example:

import pandas as pd
data=pd.DataFrame({"A":[1,2,3,4],
                              "B":[5,6,7,8],
                              "C":[0,0,0,0],
                              "D":[1,1,1,1]})

"""
Here C and D columns have constant values, so these columns need to be removed. To remove these columns a class named VarianceThreshold will be used from sklearn library. These constant features will be found according to variance. Constant features always have low variance. The threshold of variance can be selected the using threshold parameter. If it gets a variance value less or less than the equal threshold then it will drop those features. Feature selection algorithm always looks only at the x means independent features. Always Fit this algorithm train data, not in test data. It's never seen the features y means dependent features and it can be used for unsupervised learning. Here C and D columns contain the same values. So here the variance will be 0 because all the values are the same. So the algorithm will remove these two columns which contain 0 variances. To do this pass a parameter names threshold where put value 0. It means if the function gets the value of variance less the threshold value then it will remove those columns.
"""
from sklearn.feature_selection import VarianceThreshold
vari_threshold=VarianceThreshold(threshold=0)
vari_threshold.fit(data)
constant_columns=[column for column in data.columns if column not in
          data.columns[vari_threshold.get_support()]]
for feature in constant_columns:
    print(feature)
"""
Here those columns names are printed which have low variance and should be removed.
"""
data.drop(constant_columns,axis=1,inplace=True)
data
#Here those columns which contain constant value are dropped.

Let's implement it on a real dataset:
See this real data implementation if you are able to create a project. If not then watch all the previous lectures and then come here.

import pandas as pd
from sklearn.feature_selection import VarianceThreshold

fdf=pd.read_csv("Santander.csv")
fdf.head()

'''
Let's take all the numerical independent features in one variable
'''
independent_feature=fdf.drop("TARGET",axis=1)
independent_feature.head(2)

'''
Let's import VarianceThreshold from sklearn and use it. Here we will take threshold value 0. You can increase or decrease the value. After this we will fit independent_feature in it.
'''
vari_threshold=VarianceThreshold(threshold=0)
vari_threshold.fit(independent_feature)

constant_columns=[column for column in independent_feature.columns if column not in independent_feature.columns[vari_threshold.get_support()]]
for feature in constant_columns:
print(feature)
'''
Here in constant_columns variable we took all the constant column and then we will use a for loop to print the columns names one by one.
'''

#Let's drop the constant columns from independent_feature
independent_feature.drop(constant_columns,axis=1)

X=independent_feature
Y=df["TARGET"]
"""
Here in the X variable, we have all the independent features and in Y we have our target feature or dependent feature.
"""

#Now you can do encoding

Feature selection with Correlation

Normally independent features should be correlated with dependent features and uncorrelated with each other. Because if an independent feature is highly correlated with another independent feature then we can predict one independent feature from another one. So if two or more independent features are highly correlated with each other then the model needs only one feature among those highly correlated features. This is done because if these three features are highly correlated with each other then these features will behave like duplicate features.

Suppose three features are highly correlated with each other. So in the model, we have to put only one feature among these three features.
Now how will we choose among these three features which one we should take and which should remove?
So the answer is,
Here these three features are highly correlated to each other. So among these three features which one is the more correlated with the target feature or variable we will take that feature and the other two will be removed.

To find the correlation between independent features we use threshold values like 0.7, 0.8, etc. Here 0.7 means 70% and 0.8 means 80%. By using threshold value what we do is that if multiple independent features are correlated more than the threshold value to each other then we take only one among them. Suppose three features are more than 80% correlated to each other. In this case, we will take only among them. So only one will be taken(which is more correlated to the target feature among them) and others will be dropped.

Let's see an example:
See this real data implementation if you are able to create a project.If not then watch all the lecture and then come here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Getting the dataset
fdf=pd.read_csv("D:/USa_house_price.csv")
fdf.head()

'''
We perform feature selection technique on all the independent variable. Here we will take those independent variable and put it in a variable
'''
independent_feature=fdf.drop("Price",axis=1)
independent_feature.head(2)

'''
Plotting heatmap of correlation between columns
'''
plt.figure(figsize=(12,10))
cor=independent_feature.corr()
sns.heatmap(cor,annot=True,cmap=plt.cm.CMRmap_r)
plt.show()

#Getting the correlation columns
'''
Let's create a function to select the highly correlated features.
This function will remove the first feature that is highly correlated with any other feature
In the function we will pass two thing one is the dataset and the other one is threshold value. and we will try to compare features correlation using threshold value. If the correlation is greater than threshold value then we will select that feature for drop.
Here dataset means our independent_feature data.
'''
def correlation(dataset,threshold):
    col_corr=set()
    #here we set all the correlated columns names
    corr_matrix=dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i,j])>threshold:
                #We are interested in absolute coeff value
                colname=corr_matrix.columns[i]
                #getting the name of the column
                col_corr.add(colname)
    return col_corr
"""
By this function, a highly correlated feature will be selected. It will remove the first feature that is correlated with any other feature.Here in threshold parameter pass the correlation threshold like if the feature is 80% correlated then remove the feature. By the first for loop, it will travel through all the correlation matrix columns and by the second loop it will compare each column with other column and in the if statement it will try to find that whether the correlation is greater than the threshold value or not. People commonly used 85% as a threshold value.
What is happening is that it will compare each column's correlation value with the threshold value.
When it gets a column correlation value greater than the threshold value, then it will
take that column name and put the column name in colname, and then add that in a set. It will add in a set so that don't get any duplicate entries and finally it will return the correlation column names. To also get negative correlation, remove abs from if condition.
"""

Let's pass value in the function
corr_features=correlation(independent_feature,0.7)
corr_features

independent_feature.drop(corr_features,axis=1,inplace=True)
independent_feature.head(2)

Feature selection information-gain

In feature selection, we try to find the relationship between independent and dependent features. Here if the mutual information of independent and dependent is 0 then the independent and dependent feature relation is independent means not correlated and if the value is higher than 0 then those features are correlated. Now how much correlated, depends on the value. Mutual information is calculated based on entropy estimation.
Formula:!(X;Y)=H(X)-H(X|Y)
Here,
!(X;Y)=Mutual information for X and Y.
H(X)=Entropy for X
H(X|Y)=Conditional entropy for X given Y
We get the result units of bits.

Here we get information about features, which are more important. It depends on us that how many features we will use for training. But take those feature which are more important to get a better accuracy.

Look apply this technique on numerical independent features. If your independent column contain categorical values then don't use this technique. If the dependent column contains categorical data then it is not necessary to convert the data into numerical form but for better result you can do that.

Let's see an example:
See this real data implementation if you are able to create a project. If not then watch all the lectures and then come here.

Practical code example of information-gain in classification problem

import pandas as pd
from sklearn.feature_selection import mutual_info_classif
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
#Getting the dataset
df=pd.read_csv("D:/xyz.csv")
df.head()

"""
Let's take all the independent variables are in independent_feature and target variable or dependent variable is in dependent_feature
"""
independent_feature=fdf.drop("target",axis=1)
dependent_feature=fdf["target"]

#now mutual info will be applied on the data
mutual_info=mutual_info_classif(independent_feature,dependent_feature)

'''
Let's convert the data into series and sort in descending order.
'''
mutual_info=pd.Series(mutual_info)
mutual_info.index=independent_feature.columns
mutual_info.sort_values(ascending=False)

Let's create a bar plot with mutual information of the features to get a view of which columns are more important
mutual_info.sort_values(ascending=False).plot.bar(figsize=(20,8))

"""
Now those independent features will be taken which are highly correlated with dependent features.To do this use SelectKBest class from sklearn. Here SelectKBest will select top k features. Here provide the k value. If we pass k value 10 then this class will take the top 10 features. If we pass 5 then SelectKBest will select top 5 features
"""
select_best_fea=SelectKBest(mutual_info_classif,k=8)
select_best_fea.fit(independent_feature,dependent_feature)
independent_feature.columns[select_best_fea.get_support()]

'''
For better accuracy take those features which are more important. In the list we get after using get_support() is the list of important feature selected by SelectKBest. Now use those feature for training
'''

'''
Taking more important features according to the feature selection technique and creating new dataset
'''

best_feature_data=fdf[['sex', 'cp', 'chol', 'thalach', 'exang', 'oldpeak', 'ca', 'thal']]

Practical code example of information-gain in regression problem

import pandas as pd
from sklearn.feature_selection import mutual_info_regression
import matplotlib.pyplot as plt
from sklearn.feature_sele ction import SelectKBest
#Getting the dataset
df=pd.read_csv("D:/USa_house_price.csv")
df.head()

"""
Let's take all the independent variables are in independent_feature and target variable or dependent variable is in dependent_feature
"""
independent_feature=fdf.drop("target",axis=1)
dependent_feature=fdf["target"]

#now mutual info will be applied on the data
mutual_info=mutual_info_regression(independent_feature,dependent_feature)

'''
Let's convert the data into series and sort in descending order.
'''
mutual_info=pd.Series(mutual_info)
mutual_info.index=independent_feature.columns
mutual_info.sort_values(ascending=False)

Let's create a bar plot with mutual information of the features to get a view of which columns are more important
mutual_info.sort_values(ascending=False).plot.bar(figsize=(20,8))

"""
Now those independent features will be taken which are highly correlated with dependent features.To do this use SelectKBest class from sklearn. Here SelectKBest will select top k features. Here provide the k value. If we pass k value 10 then this class will take the top 10 features. If we pass 5 then SelectKBest will select top 5 features
"""
select_best_fea=SelectKBest(mutual_info_classif,k=8)
select_best_fea.fit(independent_feature,dependent_feature)
independent_feature.columns[select_best_fea.get_support()]

'''
For better accuracy take those features which are more important. In the list we get after using get_support() is the list of important feature selected by SelectKBest. Now use those feature for training
'''

'''
Taking more important features according to the feature selection technique and creating new dataset
'''

best_feature_data=fdf[["battery_power", "blue", "clock_speed", "dual_sim", "fc", "four_g", "int_memory" ]]

Feature selection chi square test

This technique is used on categorical variables in the classification task. It means, to see the relationship between the categorical features with the target feature this technique is used. The chi-square test gives two values one is F-score and the other is the p-value. The F-score should be higher. That feature is more important which has a high F-score. The p-value should be less. That feature is more important which has less p-value.

Because we perform chi square test on categorical data so at first convert them into numerical form using encoding techniques and then perform this feature selection method. Let's see an example:
Watch this real data implementation if you are able to create a project. If not then watch all the lectures and then come here.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2
#Getting the dataset
fdf=pd.read_csv("data_set.csv")
fdf.head()

fdf.info()

#Here only the categorical columns and target column will be taken.

categorical_features=fdf[["sex","embarked","alone"]]
categorical_features
target_features=fdf["survived"]

'''
For example we are doing label encoding but you have to do proper encoding
'''
encoding=LabelEncoder()
categorical_features["sex"]=encoding.fit_transform(categorical_features["sex"])
categorical_features["embarked"]=encoding.fit_transform(categorical_features["embarked"])
categorical_features["alone"]=encoding.fit_transform(categorical_features["alone"])
#Here we will use chi2 function

chi_values=chi2(categorical_features,target_features)

#Printing the p-values and sorting in ascending order.

p_values=pd.Series(chi_values[1])
p_values.index=categorical_features.columns
p_values.sort_index(ascending=True)
p_values

#Printing the F-score and alo sorting in ascending order.

f_score_values=pd.Series(chi_values[0])
f_score_values.index=categorical_features.columns
f_score_values.sort_index(ascending=False)
f_score_values
'''
More the value of F score more important the feature is.
Less the value of P value more important the feature is.
Now according to the F score and P value select the features create new dataframe of selected features and train the model.
'''

Feature selection :Feature importance technique

This technique gives a score for each feature of our data. In this technique the higher the score more relevant it is. It means that here if the score is high,then that feature is more important and if the score is low then the feature is less important.

Let's see an example:
Watch this real data implementation if you are able to create a project. If not then watch all the lectures and then come here.

import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt

#Getting the dataset
fdf=pd.read_csv("data_set.csv")
fdf.head()

independent_feature=fdf.drop("target",axis=1)
dependent_feature=fdf["target"]
"""
Here independent_feature has all the independent variables and dependent_feature has the target variable or dependent variable
"""

#Apply the function from skelarn

model=ExtraTreesClassifier()
model.fit(independent_feature,dependent_feature)

#Let's see the score and then plot it in a graph

ranked_score=pd.Series(model.feature_importances_,index=independent_feature.columns)
ranked_score.nlargest(5).plot(kind="barh")
print(ranked_score)
plt.show()

#Now select the best features according to the score

CodersAim is created for learning and training a self learner to become a professional from beginner. While using CodersAim, you agree to have read and accepted our terms of use, privacy policy, Contact Us