Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
introduction
Setup
Read data
Data preprocessing
Data cleaning
Handle date-time column
Handling outliers
Encoding
Feature_Engineering
Feature selection filter methods
Feature selection wrapper methods
Multicollinearity
Data split
Feature scaling
Supervised Learning
Regression
Classification
Bias and Variance
Overfitting and Underfitting
Regularization
Ensemble learning
Unsupervised Learning
Clustering
Association Rule
Common
Model evaluation
Cross Validation
Parameter tuning
Code Exercise
Car Price Prediction
Flight Fare Prediction
Diabetes Prediction
Spam Mail Prediction
Fake News Prediction
Boston House Price Prediction
Learn Github
Learn OpenCV
Learn Deep Learning
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
Suppose there is a car and its color is blue, it has four-wheel, one engine. Color, wheel, and engine are features of a car. Now to predict the car price these features will be used. Because according to the car color, and engine the price can be predicted of a car because the price depends on this features. So here price is a dependent feature because the price is dependent on color and engine feature. Color, engine are independent features because these features are not dependent on any other features.
Sometimes in data, there can be a lot of columns or features, but sometimes all the features are not required or not that much correlated with the target. You have to take those data which are correlated means have an impact on the project. So for taking correlated data unnecessary features need to remove and this is called reduction. Sometimes create new features are also required to get better accuracy. New features can be created by splitting a feature or combining multiple features or changing the type of features. So all these things come under feature engineering.
Sometimes the existing features don't give us good accuracy or result. To get good performance or output sometimes we create or develop new features.
Feature and variable both are the same. In machine learning we commonly say feature and in statistics, we
commonly say variable but the meaning and use of both are the same.
Let's see some example:
In the table Aircraft, boarding time and Aircraft reach the station both are features or variable.
Aircraft boarding time | Aircraft reach at station |
---|---|
01:00 | 01:00 |
03:00 | 3:15 |
04:00 | 04:10 |
Numerical variable:
There are two types of numerical variable:
Discrete: Example: integer value
Continuous:Example: float value
Categorical:
There are two types of categorical feature:
Ordinal: It means order like sunday, monday, tuesday
or grade like A,B,C etc.
Nominal: It means no order.
Date and Time
Mixed
Independent variable/feature means that variable/feature in a dataset which doesn't depend on other variables or features in the dataset . Dependent variable/feature means that variable/feature which depend on other features in a dataset.Suppose we have four features of dress: 1.price 2.Name 3.Color 4.Country. Here price is a dependent variable because the price of a dress is depend on other features like name,color,country. Without these features, we can't say or predict the price.Name,color and country are independent variables because these doesn't depend on any other variables or features.
It's important because without understanding the type of variable we can't perform any work.
For example:
1. To fill the missing values different techniques are used for the categorical and numerical variables. So
without knowing the types of variable, filling the missing values are not possible.
2.The machine learning model doesn't understand categorical variables. To work with the categorical variable
in the machine learning model the categorical variable need to convert into numerical variable. So without
knowing the types of variable it is not possible
3. Here we need to create new features. So without knowing the types of the variable it is quite difficult to
create new features.
1. Testing features.At first test with existing features. If it's doesn't work then go for new features.
2. Decide that which features should be created.
3. Creating features
4. Check that how the features are working with the model
5. Improve the features if needed
Example:
In the table, there are two features one is boarding time and another one is reached at station time. After
watching these two features two new features can be created like Delayed or on time and Delay time in a
minute.
Aircraft boarding time | Aircraft reached at station |
---|---|
01:00 | 01:00 |
03:00 | 3:15 |
04:00 | 04:10 |
Creating new features:
Aircraft boarding time | Aircraft reach at station | Delayed time or On time | Delayed time in minute |
---|---|---|---|
01:00 | 01:00 | On time | 00:00 |
03:00 | 3:15 | Delayed | 00:15 |
04:00 | 04:10 | Delayed | 00:10 |
Feature selection is done because sometimes all the features are not that much important for the ml model and
those features which are not that important can be causes of bad accuracy. So to get a good accuracy select
those features which are playing an important role.
Suppose we have three independent features and one dependent feature or target variable. Now using feature
selection techniques we will try to find correlation between dependent and independent feature or target
variable. If a independent feature is correlated with target variable then we keep that feature and if not
then we remove that feature.
Correlated means if an independent feature increases then the target variable or dependent feature will also
increase or decreases or if the feature decreases then the independent or target feature will also increase or
decrease. If this happens then we can say that the dependent feature is correlated with the target variable
because if ones get down then the other also get down or up or one gets up the other also get down or up.
To find the important features there are so many techniques that are going to be discussed.
For categorical columns, encoding needs to be done before applying the feature selection technique.
. Encoding will be discussed in the next lecture.