Learn Python
Learn Data Structure & Algorithm
Learn Numpy
Learn Pandas
Learn Matplotlib
Learn Seaborn
Learn Statistics
Learn Math
Learn MATLAB
Learn Machine learning
Learn Github
Learn OpenCV
Introduction
Setup
ANN
Working process ANN
Propagation
Bias parameter
Activation function
Loss function
Overfitting and Underfitting
Optimization function
Chain rule
Minima
Gradient problem
Weight initialization
Dropout
ANN Regression Exercise
ANN Classification Exercise
Hyper parameter tuning
CNN
CNN basics
Convolution
Padding
Pooling
Data argumentation
Flattening
Create Custom Dataset
Binary Classification Exercise
Multiclass Classification Exercise
Transfer learning
Transfer model Basic template
RNN
How RNN works
LSTM
Bidirectional RNN
Sequence to sequence
Attention model
Transformer model
Bag of words
Tokenization & Stop words
Stemming & Lemmatization
TF-IDF
N-Gram
Word embedding
Normalization
Pos tagging
Parser
semantic analysis
Regular expression
Learn MySQL
Learn MongoDB
Learn Web scraping
Learn Excel
Learn Power BI
Learn Tableau
Learn Docker
Learn Hadoop
We use a linear activation function when our data is linear in nature. We can also say that we use this
function for the regression problem. Here the output will not be limited between any range. So, the range will
be negative infinity to positive infinity. We use linear activation function only in the output layer.
Formula:
f(x) = x
Range: -infinity to infinity
It is the most widely used non-linear activation function. We use this function for binary classification
means 0 or 1. We use it in the output layer. The sigmoid function graphical representation looks like a curve
or S shape. It doesn't matter what is my y value or how big my y value is, if we put this value into the
sigmoid function then it will convert those values between 0 to 1. Here 0.5 is the threshold value. If y(means
weighted sum) is less than 0.5 then it will decide it 0 means the neuron will not get activated and if greater
than equal 0.5 then it will decide it 1. It means the neuron will get activated.
Formula : 1/(1+e-y)
In a binary classification problem, we use the sigmoid activation function. We use it in the output node
because there the output is 0 or 1 and sigmoid transform value between 0 to 1.
Tanh is a hyperbolic trigonometric function. We use this function for binary classification. We usually used
it in hidden layers. It is very similar to the sigmoid activation function but almost always better than the
sigmoid function. Tanh is a mathematically shifted version of the sigmoid activation function. The range of
values, in this case, is from -1 to 1. In sigmoid function we saw that the range is 0 to 1 and the middle
point or threshold is 0.5 and if value is les than 0.5 then sigmoid will decide it 0 and if greater than equal
to 0.5 then sigmoid will decide it as 1. But in tanh function the range is 1 to -1 and mid point or threshold
is 0. The shape is also like curve or S-shape.
Formula: tanh(z)=(1-e-2z)/(1+e-2z)
ReLU function is non-linear activation function. We use this function in the hidden layer. The main advantage
of this function is that it does not activate all the neurons at the same time. We use this function for
binary classification means 0 or 1. If the y(weighted sum ) value is less than 0 or 0 then that value will
transform into 0 means output will be always 0. It means not activated and if grater that 0 like 1, 2, 3, 4,
10, 500, etc then it will transform it a particular positive value.
So the range is: 0 to infinity
Suppose my y is 2 then output is 2 means activated, if 500 then output will be 500 means activated and if our
y is 0, -4, -1, -2, -10, -500 then the output will be always 0. It's mean that the neurons will only be
deactivated if the output of linear transformation is less than equal 0 and will activated if the output of
transformation is greater than 0. Relu activation function learns much faster than sigmoid and Tanh activation
function
We use this function for binary classification means 0 or 1. Leaky relu function is nothing but an improved
version of the relu function. As we saw that for the relu function, the gradient is 0 if the value of y is
less than equal 0(y < 0), which made the neurons die for activations in that region. Leaky relu is created
to overcome this problem. Here we don't defining the relu function as 0, we define it as a small linear
component of x. In relu function negative value means -1, -1000, -100000044 will be always 0 but in leaky relu
negative value means -1, -1000, -100000044 will not consider as 0 or greater than 0. It will consider less
than 0 and the difference between 0 and less than zero will be very little but the difference will increase
according to the negative value increase.
Formula:
f(x) = ax, x
f(x) = x, \text{otherwise}
Swish shows better performance than ReLU on deeper models. It is developed by Google. it is as computationally
efficient as ReLU. Swich function is not monotonic. Here the curve of the function is smooth and the function
is differentiable at all points
Range: -infinity to infinity
Function:
f(x) = x*sigmoid(x)
f(x) = x/(1-e^-x)
In Swish, the value of the function may decrease even when the input values are increasing.
We use this function for multiple classifications. We use this function in the output layer. We can say that,
softmax is a combination of multiple sigmoid functions. Sigmoid returns values between 0 and 1, which we can
treat as the probabilities of a data point belonging to a particular class. Softmax squeezes the outputs
between 0 and 1 of each class and also divides by the sum of outputs. In the multi classification problem, the
output layer can have any number of neuron but more than one. So we can say the number of neurons will be
equal to the number of classes in the target. If you have three classes, it means we will have three neurons
in the output layer.
Suppose we have four neurons on the output layer and we got the output from the neurons are 1.4, 0.33, 0.8,
077. After applying softmax function we will get output 0.28, 0.45, 0.17, 0.10.
If you do sum of the softmax function result, you will get the value as 1. It happens because the outputs of
softmax are interrelated. Now if we wants to increase one class value then other will be automatically
decrease by an equal amount.
Softmax(Yi)=exp(Yi)/Σexp(Yj)