Activation Function
Activation Function
FUNCTION
• An activation function are mathematical functions that
determine the output of neural network
• It is attached to each neuron in neural network and
determine whether the neuron should be
activated(“Fired”) or not
In multi-class classification, each input will have only one output class,
In neural networks, we usually use the Sigmoid Activation Function for binary classification tasks
Multi-Class Classification
For example, If we are making an animal classifier that classifies between Dog, Rabbit, Cat, and Tiger,
it makes sense only for one of these classes to be selected each time.
For example, If we are building a model which predicts all the clothing articles a person is wearing, we can
use a multi-label classification model since there can be more than one possible option at once.
Used in: Hidden layer and output layer for binary classification problems
Limitation
• It can’t provide multi value output i.e. not suitable for multiclass classification problem.
Signum Function
Used in: Hidden layer and output layer for binary classification problems
Linear Activation Function
• Also called no activation or identity function
• Activation is proportional to the input.
Range
-∞ to +∞
Limitation
• It is not possible to use backpropagation as the derivative of function is a constant and has
no relation to the input x.
• All the layers of the neural network will collapse into one if a linear function is used ,no matter
the number of layers in neural network, the last layer will be linear function of first layer.
Sigmoid / Logistic Activation Function
• This function takes any real value as input and outputs values in the range of 0 to 1.
• The larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to 0.0, as shown below.
Advantages:
• It is commonly used for models where we have to predict the probability as an
output. Since probability of anything exists only between the range of 0 and 1, sigmoid
is the right choice because of its range.
• Used in hidden layer, output layer for classification, tells likelihood of classification
rather than hard classification
• The function is differentiable and provides a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-shape of the sigmoid activation function.
Limitation
• The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).
• the gradient values are only significant for range -3 to 3, and the graph gets much flatter in other regions.
It implies that for values greater than 3 or less than -3, the function will have very small gradients. As the
gradient value approaches zero, the network ceases to learn and suffers from the Vanishing
gradient problem.
• The output of the logistic function is not symmetric around zero. So the output of all the neurons will be of the
same sign. This makes the training of the neural network more difficult and unstable.
Tanh Function (Hyperbolic Tangent)
• Tanh function is very similar to the sigmoid/logistic activation function, and even has the
same S-shape with the difference in output range of -1 to 1. In Tanh, the larger the
input (more positive), the closer the output value will be to 1.0, whereas the smaller the
input (more negative), the closer the output will be to -1.0.
Advantages of using this activation function
are:
• The output of the tanh activation function is Zero centered; hence we can easily map
the output values as strongly negative, neutral, or strongly positive.
• Used in hidden layer, output layer for classification, tells likelihood of classification rather
than hard classification
• Usually used in hidden layers of a neural network.
Limitation
It also faces the problem of vanishing gradients similar to the sigmoid activation
function. Plus the gradient of the tanh function is much steeper as compared to the
sigmoid function.
• Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is
zero centered, and the gradients are not restricted to move in a certain direction.
Therefore, in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.
RelU Function
• ReLU stands for Rectified Linear Unit.
• Although it gives an impression of a linear function, ReLU has a derivative function and allows for
backpropagation while simultaneously making it computationally efficient.
• Used in hidden layer of CNN or vision applications and in output layer where dependent variable
is always positive
• The main catch here is that the ReLU function does not activate all the neurons at the same time.
• The neurons will only be deactivated if the output of the linear transformation is less than 0.
def ReLU(x):
if x>0:
return x
else:
return 0
• Since only a certain number of neurons are activated, the ReLU function is far more computationally
efficient when compared to the sigmoid and tanh functions.
• ReLU accelerates the convergence of gradient descent towards the global minimum of the loss
function due to its linear, non-saturating property.
Range
-∞ to +∞
• The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable
backpropagation, even for negative input values. Gradient is a non zero value,no dead neurons.
• Mostly used in hidden layers of CNN
Limitations
• Predictions may not be consistent for negative input values
• The gradient for negative value is small value that makes the learning of model parameter time
consuming
Parametric ReLU Function
• Solve the problem of gradient’s becoming zero for the left half of the axis.
• This function provides the slope of the negative part of the function as an argument a. By performing
backpropagation, the most appropriate value of a is learnt.
Suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and
0.6, respectively. How can we move forward with it?
The above values don’t make sense as the sum of all the
classes/output probabilities should be equal to 1.
The Softmax function is described as a combination of multiple sigmoid.
• It is most commonly used as an activation function for the last layer of the neural network
in the case of multi-class classification.
The activation function used in hidden layers is typically chosen based on the type of
neural network architecture.
• A neural network will almost always have the same activation function in all hidden layers. This
activation function should be differentiable so that the parameters of the network are learned in
backpropagation.
• ReLU is the most commonly used activation function for hidden layers.
• While selecting an activation function, you must consider the problems it might face: vanishing
and exploding gradients.
• Regarding the output layer, we must always consider the expected value range of the predictions.
If it can be any numeric value (as in case of the regression problem) you can use the linear
activation function or ReLU.