Lec08-1Activation Functions
Lec08-1Activation Functions
Activation functions
• They introduced Non linear properties to networks.
• Linear function is polynomial of one degree and always form a
straight line.
• If we add more dimensions then they would form a planes or
hyperplanes but shapes would perfectly straight and not curves.
• Polynomial of higher degrees are non-linear and produced curves.
• Linear Eq. are easy to solve but are limited to represent any NN
function, which is considered as “Universal function approximator”.
• If we do not use any non linear fn. then NN behaves as single layer
network and does not matter who much layers are being used
because summing these layers will produce another linear network.
So it is not Strong to model any kind of Data.
Activation function
• But using non linear activation the mapping of input
to the output is non linear.
• Activation fn. Should be differentiable to compute
its derivative to perform back propagation
optimization strategy to find a non linear gradient to
learn complex behavior.
• Idea behind activation is to find model the neurons
communicate with each others in brain.
• Each neuron is activated through its action potential
if it reaches at certain threshold to activate it or not.
Whys Activation fn.s in Neural Nets
• They are used for containing the outputs b/w given values usually 0-1.
• To impart the non linearity, which is an important factor for effective
results and accuracy of model.
• So we most know about them
Activation functions (AF)
• Various threshold function can be considered as AF.
• Identity f(x) =x;
• Threshold f(x)= 0 for x<0 while f(x) = 1 for x>=0 (useful in classifiers)
• Most Popular are:
• Sigmoid
• tanh
• ReLU
• Leaky ReLU
• Maxout
• Softmax (also used as classifier by computing probability)
tanh
Sigmoid
• Takes some number and ranges it between 0 and 1 (even +ve
value is too large to avoid exponential increase in +ve values
using NN) to interpret firing of neurons.
• 0 means no firing and
• 1 means a fully saturated firing.
• Easy to understand thus popular.
• But it has two Problems:
1. Causes the gradient to vanish.
• when neurons activation saturates closing either 0 or 1 the gradient
reduces to very close to 0.
• During back propagation this local gradient will be multiplied by the
gradient of this gate output the whole objective
• If local gradient is really small it will make the gradient slowly vanished
and no signal will flow through the neurons to its weights and recursively
to its data.
2. Its Output is Not Zero centered.
Problems with Sigmoid
• It starts from 0 and ends up 1.
• That means the value after the fn. will be +ve that make the gradient of the
weights become either all +ve or all –ve.
• This makes the gradient updates gone too far in different directions, which
makes optimization harder.
1. But if lot of neurons die than consider Leaky ReLU, Maxout or other variants.
2. But don’t consider Sigmoid or tanh.
3. Still many rooms for improvements.