0% found this document useful (0 votes)
20 views

Lec08-1Activation Functions

Uploaded by

awaisqarni640
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lec08-1Activation Functions

Uploaded by

awaisqarni640
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Activation functions

Activation functions
• They introduced Non linear properties to networks.
• Linear function is polynomial of one degree and always form a
straight line.
• If we add more dimensions then they would form a planes or
hyperplanes but shapes would perfectly straight and not curves.
• Polynomial of higher degrees are non-linear and produced curves.
• Linear Eq. are easy to solve but are limited to represent any NN
function, which is considered as “Universal function approximator”.
• If we do not use any non linear fn. then NN behaves as single layer
network and does not matter who much layers are being used
because summing these layers will produce another linear network.
So it is not Strong to model any kind of Data.
Activation function
• But using non linear activation the mapping of input
to the output is non linear.
• Activation fn. Should be differentiable to compute
its derivative to perform back propagation
optimization strategy to find a non linear gradient to
learn complex behavior.
• Idea behind activation is to find model the neurons
communicate with each others in brain.
• Each neuron is activated through its action potential
if it reaches at certain threshold to activate it or not.
Whys Activation fn.s in Neural Nets
• They are used for containing the outputs b/w given values usually 0-1.
• To impart the non linearity, which is an important factor for effective
results and accuracy of model.
• So we most know about them
Activation functions (AF)
• Various threshold function can be considered as AF.
• Identity f(x) =x;
• Threshold f(x)= 0 for x<0 while f(x) = 1 for x>=0 (useful in classifiers)
• Most Popular are:
• Sigmoid
• tanh
• ReLU
• Leaky ReLU
• Maxout
• Softmax (also used as classifier by computing probability)
tanh
Sigmoid
• Takes some number and ranges it between 0 and 1 (even +ve
value is too large to avoid exponential increase in +ve values
using NN) to interpret firing of neurons.
• 0 means no firing and
• 1 means a fully saturated firing.
• Easy to understand thus popular.
• But it has two Problems:
1. Causes the gradient to vanish.
• when neurons activation saturates closing either 0 or 1 the gradient
reduces to very close to 0.
• During back propagation this local gradient will be multiplied by the
gradient of this gate output the whole objective
• If local gradient is really small it will make the gradient slowly vanished
and no signal will flow through the neurons to its weights and recursively
to its data.
2. Its Output is Not Zero centered.
Problems with Sigmoid
• It starts from 0 and ends up 1.
• That means the value after the fn. will be +ve that make the gradient of the
weights become either all +ve or all –ve.
• This makes the gradient updates gone too far in different directions, which
makes optimization harder.

Ouuuh.. I can’t control its Gradient. uuuuhh


So it is Difficult to optimize.
Hyperbolic Tangent fn. tanh
• Its squishes the real numbers between -1
and +1.
• Output is Zero centered to make
optimization easier.
• Always Preferred over Sigmoid.
• But like Sigmoid it also vanishes
gradients.
Rectified Linear Unit ReLU
• It is most popular activation fn.
• It is simplest and most elegant solution.
• Give significant improvement in
convergence over tanh according to Alex et
al., 2012.
• It just max 0 and x.
• The value is 0 when x is < 0 and
• linear with slope of 1 when x is greater than 0.
• Extensive operations are not involved in it
like tanh and Sigmoid.
ReLU
• Almost all deep network in these
days use ReLU but only for hidden
layers.
• The output layer uses
• Softmax for classification to give
probability for different classes, and
• A linear function for regression so
the signal goes through unchanged.
ReLU Sometimes gives Problem
• Some units can be fragile during training
and could “Die”. ….
• Mean a big gradient flow through ReLU to
neuron can cause a weight update that
makes it never activate on any data point
again.
• So when gradient flowing through it there
will always be 0 from that point on.
• A variant, known as Leaky ReLU, was
introduced to fix this problem.
Another problem With ReLU
Leaky ReLU
• Instead of activation fn. being 0 when x < 0 it sets a small –ve slope.
Other Popular Variants
• Maxout is generalize form of both ReLU and Leaky ReLU. Its trade off:
• It doubles the number of parameters of each neuron.
Which Activation fn. Should be
considered?

1. But if lot of neurons die than consider Leaky ReLU, Maxout or other variants.
2. But don’t consider Sigmoid or tanh.
3. Still many rooms for improvements.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy