Backpropagation Math
Backpropagation Math
Backpropagation Math
{
f ( z )= 0 z < 0
1 z≥0
It provides possible outputs = { 0 , 1 } .It can’t provide multi-value outputs – for examples, it
can’t be used for multi-class classification problem.
2. Signum function: Mathematically it can be defined as
{
f ( z )= −1 z< 0
1 z> 0
It provides possible outputs = {−1.1 } .
3. Linear function:
4. ReLU function: ReLU stands for Rectified Linear Unit.
Mathematically, it can be defined as,
f ( x )=max (0 , x )
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main advantages of ReLU activation function is, it doesn’t active all the neurons at
the same time, so it is far more computationally efficient when compared to the sigmoid
and tanh functions. ReLU accelerates the convergence of gradient descent towards the
global minimum of the loss function due to its linear, non-saturating property.
The drawback of the ReLU function is Dying ReLU problem. Some neurons will be
active and some neurons will be inactive at any given point of time. There are some
situations, a neuron which is not active, will never become active in ReLU activation
function. Such a problem is known as Dying ReLU problem.
The negative side of the graph makes the gradient value zero. Due to this reason, during
the backpropagation process, the weights and biases for some neurons are not updated.
This can create dead neurons which never get activated.
5. Leaky ReLU function: Leaky ReLU is an improved version of ReLU function to solve
the Dying ReLU problem as it has a small positive slope in the negative area.
Mathematically, it can be defined as,
f ( x )=max (0.1 x , x )
Leaky ReLU function enables backpropagation, even for negative input values.
The predictions may not be consistent, for negative input values. The gradient for
negative values is a small value that makes the learning of model parameters time-
consuming.
6. Tanh function: It is a non-linear activation function. Its output ranges from -1 to 1.
Mathematically, it can be defined as
z −z
e −e
f ( z )= z −z
e +e
Where z is the weighted sum of the neuron.
The output of the tanh activation function is zero centered; so we can easily map the
output values as strongly negative, neutral or strongly positive. Drawback of this function
is vanishing gradient problem.
7. Sigmoid function: This function takes any real value as input and outputs values in the
range of 0 to 1. Mathematically it can be defined as
z
e 1
f ( z )= z
= −z
1+e 1+e
Where z is the weighted sum.
It is commonly used for models where we have to predict the probability as an output.
Since probability of anything exists only between the range of 0 and 1, sigmoid is the
right choice because of its range. The function is differentiable and provides a smooth
gradient, i.e., preventing jumps in output values. It can also be used in backpropagation
algorithm.
It works on calculating probability values. The output of the sigmoid function was in
range of 0 to 1, which can be thought of as probability.
Suppose we have five output values of 0.8, 0.9, 0.7, 0.8 and 0.6 respectively. We can’t
move forward with sigmoid activation function because the above values don’t make
sense as the sum of all the classes/output probability should be equal to 1. Here, comes
the softmax activation function.
8. Softmax function: It is the combination of multiples sigmoid activation function.
Learning Rule:
1. Perceptron Rule:
Problem – 01: w 1=1.2 , w2=0.6 ,Threshold T =1 , learning rate η=0.5
A B A^B
0 0 0
0 1 0
1 0 0
1 1 1
Solution:
∑ w2 x2 =1.2× 0+0.6 ×1
¿ 0.6< 1
¿ 0( AO=¿)
∑ w3 x3 =1.2× 1+ 0.6 ×0
¿ 1.2>1
¿ 1( AO ≠¿)
w 1=w 1+ ∆ w 1=w 1+ η ( T −O ) × x 1=1.2+0.5 × ( 0−1 ) ×1=0.7
A B A OR B
0 0 0
0 1 1
1 0 1
1 1 1
Solution:
1
y 3= −0.755
=0.68
1−e
for H 4 :a 2=w 14 x 1 +w 24 x 2=0.35 × 0.4+0.9 × 0.6=0.68
1
y4= −0.68
=0.66
1−e
For O5 :a3 :w35 y 3 +w 45 y 4 =0.3 ×0.68+ 0.9 ×0.66=0.801
1
y 5= −0.801
=0.69 (Network output)
1−e
Error = ¿ y target − y 5=0.5−0.69=−0.19
−3
∆ w23=η δ 3 x 2=1 × (−0.00265 ) × 0.9=−2.385× 10
−3
w 23 ( new )=w23 ( old ) +∆ w23=0.8−2.385 ×10 =0.7976
−4
∆ w13=η δ 3 x 1=1 × (−0.00265 ) × ( 0.35 )=−9.275 ×10
−4
w 13 ( new )=w13 ( old ) +∆ w13=0.1−9.275 ×10 =0.0991
−3
∆ w24=η δ 4 x 2=1 × (−0.0082 ) × ( 0.9 )=−7.38 ×10
−3
w 24 ( new )=w24 ( old )+ ∆ w 24=0.6−7.38 ×10 =0.5926
Perform another forward pass:
Forward pass: compute output for y 3 , y 4 and y 5
1
y 3= −0.7525
=0.6797
1−e
for H 4 :a 2=w 14 x 1 +w 24 x 2=0.35 × 0.3971+ 0.9 ×0.5926=0.6723
1
y4= −0.6723
=0.6620
1−e
For O5 :a3 :w35 y 3 +w 45 y 4 =0.2724 × 0.6797+0.8731 ×0.6620=0.7631
1
y 5= −0.7631
=0.6820 (Network output)
1−e
Error = ¿ y target − y 5=0.5−0.6820=−0.1820
Problem – 02:
Assume that the neurons have a sigmoid activation function, perform a forward pass and a
backward pass on the network. Assume that the actual output of y is 1 and learning rate is 0.9.
Perform another forward pass.
Solution:
Forward pass: Compute output for y 4 , y 5 and y 6.
a 4=( w14∗x1 ) + ( w24∗x 2 )+ ( w 34∗x 3 ) +θ4
δ 4 = y 4 (1− y 4 )w46 δ 6
−3
∆ w35=η δ 4 x 3=0.9 × (−0.0087 ) ×1=−7.83 ×10
−3
w 35 ( new )=w35 ( old ) +∆ w35=0.2−7.83 × 10 =0.194
−3
∆ w15=η δ 4 x 1=0.9 × (−0.0087 ) × 1=−7.83 ×10
−3
w 15 ( new )=w15 ( old ) +∆ w15=−0.3−7.83 ×10 =−0.306