Backpropagation Math

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Activation Function:

1. Step function: Mathematically it can be defined as

{
f ( z )= 0 z < 0
1 z≥0
It provides possible outputs = { 0 , 1 } .It can’t provide multi-value outputs – for examples, it
can’t be used for multi-class classification problem.
2. Signum function: Mathematically it can be defined as

{
f ( z )= −1 z< 0
1 z> 0
It provides possible outputs = {−1.1 } .
3. Linear function:
4. ReLU function: ReLU stands for Rectified Linear Unit.
Mathematically, it can be defined as,
f ( x )=max ⁡(0 , x )

Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main advantages of ReLU activation function is, it doesn’t active all the neurons at
the same time, so it is far more computationally efficient when compared to the sigmoid
and tanh functions. ReLU accelerates the convergence of gradient descent towards the
global minimum of the loss function due to its linear, non-saturating property.
The drawback of the ReLU function is Dying ReLU problem. Some neurons will be
active and some neurons will be inactive at any given point of time. There are some
situations, a neuron which is not active, will never become active in ReLU activation
function. Such a problem is known as Dying ReLU problem.
The negative side of the graph makes the gradient value zero. Due to this reason, during
the backpropagation process, the weights and biases for some neurons are not updated.
This can create dead neurons which never get activated.
5. Leaky ReLU function: Leaky ReLU is an improved version of ReLU function to solve
the Dying ReLU problem as it has a small positive slope in the negative area.
Mathematically, it can be defined as,
f ( x )=max ⁡(0.1 x , x )

Leaky ReLU function enables backpropagation, even for negative input values.

The predictions may not be consistent, for negative input values. The gradient for
negative values is a small value that makes the learning of model parameters time-
consuming.
6. Tanh function: It is a non-linear activation function. Its output ranges from -1 to 1.
Mathematically, it can be defined as
z −z
e −e
f ( z )= z −z
e +e
Where z is the weighted sum of the neuron.
The output of the tanh activation function is zero centered; so we can easily map the
output values as strongly negative, neutral or strongly positive. Drawback of this function
is vanishing gradient problem.
7. Sigmoid function: This function takes any real value as input and outputs values in the
range of 0 to 1. Mathematically it can be defined as
z
e 1
f ( z )= z
= −z
1+e 1+e
Where z is the weighted sum.
It is commonly used for models where we have to predict the probability as an output.
Since probability of anything exists only between the range of 0 and 1, sigmoid is the
right choice because of its range. The function is differentiable and provides a smooth
gradient, i.e., preventing jumps in output values. It can also be used in backpropagation
algorithm.
It works on calculating probability values. The output of the sigmoid function was in
range of 0 to 1, which can be thought of as probability.
Suppose we have five output values of 0.8, 0.9, 0.7, 0.8 and 0.6 respectively. We can’t
move forward with sigmoid activation function because the above values don’t make
sense as the sum of all the classes/output probability should be equal to 1. Here, comes
the softmax activation function.
8. Softmax function: It is the combination of multiples sigmoid activation function.
Learning Rule:

1. Perceptron Rule:
Problem – 01: w 1=1.2 , w2=0.6 ,Threshold T =1 , learning rate η=0.5

A B A^B
0 0 0
0 1 0
1 0 0
1 1 1

Solution:

Step – 01: ∑ w1 x1=1.2× 0+0.6 × 0


¿ 0<1
¿ 0( AO=¿)

∑ w2 x2 =1.2× 0+0.6 ×1
¿ 0.6< 1
¿ 0( AO=¿)

∑ w3 x3 =1.2× 1+ 0.6 ×0
¿ 1.2>1
¿ 1( AO ≠¿)
w 1=w 1+ ∆ w 1=w 1+ η ( T −O ) × x 1=1.2+0.5 × ( 0−1 ) ×1=0.7

w 2=w 2+ ∆ w 2=w 2+ η (T −O ) × x 2=0.6+0.5 × ( 0−1 ) ×0=0.6


w 1=0.7 , w 2=0.6

Step – 02: ∑ w1 x1=0.7 ×0+ 0.6 ×0=0<1=0(¿=AO )


∑ w2 x2 =0.7 ×0+ 0.6 ×1=0.6<1=0 (¿= AO )
∑ w3 x3 =0.7 ×1+0.6 × 0=0.7<1=0 ( ¿= AO )
∑ w4 x 4 =0.7 ×1+0.6 × 1=1.3>1=1(¿=AO )

Problem – 02: w 1=0.6 , w 2=0.6 , Threshold T =1 ,learning rate η=0.5

A B A OR B
0 0 0
0 1 1
1 0 1
1 1 1

Solution:

Step – 01: ∑ w1 x1=0.6 ×0+ 0.6 ×0=0<1=0( AO=¿)


∑ w2 x2 =0.6 ×0+ 0.6 ×1=0.6<1=0 (AO ≠¿)
w 1=w 1+ ∆ w 1=w 1+ η ( T −O ) × x 1=0.6 +0.5 × ( 1−0 ) × 0=0.6

w 2=w 2+ ∆ w 2=w 2+ η (T −O ) × x 2=0.6+0.5 × ( 1−0 ) ×1=1.1


w 1=0.6 , w 2=1.1

Step – 02: ∑ w1 x1=0.6 ×0+1.1 × 0=0<1=0 ( AO =¿ )


∑ w2 x2 =0.6 ×0+1.1 ×1=1.1>1=1( AO=¿)
∑ w3 x3 =0.6 ×1+1.1 ×0=0.6<1=0( AO ≠¿)
w 1=w 1+ ∆ w 1=w 1+ η ( T −O ) × x 1=0.6 +0.5 × ( 1−0 ) × 1=1.1

w 2=w 2+ ∆ w 2=w 2+ η (T −O ) × x 2=1.1+0.5 × ( 1−0 ) × 0=1.1


w 1=1.1 , w2=1.1

Step – 03: ∑ w1 x1=1.1× 0+1.1× 0=0<1=0 (AO =¿)


∑ w2 x2 =1.1× 0+1.1× 1=1.1>1=1(AO=¿)
∑ w3 x3 =1.1× 1+ 1.1× 0=1.1>1=1( AO=¿)
∑ w4 x 4 =1.1× 1+ 1.1× 1=2.2>1=1( AO=¿)
2. Delta Rule:

Back Propagation Algorithm on Multi-Layer Perceptron Network


Problem – 01: Assume that the neurons have a sigmoid activation function, perform a forward
pass and a backward pass on the network. Assume that the actual output of y is 0.5 and learning
rate is 1. Perform another forward pass.

Solution: Forward Pass:

for H 3 : ∑ ( w ij∗x i )=w 13 x 1+ w23 x 2=0.1× 0.35+0.8 × 0.9=0.755


j

1
y 3= −0.755
=0.68
1−e
for H 4 :a 2=w 14 x 1 +w 24 x 2=0.35 × 0.4+0.9 × 0.6=0.68
1
y4= −0.68
=0.66
1−e
For O5 :a3 :w35 y 3 +w 45 y 4 =0.3 ×0.68+ 0.9 ×0.66=0.801
1
y 5= −0.801
=0.69 (Network output)
1−e
Error = ¿ y target − y 5=0.5−0.69=−0.19

Backward pass: Compute δ 3 , δ 4 and δ 5.


For output unit,
δ 5= y 5 ( 1− y 5 ) ( y target − y 5 )=0.69 ( 1−0.69 ) ( 0.5−0.69 )=−0.0406

For hidden unit,


δ 3= y 3 ( 1− y 3 ) δ 5 w53=0.68 ( 1−0.68 ) (−0.0406 ) ( 0.3 )=−0.00265

δ 4 = y 4 ( 1− y 4 ) δ 5 w 54=0.66 ( 1−0.66 ) (−0.0406 ) ( 0.9 )=−0.0082

Compute new weights:


∆ w45=η δ 5 y 4=1 × (−0.0406 ) × ( 0.6637 )=−0.0269

w 45 ( new )=∆ w 45+ w45 ( old )=−0.0269+0.9=0.8731

∆ w14=η δ 4 x 1=1 × (−0.0082 ) × ( 0.35 )=−0.00287

w 14 ( new )=w14 ( old )+ ∆ w14 =0.4−0.00287=0.3971

∆ w35=η δ 5 y3 =1× (−0.0406 ) ×0.68=−0.027608

w 35 ( new )=w35 ( old ) +∆ w35=0.3−0.0027608=0.2724

−3
∆ w23=η δ 3 x 2=1 × (−0.00265 ) × 0.9=−2.385× 10
−3
w 23 ( new )=w23 ( old ) +∆ w23=0.8−2.385 ×10 =0.7976

−4
∆ w13=η δ 3 x 1=1 × (−0.00265 ) × ( 0.35 )=−9.275 ×10
−4
w 13 ( new )=w13 ( old ) +∆ w13=0.1−9.275 ×10 =0.0991

−3
∆ w24=η δ 4 x 2=1 × (−0.0082 ) × ( 0.9 )=−7.38 ×10
−3
w 24 ( new )=w24 ( old )+ ∆ w 24=0.6−7.38 ×10 =0.5926
Perform another forward pass:
Forward pass: compute output for y 3 , y 4 and y 5

for H 3 : ∑ ( w ij∗x i )=w 13 x 1+ w23 x 2=0.0991× 0.35+0.7976 × 0.9=0.7525


j

1
y 3= −0.7525
=0.6797
1−e
for H 4 :a 2=w 14 x 1 +w 24 x 2=0.35 × 0.3971+ 0.9 ×0.5926=0.6723
1
y4= −0.6723
=0.6620
1−e
For O5 :a3 :w35 y 3 +w 45 y 4 =0.2724 × 0.6797+0.8731 ×0.6620=0.7631
1
y 5= −0.7631
=0.6820 (Network output)
1−e
Error = ¿ y target − y 5=0.5−0.6820=−0.1820
Problem – 02:
Assume that the neurons have a sigmoid activation function, perform a forward pass and a
backward pass on the network. Assume that the actual output of y is 1 and learning rate is 0.9.
Perform another forward pass.
Solution:
Forward pass: Compute output for y 4 , y 5 and y 6.
a 4=( w14∗x1 ) + ( w24∗x 2 )+ ( w 34∗x 3 ) +θ4

¿ ( 0.2 ×1 ) + ( 0.4 × 0 ) + (−0.5 × 1 )+ (−0.4 )=−0.7


1
O ( H 4)= y4= 0.7
=0.332
1+ e

a 5=( w 15∗x 1 ) + ( w 25∗x 3 ) + ( w 35∗x 3 ) +θ5

¿ (−0.3 ×1 )+ ( 0.1× 0 ) + ( 0.2 ×1 ) + ( 0.2 )=0.1


1
O ( H 5 )= y 5 = −0.1
=0.525
1+ e

a 6=( w46∗ y 4 ) + ( w56∗y 5 ) +θ 6

¿ (−0.3 × 0.332 )+ (−0.2× 0.525 ) +0.1=−0.105


1
O ( O 6 )= y6 = 0.105
=0.474
1+ e
Error= y target − y 6=1−0.474=0.526

Backward Pass: Compute δ 4 , δ 5 ∧δ 6

For output unit: δ 6= y 6 (1− y 6 )( y target − y 6 )


¿ 0.474 × ( 1−0.474 ) × ( 1−0.474 )=0.1311
For hidden unit: δ 5= y 5 (1− y 5)w 56 δ 6
¿ 0.525 × ( 1−0.525 ) × (−0.2 ) ×0.1311=−0.0065

δ 4 = y 4 (1− y 4 )w46 δ 6

¿ 0.332 × ( 1−0.332 ) × (−0.3 ×0.1331 ) =−0.0087

Compute new weights:


∆ w46=η δ 6 y 4 =0.9 ×0.1311× 0.332=0.03917

w 46 ( new )=w 46 ( old )+ ∆ w 46=−0.3+ 0.03917=−0.261

∆ w56=η δ 6 y 5=0.9 ×0.1311× 0.525=0.06194475

w 56 ( new )=w56 ( old ) +∆ w56=−0.2+ 0.06194475=−0.138

−3
∆ w35=η δ 4 x 3=0.9 × (−0.0087 ) ×1=−7.83 ×10
−3
w 35 ( new )=w35 ( old ) +∆ w35=0.2−7.83 × 10 =0.194

∆ w25=η δ 4 x 2=0.9 × (−0.0087 ) × 0=0

w 25 ( new )=w25 ( old ) +∆ w25=0.1+0=0.1

−3
∆ w15=η δ 4 x 1=0.9 × (−0.0087 ) × 1=−7.83 ×10
−3
w 15 ( new )=w15 ( old ) +∆ w15=−0.3−7.83 ×10 =−0.306

∆ w24=η δ 4 x 2=1 × (−0.0087 ) × 0=0

w 24 ( new )=w24 ( old )+ ∆ w 24=0.4+ 0=0.4

∆ w34=η δ 4 x 3=1 × (−0.0087 ) × ( 1 )=−0.0087

w 34 ( new )=w34 ( old )+ ∆ w34 =−0.5−0.0087=−0.508


∆ w14=η δ 4 x 1=0.9 × (−0.0087 ) ×1=−0.0078

w 14 ( new )=∆ w 14+ w14 ( old )=−0.0078+0.2=0.192

Compute bais weights:


θ6 ( new )=θ6 ( old )+ η δ 6 =0.1+ ( 0.1311× 0.9 )=0.218

θ5 ( new )=θ5 ( old )+ η δ 5=0.2+ (−0.0065 ×0.9 ) =0.194

θ 4 ( new )=θ 4 ( old ) + η δ 4 =−0.4+ (−0.0087 ×0.9 )=−0.408

Now, perform another forward pass:


Compute output for y 4 , y 5 and y 6.
a 4=( w14∗x1 ) + ( w24∗x 2 )+ ( w 34∗x 3 ) +θ4

¿ ( 0.192 ×1 ) + ( 0.4 × 0 ) + (−0.508 × 1 )+ (−0.408 )=−0.724


1
O ( H 4)= y4= 0.724
=0.327
1+ e

a 5=( w 15∗x 1 ) + ( w 25∗x 3 ) + ( w 35∗x 3 ) +θ5

¿ (−0.306 × 1 )+ ( 0.1× 0 ) + ( 0.194 × 1 ) + ( 0.194 )=0.082


1
O ( H 5 )= y 5 = −0.082
=0.520
1+ e

a 6=( w46∗ y 4 ) + ( w56∗y 5 ) +θ 6


¿ (−0.261 ×0.327 ) + (−0.138 ×0.520 )+ 0.218=0.061
1
O ( O6 )= y6 = −0.061
=0.515 (Network Output )
1+ e
Error= y target − y 6=1−0.515=0.485

Convolutional Neural Network (CNN)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy