0% found this document useful (0 votes)

3 views233 pages

8-10. Backpropagation Algorithm

The document provides an introduction to Feed Forward Neural Networks and Backpropagation, detailing the structure of the network, including input, hidden, and output layers. It explains the concepts of pre-activation and activation functions, as well as the mathematical formulation of the model and parameters involved. The document also mentions the use of Gradient Descent with Back-propagation as the algorithm for training the network.

Uploaded by

anand800a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views233 pages

8-10. Backpropagation Algorithm

Uploaded by

anand800a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 233

Deep Learning

Feed Forward Neural Networks and Backpropagation

Puneet Kumar Jain

CSE Department
National Institute of Technology Rourkela
References:
The Slides are prepared from the following major source:

▪ “CS7105-Deep Learning” by Mitesh M. Khapra, IIT Madras.

http://www.cse.iitm.ac.in/~miteshk/CS7015_2018.html.

▪ See the excellent videos by Hugo Larochelle on Backpropagation

NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
• Feedforward Neural Networks (a.k.a.
multilayered network of neurons)

NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
• The input to the network is an n -dimensional vector

3
• The input to the network is an n -dimensional vector

x1 x2 xn

3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each

x1 x2 xn

3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each

x1 x2 xn

3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)

x1 x2 xn

3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation

x1 x2 xn

3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
(ai and h i are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L )-th layer
a1

x1 x2 xn

3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and h i are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L )-th layer
a1
n ×n and b
b1 • W i ∈ R i ∈ R are the weight and bias
W1 n

between layers i −1 and i (0 < i < L )

x1 x2 xn

3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
W3 • Finally, there is one output layer containing k neurons
b3
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and h i are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L )-th layer
a1
n ×n and b
b1 • W i ∈ R i ∈ R are the weight and bias
W1 n

between layers i −1 and i (0 < i < L )

x1 x2 xn
• W L ∈ Rn ×k and bL ∈ Rk are the weight and bias
between the last hidden layer and the output layer
3
(L = 3 in this case)
h L = ŷ = f (x ) • The pre-activation at layer i is given by

a i (x) = bi + W i h i−1 (x)

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

4
h L = ŷ = f (x ) • The pre-activation at layer i is given by

a i (x) = bi + W i h i−1 (x)

a3
W3 b3
h2 • The activation at layer i is given by

h i (x) = g(a i (x))

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

4
h L = ŷ = f (x ) • The pre-activation at layer i is given by

a i (x) = bi + W i h i−1 (x)

a3
W3 b3
h2 • The activation at layer i is given by

h i (x) = g(a i (x))

a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)

a1
W1 b1

x1 x2 xn

4
h L = ŷ = f (x ) • The pre-activation at layer i is given by

a i (x) = bi + W i h i−1 (x)

a3
W3 b3
h2 • The activation at layer i is given by

h i (x) = g(a i (x))

a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x ) = h L (x ) = O (aL (x ))
x1 x2 xn

4
h L = ŷ = f (x ) • The pre-activation at layer i is given by

a i (x) = bi + W i h i−1 (x)

a3
W3 b3
h2 • The activation at layer i is given by

h i (x) = g(a i (x))

a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x ) = h L (x ) = O (aL (x ))
x1 x2 xn
where O is the output activation function (for
example, softmax, linear, etc.)
4
h L = ŷ = f (x ) • The pre-activation at layer i is given by

a i (x) = bi + W i h i−1 (x)

a3
W3 b3
h2 • The activation at layer i is given by

h i (x) = g(a i (x))

a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x ) = h L (x ) = O (aL (x ))
x1 x2 xn
where O is the output activation function (for
example, softmax, linear, etc.)
• Tosimplify notation we will refer to a i (x) as a i and 4
h L = ŷ = f (x ) • The pre-activation at layer i is given by

a i = bi + W i h i−1
a3
W3 b3
h2 • The activation at layer i is given by

h i = g(a i )
a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x ) = h L = O (aL )
x1 x2 xn
where O is the output activation function (for
example, softmax, linear, etc.)
5
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:

a3 yˆi= f ( x i ) = O(W 3 g(W 2 g(W 1 x + b1) + b2) + b3)

W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:

a3 yˆi= f ( x i ) = O(W 3 g(W 2 g(W 1 x + b1) + b2) + b3)

W3 b3
h2
• Parameters:
θ = W 1, .., W L , b1, b2, ..., bL (L = 3)
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:

a3 yˆi= f ( x i ) = O(W 3 g(W 2 g(W 1 x + b1) + b2) + b3)

W3 b3
h2
• Parameters:
θ = W 1, .., W L , b1, b2, ..., bL (L = 3)
a2
W2 b2 • Algorithm: Gradient Descent with Back-propagation
h1 (we will see soon)

a1
W1 b1

x1 x2 xn

6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:

a3 yˆi= f ( x i ) = O(W 3 g(W 2 g(W 1 x + b1) + b2) + b3)

W3 b3
h2
• Parameters:
θ = W 1, .., W L , b1, b2, ..., bL (L = 3)
a2
W2 b2 • Algorithm: Gradient Descent with Back-propagation
h1 (we will see soon)
• Objective/Loss/Error function: Say,
a1 N k
W1 1 Σ Σ
b1 mi n (ŷ i j −y ij )2
N
i=1 j =1
x1 x2 xn
In general, mi n L (θ)
6
where L (θ) is some function of the parameters
• Learning Parameters of Feedforward
Neural Networks (Intuition)

NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
The story so far...

• We have introduced feedforward neural networks

• We are now interested in finding an algorithm for learning the parameters of this model

8
h L = ŷ = f (x ) • Recall our gradient descent algorithm

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

9
h L = ŷ = f (x ) • Recall our gradient descent algorithm

a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e w0, b0;
W2 b2 while t++ < max_iterations do
h1 wt+1 ← wt −η∇wt ;
bt+1 ← bt −η∇bt ;
a1 end
W1 b1

x1 x2 xn

9
h L = ŷ = f (x ) • Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e θ0 = [w0, b0];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt −η∇θt ;
end
a1
∂L (θ) ∂L (θ) T
W1 b1 • where ∇θ t= ∂w t , ∂bt
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W 1, W 2, .., W L , b1, b2, .., bL ]
9
h L = ŷ = f (x ) • Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e θ0 = [W 10, ..., W L0, b01, ..., b0L ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt −η∇θt ;
end
a1
∂ L (θ) ∂ L (θ) ∂ L (θ) ∂ L (θ) T
W1 b1 • where ∇θ t= ∂ W1, t , ., ∂WL,t , ∂b 1,t , ., ∂ bL, t
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W 1, W 2, .., W L , b1, b2, .., bL ]
• We can still use the same algorithm for learning 9
the parameters of our model
• Except that now our ∇θ looks much more nasty

10
• Except that now our ∇θ looks much more nasty

10
We need to answer two questions
• How to choose the loss function L (θ)?
• How to compute ∇θ which is composed of
∇W 1, ∇W 2, ..., ∇W L −1 ∈Rn ×n , ∇W L ∈Rn ×k
∇b1, ∇b2, ..., ∇bL −1 ∈ Rn and ∇bL ∈Rk ?

11
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W 1, ∇W 2, ..., ∇W L −1 ∈Rn ×n , ∇W L ∈Rn ×k
∇b1, ∇b2, ..., ∇bL −1 ∈Rn and ∇bL ∈Rk ?

13
• Output Functions and Loss
Functions

12
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
• The choice of loss function depends on
the problem at hand

14
• The choice of loss function depends on
the problem at hand
• We will illustrate this with the help of
two examples

isActor isDirector
.. Nolan
........
Damon
xi

14
• The choice of loss function depends on
y i = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L −1 hidden layers • Here y i ∈R3
• The loss function should capture how
much ŷ i deviates from y i
isActor isDirector • If y i ∈Rn then the squared error losscan
Damon .. Nolan ........ capture this deviation
N 3
xi 1
L (θ) = Σ Σ (ŷ i j −y i j ) 2
N
i=1 j =1
14
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

15
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

15
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts yˆito a value
between 0 & 1 but we want ŷ i ∈R
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a1 f (x ) = h L = O (aL )
W1 b1 = W O a L + bO

x1 x2 xn

a1 f (x ) = h L = O (aL )
W1 b1 = W O a L + bO

x1 x2 xn • ŷ i = f (x i ) is no longer bounded
between 0 and 1

15
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate

Neural network with

L −1 hidden layers

17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes

Neural network with

L −1 hidden layers

Neural network with • Here again we could use the squared

L −1 hidden layers error loss to capture the deviation

Neural network with • Here again we could use the squared

L −1 hidden layers error loss to capture the deviation
• But can you think of a better function?

17
• Notice that y is a probability distribution
y = [1 0 0 0]
Apple Mango Orange Banana

Neural network with

L −1 hidden layers

18
• Notice that y is a probability distribution
y = [1 0 0 0] • Therefore we should also ensure that ŷ is
Apple Mango Orange Banana a probability distribution

Neural network with

L −1 hidden layers

18
h L = ŷ = f (x ) • Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O ’
will ensure this ?

a2 a L = W L h L −1 + bL
W2 b2
h1

a1
W1 b1

x1 x2 xn

a2 a L = W L h L −1 + bL
W2 b2 e a L, j
h1 yˆj = O ( a L ) j =
Σ ki =1 e a L , i

a1 O (aL ) j is the j th element of ŷ and aL , j

W1 b1 is the j th element of the vector aL .

x1 x2 xn

a2 a L = W L h L −1 + bL
W2 b2 e a L, j
h1 yˆj = O ( a L ) j =
Σ ki =1 e a L , i

a1 O (aL ) j is the j th element of ŷ and aL , j

W1 b1 is the j th element of the vector aL .

x1 x2 xn • This function is called the softmax

function

18
• Now that we have ensured that both y
y = [1 0 0 0] & yˆ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?

Neural network with

L −1 hidden layers

19
• Now that we have ensured that both y
y = [1 0 0 0] & yˆ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
• Cross-entropy
Neural network with k
L −1 hidden layers L (θ) = − Σ y c log ŷ c
c=1

• Notice that

yc = 1 if c = l (the true class label)

= 0 otherwise
∴ L (θ) = −log ŷ l

19
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function

a3 minimize L (θ) = −log ŷ l

W3 b3 θ
h2
or maximize −L (θ) = log ŷ l
θ

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function

a3 minimize L (θ) = −log ŷ l

W3 b3 θ
h2
or maximize −L (θ) = log ŷ l
θ

a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1

a1
W1 b1

x1 x2 xn

20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function

a3 minimize L (θ) = −log ŷ l

W3 b3 θ
h2
or maximize −L (θ) = log ŷ l
θ

a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 yˆl = [O(W 3 g(W 2 g(W 1 x + b1) + b2) + b 3)]l

W1 b1

x1 x2 xn

20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function

a3 minimize L (θ) = −log ŷ l

W3 b3 θ
h2
or maximize −L (θ) = log ŷ l
θ

a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 yˆl = [O(W 3 g(W 2 g(W 1 x + b1) + b2) + b 3)]l

W1 b1 • What does ŷ l encode?

x1 x2 xn

20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function

a3 minimize L (θ) = −log ŷ l

W3 b3 θ
h2
or maximize −L (θ) = log ŷ l
θ

a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 yˆl = [O(W 3 g(W 2 g(W 1 x + b1) + b2) + b 3)]l

W1 b1 • What does ŷ l encode?
• It is the probability that x belongsto the l th class(bring
x1 x2 xn
it as close to 1).
20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function

a3 minimize L (θ) = −log ŷ l

W3 b3 θ
h2
or maximize −L (θ) = log ŷ l
θ

a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 yˆl = [O(W 3 g(W 2 g(W 1 x + b1) + b2) + b 3)]l

W1 b1 • What does ŷ l encode?
• It is the probability that x belongsto the l th class(bring
x1 x2 xn
it as close to 1).
20
• log yˆl is called the log-likelihood of the data.
Outputs

Real Values Probabilities

Output Activation

Loss Function

21
Outputs

Real Values Probabilities

Output Activation Linear

Loss Function

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
• For the rest of this lecture we will focus on the case where the output activation is a
softmax function and the loss function is cross entropy

21
• Backpropagation (Intuition)

22
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W 1, ∇W 2, ..., ∇W L −1 ∈Rn ×n , ∇W L ∈Rn ×k
∇b1, ∇b2, ..., ∇bL −1 ∈Rn and ∇bL ∈Rk ?

23
ŷ = f (x )
• Let us focus on this one Algorithm: Gradient descent()
weight (W 112). a31
W311 b3 t ← 0;
h 21
max_iterations ←
a21 1000;
W211 b2
h 11 Initial ize θ0;
while
a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt −η∇θt ;
end

24
ŷ = f (x )
• Let us focus on this one Algorithm: Gradient descent()
weight (W 112). a31

• To learn this weight using

W311 b3 t ← 0;
h 21
SGD we need a formula max_iterations ←
for ∂L (θ)
.
a21 1000;
∂W W211 b2
112 h 11 Initial ize θ0;
• We will see how to
while
calculate this. a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt −η∇θt ;
end

24
• First let us take the simple case when we L (θ)
ŷ = f (x )
have a deep but thin network.

aL1
W L1 1
h21

a21
W211
h11

a11
W111

25
• First let us take the simple case when we L (θ)
ŷ = f (x )
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. W L1 1
h21
∂L (θ) ∂L (θ) ∂ŷ ∂aL 11 ∂h 21 ∂a21 ∂h 11 ∂a11
=
∂W111 ∂ŷ ∂a L11 ∂h21 ∂a 21 ∂h11 ∂a 11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h 11 W211
= (just compressing the chain rule) h 11
∂W111 ∂h11 ∂W111
a11
W111

25
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details

26
• We get acertain loss at the output and we try to figure —log ŷ l
out who is responsible for this loss

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

27
• We get a certain loss at the output and we try to figure —log ŷ l
out who is responsible for this loss
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
• The output layer says “Well, I take responsibility for my h2
part but please understand that I am only as the good
as the hidden layer and weights below me". After all a2
... W2 b2
h1
f (x ) = ŷ = O (W L h L − 1 + bL )

a1
W1 b1

x1 x2 xn

27
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a1
W1 b1

x1 x2 xn

28
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"
• WL and bL take full responsibility but h L says “Well, please
understand that I amonly asgood as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
∂L (θ) ∂L (θ) ∂ŷ ∂a 3 ∂h 2 ∂a 2 ∂h 1 ∂a 1 W1 b1
=
`∂W
˛¸ 1 1 1x `
∂yˆ ∂a 3 ∂h 2˛¸∂a 2x `∂h 1˛¸∂a 1x ∂W
˛¸ x` 111
` ˛¸ x
Talk to the Talk to the Talk to the Talk to the and now
x1 x2 xn
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

28
Quantities of interest (roadmap for the remaining part):

∂ L (θ) ∂ L (θ) ∂yˆ ∂a3 ∂h 2 ∂a2 ∂h 1 ∂a1

=
∂W ∂yˆ ∂a3 ∂h 2 ∂a2 ∂h 1 ∂a1 ∂W111
` ˛¸111x ` ˛¸ x ` ˛¸ x ` ˛¸ x ` ˛¸ x
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

29
Quantities of interest
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases

∂ L (θ) ∂ L (θ) ∂yˆ ∂a3 ∂h 2 ∂a2 ∂h 1 ∂a1

• Our focus is on Cross entropy loss and Softmax output.

29
• Backpropagation: Computing
Gradients w.r.t. the Output Units

30
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
Quantities of interest
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights

∂L (θ) ∂ L (θ) ∂yˆ ∂a3 ∂h 2 ∂a2 ∂h 1 ∂a1

= ∂yˆ ∂a3 ∂h 2 ∂a2 ∂h 1 ∂a1 ∂W111
∂W
` ˛¸111x ` ˛¸ x ` ˛¸ x ` ˛¸ x ` ˛¸ x
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

• Our focus is on Cross entropy loss and Softmax output.

31
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output

L (θ) = −log ŷ l (l = true class label)

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

32
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output

L (θ) = −log ŷ l (l = true class label)

∂ ∂ a3
(L (θ)) = (−log ŷ l)
∂yˆi ∂yˆi W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

32
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output

L (θ) = −log ŷ l (l = true class label)

∂ ∂ a3
(L (θ)) = (−log ŷ l)
∂yˆi ∂yˆi W3 b3
1 h2
ŷ l if i = l
= −
a2
= 0 otherwise W2 b2
h1

a1
W1 b1

x1 x2 xn

32
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output

L (θ) = −log ŷ l (l = true class label)

∂ ∂ a3
(L (θ)) = (−log ŷ l)
∂yî ∂yî W3 b3
1 h2
ŷ l if i = l
= −
a2
= 0 otherwise W2 b2
h1
More compactly,
∂ (L (θ)) = − 1(i=l)
a1
∂yî ŷ l W1 b1

x1 x2 xn

32
—log ŷ l
∂ (L (θ)) = − 1(l=i)

∂yˆi ŷ l

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

33
—log ŷ l
∂ (L (θ)) = − 1(l=i)

∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2

a2
W2 b2
h1
∇ŷ L (θ) =

a1
W1 b1

x1 x2 xn

33
—log ŷ l
∂ (L (θ)) = − 1(l=i)

∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2

∂L (θ)
a2
∂yˆ1 W2 b2
h1
∇ŷ L (θ) =

a1
W1 b1

x1 x2 xn

33
—log ŷ l
∂ (L (θ)) = − 1(l=i)

∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2

∂L (θ)
a2
∂yˆ1 W2 b2
h1
∇yˆL (θ) = ..

a1
W1 b1

x1 x2 xn

33
—log ŷ l
∂ (L (θ)) = − 1(l=i)

∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2

∂L (θ)
a2
∂yˆ1 W2 b2
h1
∇yˆL (θ) = ..
∂L (θ)
∂yˆk a1
W1 b1

x1 x2 xn

33
—log ŷ l
∂ (L (θ)) = − 1(l=i)

∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2

∂L (θ) 1l=1 a2
∂yˆ1 W2 b2
1 1 l =2 h1
∇yˆL (θ) = .. = −
ŷ l .
∂L (θ)
∂yˆk 1l=k
a1
W1 b1

x1 x2 xn

33
—log ŷ l
∂ (L (θ)) = − 1(l=i)

∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2

∂L (θ) 1l=1 a2
∂yˆ1 W2 b2
1 1 l =2 h1
∇yˆL (θ) = .. = −
ŷ l .
∂L (θ)
∂yˆk 1l=k
a1
W1 b1
1
= − el
ŷ l x1 x2 xn

where e(l ) is a k-dimensional vector whose

l -th element is 1 and all other elements are 0.
33
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
Does ŷ l depend on aL i ?
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
Does ŷ l depend on aL i ?Indeed, it does.
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
Does ŷ l depend on aL i ?Indeed, it does.

exp(a L l ) a2
yˆl = Σ W2 b2
i exp(a L i ) h1

a1
W1 b1

x1 x2 xn

exp(a L l ) a2
yˆl = Σ W2 b2
i exp(a L i ) h1

Having established this, we will now derive

a1
the full expression on the next slide W1 b1

x1 x2 xn

34
∂
—log ŷ A =
∂aL i

35
Li

35
Sofar we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2

a2
W2 b2
∇a L L (θ) h1

a1
W1 b1

x1 x2 xn

∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = h1

a1
W1 b1

x1 x2 xn

∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = .. h1

a1
W1 b1

x1 x2 xn

∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = .. h1
∂L (θ)
∂a L k a1
W1 b1

x1 x2 xn

∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = .. = h1
∂L (θ)
∂a L k a1
W1 b1

x1 x2 xn

∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 W2 b2
∇ a L L (θ) = .. = h1
∂L (θ)
∂a L k a1
W1 b1

x1 x2 xn

∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
∂L (θ)
∂a L k a1
W1 b1

x1 x2 xn

∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
..
∂L (θ)
∂a L k a1
W1 b1

x1 x2 xn

∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
..
∂L (θ)
∂a L k —( 1 l =k −ŷ k ) a1
W1 b1

x1 x2 xn

∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
..
∂L (θ)
∂a L k —( 1 l =k −ŷ k ) a1
W1 b1
= −(e(l ) −ŷ)
x1 x2 xn

36
• Backpropagation: Computing
Gradients w.r.t. Hidden Units

37
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases

∂ L (θ) ∂ L (θ) ∂yˆ ∂a3 ∂h 2 ∂a2 ∂h 1 ∂a1

• Our focus is on Cross entropy loss and Softmax output.

38
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :

a3
∂p(z) ∂p(z) ∂q m (z)
= Σ W3 b3
∂z ∂qm (z ) ∂z h2
m

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :

∂p(z) Σ ∂p(z) ∂q m (z)

a3
= W3 b3
∂z ∂qm (z ) ∂z h2
m

a2
In our case: W2 b2
h1
• p(z ) is the loss function L (θ)
a1
W1 b1

x1 x2 xn

39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :

∂p(z) Σ ∂p(z) ∂q m (z)

a3
= W3 b3
∂z ∂qm (z ) ∂z h2
m

a2
In our case: W2 b2
h1
• p(z ) is the loss function L (θ)
• z = hij a1
W1 b1

x1 x2 xn

39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :

∂p(z) Σ ∂p(z) ∂q m (z)

a3
= W3 b3
∂z ∂qm (z ) ∂z h2
m

a2
In our case: W2 b2
h1
• p(z ) is the loss function L (θ)
• z = hij a1
W1 b1
• q m (z) = a L m
x1 x2 xn

39
—log ŷ l
∂L (θ)
∂hij

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
k —log ŷ l
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= Σ ∂a i+ 1, m ∂hij
m =1

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
k —log ŷ l
∂L (θ) Σ ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
k
∂L (θ)
= Σ ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
k —log ŷ l
∂L (θ) Σ ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
k
∂L (θ)
= Σ ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
k —log ŷ l
∂L (θ) Σ ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
k
∂L (θ)
= Σ ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

a2
W2 b2
∇a i + 1 L (θ) = ; W i +1, · , j = h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
a2
W2 b2
∇ a i + 1 L (θ) = ; W i +1, · , j = h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = ; W i +1, · , j = h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1

a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1

x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ;
x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ; see that,
x1 x2 xn

a i+1 = W i+1 h i j + bi+1

41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ; see that,
x1 x2 xn
T
(W i +1, · , j ) ∇a i+ 1
L (θ) =
a i+1 = W i+1 h i j + bi+1
41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2

∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ; see that,
x1 x2 xn
Σk ∂L (θ)
(W i +1, · , j ) T ∇ a i + 1 L (θ) = W i +1, m, j
∂a i+1,m
m =1 a i+1 = W i+1 h i j + bi+1
41
—log ŷ l
∂L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn