8-10. Backpropagation Algorithm
8-10. Backpropagation Algorithm
CSE Department
National Institute of Technology Rourkela
References:
The Slides are prepared from the following major source:
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
• Feedforward Neural Networks (a.k.a.
multilayered network of neurons)
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
• The input to the network is an n -dimensional vector
3
• The input to the network is an n -dimensional vector
x1 x2 xn
3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
x1 x2 xn
3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
x1 x2 xn
3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
x1 x2 xn
3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
x1 x2 xn
3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
be split into two parts :
x1 x2 xn
3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
be split into two parts :
x1 x2 xn
3
• The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation
a1
x1 x2 xn
3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1
a1
x1 x2 xn
3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
(ai and h i are vectors)
h1
a1
x1 x2 xn
3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
(ai and h i are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L )-th layer
a1
x1 x2 xn
3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
(ai and h i are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L )-th layer
a1
n ×n and b ∈ Rn are the weight and bias
W1 b1 • W i ∈ R i
between layers i −1 and i (0 < i < L )
x1 x2 xn
3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and h i are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L )-th layer
a1
n ×n and b
b1 • W i ∈ R i ∈ R are the weight and bias
W1 n
3
h L = ŷ = f (x ) • The input to the network is an n -dimensional vector
• The network contains L −1 hidden layers (2, in this
case) having n neurons each
a3
W3 • Finally, there is one output layer containing k neurons
b3
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and h i are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L )-th layer
a1
n ×n and b
b1 • W i ∈ R i ∈ R are the weight and bias
W1 n
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
4
h L = ŷ = f (x ) • The pre-activation at layer i is given by
a1
W1 b1
x1 x2 xn
4
h L = ŷ = f (x ) • The pre-activation at layer i is given by
a1
W1 b1
x1 x2 xn
4
h L = ŷ = f (x ) • The pre-activation at layer i is given by
4
h L = ŷ = f (x ) • The pre-activation at layer i is given by
a i = bi + W i h i−1
a3
W3 b3
h2 • The activation at layer i is given by
h i = g(a i )
a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x ) = h L = O (aL )
x1 x2 xn
where O is the output activation function (for
example, softmax, linear, etc.)
5
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:
a1
W1 b1
x1 x2 xn
6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:
a1
W1 b1
x1 x2 xn
6
• Data: {x i , y i }Ni=1
h L = ŷ = f (x )
• Model:
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
The story so far...
8
h L = ŷ = f (x ) • Recall our gradient descent algorithm
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
9
h L = ŷ = f (x ) • Recall our gradient descent algorithm
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e w0, b0;
W2 b2 while t++ < max_iterations do
h1 wt+1 ← wt −η∇wt ;
bt+1 ← bt −η∇bt ;
a1 end
W1 b1
x1 x2 xn
9
h L = ŷ = f (x ) • Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e θ0 = [w0, b0];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt −η∇θt ;
end
a1
W1 b1
x1 x2 xn
9
h L = ŷ = f (x ) • Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e θ0 = [w0, b0];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt −η∇θt ;
end
a1
∂L (θ) ∂L (θ) T
W1 b1 • where ∇θ t= ∂w t , ∂bt
x1 x2 xn
9
h L = ŷ = f (x ) • Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e θ0 = [w0, b0];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt −η∇θt ;
end
a1
∂L (θ) ∂L (θ) T
W1 b1 • where ∇θ t= ∂w t , ∂bt
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W 1, W 2, .., W L , b1, b2, .., bL ]
9
h L = ŷ = f (x ) • Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 I nitializ e θ0 = [W 10, ..., W L0, b01, ..., b0L ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt −η∇θt ;
end
a1
∂ L (θ) ∂ L (θ) ∂ L (θ) ∂ L (θ) T
W1 b1 • where ∇θ t= ∂ W1, t , ., ∂WL,t , ∂b 1,t , ., ∂ bL, t
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W 1, W 2, .., W L , b1, b2, .., bL ]
• We can still use the same algorithm for learning 9
the parameters of our model
• Except that now our ∇θ looks much more nasty
10
• Except that now our ∇θ looks much more nasty
10
• Except that now our ∇θ looks much more nasty
10
• Except that now our ∇θ looks much more nasty
10
• Except that now our ∇θ looks much more nasty
10
• Except that now our ∇θ looks much more nasty
10
• Except that now our ∇θ looks much more nasty
10
• Except that now our ∇θ looks much more nasty
10
We need to answer two questions
• How to choose the loss function L (θ)?
• How to compute ∇θ which is composed of
∇W 1, ∇W 2, ..., ∇W L −1 ∈Rn ×n , ∇W L ∈Rn ×k
∇b1, ∇b2, ..., ∇bL −1 ∈ Rn and ∇bL ∈Rk ?
11
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W 1, ∇W 2, ..., ∇W L −1 ∈Rn ×n , ∇W L ∈Rn ×k
∇b1, ∇b2, ..., ∇bL −1 ∈Rn and ∇bL ∈Rk ?
13
• Output Functions and Loss
Functions
12
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
• The choice of loss function depends on
the problem at hand
14
• The choice of loss function depends on
the problem at hand
• We will illustrate this with the help of
two examples
14
• The choice of loss function depends on
y i = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L −1 hidden layers
isActor isDirector
.. Nolan
........
Damon
xi
14
• The choice of loss function depends on
y i = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L −1 hidden layers • Here y i ∈R3
isActor isDirector
.. Nolan
........
Damon
xi
14
• The choice of loss function depends on
y i = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L −1 hidden layers • Here y i ∈R3
• The loss function should capture how
much ŷ i deviates from y i
isActor isDirector
.. Nolan
........
Damon
xi
14
• The choice of loss function depends on
y i = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L −1 hidden layers • Here y i ∈R3
• The loss function should capture how
much ŷ i deviates from y i
isActor isDirector • If y i ∈Rn then the squared error losscan
Damon .. Nolan ........ capture this deviation
N 3
xi 1
L (θ) = Σ Σ (ŷ i j −y i j ) 2
N
i=1 j =1
14
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts yˆito a value
between 0 & 1 but we want ŷ i ∈R
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts yˆito a value
between 0 & 1 but we want ŷ i ∈R
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O ’ as linear function
a1 f (x ) = h L = O (aL )
W1 b1 = W O a L + bO
x1 x2 xn
15
h L = ŷ = f (x ) • Arelated question: What should the
output function ‘O ’ be if y i ∈R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts yˆito a value
between 0 & 1 but we want ŷ i ∈R
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O ’ as linear function
a1 f (x ) = h L = O (aL )
W1 b1 = W O a L + bO
x1 x2 xn • ŷ i = f (x i ) is no longer bounded
between 0 and 1
15
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
17
• Notice that y is a probability distribution
y = [1 0 0 0]
Apple Mango Orange Banana
18
• Notice that y is a probability distribution
y = [1 0 0 0] • Therefore we should also ensure that ŷ is
Apple Mango Orange Banana a probability distribution
18
h L = ŷ = f (x ) • Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O ’
will ensure this ?
a2 a L = W L h L −1 + bL
W2 b2
h1
a1
W1 b1
x1 x2 xn
18
h L = ŷ = f (x ) • Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O ’
will ensure this ?
a2 a L = W L h L −1 + bL
W2 b2 e a L, j
h1 yˆj = O ( a L ) j =
Σ ki =1 e a L , i
x1 x2 xn
18
h L = ŷ = f (x ) • Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O ’
will ensure this ?
a2 a L = W L h L −1 + bL
W2 b2 e a L, j
h1 yˆj = O ( a L ) j =
Σ ki =1 e a L , i
18
• Now that we have ensured that both y
y = [1 0 0 0] & yˆ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
19
• Now that we have ensured that both y
y = [1 0 0 0] & yˆ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
• Cross-entropy
Neural network with k
L −1 hidden layers L (θ) = − Σ y c log ŷ c
c=1
19
• Now that we have ensured that both y
y = [1 0 0 0] & yˆ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
• Cross-entropy
Neural network with k
L −1 hidden layers L (θ) = − Σ y c log ŷ c
c=1
• Notice that
19
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function
a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
a1
W1 b1
x1 x2 xn
20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function
a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ
x1 x2 xn
20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function
a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ
x1 x2 xn
20
• So, for classification problem (where you have to
h L = ŷ = f (x ) choose 1 of K classes), we use the following objective
function
a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ
a2 • But wait!
W2 b2 Is ŷ l a function of θ = [W 1, W 2, ., W L , b1, b2, ., bL ]?
h1
• Yes, it is indeed a function of θ
Output Activation
Loss Function
21
Outputs
Loss Function
21
Outputs
Loss Function
21
Outputs
21
Outputs
21
Outputs
• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
21
Outputs
• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
• For the rest of this lecture we will focus on the case where the output activation is a
softmax function and the loss function is cross entropy
21
• Backpropagation (Intuition)
22
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W 1, ∇W 2, ..., ∇W L −1 ∈Rn ×n , ∇W L ∈Rn ×k
∇b1, ∇b2, ..., ∇bL −1 ∈Rn and ∇bL ∈Rk ?
23
ŷ = f (x )
• Let us focus on this one Algorithm: Gradient descent()
weight (W 112). a31
W311 b3 t ← 0;
h 21
max_iterations ←
a21 1000;
W211 b2
h 11 Initial ize θ0;
while
a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt −η∇θt ;
end
24
ŷ = f (x )
• Let us focus on this one Algorithm: Gradient descent()
weight (W 112). a31
24
• First let us take the simple case when we L (θ)
ŷ = f (x )
have a deep but thin network.
aL1
W L1 1
h21
a21
W211
h11
a11
W111
x1
25
• First let us take the simple case when we L (θ)
ŷ = f (x )
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. W L1 1
h21
∂L (θ) ∂L (θ) ∂ŷ ∂aL 11 ∂h 21 ∂a21 ∂h 11 ∂a11
=
∂W111 ∂ŷ ∂a L11 ∂h21 ∂a 21 ∂h11 ∂a 11 ∂W111 a21
W211
h11
a11
W111
x1
25
• First let us take the simple case when we L (θ)
ŷ = f (x )
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. W L1 1
h21
∂L (θ) ∂L (θ) ∂ŷ ∂aL 11 ∂h 21 ∂a21 ∂h 11 ∂a11
=
∂W111 ∂ŷ ∂a L11 ∂h21 ∂a 21 ∂h11 ∂a 11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h 11 W211
= (just compressing the chain rule) h 11
∂W111 ∂h11 ∂W111
a11
W111
x1
25
• First let us take the simple case when we L (θ)
ŷ = f (x )
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. W L1 1
h21
∂L (θ) ∂L (θ) ∂ŷ ∂aL 11 ∂h 21 ∂a21 ∂h 11 ∂a11
=
∂W111 ∂ŷ ∂a L11 ∂h21 ∂a 21 ∂h11 ∂a 11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h 11 W211
= (just compressing the chain rule) h 11
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h 21
= a 11
∂W211 ∂h21 ∂W211 W 111
x1
25
• First let us take the simple case when we L (θ)
ŷ = f (x )
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. W L1 1
h21
∂L (θ) ∂L (θ) ∂ŷ ∂aL 11 ∂h 21 ∂a21 ∂h 11 ∂a11
=
∂W111 ∂ŷ ∂a L11 ∂h21 ∂a 21 ∂h11 ∂a 11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h 11 W211
= (just compressing the chain rule) h 11
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h 21
= a 11
∂W211 ∂h21 ∂W211 W 111
∂L (θ) ∂L (θ) ∂aL 1
=
∂W L11 ∂a L1 ∂W L11 x1
25
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details
26
• We get acertain loss at the output and we try to figure —log ŷ l
out who is responsible for this loss
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
27
• We get a certain loss at the output and we try to figure —log ŷ l
out who is responsible for this loss
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
27
• We get a certain loss at the output and we try to figure —log ŷ l
out who is responsible for this loss
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
• The output layer says “Well, I take responsibility for my h2
part but please understand that I am only as the good
as the hidden layer and weights below me". After all a2
... W2 b2
h1
f (x ) = ŷ = O (W L h L − 1 + bL )
a1
W1 b1
x1 x2 xn
27
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"
• WL and bL take full responsibility but h L says “Well, please
understand that I amonly asgood as the pre-activation layer"
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"
• WL and bL take full responsibility but h L says “Well, please
understand that I amonly asgood as the pre-activation layer"
• The pre-activation layer in turn says that I am only asgood as
a3
the hidden layer and weights below me. W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"
• WL and bL take full responsibility but h L says “Well, please
understand that I amonly asgood as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"
• WL and bL take full responsibility but h L says “Well, please
understand that I amonly asgood as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
W1 b1
x1 x2 xn
28
• So, we talk to W L , bL and h L and ask them “What is wrong —log ŷ l
with you?"
• WL and bL take full responsibility but h L says “Well, please
understand that I amonly asgood as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
∂L (θ) ∂L (θ) ∂ŷ ∂a 3 ∂h 2 ∂a 2 ∂h 1 ∂a 1 W1 b1
=
`∂W
˛¸ 1 1 1x `
∂yˆ ∂a 3 ∂h 2˛¸∂a 2x `∂h 1˛¸∂a 1x ∂W
˛¸ x` 111
` ˛¸ x
Talk to the Talk to the Talk to the Talk to the and now
x1 x2 xn
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights
28
Quantities of interest (roadmap for the remaining part):
29
Quantities of interest
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases
29
• Backpropagation: Computing
Gradients w.r.t. the Output Units
30
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
Quantities of interest
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights
31
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output
a1
W1 b1
x1 x2 xn
32
Let us first consider the partial derivative —log ŷ l
w.r.t. i-th output
x1 x2 xn
32
—log ŷ l
∂ (L (θ)) = − 1(l=i)
∂yˆi ŷ l
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
33
—log ŷ l
∂ (L (θ)) = − 1(l=i)
∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2
a2
W2 b2
h1
∇ŷ L (θ) =
a1
W1 b1
x1 x2 xn
33
—log ŷ l
∂ (L (θ)) = − 1(l=i)
∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2
∂L (θ)
a2
∂yˆ1 W2 b2
h1
∇ŷ L (θ) =
a1
W1 b1
x1 x2 xn
33
—log ŷ l
∂ (L (θ)) = − 1(l=i)
∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2
∂L (θ)
a2
∂yˆ1 W2 b2
h1
∇yˆL (θ) = ..
a1
W1 b1
x1 x2 xn
33
—log ŷ l
∂ (L (θ)) = − 1(l=i)
∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2
∂L (θ)
a2
∂yˆ1 W2 b2
h1
∇yˆL (θ) = ..
∂L (θ)
∂yˆk a1
W1 b1
x1 x2 xn
33
—log ŷ l
∂ (L (θ)) = − 1(l=i)
∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2
∂L (θ) 1l=1 a2
∂yˆ1 W2 b2
1 1 l =2 h1
∇yˆL (θ) = .. = −
ŷ l .
∂L (θ)
∂yˆk 1l=k
a1
W1 b1
x1 x2 xn
33
—log ŷ l
∂ (L (θ)) = − 1(l=i)
∂yˆi ŷ l
We can now talk about the gradient w.r.t. a3
the vector ŷ W3 b3
h2
∂L (θ) 1l=1 a2
∂yˆ1 W2 b2
1 1 l =2 h1
∇yˆL (θ) = .. = −
ŷ l .
∂L (θ)
∂yˆk 1l=k
a1
W1 b1
1
= − el
ŷ l x1 x2 xn
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
Does ŷ l depend on aL i ?
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
Does ŷ l depend on aL i ?Indeed, it does.
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
Does ŷ l depend on aL i ?Indeed, it does.
exp(a L l ) a2
yˆl = Σ W2 b2
i exp(a L i ) h1
a1
W1 b1
x1 x2 xn
34
What we are actually interested in is —log ŷ l
∂L (θ) ∂(−log ŷ l )
=
∂a L i ∂a L i
∂(−log ŷ l ) ∂ŷ l
= a3
∂yˆl ∂a L i W3 b3
h2
Does ŷ l depend on aL i ?Indeed, it does.
exp(a L l ) a2
yˆl = Σ W2 b2
i exp(a L i ) h1
x1 x2 xn
34
∂
—log ŷ A =
∂aL i
35
Li
35
Li
35
Li
35
Sofar we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
a2
W2 b2
∇a L L (θ) h1
a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = h1
a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = .. h1
a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = .. h1
∂L (θ)
∂a L k a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) a2
∂aL1 W2 b2
∇ a L L (θ) = .. = h1
∂L (θ)
∂a L k a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 W2 b2
∇ a L L (θ) = .. = h1
∂L (θ)
∂a L k a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
∂L (θ)
∂a L k a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
..
∂L (θ)
∂a L k a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
..
∂L (θ)
∂a L k —( 1 l =k −ŷ k ) a1
W1 b1
x1 x2 xn
36
So far we have derived the partial derivative w.r.t. the —log ŷ l
i -th element of a L
∂L (θ)
= −( 1 l =i −ŷ i )
∂a L , i
a3
We can now write the gradient w.r.t. the vector a L W3 b3
h2
∂L (θ) —( 1 l =1 −ŷ 1) a2
∂a L 1 −( 1 l =2 −ŷ 2) W2 b2
∇ a L L (θ) = .. = h1
..
∂L (θ)
∂a L k —( 1 l =k −ŷ k ) a1
W1 b1
= −(e(l ) −ŷ)
x1 x2 xn
36
• Backpropagation: Computing
Gradients w.r.t. Hidden Units
37
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases
38
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :
a3
∂p(z) ∂p(z) ∂q m (z)
= Σ W3 b3
∂z ∂qm (z ) ∂z h2
m
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :
a2
In our case: W2 b2
h1
• p(z ) is the loss function L (θ)
a1
W1 b1
x1 x2 xn
39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :
a2
In our case: W2 b2
h1
• p(z ) is the loss function L (θ)
• z = hij a1
W1 b1
x1 x2 xn
39
Chain rule along multiple paths: If a function —log ŷ l
p(z ) can be written as a function of intermediate
results qi (z ) then we have :
a2
In our case: W2 b2
h1
• p(z ) is the loss function L (θ)
• z = hij a1
W1 b1
• q m (z) = a L m
x1 x2 xn
39
—log ŷ l
∂L (θ)
∂hij
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
a2
W2 b2
∇a i + 1 L (θ) = ; W i +1, · , j = h1
a1
W1 b1
x1 x2 xn
∂L (θ)
∂ai+1 ,1
a2
W2 b2
∇ a i + 1 L (θ) = ; W i +1, · , j = h1
a1
W1 b1
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = ; W i +1, · , j = h1
a1
W1 b1
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
a1
W1 b1
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
a1
W1 b1
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ;
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ; see that,
x1 x2 xn
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ; see that,
x1 x2 xn
T
(W i +1, · , j ) ∇a i+ 1
L (θ) =
a i+1 = W i+1 h i j + bi+1
41
—log ŷ l
Σk
∂L (θ) ∂L (θ) ∂a i + 1 , m
∂hij
= ∂a i+ 1, m ∂hij
m =1
Σk
∂L (θ)
= ∂a i+ 1, m W i +1, m, j a3
m =1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1 ,1
W i +1,1, j a2
W2 b2
∇ a i + 1 L (θ) = .. ; W i +1, · , j = .. h1
∂L (θ)
∂a i+ 1,k
W i+1, k,j
a1
W1 b1
W i +1, · , j is the j -th column of W i +1 ; see that,
x1 x2 xn
Σk ∂L (θ)
(W i +1, · , j ) T ∇ a i + 1 L (θ) = W i +1, m, j
∂a i+1,m
m =1 a i+1 = W i+1 h i j + bi+1
41
—log ŷ l
∂L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
∇h i L (θ) a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
∇ hLi (θ) = = a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
∇ hLi (θ) = = a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
∇ hLi (θ) = = a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
a1
W1 b1
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
x1 x2 xn
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
42
—log ŷ l
∂ L (θ)
We have, = (W i +1,.,j ) T ∇ a L (θ)
i+1
∂h i j
∇a i L (θ)
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∇a i L (θ) =
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇a i L (θ) =
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) W3 b3
h2
∂a ij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
∂L (θ) ′ a2
= g (a i j ) [∵ h i j = g(ai j )]
∂h i j W2 b2
h1
a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
∂L (θ) ′ a2
= g (a i j ) [∵ h i j = g(ai j )]
∂h i j W2 b2
h1
∇a i L (θ) a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
∂L (θ) ′ a2
= g (a i j ) [∵ h i j = g(ai j )]
∂h i j W2 b2
h1
∇a i L (θ) = a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
∂L (θ) ′ a2
= g (a i j ) [∵ h i j = g(ai j )]
∂h i j W2 b2
h1
∂L (θ) ′
∂ hi1
g (a i 1)
∇a i L (θ) = a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
∂L (θ) ′ a2
= g (a i j ) [∵ h i j = g(ai j )]
∂h i j W2 b2
h1
∂L (θ) ′
∂ hi1
g (a i 1)
∇ aLi (θ) = .. a1
W1 b1
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
∂L (θ) ′ a2
= g (a i j ) [∵ h i j = g(ai j )]
∂h i j W2 b2
h1
∂L (θ) ′
∂ hi1
g (a i 1)
∇ aLi (θ) = .. a1
∂L (θ) ′
W1 b1
∂h
g (a i n )
in
x1 x2 xn
43
—log ŷ l
∂L (θ)
∂a i 1
∇ aLi (θ) = ..
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂h i j W3 b3
= h2
∂a ij ∂h i j ∂a ij
∂L (θ) ′ a2
= g (a i j ) [∵ h i j = g(ai j )]
∂h i j W2 b2
h1
∂L (θ) ′
∂ hi1
g (a i 1)
∇ aLi (θ) = .. a1
∂L (θ) ′
W1 b1
∂h
g (a i n )
in
′ x1 x2 xn
= ∇ hLi (θ) Ⓢ[. . . , g (a i ),
k ...]
43
• Backpropagation: Computing
Gradients w.r.t. Parameters
191
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
44
44
44
44
44
44
44
44
44
44
44
203
204
205
206
207
208
209
210
211
212
• Backpropagation: Pseudo code
50
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”
214
215
216
217
218
L-1 to 1
219
220
Module 4.9: Derivative of the activation function
55
Now, the only thing we need to figure out is how to compute gJ
56
Now, the only thing we need to figure out is how to compute gJ
Logistic function
g(z) = σ(z)
1
=
1 + e−z
56
Now, the only thing we need to figure out is how to compute gJ
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d −z
gJ(z) = (−1) (1 + e )
(1 + e−z ) 2 dz
56
Now, the only thing we need to figure out is how to compute gJ
Logistic function
g(z) = σ(z)
1
=
1 + e−z
J
1 d −z
g (z) = (−1) (1 + e )
(1 + e−z ) 2 dz
1
= (−1) (−e −z )
(1 + e−z ) 2
56
Now, the only thing we need to figure out is how to compute gJ
Logistic function
g(z) = σ(z)
1
=
1 + e−z
J
1 d −z
g (z) = (−1) (1 + e )
(1 + e−z ) 2 dz
1
= (−1) (−e −z )
(1 + e−z ) 2
1 1 + e−z −1
=
1 + e−z 1 + e−z
56
Now, the only thing we need to figure out is how to compute gJ
Logistic function
g(z) = σ(z)
1
=
1 + e−z
J
1 d −z
g (z) = (−1) (1 + e )
(1 + e−z ) 2 dz
1
= (−1) (−e −z )
(1 + e−z ) 2
1 1 + e−z −1
=
1 + e−z 1 + e−z
= g(z )(1 −g(z ))
56
Now, the only thing we need to figure out is how to compute gJ
56
Now, the only thing we need to figure out is how to compute gJ
56
Now, the only thing we need to figure out is how to compute gJ
56
Now, the only thing we need to figure out is how to compute gJ
56
Now, the only thing we need to figure out is how to compute gJ
233
NIT Rourkela Sibarama Panigrahi & Puneet Kumar Jain “Deep Learning: Course Introduction”