UNIT-5-Modern Recurrent Neural Networks
UNIT-5-Modern Recurrent Neural Networks
UNIT-5-Modern Recurrent Neural Networks
The previous chapter introduced the key ideas behind recurrent neural networks (RNNs).
However, just as with convolutional neural networks, there has been a tremendous amount
of innovation in RNN architectures, culminating in several complex designs that have proven
successful in practice. In particular, the most popular designs feature mechanisms to mitigate
the notorious numerical instability faced by RNNs, as typified by vanishing and exploding
gradients. Recall that in Chapter 9 we dealt with exploding gradient by applying a blunt
gradient clipping heuristic. Despite the efficacy of this hack, it leaves open the problem of
vanishing gradients.
In this chapter, we introduce the key ideas behind the most successful RNN architectures
for sequence, which stem from two papers published in 1997. The first paper, Long Short-
Term Memory (Hochreiter and Schmidhuber, 1997), introduces the memory cell, a unit of
computation that replaces traditional nodes in the hidden layer of a network. With these
memory cells, networks are able to overcome difficulties with training encountered by earlier
recurrent networks. Intuitively, the memory cell avoids the vanishing gradient problem by
keeping values in each memory cell’s internal state cascading along a recurrent edge with
weight 1 across many successive time steps. A set of multiplicative gates help the network
to determine both which inputs to allow into the memory state, and when the content of the
memory state should influence the model’s output.
The second paper, Bidirectional Recurrent Neural Networks (Schuster and Paliwal, 1997), in-
troduces an architecture in which information from both the future (subsequent time steps)
and the past (preceding time steps) are used to determine the output at any point in the
sequence. This is in contrast to previous networks, in which only past input can affect the
output. Bidirectional RNNs have become a mainstay for sequence labeling tasks in natural
language processing, among myriad other tasks. Fortunately, the two innovations are not mu-
tually exclusive, and have been successfully combined for phoneme classification (Graves and
Schmidhuber, 2005) and handwriting recognition (Graves et al., 2008).
The first sections in this chapter will explain the LSTM architecture, a lighter-weight ver-
sion called the gated recurrent unit (GRU), the key ideas behind bidirectional RNNs and
a brief explanation of how RNN layers are stacked together to form deep RNNs. Subse-
quently, we will explore the application of RNNs in sequence-to-sequence tasks, introducing
machine translation along with key ideas such as encoder-decoder architectures and beam
search.
Shortly after the first Elman-style RNNs were trained using backpropagation (Elman, 1990),
the problems of learning long-term dependencies (owing to vanishing and exploding gra-
dients) became salient, with Bengio and Hochreiter discussing the problem (Bengio et al.,
1994) (Hochreiter et al., 2001). Hochreiter had articulated this problem as early as in his
1991 masters thesis, although the results were not widely known because the thesis was writ-
ten in German. While gradient clipping helps with exploding gradients, handling vanishing
349
350 Modern Recurrent Neural Networks
gradients appears to require a more elaborate solution. One of the first and most successful
techniques for addressing vanishing gradients came in the form of the long short-term mem-
ory (LSTM) model due to Hochreiter and Schmidhuber (1997). LSTMs resemble standard
recurrent neural networks but here each ordinary recurrent node is replaced by a memory
cell. Each memory cell contains an internal state, i.e., a node with a self-connected recurrent
edge of fixed weight 1, ensuring that the gradient can pass across many time steps without
vanishing or exploding.
The term “long short-term memory” comes from the following intuition. Simple recurrent
neural networks have long-term memory in the form of weights. The weights change slowly
during training, encoding general knowledge about the data. They also have short-term mem-
ory in the form of ephemeral activations, which pass from each node to successive nodes. The
LSTM model introduces an intermediate type of storage via the memory cell. A memory cell
is a composite unit, built from simpler nodes in a specific connectivity pattern, with the novel
inclusion of multiplicative nodes.
Each memory cell is equipped with an internal state and a number of multiplicative gates that
determine whether (i) a given input should impact the internal state (the input gate), (ii) the
internal state should be flushed to 0 (the forget gate), and (iii) the internal state of a given
neuron should be allowed to impact the cell’s output (the output gate).
The key distinction between vanilla RNNs and LSTMs is that the latter support gating of
the hidden state. This means that we have dedicated mechanisms for when a hidden state
should be updated and also when it should be reset. These mechanisms are learned and they
address the concerns listed above. For instance, if the first token is of great importance we
will learn not to update the hidden state after the first observation. Likewise, we will learn to
skip irrelevant temporary observations. Last, we will learn to reset the latent state whenever
needed. We discuss this in detail below.
The data feeding into the LSTM gates are the input at the current time step and the hidden
state of the previous time step, as illustrated in Fig.10.1.1. Three fully connected layers with
sigmoid activation functions compute the values of the input, forget, and output gates. As
a result of the sigmoid activation, all values of the three gates are in the range of (0, 1).
Additionally, we require an input node, typically computed with a tanh activation function.
Intuitively, the input gate determines how much of the input node’s value should be added
to the current memory cell internal state. The forget gate determines whether to keep the
current value of the memory or flush it. And the output gate determines whether the memory
cell should influence the output at the current time step.
Mathematically, suppose that there are h hidden units, the batch size is n, and the number
of inputs is d. Thus, the input is Xt ∈ Rn×d and the hidden state of the previous time step
is Ht−1 ∈ Rn×h . Correspondingly, the gates at time step t are defined as follows: the input
gate is It ∈ Rn×h , the forget gate is Ft ∈ Rn×h , and the output gate is Ot ∈ Rn×h . They
are calculated as follows:
It = σ(Xt Wxi + Ht−1 Whi + bi ),
Ft = σ(Xt Wxf + Ht−1 Whf + bf ), (10.1.1)
Ot = σ(Xt Wxo + Ht−1 Who + bo ),
351 Long Short-Term Memory (LSTM)
t
Figure 10.1.1 Computing the input gate, the forget gate, and the output gate in an LSTM model.
where Wxi , Wxf , Wxo ∈ Rd×h and Whi , Whf , Who ∈ Rh×h are weight parameters
and bi , bf , bo ∈ R1×h are bias parameters. Note that broadcasting (see Section 2.1.4) is
triggered during the summation. We use sigmoid functions (as introduced in Section 5.1) to
map the input values to the interval (0, 1).
Input Node
Next we design the memory cell. Since we have not specified the action of the various gates
yet, we first introduce the input node C̃t ∈ Rn×h . Its computation is similar to that of the
three gates described above, but using a tanh function with a value range for (−1, 1) as the
activation function. This leads to the following equation at time step t:
where Wxc ∈ Rd×h and Whc ∈ Rh×h are weight parameters and bc ∈ R1×h is a bias
parameter.
t
Figure 10.1.2 Computing the input node in an LSTM model.
In LSTMs, the input gate It governs how much we take new data into account via C̃t and
the forget gate Ft addresses how much of the old cell internal state Ct−1 ∈ Rn×h we retain.
352 Modern Recurrent Neural Networks
Using the Hadamard (elementwise) product operator ⊙ we arrive at the following update
equation:
If the forget gate is always 1 and the input gate is always 0, the memory cell internal state Ct−1
will remain constant forever, passing unchanged to each subsequent time step. However, input
gates and forget gates give the model the flexibility to learn when to keep this value unchanged
and when to perturb it in response to subsequent inputs. In practice, this design alleviates the
vanishing gradient problem, resulting in models that are much easier to train, especially when
facing datasets with long sequence lengths.
t
Figure 10.1.3 Computing the memory cell internal state in an LSTM model.
Hidden State
Last, we need to define how to compute the output of the memory cell, i.e., the hidden state
Ht ∈ Rn×h , as seen by other layers. This is where the output gate comes into play. In
LSTMs, we first apply tanh to the memory cell internal state and then apply another point-
wise multiplication, this time with the output gate. This ensures that the values of Ht are
always in the interval (−1, 1):
Ht = Ot ⊙ tanh(Ct ). (10.1.4)
Whenever the output gate is close to 1, we allow the memory cell internal state to impact
the subsequent layers uninhibited, whereas for output gate values close to 0, we prevent the
current memory from impacting other layers of the network at the current time step. Note
that a memory cell can accrue information across many time steps without impacting the rest
of the network (so long as the output gate takes values close to 0), and then suddenly impact
the network at a subsequent time step as soon as the output gate flips from values close to 0
to values close to 1.
Now let’s implement an LSTM from scratch. As same as the experiments in Section 9.5, we
first load The Time Machine dataset.
353 Long Short-Term Memory (LSTM)
t
Figure 10.1.4 Computing the hidden state in an LSTM model.
import torch
from torch import nn
from d2l import torch as d2l
Next, we need to define and initialize the model parameters. As previously, the hyperpa-
rameter num_hiddens dictates the number of hidden units. We initialize weights following
a Gaussian distribution with 0.01 standard deviation, and we set the biases to 0.
class LSTMScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, sigma=0.01):
super().__init__()
self.save_hyperparameters()
The actual model is defined as described above, consisting of three gates and an input node.
Note that only the hidden state is passed to the output layer.
@d2l.add_to_class(LSTMScratch)
def forward(self, inputs, H_C=None):
if H_C is None:
# Initial state with shape: (batch_size, num_hiddens)
H = torch.zeros((inputs.shape[1], self.num_hiddens),
device=inputs.device)
C = torch.zeros((inputs.shape[1], self.num_hiddens),
device=inputs.device)
else:
H, C = H_C
outputs = []
for X in inputs:
I = torch.sigmoid(torch.matmul(X, self.W_xi) +
torch.matmul(H, self.W_hi) + self.b_i)
F = torch.sigmoid(torch.matmul(X, self.W_xf) +
torch.matmul(H, self.W_hf) + self.b_f)
(continues on next page)
354 Modern Recurrent Neural Networks
O = torch.sigmoid(torch.matmul(X, self.W_xo) +
torch.matmul(H, self.W_ho) + self.b_o)
C_tilde = torch.tanh(torch.matmul(X, self.W_xc) +
torch.matmul(H, self.W_hc) + self.b_c)
C = F * C + I * C_tilde
H = O * torch.tanh(C)
outputs.append(H)
return outputs, (H, C)
Let’s train an LSTM model by instantiating the RNNLMScratch class as introduced in Section
9.5.
Using high-level APIs, we can directly instantiate an LSTM model. This encapsulates all the
configuration details that we made explicit above. The code is significantly faster as it uses
compiled operators rather than Python for many details that we spelled out before.
class LSTM(d2l.RNN):
def __init__(self, num_inputs, num_hiddens):
d2l.Module.__init__(self)
self.save_hyperparameters()
self.rnn = nn.LSTM(num_inputs, num_hiddens)
LSTMs are the prototypical latent variable autoregressive model with nontrivial state con-
trol. Many variants thereof have been proposed over the years, e.g., multiple layers, residual
connections, different types of regularization. However, training LSTMs and other sequence
models (such as GRUs) are quite costly due to the long range dependency of the sequence.
Later we will encounter alternative models such as Transformers that can be used in some
cases.
10.1.4 Summary
While LSTMs were published in 1997, they rose to greater prominence with some victories
in prediction competitions in the mid-2000s, and became the dominant models for sequence
learning from 2011 until more recently with the rise of Transformer models, starting in 2017.
Even tranformers owe some of their key ideas to architecture design innovations introduced
by the LSTM. LSTMs have three types of gates: input gates, forget gates, and output gates
that control the flow of information. The hidden layer output of LSTM includes the hidden
state and the memory cell internal state. Only the hidden state is passed into the output layer
while the memory cell internal state is entirely internal. LSTMs can alleviate vanishing and
exploding gradients.
10.1.5 Exercises
1. Adjust the hyperparameters and analyze their influence on running time, perplexity, and
the output sequence.
2. How would you need to change the model to generate proper words as opposed to se-
quences of characters?
3. Compare the computational cost for GRUs, LSTMs, and regular RNNs for a given hidden
dimension. Pay special attention to the training and inference cost.
4. Since the candidate memory cell ensures that the value range is between −1 and 1 by
using the tanh function, why does the hidden state need to use the tanh function again to
ensure that the output value range is between −1 and 1?
143 5. Implement an LSTM model for time series prediction rather than character sequence
prediction.
Discussions 143
356 Modern Recurrent Neural Networks
As RNNs and particularly the LSTM architecture (Section 10.1) rapidly gained popularity
during the 2010s, a number of papers began to experiment with simplified architectures in
hopes of retaining the key idea of incorporating an internal state and multiplicative gating
mechanisms but with the aim of speeding up computation. The gated recurrent unit (GRU)
(Cho et al., 2014) offered a streamlined version of the LSTM memory cell that often achieves
comparable performance but with the advantage of being faster to compute (Chung et al.,
2014).
Here, the LSTM’s three gates are replaced by two: the reset gate and the update gate. As with
LSTMs, these gates are given sigmoid activations, forcing their values to lie in the interval
(0, 1). Intuitively, the reset gate controls how much of the previous state we might still want
to remember. Likewise, an update gate would allow us to control how much of the new state
is just a copy of the old state. Fig.10.2.1 illustrates the inputs for both the reset and update
gates in a GRU, given the input of the current time step and the hidden state of the previous
time step. The outputs of two gates are given by two fully connected layers with a sigmoid
activation function.
t
Figure 10.2.1 Computing the reset gate and the update gate in a GRU model.
Mathematically, for a given time step t, suppose that the input is a minibatch Xt ∈ Rn×d
(number of examples: n, number of inputs: d) and the hidden state of the previous time step
is Ht−1 ∈ Rn×h (number of hidden units: h). Then, the reset gate Rt ∈ Rn×h and update
gate Zt ∈ Rn×h are computed as follows:
Rt = σ(Xt Wxr + Ht−1 Whr + br ),
(10.2.1)
Zt = σ(Xt Wxz + Ht−1 Whz + bz ),
where Wxr , Wxz ∈ Rd×h and Whr , Whz ∈ Rh×h are weight parameters and br , bz ∈
R1×h are bias parameters.
Next, we integrate the reset gate Rt with the regular updating mechanism in (9.4.5), leading
to the following candidate hidden state H̃t ∈ Rn×h at time step t:
where Wxh ∈ Rd×h and Whh ∈ Rh×h are weight parameters, bh ∈ R1×h is the bias, and
the symbol ⊙ is the Hadamard (elementwise) product operator. Here we use a tanh activation
function.
The result is a candidate, since we still need to incorporate the action of the update gate.
Comparing with (9.4.5), now the influence of the previous states can be reduced with the
elementwise multiplication of Rt and Ht−1 in (10.2.2). Whenever the entries in the reset
gate Rt are close to 1, we recover a vanilla RNN such as in (9.4.5). For all entries of the
reset gate Rt that are close to 0, the candidate hidden state is the result of an MLP with Xt
as input. Any pre-existing hidden state is thus reset to defaults.
Fig.10.2.2 illustrates the computational flow after applying the reset gate.
t
Figure 10.2.2 Computing the candidate hidden state in a GRU model.
Finally, we need to incorporate the effect of the update gate Zt . This determines the extent
to which the new hidden state Ht ∈ Rn×h matches the old state Ht−1 versus how much
it resembles the new candidate state H̃t . The update gate Zt can be used for this purpose,
simply by taking elementwise convex combinations of Ht−1 and H̃t . This leads to the final
update equation for the GRU:
Whenever the update gate Zt is close to 1, we simply retain the old state. In this case the
information from Xt is ignored, effectively skipping time step t in the dependency chain.
In contrast, whenever Zt is close to 0, the new latent state Ht approaches the candidate
latent state H̃t . Fig.10.2.3 illustrates the computational flow after the update gate is in ac-
tion.
In summary, GRUs have the following two distinguishing features:
• Reset gates help capture short-term dependencies in sequences.
• Update gates help capture long-term dependencies in sequences.
To gain a better understanding of the GRU model, let’s implement it from scratch.
import torch
from torch import nn
from d2l import torch as d2l
358 Modern Recurrent Neural Networks
t
Figure 10.2.3 Computing the hidden state in a GRU model.
The first step is to initialize the model parameters. We draw the weights from a Gaussian
distribution with standard deviation to be sigma and set the bias to 0. The hyperparame-
ter num_hiddens defines the number of hidden units. We instantiate all weights and biases
relating to the update gate, the reset gate, and the candidate hidden state.
class GRUScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, sigma=0.01):
super().__init__()
self.save_hyperparameters()
Now we are ready to define the GRU forward computation. Its structure is the same as that
of the basic RNN cell, except that the update equations are more complex.
@d2l.add_to_class(GRUScratch)
def forward(self, inputs, H=None):
if H is None:
# Initial state with shape: (batch_size, num_hiddens)
H = torch.zeros((inputs.shape[1], self.num_hiddens),
device=inputs.device)
outputs = []
for X in inputs:
Z = torch.sigmoid(torch.matmul(X, self.W_xz) +
torch.matmul(H, self.W_hz) + self.b_z)
R = torch.sigmoid(torch.matmul(X, self.W_xr) +
torch.matmul(H, self.W_hr) + self.b_r)
H_tilde = torch.tanh(torch.matmul(X, self.W_xh) +
torch.matmul(R * H, self.W_hh) + self.b_h)
H = Z * H + (1 - Z) * H_tilde
outputs.append(H)
return outputs, H
359 Gated Recurrent Units (GRU)
Training
Training a language model on The Time Machine dataset works in exactly the same manner
as in Section 9.5.
In high-level APIs, we can directly instantiate a GRU model. This encapsulates all the con-
figuration detail that we made explicit above.
class GRU(d2l.RNN):
def __init__(self, num_inputs, num_hiddens):
d2l.Module.__init__(self)
self.save_hyperparameters()
self.rnn = nn.GRU(num_inputs, num_hiddens)
The code is significantly faster in training as it uses compiled operators rather than Python.
After training, we print out the perplexity on the training set and the predicted sequence
following the provided prefix.
360 Modern Recurrent Neural Networks
10.2.6 Summary
Compared with LSTMs, GRUs achieve similar performance but tend to be lighter computa-
tionally. Generally, compared with simple RNNs, gated RNNs like LSTMs and GRUs can
better capture dependencies for sequences with large time step distances. GRUs contain ba-
sic RNNs as their extreme case whenever the reset gate is switched on. They can also skip
subsequences by turning on the update gate.
10.2.7 Exercises
1. Assume that we only want to use the input at time step t′ to predict the output at time step
t > t′ . What are the best values for the reset and update gates for each time step?
2. Adjust the hyperparameters and analyze their influence on running time, perplexity, and
the output sequence.
3. Compare runtime, perplexity, and the output strings for rnn.RNN and rnn.GRU imple-
mentations with each other.
4. What happens if you implement only parts of a GRU, e.g., with only a reset gate or only
an update gate?
Discussions 144 144
Up until now, we have focused on defining networks consisting of a sequence input, a single
hidden RNN layer, and an output layer. Despite having just one hidden layer between the
input at any time step and the corresponding output, there is a sense in which these networks
are deep. Inputs from the first time step can influence the outputs at the final time step T (often
100s or 1000s of steps later). These inputs pass through T applications of the recurrent layer
before reaching the final output. However, we often also wish to retain the ability to express
complex relationships between the inputs at a given time step and the outputs at that same
time step. Thus we often construct RNNs that are deep not only in the time direction but also
in the input-to-output direction. This is precisely the notion of depth that we have already
encountered in our development of MLPs and deep CNNs.
The standard method for building this sort of deep RNN is strikingly simple: we stack the
RNNs on top of each other. Given a sequence of length T , the first RNN produces a sequence
of outputs, also of length T . These, in turn, constitute the inputs to the next RNN layer. In
this short section, we illustrate this design pattern and present a simple example for how to
code up such stacked RNNs. Below, in Fig.10.3.1, we illustrate a deep RNN with L hidden
layers. Each hidden state operates on a sequential input and produces a sequential output.
Moreover, any RNN cell (white box in Fig.10.3.1) at each time step depends on both the
same layer’s value at the previous time step and the previous layer’s value at the same time
step.
361 Deep Recurrent Neural Networks
t
Figure 10.3.1 Architecture of a deep RNN.
where the weight Whq ∈ Rh×q and the bias bq ∈ R1×q are the model parameters of the
output layer.
Just as with MLPs, the number of hidden layers L and the number of hidden units h are hy-
perparameters that we can tune. Common RNN layer widths (h) are in the range (64, 2056),
and common depths (L) are in the range (1, 8). In addition, we can easily get a deep gated
RNN by replacing the hidden state computation in (10.3.1) with that from an LSTM or a
GRU.
import torch
from torch import nn
from d2l import torch as d2l
To implement a multi-layer RNN from scratch, we can treat each layer as an RNNScratch
instance with its own learnable parameters.
class StackedRNNScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.rnns = nn.Sequential(*[d2l.RNNScratch(
(continues on next page)
362 Modern Recurrent Neural Networks
The multi-layer forward computation simply performs forward computation layer by layer.
@d2l.add_to_class(StackedRNNScratch)
def forward(self, inputs, Hs=None):
outputs = inputs
if Hs is None: Hs = [None] * self.num_layers
for i in range(self.num_layers):
outputs, Hs[i] = self.rnns[i](outputs, Hs[i])
outputs = torch.stack(outputs, 0)
return outputs, Hs
As an example, we train a deep GRU model on The Time Machine dataset (same as in Section
9.5). To keep things simple we set the number of layers to 2.
Fortunately many of the logistical details required to implement multiple layers of an RNN
are readily available in high-level APIs. Our concise implementation will use such built-in
functionalities. The code generalizes the one we used previously in Section 10.2, allowing
specification of the number of layers explicitly rather than picking the default of a single
layer.
The architectural decisions such as choosing hyperparameters are very similar to those of
Section 10.2. We pick the same number of inputs and outputs as we have distinct tokens,
i.e., vocab_size. The number of hidden units is still 32. The only difference is that we now
select a nontrivial number of hidden layers by specifying the value of num_layers.
363 Bidirectional Recurrent Neural Networks
10.3.3 Summary
In deep RNNs, the hidden state information is passed to the next time step of the current
layer and the current time step of the next layer. There exist many different flavors of deep
RNNs, such as LSTMs, GRUs, or vanilla RNNs. Conveniently, these models are all available
as parts of the high-level APIs of deep learning frameworks. Initialization of models requires
care. Overall, deep RNNs require considerable amount of work (such as learning rate and
clipping) to ensure proper convergence.
10.3.4 Exercises
1. Replace the GRU by an LSTM and compare the accuracy and training speed.
2. Increase the training data to include multiple books. How low can you go on the perplexity
scale?
3. Would you want to combine sources of different authors when modeling text? Why is this
a good idea? What could go wrong?
145
Discussions 145
So far, our working example of a sequence learning task has been language modeling, where
we aim to predict the next token given all previous tokens in a sequence. In this scenario,
we wish only to condition upon the leftward context, and thus the unidirectional chaining
of a standard RNN seems appropriate. However, there are many other sequence learning
tasks contexts where it’s perfectly fine to condition the prediction at every time step on both
the leftward and the rightward context. Consider, for example, part of speech detection. Why
364 Modern Recurrent Neural Networks
shouldn’t we take the context in both directions into account when assessing the part of speech
associated with a given word?
Another common task—often useful as a pretraining exercise prior to fine-tuning a model on
an actual task of interest—is to mask out random tokens in a text document and then to train
a sequence model to predict the values of the missing tokens. Note that depending on what
comes after the blank, the likely value of the missing token changes dramatically:
• I am ___.
• I am ___ hungry.
• I am ___ hungry, and I can eat half a pig.
In the first sentence “happy” seems to be a likely candidate. The words “not” and “very”
seem plausible in the second sentence, but “not” seems incompatible with the third sen-
tences.
Fortunately, a simple technique transforms any unidirectional RNN into a bidirectional RNN
(Schuster and Paliwal, 1997). We simply implement two unidirectional RNN layers chained
together in opposite directions and acting on the same input (Fig.10.4.1). For the first RNN
layer, the first input is x1 and the last input is xT , but for the second RNN layer, the first
input is xT and the last input is x1 . To produce the output of this bidirectional RNN layer, we
simply concatenate together the corresponding outputs of the two underlying unidirectional
RNN layers.
t
Figure 10.4.1 Architecture of a bidirectional RNN.
Formally for any time step t, we consider a minibatch input Xt ∈ Rn×d (number of exam-
ples: n, number of inputs in each example: d) and let the hidden layer activation function be
ϕ. In the bidirectional architecture, the forward and backward hidden states for this time step
−
→ ←−
are H t ∈ Rn×h and H t ∈ Rn×h , respectively, where h is the number of hidden units. The
forward and backward hidden state updates are as follows:
−
→ (f ) −
→ (f ) (f )
H t = ϕ(Xt Wxh + H t−1 Whh + bh ),
←− (b) ←− (b) (b)
(10.4.1)
H t = ϕ(Xt Wxh + H t+1 Whh + bh ),
(f ) (f ) (b) (b)
where the weights Wxh ∈ Rd×h , Whh ∈ Rh×h , Wxh ∈ Rd×h , and Whh ∈ Rh×h , and
(f ) (b)
biases bh ∈ R1×h and bh ∈ R1×h are all the model parameters.
−
→ ←−
Next, we concatenate the forward and backward hidden states H t and H t to obtain the
hidden state Ht ∈ Rn×2h to be fed into the output layer. In deep bidirectional RNNs with
multiple hidden layers, such information is passed on as input to the next bidirectional layer.
Last, the output layer computes the output Ot ∈ Rn×q (number of outputs: q):
Ot = Ht Whq + bq . (10.4.2)
Here, the weight matrix Whq ∈ R2h×q and the bias bq ∈ R1×q are the model parame-
ters of the output layer. While technically, the two directions can have different numbers of
hidden units, this design choice is seldom made in practice. We now demonstrate a simple
implementation of a bidirectional RNN.
365 Bidirectional Recurrent Neural Networks
import torch
from torch import nn
from d2l import torch as d2l
To implement a bidirectional RNN from scratch, we can include two unidirectional RNNScratch
instances with separate learnable parameters.
class BiRNNScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.f_rnn = d2l.RNNScratch(num_inputs, num_hiddens, sigma)
self.b_rnn = d2l.RNNScratch(num_inputs, num_hiddens, sigma)
self.num_hiddens *= 2 # The output dimension will be doubled
States of forward and backward RNNs are updated separately, while outputs of these two
RNNs are concatenated.
@d2l.add_to_class(BiRNNScratch)
def forward(self, inputs, Hs=None):
f_H, b_H = Hs if Hs is not None else (None, None)
f_outputs, f_H = self.f_rnn(inputs, f_H)
b_outputs, b_H = self.b_rnn(reversed(inputs), b_H)
outputs = [torch.cat((f, b), -1) for f, b in zip(
f_outputs, reversed(b_outputs))]
return outputs, (f_H, b_H)
Using the high-level APIs, we can implement bidirectional RNNs more concisely. Here we
take a GRU model as an example.
class BiGRU(d2l.RNN):
def __init__(self, num_inputs, num_hiddens):
d2l.Module.__init__(self)
self.save_hyperparameters()
self.rnn = nn.GRU(num_inputs, num_hiddens, bidirectional=True)
self.num_hiddens *= 2
10.4.3 Summary
In bidirectional RNNs, the hidden state for each time step is simultaneously determined by
the data prior to and after the current time step. Bidirectional RNNs are mostly useful for se-
quence encoding and the estimation of observations given bidirectional context. Bidirectional
RNNs are very costly to train due to long gradient chains.
10.4.4 Exercises
1. If the different directions use a different number of hidden units, how will the shape of
Ht change?
3. Polysemy is common in natural languages. For example, the word “bank” has different
meanings in contexts “i went to the bank to deposit cash” and “i went to the bank to sit
down”. How can we design a neural network model such that given a context sequence and
a word, a vector representation of the word in the context will be returned? What type of
neural architectures is preferred for handling polysemy?
Discussions 146 146
Among the major breakthroughs that prompted widespread interest in modern RNNs was
a major advance in the applied field of statistical machine translation. Here, the model is
presented with a sentence in one language and must predict the corresponding sentence in
another language. Note that here the sentences may be of different lengths, and that corre-
sponding words in the two sentences may not occur in the same order, owing to differences
in the two language’s grammatical structure.
Many problems have this flavor of mapping between two such “unaligned” sequences. Exam-
ples include mapping from dialog prompts to replies or from questions to answers. Broadly,
such problems are called sequence-to-sequence (seq2seq) problems and they are our focus for
both the remainder of this chapter and much of Chapter 11.
In this section, we introduce the machine translation problem and an example dataset that
we will use in the subsequent examples. For decades, statistical formulations of translation
between languages had been popular (Brown et al., 1990, Brown et al., 1988), even before
researchers got neural network approaches working (methods often lumped together under
the term neural machine translation).
First we will need some new code to process our data. Unlike the language modeling that
we saw in Section 9.3, here each example consists of two separate text sequences, one in
the source language and another (the translation) in the target language. The following code
snippets will show how to load the preprocessed data into minibatches for training.
import os
import torch
from d2l import torch as d2l
return f.read()
data = MTFraEng()
raw_text = data._download()
print(raw_text[:75])
Go. Va !
Hi. Salut !
Run! Cours !
Run! Courez !
Who? Qui ?
Wow! Ça alors !
After downloading the dataset, we proceed with several preprocessing steps for the raw text
data. For instance, we replace non-breaking space with space, convert uppercase letters to
lowercase ones, and insert space between words and punctuation marks.
@d2l.add_to_class(MTFraEng) #@save
def _preprocess(self, text):
# Replace non-breaking space with space
text = text.replace('\u202f', ' ').replace('\xa0', ' ')
# Insert space between words and punctuation marks
no_space = lambda char, prev_char: char in ',.!?' and prev_char != ' '
out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
for i, char in enumerate(text.lower())]
return ''.join(out)
text = data._preprocess(raw_text)
print(text[:80])
go . va !
hi . salut !
run ! cours !
run ! courez !
who ? qui ?
wow ! ça alors !
10.5.2 Tokenization
Unlike the character-level tokenization in Section 9.3, for machine translation we prefer
word-level tokenization here (today’s state-of-the-art models use more complex tokeniza-
tion techniques). The following _tokenize method tokenizes the first max_examples text
sequence pairs, where each token is either a word or a punctuation mark. We append the spe-
cial “<eos>” token to the end of every sequence to indicate the end of the sequence. When a
model is predicting by generating a sequence token after token, the generation of the “<eos>”
token can suggest that the output sequence is complete. In the end, the method below returns
two lists of token lists: src and tgt. Specifically, src[i] is a list of tokens from the ith
text sequence in the source language (English here) and tgt[i] is that in the target language
(French here).
@d2l.add_to_class(MTFraEng) #@save
def _tokenize(self, text, max_examples=None):
src, tgt = [], []
for i, line in enumerate(text.split('\n')):
if max_examples and i > max_examples: break
(continues on next page)
368 Modern Recurrent Neural Networks
parts = line.split('\t')
if len(parts) == 2:
# Skip empty tokens
src.append([t for t in f'{parts[0]} <eos>'.split(' ') if t])
tgt.append([t for t in f'{parts[1]} <eos>'.split(' ') if t])
return src, tgt
Let’s plot the histogram of the number of tokens per text sequence. In this simple English-
French dataset, most of the text sequences have fewer than 20 tokens.
#@save
def show_list_len_pair_hist(legend, xlabel, ylabel, xlist, ylist):
"""Plot the histogram for list length pairs."""
d2l.set_figsize()
_, _, patches = d2l.plt.hist(
[[len(l) for l in xlist], [len(l) for l in ylist]])
d2l.plt.xlabel(xlabel)
d2l.plt.ylabel(ylabel)
for patch in patches[1].patches:
patch.set_hatch('/')
d2l.plt.legend(legend)
Recall that in language modeling each example sequence, either a segment of one sentence
or a span over multiple sentences, had a fixed length. This was specified by the num_steps
(number of time steps or tokens) argument in Section 9.3. In machine translation, each ex-
369 Machine Translation and the Dataset
ample is a pair of source and target text sequences, where the two text sequences may have
different lengths.
For computational efficiency, we can still process a minibatch of text sequences at one time
by truncation and padding. Suppose that every sequence in the same minibatch should have
the same length num_steps. If a text sequence has fewer than num_steps tokens, we will
keep appending the special “<pad>” token to its end until its length reaches num_steps.
Otherwise, we will truncate the text sequence by only taking its first num_steps tokens and
discarding the remaining. In this way, every text sequence will have the same length to be
loaded in minibatches of the same shape. Besides, we also record length of the source se-
quence excluding padding tokens. This information will be needed by some models that we
will cover later.
Since the machine translation dataset consists of pairs of languages, we can build two vo-
cabularies for both the source language and the target language separately. With word-level
tokenization, the vocabulary size will be significantly larger than that using character-level
tokenization. To alleviate this, here we treat infrequent tokens that appear less than 2 times
as the same unknown (“<unk>”) token. As we will explain later (Fig.10.7.1), when training
with target sequences, the decoder output (label tokens) can be the same decoder input (tar-
get tokens), shifted by one token; and the special beginning-of-sequence “<bos>” token will
be used as the first input token for predicting the target sequence (Fig.10.7.3).
@d2l.add_to_class(MTFraEng) #@save
def __init__(self, batch_size, num_steps=9, num_train=512, num_val=128):
super(MTFraEng, self).__init__()
self.save_hyperparameters()
self.arrays, self.src_vocab, self.tgt_vocab = self._build_arrays(
self._download())
@d2l.add_to_class(MTFraEng) #@save
def _build_arrays(self, raw_text, src_vocab=None, tgt_vocab=None):
def _build_array(sentences, vocab, is_tgt=False):
pad_or_trim = lambda seq, t: (
seq[:t] if len(seq) > t else seq + ['<pad>'] * (t - len(seq)))
sentences = [pad_or_trim(s, self.num_steps) for s in sentences]
if is_tgt:
sentences = [['<bos>'] + s for s in sentences]
if vocab is None:
vocab = d2l.Vocab(sentences, min_freq=2)
array = torch.tensor([vocab[s] for s in sentences])
valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1)
return array, vocab, valid_len
src, tgt = self._tokenize(self._preprocess(raw_text),
self.num_train + self.num_val)
src_array, src_vocab, src_valid_len = _build_array(src, src_vocab)
tgt_array, tgt_vocab, _ = _build_array(tgt, tgt_vocab, True)
return ((src_array, tgt_array[:,:-1], src_valid_len, tgt_array[:,1:]),
src_vocab, tgt_vocab)
@d2l.add_to_class(MTFraEng) #@save
def get_dataloader(self, train):
idx = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader(self.arrays, train, idx)
data = MTFraEng(batch_size=3)
src, tgt, src_valid_len, label = next(iter(data.train_dataloader()))
print('source:', src.type(torch.int32))
print('decoder input:', tgt.type(torch.int32))
print('source len excluding pad:', src_valid_len.type(torch.int32))
print('label:', label.type(torch.int32))
Below we show a pair of source and target sequences that are processed by the above _build_arrays
method (in the string format).
@d2l.add_to_class(MTFraEng) #@save
def build(self, src_sentences, tgt_sentences):
raw_text = '\n'.join([src + '\t' + tgt for src, tgt in zip(
src_sentences, tgt_sentences)])
arrays, _, _ = self._build_arrays(
raw_text, self.src_vocab, self.tgt_vocab)
return arrays
source: ['hi', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '
,→<pad>']
target: ['<bos>', 'salut', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '
,→<pad>']
10.5.5 Summary
In natural language processing, machine translation refers to the task of automatically map-
ping from a sequence representing a string of text in a source language to a string represent-
ing a plausible translation in a target language. Using word-level tokenization, the vocabulary
size will be significantly larger than that using character-level tokenization, but the sequence
lengths will be much shorter. To mitigate the large vocabulary size, we can treat infrequent
tokens as some “unknown” token. We can truncate and pad text sequences so that all of them
will have the same length to be loaded in minibatches. Modern implementations often bucket
sequences with similar lengths to avoid wasting excessive computation on padding.
10.5.6 Exercises
1. Try different values of the max_examples argument in the _tokenize method. How does
this affect the vocabulary sizes of the source language and the target language?
2. Text in some languages such as Chinese and Japanese does not have word boundary in-
dicators (e.g., space). Is word-level tokenization still a good idea for such cases? Why or
why not?
371 Encoder-Decoder Architecture
148
Discussions 148
In general seq2seq problems like machine translation (Section 10.5), inputs and outputs are
of varying lengths that are unaligned. The standard approach to handling this sort of data is to
design an encoder-decoder architecture (Fig.10.6.1) consisting of two major components: an
encoder that takes a variable-length sequence as input, and a decoder that acts as a conditional
language model, taking in the encoded input and the leftwards context of the target sequence
and predicting the subsequent token in the target sequence.
t
Figure 10.6.1 The encoder-decoder architecture.
Let’s take machine translation from English to French as an example. Given an input sequence
in English: “They”, “are”, “watching”, “.”, this encoder-decoder architecture first encodes the
variable-length input into a state, then decodes the state to generate the translated sequence,
token by token, as output: “Ils”, “regardent”, “.”. Since the encoder-decoder architecture
forms the basis of different seq2seq models in subsequent sections, this section will convert
this architecture into an interface that will be implemented later.
10.6.1 Encoder
In the encoder interface, we just specify that the encoder takes variable-length sequences as
input X. The implementation will be provided by any model that inherits this base Encoder
class.
10.6.2 Decoder
In the following decoder interface, we add an additional init_state function to convert the
encoder output (enc_all_outputs) into the encoded state. Note that this step may require
extra inputs, such as the valid length of the input, which was explained in Section 10.5. To
generate a variable-length sequence token by token, every time the decoder may map an input
(e.g., the generated token at the previous time step) and the encoded state into an output token
at the current time step.
372 Modern Recurrent Neural Networks
In the forward propagation, the output of the encoder is used to produce the encoded state,
and this state will be further used by the decoder as one of its input.
In the next section, we will see how to apply RNNs to design seq2seq models based on this
encoder-decoder architecture.
10.6.4 Summary
Encoder-decoder architectures can handle inputs and outputs that both consist of variable-
length sequences and thus are suitable for seq2seq problems such as machine translation.
The encoder takes a variable-length sequence as input and transforms it into a state with a
fixed shape. The decoder maps the encoded state of a fixed shape to a variable-length se-
quence.
10.6.5 Exercises
149
2. Besides machine translation, can you think of another application where the encoder-
decoder architecture can be applied?
Discussions 149
373 Encoder-Decoder Seq2Seq for Machine Translation
In so-called seq2seq problems like machine translation (as discussed in Section 10.5), where
inputs and outputs both consist of variable-length unaligned sequences, we generally rely
on encoder-decoder architectures (Section 10.6). In this section, we will demonstrate the
application of an encoder-decoder architecture, where both the encoder and decoder are im-
plemented as RNNs, to the task of machine translation (Cho et al., 2014, Sutskever et al.,
2014).
Here, the encoder RNN will take a variable-length sequence as input and transform it into
a fixed-shape hidden state. Later, in Chapter 11, we will introduce attention mechanisms,
which allow us to access encoded inputs without having to compress the entire input into a
single fixed-length representation.
Then to generate the output sequence, one token at a time, the decoder model, consisting of
a separate RNN, will predict each successive target token given both the input sequence and
the preceding tokens in the output. During training, the decoder will typically be conditioned
upon the preceding tokens in the official “ground-truth” label. However, at test time, we will
want to condition each output of the decoder on the tokens already predicted. Note that if we
ignore the encoder, the decoder in a seq2seq architecture behaves just like a normal language
model. Fig.10.7.1 illustrates how to use two RNNs for sequence to sequence learning in
machine translation.
t
Figure 10.7.1 Sequence to sequence learning with an RNN encoder and an RNN decoder.
In Fig.10.7.1, the special “<eos>” token marks the end of the sequence. Our model can stop
making predictions once this token is generated. At the initial time step of the RNN decoder,
there are two special design decisions to be aware of: First, we begin every input with a
special beginning-of-sequence “<bos>” token. Second, we may feed the final hidden state of
the encoder into the decoder at every single decoding time step (Cho et al., 2014). In some
other designs, such as Sutskever et al. (2014), the final hidden state of the RNN encoder is
used to initiate the hidden state of the decoder only at the first decoding step.
While running the encoder on the input sequence is relatively straightforward, how to handle
the input and output of the decoder requires more care. The most common approach is some-
times called teacher forcing. Here, the original target sequence (token labels) is fed into the
decoder as input. More concretely, the special beginning-of-sequence token and the original
target sequence, excluding the final token, are concatenated as input to the decoder, while
the decoder output (labels for training) is the original target sequence, shifted by one token:
“<bos>”, “Ils”, “regardent”, “.” → “Ils”, “regardent”, “.”, “<eos>” (Fig.10.7.1).
Our implementation in Section 10.5.3 prepared training data for teacher forcing, where shift-
ing tokens for self-supervised learning is similar to the training of language models in Section
374 Modern Recurrent Neural Networks
9.3. An alternative approach is to feed the predicted token from the previous time step as the
current input to the decoder.
In the following, we explain the design depicted in Fig.10.7.1 in greater detail. We will train
this model for machine translation on the English-French dataset as introduced in Section
10.5.
import collections
import math
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
10.7.2 Encoder
Recall that the encoder transforms an input sequence of variable length into a fixed-shape
context variable c (see Fig.10.7.1).
Consider a single sequence example (batch size 1). Suppose that the input sequence is x1 , . . . , xT ,
such that xt is the tth token. At time step t, the RNN transforms the input feature vector xt
for xt and the hidden state ht−1 from the previous time step into the current hidden state ht .
We can use a function f to express the transformation of the RNN’s recurrent layer:
In general, the encoder transforms the hidden states at all time steps into a context variable
through a customized function q:
c = q(h1 , . . . , hT ). (10.7.2)
For example, in Fig.10.7.1, the context variable is just the hidden state hT correspond-
ing to the encoder RNN’s representation after processing the final token of the input se-
quence.
In this example, we have used a unidirectional RNN to design the encoder, where the hidden
state only depends on the input subsequence at and before the time step of the hidden state. We
can also construct encoders using bidirectional RNNs. In this case, a hidden state depends on
the subsequence before and after the time step (including the input at the current time step),
which encodes the information of the entire sequence.
Now let’s implement the RNN encoder. Note that we use an embedding layer to obtain the
feature vector for each token in the input sequence. The weight of an embedding layer is a ma-
trix, where the number of rows corresponds to the size of the input vocabulary (vocab_size)
and number of columns corresponds to the feature vector’s dimension (embed_size). For
any input token index i, the embedding layer fetches the ith row (starting from 0) of the
weight matrix to return its feature vector. Here we implement the encoder with a multilayer
GRU.
Let’s use a concrete example to illustrate the above encoder implementation. Below, we in-
stantiate a two-layer GRU encoder whose number of hidden units is 16. Given a minibatch of
sequence inputs X (batch size: 4, number of time steps: 9), the hidden states of the last layer
at all the time steps (enc_outputs returned by the encoder’s recurrent layers) are a tensor
of shape (number of time steps, batch size, number of hidden units).
Since we are using a GRU here, the shape of the multilayer hidden states at the final time
step is (number of hidden layers, batch size, number of hidden units).
10.7.3 Decoder
Given a target output sequence y1 , y2 , . . . , yT ′ for each time step t′ (we use t′ to differenti-
ate from the input sequence time steps), the decoder assigns a predicted probability to each
possible token occurring at step yt′ +1 conditioned upon the previous tokens in the target
y1 , . . . , yt′ and the context variable c, i.e., P (yt′ +1 | y1 , . . . , yt′ , c).
To predict the subsequent token t′ + 1 in the target sequence, the RNN decoder takes the
previous step’s target token yt′ , the hidden RNN state from the previous time step st′ −1 , and
the context variable c as its input, and transforms them into the hidden state st′ at the current
time step. We can use a function g to express the transformation of the decoder’s hidden
layer:
After obtaining the hidden state of the decoder, we can use an output layer and the softmax
operation to compute the predictive distribution p(yt′ +1 | y1 , . . . , yt′ , c) over the subsequent
output token t′ + 1.
Following Fig.10.7.1, when implementing the decoder as follows, we directly use the hidden
state at the final time step of the encoder to initialize the hidden state of the decoder. This
requires that the RNN encoder and the RNN decoder have the same number of layers and
hidden units. To further incorporate the encoded input sequence information, the context
376 Modern Recurrent Neural Networks
variable is concatenated with the decoder input at all the time steps. To predict the probability
distribution of the output token, we use a fully connected layer to transform the hidden state
at the final layer of the RNN decoder.
class Seq2SeqDecoder(d2l.Decoder):
"""The RNN decoder for sequence to sequence learning."""
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = d2l.GRU(embed_size+num_hiddens, num_hiddens,
num_layers, dropout)
self.dense = nn.LazyLinear(vocab_size)
self.apply(init_seq2seq)
To illustrate the implemented decoder, below we instantiate it with the same hyperparameters
from the aforementioned encoder. As we can see, the output shape of the decoder becomes
(batch size, number of time steps, vocabulary size), where the last dimension of the tensor
stores the predicted token distribution.
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/
,→torch/nn/modules/lazy.py:178: UserWarning: Lazy modules are a new feature␣
,→under heavy development so changes to the API or functionality can happen at␣
,→any moment.
To summarize, the layers in the above RNN encoder-decoder model are illustrated in Fig.10.7.2.
t
Figure 10.7.2 Layers in an RNN encoder-decoder model.
def configure_optimizers(self):
# Adam optimizer is used here
return torch.optim.Adam(self.parameters(), lr=self.lr)
At each time step, the decoder predicts a probability distribution for the output tokens. As
with language modeling, we can apply softmax to obtain the distribution and calculate the
cross-entropy loss for optimization. Recall Section 10.5 that the special padding tokens are
appended to the end of sequences so sequences of varying lengths can be efficiently loaded in
minibatches of the same shape. However, prediction of padding tokens should be excluded
from loss calculations. To this end, we can mask irrelevant entries with zero values so that
multiplication of any irrelevant prediction with zero equals to zero.
@d2l.add_to_class(Seq2Seq)
def loss(self, Y_hat, Y):
l = super(Seq2Seq, self).loss(Y_hat, Y, averaged=False)
mask = (Y.reshape(-1) != self.tgt_pad).type(torch.float32)
return (l * mask).sum() / mask.sum()
10.7.6 Training
Now we can create and train an RNN encoder-decoder model for sequence to sequence learn-
ing on the machine translation dataset.
data = d2l.MTFraEng(batch_size=128)
embed_size, num_hiddens, num_layers, dropout = 256, 256, 2, 0.2
encoder = Seq2SeqEncoder(
len(data.src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqDecoder(
len(data.tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
model = Seq2Seq(encoder, decoder, tgt_pad=data.tgt_vocab['<pad>'],
lr=0.005)
trainer = d2l.Trainer(max_epochs=30, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)
10.7.7 Prediction
378 Modern Recurrent Neural Networks
To predict the output sequence at each step, the predicted token from the previous time step
is fed into the decoder as an input. One simple strategy is to sample whichever token the
decoder has assigned the highest probability when predicting at each step. As in training, at
the initial time step the beginning-of-sequence (“<bos>”) token is fed into the decoder. This
prediction process is illustrated in Fig.10.7.3. When the end-of-sequence (“<eos>”) token is
predicted, the prediction of the output sequence is complete.
t
Figure 10.7.3 Predicting the output sequence token by token using an RNN encoder-decoder.
In the next section, we will introduce more sophisticated strategies based on beam search
(Section 10.8).
@d2l.add_to_class(d2l.EncoderDecoder) #@save
def predict_step(self, batch, device, num_steps,
save_attention_weights=False):
batch = [a.to(device) for a in batch]
src, tgt, src_valid_len, _ = batch
enc_all_outputs = self.encoder(src, src_valid_len)
dec_state = self.decoder.init_state(enc_all_outputs, src_valid_len)
outputs, attention_weights = [tgt[:, 0].unsqueeze(1), ], []
for _ in range(num_steps):
Y, dec_state = self.decoder(outputs[-1], dec_state)
outputs.append(Y.argmax(2))
# Save attention weights (to be covered later)
if save_attention_weights:
attention_weights.append(self.decoder.attention_weights)
return torch.cat(outputs[1:], 1), attention_weights
We can evaluate a predicted sequence by comparing it with the target sequence (the ground-
truth). But what precisely is the appropriate measure for comparing similarity between two
sequences?
BLEU (Bilingual Evaluation Understudy), though originally proposed for evaluating machine
translation results (Papineni et al., 2002), has been extensively used in measuring the quality
of output sequences for different applications. In principle, for any n-grams in the predicted
sequence, BLEU evaluates whether this n-grams appears in the target sequence.
379 Encoder-Decoder Seq2Seq for Machine Translation
Denote by pn the precision of n-grams, which is the ratio of the number of matched n-grams
in the predicted and target sequences to the number of n-grams in the predicted sequence.
To explain, given a target sequence A, B, C, D, E, F , and a predicted sequence A, B, B, C,
D, we have p1 = 4/5, p2 = 3/4, p3 = 1/3, and p4 = 0. Besides, let lenlabel and lenpred be
the numbers of tokens in the target sequence and the predicted sequence, respectively. Then,
BLEU is defined as
( ( )) ∏k
lenlabel n
exp min 0, 1 − p1/2
n , (10.7.4)
lenpred n=1
Based on the definition of BLEU in (10.7.4), whenever the predicted sequence is the same as
the target sequence, BLEU is 1. Moreover, since matching longer n-grams is more difficult,
BLEU assigns a greater weight to a longer n-gram precision. Specifically, when pn is fixed,
1/2n 1/n
pn increases as n grows (the original paper uses pn ). Furthermore, since predicting
shorter sequences tends to obtain a higher pn value, the coefficient before the multiplication
term in (10.7.4) penalizes shorter predicted sequences. For example, when k = 2, given the
target sequence A, B, C, D, E, F and the predicted sequence A, B, although p1 = p2 = 1,
the penalty factor exp(1 − 6/2) ≈ 0.14 lowers the BLEU.
In the end, we use the trained RNN encoder-decoder to translate a few English sentences into
French and compute the BLEU of the results.
engs = ['go .', 'i lost .', 'he\'s calm .', 'i\'m home .']
fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
preds, _ = model.predict_step(
data.build(engs, fras), d2l.try_gpu(), data.num_steps)
for en, fr, p in zip(engs, fras, preds):
translation = []
for token in data.tgt_vocab.to_tokens(p):
if token == '<eos>':
break
translation.append(token)
print(f'{en} => {translation}, bleu,'
f'{bleu(" ".join(translation), fr, k=2):.3f}')
10.7.9 Summary
380 Modern Recurrent Neural Networks
Following the design of the encoder-decoder architecture, we can use two RNNs to design
a model for sequence to sequence learning. In encoder-decoder training, the teacher forcing
approach feeds original output sequences (in contrast to predictions) into the decoder. When
implementing the encoder and the decoder, we can use multilayer RNNs. We can use masks
to filter out irrelevant computations, such as when calculating the loss. As for evaluating output
sequences, BLEU is a popular measure by matching n-grams between the predicted sequence
and the target sequence.
10.7.10 Exercises
2. Rerun the experiment without using masks in the loss calculation. What results do you
observe? Why?
3. If the encoder and the decoder differ in the number of layers or the number of hidden
units, how can we initialize the hidden state of the decoder?
4. In training, replace teacher forcing with feeding the prediction at the previous time step
into the decoder. How does this influence the performance?
6. Are there any other ways to design the output layer of the decoder?
150
Discussions 150
In Section 10.7, we introduced the encoder-decoder architecture, and the standard techniques
for training them end-to-end. However, when it came to test-time prediction, we mentioned
only the greedy strategy, where we select at each time step the token given the highest pre-
dicted probability of coming next, until, at some time step, we find that we have predicted
the special end-of-sequence “<eos>” token. In this section, we will begin by formalizing this
greedy search strategy and identifying some problems that practitioners tend to run into. Sub-
sequently, we compare this strategy with two alternatives: exhaustive search (illustrative but
not practical) and beam search (the standard method in practice).
Let’s begin by setting up our mathematical notation, borrowing conventions from Section
10.7. At any time step t′ , the decoder outputs predictions representing the probability of each
token in the vocabulary coming next in the sequence (the likely value of yt′ +1 , conditioned
on the previous tokens y1 , . . . , yt′ and the context variable c, produced by the encoder to
represent the input sequence. To quantify computational cost, denote by Y the output vocab-
ulary (including the special end-of-sequence token “<eos>”). Let’s also specify the maximum
number of tokens of an output sequence as T ′ . Our goal is to search for an ideal output from
T′
all O(|Y| ) possible output sequences. Note that this slightly overestimates the number of
distinct outputs because there are no subsequent tokens after the “<eos>” token occurs. How-
ever, for our purposes, this number roughly captures the size of the search space.
If you read the book in sequence up to this point you already used a number of optimization
algorithms to train deep learning models. They were the tools that allowed us to continue up-
dating model parameters and to minimize the value of the loss function, as evaluated on the
training set. Indeed, anyone content with treating optimization as a black box device to min-
imize objective functions in a simple setting might well content oneself with the knowledge
that there exists an array of incantations of such a procedure (with names such as “SGD” and
“Adam”).
To do well, however, some deeper knowledge is required. Optimization algorithms are im-
portant for deep learning. On the one hand, training a complex deep learning model can take
hours, days, or even weeks. The performance of the optimization algorithm directly affects
the model’s training efficiency. On the other hand, understanding the principles of differ-
ent optimization algorithms and the role of their hyperparameters will enable us to tune the
hyperparameters in a targeted manner to improve the performance of deep learning mod-
els.
In this chapter, we explore common deep learning optimization algorithms in depth. Almost
all optimization problems arising in deep learning are nonconvex. Nonetheless, the design and
analysis of algorithms in the context of convex problems have proven to be very instructive. It
is for that reason that this chapter includes a primer on convex optimization and the proof for
a very simple stochastic gradient descent algorithm on a convex objective function.
In this section, we will discuss the relationship between optimization and deep learning as well
as the challenges of using optimization in deep learning. For a deep learning problem, we will
usually define a loss function first. Once we have the loss function, we can use an optimization
algorithm in attempt to minimize the loss. In optimization, a loss function is often referred
to as the objective function of the optimization problem. By tradition and convention most
optimization algorithms are concerned with minimization. If we ever need to maximize an
objective there is a simple solution: just flip the sign on the objective.
Although optimization provides a way to minimize the loss function for deep learning, in
essence, the goals of optimization and deep learning are fundamentally different. The for-
mer is primarily concerned with minimizing an objective whereas the latter is concerned
with finding a suitable model, given a finite amount of data. In Section 3.6, we discussed the
difference between these two goals in detail. For instance, training error and generalization
error generally differ: since the objective function of the optimization algorithm is usually a
loss function based on the training dataset, the goal of optimization is to reduce the training
error. However, the goal of deep learning (or more broadly, statistical inference) is to reduce
435
436 Optimization Algorithms
the generalization error. To accomplish the latter we need to pay attention to overfitting in
addition to using the optimization algorithm to reduce the training error.
%matplotlib inline
import numpy as np
import torch
from mpl_toolkits import mplot3d
from d2l import torch as d2l
To illustrate the aforementioned different goals, let’s consider the empirical risk and the risk.
As described in Section 4.7.3, the empirical risk is an average loss on the training dataset
while the risk is the expected loss on the entire population of data. Below we define two
functions: the risk function f and the empirical risk function g. Suppose that we have only a
finite amount of training data. As a result, here g is less smooth than f.
def f(x):
return x * torch.cos(np.pi * x)
def g(x):
return f(x) + 0.2 * torch.cos(5 * np.pi * x)
The graph below illustrates that the minimum of the empirical risk on a training dataset may
be at a different location from the minimum of the risk (generalization error).
In this chapter, we are going to focus specifically on the performance of optimization algo-
rithms in minimizing the objective function, rather than a model’s generalization error. In
Section 3.1 we distinguished between analytical solutions and numerical solutions in opti-
mization problems. In deep learning, most objective functions are complicated and do not
have analytical solutions. Instead, we must use numerical optimization algorithms. The opti-
mization algorithms in this chapter all fall into this category.
There are many challenges in deep learning optimization. Some of the most vexing ones are
local minima, saddle points, and vanishing gradients. Let’s have a look at them.
437 Optimization and Deep Learning
Local Minima
For any objective function f (x), if the value of f (x) at x is smaller than the values of f (x)
at any other points in the vicinity of x, then f (x) could be a local minimum. If the value of
f (x) at x is the minimum of the objective function over the entire domain, then f (x) is the
global minimum.
we can approximate the local minimum and global minimum of this function.
The objective function of deep learning models usually has many local optima. When the nu-
merical solution of an optimization problem is near the local optimum, the numerical solution
obtained by the final iteration may only minimize the objective function locally, rather than
globally, as the gradient of the objective function’s solutions approaches or becomes zero.
Only some degree of noise might knock the parameter out of the local minimum. In fact,
this is one of the beneficial properties of minibatch stochastic gradient descent where the
natural variation of gradients over minibatches is able to dislodge the parameters from local
minima.
Saddle Points
Besides local minima, saddle points are another reason for gradients to vanish. A saddle point
is any location where all gradients of a function vanish but which is neither a global nor a local
minimum. Consider the function f (x) = x3 . Its first and second derivative vanish for x = 0.
Optimization might stall at this point, even though it is not a minimum.
Saddle points in higher dimensions are even more insidious, as the example below shows.
Consider the function f (x, y) = x2 − y 2 . It has its saddle point at (0, 0). This is a maximum
with respect to y and a minimum with respect to x. Moreover, it looks like a saddle, which
is where this mathematical property got its name.
438 Optimization Algorithms
x, y = torch.meshgrid(
torch.linspace(-1.0, 1.0, 101), torch.linspace(-1.0, 1.0, 101))
z = x**2 - y**2
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/
,→torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release,
,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
We assume that the input of a function is a k-dimensional vector and its output is a scalar,
so its Hessian matrix will have k eigenvalues. The solution of the function could be a local
minimum, a local maximum, or a saddle point at a position where the function gradient is
zero:
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all
positive, we have a local minimum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all
negative, we have a local maximum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are
negative and positive, we have a saddle point for the function.
For high-dimensional problems the likelihood that at least some of the eigenvalues are negative
is quite high. This makes saddle points more likely than local minima. We will discuss some
439 Optimization and Deep Learning
exceptions to this situation in the next section when introducing convexity. In short, convex
functions are those where the eigenvalues of the Hessian are never negative. Sadly, though,
most deep learning problems do not fall into this category. Nonetheless it is a great tool to
study optimization algorithms.
Vanishing Gradients
Probably the most insidious problem to encounter is the vanishing gradient. Recall our commonly-
used activation functions and their derivatives in Section 5.1.2. For instance, assume that we
want to minimize the function f (x) = tanh(x) and we happen to get started at x = 4. As
we can see, the gradient of f is close to nil. More specifically, f ′ (x) = 1 − tanh2 (x) and
thus f ′ (4) = 0.0013. Consequently, optimization will get stuck for a long time before we
make progress. This turns out to be one of the reasons that training deep learning models was
quite tricky prior to the introduction of the ReLU activation function.
As we saw, optimization for deep learning is full of challenges. Fortunately there exists a
robust range of algorithms that perform well and that are easy to use even for beginners. Fur-
thermore, it is not really necessary to find the best solution. Local optima or even approximate
solutions thereof are still very useful.
12.1.3 Summary
• Minimizing the training error does not guarantee that we find the best set of parameters to
minimize the generalization error.
• The problem may have even more saddle points, as generally the problems are not convex.
12.1.4 Exercises
1. Consider a simple MLP with a single hidden layer of, say, d dimensions in the hidden
layer and a single output. Show that for any local minimum there are at least d! equivalent
solutions that behave identically.
440 Optimization Algorithms
2. Assume that we have a symmetric random matrix M where the entries Mij = Mji are
each drawn from some probability distribution pij . Furthermore assume that pij (x) =
pij (−x), i.e., that the distribution is symmetric (see e.g., Wigner (1958) for details).
1. Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvec-
tor v the probability that the associated eigenvalue λ satisfies P (λ > 0) = P (λ < 0).
2. Why does the above not imply P (λ > 0) = 0.5?
3. What other challenges involved in deep learning optimization can you think of?
4. Assume that you want to balance a (real) ball on a (real) saddle.
1. Why is this hard?
2. Can you exploit this effect also for optimization algorithms?
Discussions 165 165
12.2 Convexity
Convexity plays a vital role in the design of optimization algorithms. This is largely due to
the fact that it is much easier to analyze and test algorithms in such a context. In other words,
if the algorithm performs poorly even in the convex setting, typically we should not hope
to see great results otherwise. Furthermore, even though the optimization problems in deep
learning are generally nonconvex, they often exhibit some properties of convex ones near
local minima. This can lead to exciting new optimization variants such as (Izmailov et al.,
2018).
%matplotlib inline
import numpy as np
import torch
from mpl_toolkits import mplot3d
from d2l import torch as d2l
12.2.1 Definitions
Before convex analysis, we need to define convex sets and convex functions. They lead to
mathematical tools that are commonly applied to machine learning.
Convex Sets
Sets are the basis of convexity. Simply put, a set X in a vector space is convex if for any
a, b ∈ X the line segment connecting a and b is also in X . In mathematical terms this means
that for all λ ∈ [0, 1] we have
This sounds a bit abstract. Consider Fig.12.2.1. The first set is not convex since there exist
line segments that are not contained in it. The other two sets suffer no such problem.
Definitions on their own are not particularly useful unless you can do something with them.
In this case we can look at intersections as shown in Fig.12.2.2. Assume that X and Y are
convex sets. Then X ∩ Y is also convex. To see this, consider any a, b ∈ X ∩ Y. Since X and
441 Convexity
t
Figure 12.2.1 The first set is nonconvex and the other two are convex.
Y are convex, the line segments connecting a and b are contained in both X and Y. Given
that, they also need to be contained in X ∩ Y, thus proving our theorem.
t
Figure 12.2.2 The intersection between two convex sets is convex.
We can strengthen this result with little effort: given convex sets Xi , their intersection ∩i Xi
is convex. To see that the converse is not true, consider two disjoint sets X ∩ Y = ∅. Now
pick a ∈ X and b ∈ Y. The line segment in Fig.12.2.3 connecting a and b needs to contain
some part that is neither in X nor in Y, since we assumed that X ∩ Y = ∅. Hence the line
segment is not in X ∪ Y either, thus proving that in general unions of convex sets need not
be convex.
t
Figure 12.2.3 The union of two convex sets need not be convex.
Typically the problems in deep learning are defined on convex sets. For instance, Rd , the set
of d-dimensional vectors of real numbers, is a convex set (after all, the line between any two
points in Rd remains in Rd ). In some cases we work with variables of bounded length, such
as balls of radius r as defined by {x|x ∈ Rd and ∥x∥ ≤ r}.
Convex Functions
Now that we have convex sets we can introduce convex functions f . Given a convex set X , a
function f : X → R is convex if for all x, x′ ∈ X and for all λ ∈ [0, 1] we have
To illustrate this let’s plot a few functions and check which ones satisfy the requirement.
Below we define a few functions, both convex and nonconvex.
442 Optimization Algorithms
As expected, the cosine function is nonconvex, whereas the parabola and the exponential
function are. Note that the requirement that X is a convex set is necessary for the condition
to make sense. Otherwise the outcome of f (λx+(1−λ)x′ ) might not be well defined.
Jensen’s Inequality
Given a convex function f , one of the most useful mathematical tools is Jensen’s inequality.
It amounts to a generalization of the definition of convexity:
( )
∑ ∑
αi f (xi ) ≥ f αi xi and EX [f (X)] ≥ f (EX [X]) , (12.2.3)
i i
∑
where αi are nonnegative real numbers such that i αi = 1 and X is a random variable. In
other words, the expectation of a convex function is no less than the convex function of an
expectation, where the latter is usually a simpler expression. To prove the first inequality we
repeatedly apply the definition of convexity to one term in the sum at a time.
One of the common applications of Jensen’s inequality is to bound a more complicated ex-
pression by a simpler one. For example, its application can be with regard to the log-likelihood
of partially observed random variables. That is, we use
12.2.2 Properties
Convex functions have many useful properties. We describe a few commonly-used ones be-
low.
443 Convexity
First and foremost, the local minima of convex functions are also the global minima. We can
prove it by contradiction as follows.
Consider a convex function f defined on a convex set X . Suppose that x∗ ∈ X is a local
minimum: there exists a small positive value p so that for x ∈ X that satisfies 0 < |x − x∗ | ≤
p we have f (x∗ ) < f (x).
Assume that the local minimum x∗ is not the global minumum of f : there exists x′ ∈ X
for which f (x′ ) < f (x∗ ). There also exists λ ∈ [0, 1) such as λ = 1 − |x∗ −x
p
′ | so that
f = lambda x: (x - 1) ** 2
d2l.set_figsize()
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')
The fact that the local minima for convex functions are also the global minima is very conve-
nient. It means that if we minimize functions we cannot “get stuck”. Note, though, that this
does not mean that there cannot be more than one global minimum or that there might even
exist one. For instance, the function f (x) = max(|x| − 1, 0) attains its minimum value over
the interval [−1, 1]. Conversely, the function f (x) = exp(x) does not attain a minimum
value on R: for x → −∞ it asymptotes to 0, but there is no x for which f (x) = 0.
We can conveniently define convex sets via below sets of convex functions. Concretely, given
a convex function f defined on a convex set X , any below set
is convex.
444 Optimization Algorithms
Let’s prove this quickly. Recall that for any x, x′ ∈ Sb we need to show that λx+(1−λ)x′ ∈
Sb as long as λ ∈ [0, 1]. Since f (x) ≤ b and f (x′ ) ≤ b, by the definition of convexity we
have
First, we need to prove the one-dimensional case. To see that convexity of f implies f ′′ ≥ 0
we use the fact that
( )
1 1 x+ϵ x−ϵ
f (x + ϵ) + f (x − ϵ) ≥ f + = f (x). (12.2.8)
2 2 2 2
Since the second derivative is given by the limit over finite differences it follows that
f (x + ϵ) + f (x − ϵ) − 2f (x)
f ′′ (x) = lim ≥ 0. (12.2.9)
ϵ→0 ϵ2
To see that f ′′ ≥ 0 implies that f is convex we use the fact that f ′′ ≥ 0 implies that f ′
is a monotonically nondecreasing function. Let a < x < b be three points in R, where
x = (1 − λ)a + λb and λ ∈ (0, 1). According to the mean value theorem, there exist
α ∈ [a, x] and β ∈ [x, b] such that
f (x) − f (a) f (b) − f (x)
f ′ (α) = and f ′ (β) = . (12.2.10)
x−a b−x
By monotonicity f ′ (β) ≥ f ′ (α), hence
x−a b−x
f (b) + f (a) ≥ f (x). (12.2.11)
b−a b−a
Since x = (1 − λ)a + λb, we have
is convex.
To prove that convexity of f implies that g is convex, we can show that for all a, b, λ ∈ [0, 1]
(thus 0 ≤ λa + (1 − λ)b ≤ 1)
g(λa + (1 − λ)b)
=f ((λa + (1 − λ)b) x + (1 − λa − (1 − λ)b) y)
=f (λ (ax + (1 − a)y) + (1 − λ) (bx + (1 − b)y)) (12.2.14)
≤λf (ax + (1 − a)y) + (1 − λ)f (bx + (1 − b)y)
=λg(a) + (1 − λ)g(b).
445 Convexity
f (λx + (1 − λ)y)
=g(λ · 1 + (1 − λ) · 0)
(12.2.15)
≤λg(1) + (1 − λ)g(0)
=λf (x) + (1 − λ)f (y).
Finally, using the lemma above and the result of the one-dimensional case, the multi-dimensional
case can be proven as follows. A multi-dimensional function f : Rn → R is convex if and
def
only if for all x, y ∈ Rn g(z) = f (zx + (1 − z)y), where z ∈ [0, 1], is convex. Accord-
ing to the one-dimensional case, this holds if and only if g ′′ = (x − y)⊤ H(x − y) ≥ 0
def
(H = ∇2 f ) for all x, y ∈ Rn , which is equivalent to H ⪰ 0 per the definition of positive
semidefinite matrices.
12.2.3 Constraints
One of the nice properties of convex optimization is that it allows us to handle constraints
efficiently. That is, it allows us to solve constrained optimization problems of the form:
minimize f (x)
x
(12.2.16)
subject to ci (x) ≤ 0 for all i ∈ {1, . . . , n},
where f is the objective and the functions ci are constraint functions. To see what this does
consider the case where c1 (x) = ∥x∥2 − 1. In this case the parameters x are constrained to
the unit ball. If a second constraint is c2 (x) = v⊤ x + b, then this corresponds to all x lying
on a half-space. Satisfying both constraints simultaneously amounts to selecting a slice of a
ball.
Lagrangian
Skipping over the derivation of the Lagrangian L, the above reasoning can be expressed via
the following saddle point optimization problem:
∑
n
L(x, α1 , . . . , αn ) = f (x) + αi ci (x) where αi ≥ 0. (12.2.17)
i=1
Here the variables αi (i = 1, . . . , n) are the so-called Lagrange multipliers that ensure that
constraints are properly enforced. They are chosen just large enough to ensure that ci (x) ≤ 0
for all i. For instance, for any x where ci (x) < 0 naturally, we’d end up picking αi = 0.
Moreover, this is a saddle point optimization problem where one wants to maximize L with
respect to all αi and simultaneously minimize it with respect to x. There is a rich body of
literature explaining how to arrive at the function L(x, α1 , . . . , αn ). For our purposes it is
sufficient to know that the saddle point of L is where the original constrained optimization
problem is solved optimally.
446 Optimization Algorithms
Penalties
In fact, we have been using this trick all along. Consider weight decay in Section 3.7. In it
we add λ2 ∥w∥2 to the objective function to ensure that w does not grow too large. From the
constrained optimization point of view we can see that this will ensure that ∥w∥2 − r2 ≤ 0
for some radius r. Adjusting the value of λ allows us to vary the size of w.
Projections
This turns out to be a projection of g onto the ball of radius θ. More generally, a projection
on a convex set X is defined as
ProjX (x) = argmin ∥x − x′ ∥, (12.2.19)
x′ ∈X
t
Figure 12.2.4 Convex Projections.
The mathematical definition of projections may sound a bit abstract. Fig.12.2.4 explains it
somewhat more clearly. In it we have two convex sets, a circle and a diamond. Points inside
both sets (yellow) remain unchanged during projections. Points outside both sets (black) are
projected to the points inside the sets (red) that are closet to the original points (black). While
for ℓ2 balls this leaves the direction unchanged, this need not be the case in general, as can
be seen in the case of the diamond.
One of the uses for convex projections is to compute sparse weight vectors. In this case we
project weight vectors onto an ℓ1 ball, which is a generalized version of the diamond case in
Fig.12.2.4.
12.2.4 Summary
In the context of deep learning the main purpose of convex functions is to motivate opti-
mization algorithms and help us understand them in detail. In the following we will see how
gradient descent and stochastic gradient descent can be derived accordingly.
447 Gradient Descent
• The expectation of a convex function is no less than the convex function of an expectation
(Jensen’s inequality).
• Convex constraints can be added via the Lagrangian. In practice we may simply add them
with a penalty to the objective function.
• Projections map to points in the convex set closest to the original points.
12.2.5 Exercises
1. Assume that we want to verify convexity of a set by drawing all lines between points within
the set and checking whether the lines are contained.
3. Given convex functions f and g, show that max(f, g) is convex, too. Prove that min(f, g)
is not convex.
4. Prove that the normalization of the softmax function is convex. More specifically prove
∑
the convexity of f (x) = log i exp(xi ).
5. Prove that linear subspaces, i.e., X = {x|Wx = b}, are convex sets.
6. Prove that in the case of linear subspaces with b = 0 the projection ProjX can be written
as Mx for some matrix M.
Discussions 166
In this section we are going to introduce the basic concepts underlying gradient descent. Al-
though it is rarely used directly in deep learning, an understanding of gradient descent is key
to understanding stochastic gradient descent algorithms. For instance, the optimization prob-
lem might diverge due to an overly large learning rate. This phenomenon can already be seen
in gradient descent. Likewise, preconditioning is a common technique in gradient descent
and carries over to more advanced algorithms. Let’s start with a simple special case.
448 Optimization Algorithms
Gradient descent in one dimension is an excellent example to explain why the gradient de-
scent algorithm may reduce the value of the objective function. Consider some continuously
differentiable real-valued function f : R → R. Using a Taylor expansion we obtain
That is, in first-order approximation f (x + ϵ) is given by the function value f (x) and the first
derivative f ′ (x) at x. It is not unreasonable to assume that for small ϵ moving in the direction
of the negative gradient will decrease f . To keep things simple we pick a fixed step size η > 0
and choose ϵ = −ηf ′ (x). Plugging this into the Taylor expansion above we get
If the derivative f ′ (x) ̸= 0 does not vanish we make progress since ηf ′2 (x) > 0. Moreover,
we can always choose η small enough for the higher-order terms to become irrelevant. Hence
we arrive at
x ← x − ηf ′ (x) (12.3.4)
to iterate x, the value of function f (x) might decline. Therefore, in gradient descent we first
choose an initial value x and a constant η > 0 and then use them to continuously iterate x
until the stop condition is reached, for example, when the magnitude of the gradient |f ′ (x)|
is small enough or the number of iterations has reached a certain value.
For simplicity we choose the objective function f (x) = x2 to illustrate how to implement
gradient descent. Although we know that x = 0 is the solution to minimize f (x), we still use
this simple function to observe how x changes.
%matplotlib inline
import numpy as np
import torch
from d2l import torch as d2l
Next, we use x = 10 as the initial value and assume η = 0.2. Using gradient descent to
iterate x for 10 times we can see that, eventually, the value of x approaches the optimal
solution.
show_trace(results, f)
Learning Rate
The learning rate η can be set by the algorithm designer. If we use a learning rate that is too
small, it will cause x to update very slowly, requiring more iterations to get a better solution.
To show what happens in such a case, consider the progress in the same optimization problem
for η = 0.05. As we can see, even after 10 steps we are still very far from the optimal
solution.
show_trace(gd(0.05, f_grad), f)
Conversely, if we use an excessively high learning rate, |ηf ′ (x)| might be too large for the
first-order Taylor expansion formula. That is, the term O(η 2 f ′2 (x)) in (12.3.2) might become
significant. In this case, we cannot guarantee that the iteration of x will be able to lower the
value of f (x). For example, when we set the learning rate to η = 1.1, x overshoots the
optimal solution x = 0 and gradually diverges.
450 Optimization Algorithms
show_trace(gd(1.1, f_grad), f)
Local Minima
To illustrate what happens for nonconvex functions consider the case of f (x) = x·cos(cx) for
some constant c. This function has infinitely many local minima. Depending on our choice of
the learning rate and depending on how well conditioned the problem is, we may end up with
one of many solutions. The example below illustrates how an (unrealistically) high learning
rate will lead to a poor local minimum.
c = torch.tensor(0.15 * np.pi)
show_trace(gd(2, f_grad), f)
Now that we have a better intuition of the univariate case, let’s consider the situation where
x = [x1 , x2 , . . . , xd ]⊤ . That is, the objective function f : Rd → R maps vectors into
451 Gradient Descent
In other words, up to second-order terms in ϵ the direction of steepest descent is given by the
negative gradient −∇f (x). Choosing a suitable learning rate η > 0 yields the prototypical
gradient descent algorithm:
To see how the algorithm behaves in practice let’s construct an objective function f (x) =
x21 + 2x22 with a two-dimensional vector x = [x1 , x2 ]⊤ as input and a scalar as output. The
gradient is given by ∇f (x) = [2x1 , 4x2 ]⊤ . We will observe the trajectory of x by gradient
descent from the initial position [−5, −2].
To begin with, we need two more helper functions. The first uses an update function and
applies it 20 times to the initial value. The second helper visualizes the trajectory of x.
Next, we observe the trajectory of the optimization variable x for learning rate η = 0.1. We
can see that after 20 steps the value of x approaches its minimum at [0, 0]. Progress is fairly
well-behaved albeit rather slow.
eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))
,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)
As we could see in Section 12.3.1, getting the learning rate η “just right” is tricky. If we
pick it too small, we make little progress. If we pick it too large, the solution oscillates and
in the worst case it might even diverge. What if we could determine η automatically or get
rid of having to select a learning rate at all? Second-order methods that look not only at the
value and gradient of the objective function but also at its curvature can help in this case.
While these methods cannot be applied to deep learning directly due to the computational
cost, they provide useful intuition into how to design advanced optimization algorithms that
mimic many of the desirable properties of the algorithms outlined below.
Newton’s Method
Reviewing the Taylor expansion of some function f : Rd → R there is no need to stop after
the first term. In fact, we can write it as
1
f (x + ϵ) = f (x) + ϵ⊤ ∇f (x) + ϵ⊤ ∇2 f (x)ϵ + O(∥ϵ∥3 ). (12.3.8)
2
def
To avoid cumbersome notation we define H = ∇2 f (x) to be the Hessian of f , which is
a d × d matrix. For small d and simple problems H is easy to compute. For deep neural
networks, on the other hand, H may be prohibitively large, due to the cost of storing O(d2 )
entries. Furthermore it may be too expensive to compute via backpropagation. For now let’s
ignore such considerations and look at what algorithm we would get.
After all, the minimum of f satisfies ∇f = 0. Following calculus rules in Section 2.4.3,
by taking derivatives of (12.3.8) with regard to ϵ and ignoring higher-order terms we arrive
at
That is, we need to invert the Hessian H as part of the optimization problem.
As a simple example, for f (x) = 12 x2 we have ∇f (x) = x and H = 1. Hence for any x
we obtain ϵ = −x. In other words, a single step is sufficient to converge perfectly without the
need for any adjustment! Alas, we got a bit lucky here: the Taylor expansion was exact since
f (x + ϵ) = 21 x2 + ϵx + 12 ϵ2 .
Let’s see what happens in other problems. Given a convex hyperbolic cosine function f (x) =
cosh(cx) for some constant c, we can see that the global minimum at x = 0 is reached after
a few iterations.
c = torch.tensor(0.5)
def newton(eta=1):
x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x) / f_hess(x)
results.append(float(x))
print('epoch 10, x:', x)
return results
show_trace(newton(), f)
Now let’s consider a nonconvex function, such as f (x) = x cos(cx) for some constant c.
After all, note that in Newton’s method we end up dividing by the Hessian. This means that
if the second derivative is negative we may walk into the direction of increasing the value of
f . That is a fatal flaw of the algorithm. Let’s see what happens in practice.
c = torch.tensor(0.15 * np.pi)
show_trace(newton(), f)
This went spectacularly wrong. How can we fix it? One way would be to “fix” the Hessian
by taking its absolute value instead. Another strategy is to bring back the learning rate. This
seems to defeat the purpose, but not quite. Having second-order information allows us to
be cautious whenever the curvature is large and to take longer steps whenever the objective
function is flatter. Let’s see how this works with a slightly smaller learning rate, say η = 0.5.
As we can see, we have quite an efficient algorithm.
show_trace(newton(0.5), f)
Convergence Analysis
We only analyze the convergence rate of Newton’s method for some convex and three times
differentiable objective function f , where the second derivative is nonzero, i.e., f ′′ > 0. The
multivariate proof is a straightforward extension of the one-dimensional argument below and
omitted since it does not help us much in terms of intuition.
def
Denote by x(k) the value of x at the k th iteration and let e(k) = x(k) −x∗ be the distance from
optimality at the k th iteration. By Taylor expansion we have that the condition f ′ (x∗ ) = 0
455 Gradient Descent
can be written as
1
0 = f ′ (x(k) − e(k) ) = f ′ (x(k) ) − e(k) f ′′ (x(k) ) + (e(k) )2 f ′′′ (ξ (k) ), (12.3.10)
2
which holds for some ξ (k) ∈ [x(k) − e(k) , x(k) ]. Dividing the above expansion by f ′′ (x(k) )
yields
f ′ (x(k) ) 1 f ′′′ (ξ (k) )
e(k) − ′′ (k)
= (e(k) )2 ′′ (k) . (12.3.11)
f (x ) 2 f (x )
Recall that we have the update x(k+1) = x(k) − f ′ (x(k) )/f ′′ (x(k) ). Plugging in this update
equation and taking the absolute value of both sides, we have
1 (k) 2 f ′′′ (ξ (k) )
e(k+1) = (e ) . (12.3.12)
2 f ′′ (x(k) )
Consequently, whenever we are in a region of bounded f ′′′ (ξ (k) ) /(2f ′′ (x(k) )) ≤ c, we
have a quadratically decreasing error
As an aside, optimization researchers call this linear convergence, whereas a condition such
as e(k+1) ≤ α e(k) would be called a constant rate of convergence. Note that this analysis
comes with a number of caveats. First, we do not really have much of a guarantee when
we will reach the region of rapid convergence. Instead, we only know that once we reach it,
convergence will be very quick. Second, this analysis requires that f is well-behaved up to
higher-order derivatives. It comes down to ensuring that f does not have any “surprising”
properties in terms of how it might change its values.
Preconditioning
Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus desir-
able to find alternatives. One way to improve matters is preconditioning. It avoids computing
the Hessian in its entirety but only computes the diagonal entries. This leads to update algo-
rithms of the form
While this is not quite as good as the full Newton’s method, it is still much better than not
using it. To see why this might be a good idea consider a situation where one variable de-
notes height in millimeters and the other one denotes height in kilometers. Assuming that
for both the natural scale is in meters, we have a terrible mismatch in parameterizations.
Fortunately, using preconditioning removes this. Effectively preconditioning with gradient
descent amounts to selecting a different learning rate for each variable (coordinate of vector
x). As we will see later, preconditioning drives some of the innovation in stochastic gradient
descent optimization algorithms.
One of the key problems in gradient descent is that we might overshoot the goal or make
insufficient progress. A simple fix for the problem is to use line search in conjunction with
gradient descent. That is, we use the direction given by ∇f (x) and then perform binary
search as to which learning rate η minimizes f (x − η∇f (x)).
This algorithm converges rapidly (for an analysis and proof see e.g., Boyd and Vandenberghe
(2004)). However, for the purpose of deep learning this is not quite so feasible, since each
step of the line search would require us to evaluate the objective function on the entire dataset.
This is way too costly to accomplish.
456 Optimization Algorithms
12.3.4 Summary
• Learning rates matter. Too large and we diverge, too small and we do not make progress.
• Newton’s method is a lot faster once it has started working properly in convex problems.
• Beware of using Newton’s method without any adjustments for nonconvex problems.
12.3.5 Exercises
1. Experiment with different learning rates and objective functions for gradient descent.
2. Implement line search to minimize a convex function in the interval [a, b].
1. Do you need derivatives for binary search, i.e., to decide whether to pick [a, (a + b)/2]
or [(a + b)/2, b].
2. Use the absolute values of that rather than the actual (possibly signed) values.
Discussions 167
In earlier chapters we kept using stochastic gradient descent in our training procedure, how-
ever, without explaining why it works. To shed some light on it, we just described the basic
principles of gradient descent in Section 12.3. In this section, we go on to discuss stochastic
gradient descent in greater detail.
%matplotlib inline
import math
import torch
from d2l import torch as d2l
457 Stochastic Gradient Descent
In deep learning, the objective function is usually the average of the loss functions for each
example in the training dataset. Given a training dataset of n examples, we assume that fi (x)
is the loss function with respect to the training example of index i, where x is the parameter
vector. Then we arrive at the objective function
1∑
n
f (x) = fi (x). (12.4.1)
n i=1
1∑
n
∇f (x) = ∇fi (x). (12.4.2)
n i=1
If gradient descent is used, the computational cost for each independent variable iteration is
O(n), which grows linearly with n. Therefore, when the training dataset is larger, the cost of
gradient descent for each iteration will be higher.
Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each it-
eration of stochastic gradient descent, we uniformly sample an index i ∈ {1, . . . , n} for data
examples at random, and compute the gradient ∇fi (x) to update x:
where η is the learning rate. We can see that the computational cost for each iteration drops
from O(n) of the gradient descent to the constant O(1). Moreover, we want to emphasize
that the stochastic gradient ∇fi (x) is an unbiased estimate of the full gradient ∇f (x) be-
cause
1∑
n
Ei ∇fi (x) = ∇fi (x) = ∇f (x). (12.4.4)
n i=1
This means that, on average, the stochastic gradient is a good estimate of the gradient.
Now, we will compare it with gradient descent by adding random noise with a mean of 0 and
a variance of 1 to the gradient to simulate a stochastic gradient descent.
def constant_lr():
return 1
eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
458 Optimization Algorithms
,→ ../aten/src/ATen/native/TensorShape.cpp:2895.)
As we can see, the trajectory of the variables in the stochastic gradient descent is much
more noisy than the one we observed in gradient descent in Section 12.3. This is due to the
stochastic nature of the gradient. That is, even when we arrive near the minimum, we are still
subject to the uncertainty injected by the instantaneous gradient via η∇fi (x). Even after 50
steps the quality is still not so good. Even worse, it will not improve after additional steps (we
encourage you to experiment with a larger number of steps to confirm this). This leaves us
with the only alternative: change the learning rate η. However, if we pick this too small, we
will not make any meaningful progress initially. On the other hand, if we pick it too large, we
will not get a good solution, as seen above. The only way to resolve these conflicting goals is
to reduce the learning rate dynamically as optimization progresses.
This is also the reason for adding a learning rate function lr into the sgd step function. In
the example above any functionality for learning rate scheduling lies dormant as we set the
associated lr function to be constant.
Replacing η with a time-dependent learning rate η(t) adds to the complexity of controlling
convergence of an optimization algorithm. In particular, we need to figure out how rapidly
η should decay. If it is too quick, we will stop optimizing prematurely. If we decrease it too
slowly, we waste too much time on optimization. The following are a few basic strategies that
are used in adjusting η over time (we will discuss more advanced strategies later):
In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress
in optimization stalls. This is a common strategy for training deep networks. Alternatively
we could decrease it much more aggressively by an exponential decay. Unfortunately this
often leads to premature stopping before the algorithm has converged. A popular choice is
459 Stochastic Gradient Descent
polynomial decay with α = 0.5. In the case of convex optimization there are a number of
proofs that show that this rate is well behaved.
def exponential_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return math.exp(-0.1 * t)
t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))
As expected, the variance in the parameters is significantly reduced. However, this comes
at the expense of failing to converge to the optimal solution x = (0, 0). Even after 1000
iteration steps are we are still very far away from the optimal solution. Indeed, the algorithm
fails to converge at all. On the other hand, if we use a polynomial decay where the learning
rate decays with the inverse square root of the number of steps, convergence gets better after
only 50 steps.
def polynomial_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)
t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
There exist many more choices for how to set the learning rate. For instance, we could start
with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We
could even alternate between smaller and larger learning rates. There exists a large variety of
such schedules. For now let’s focus on learning rate schedules for which a comprehensive the-
oretical analysis is possible, i.e., on learning rates in a convex setting. For general nonconvex
168 problems it is very difficult to obtain meaningful convergence guarantees, since in general
minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the excellent
lecture notes 168 of Tibshirani 2015.
460 Optimization Algorithms
The following convergence analysis of stochastic gradient descent for convex objective func-
tions is optional and primarily serves to convey more intuition about the problem. We limit
ourselves to one of the simplest proofs (Nesterov and Vial, 2000). Significantly more ad-
vanced proof techniques exist, e.g., whenever the objective function is particularly well be-
haved.
Suppose that the objective function f (ξ, x) is convex in x for all ξ. More concretely, we
consider the stochastic gradient descent update:
where f (ξ t , x) is the objective function with respect to the training example ξ t drawn from
some distribution at step t and x is the model parameter. Denote by
the expected risk and by R∗ its minimum with regard to x. Last let x∗ be the minimizer
(we assume that it exists within the domain where x is defined). In this case we can track
the distance between the current parameter xt at time t and the risk minimizer x∗ and see
whether it improves over time:
∥xt+1 − x∗ ∥2
=∥xt − ηt ∂x f (ξ t , x) − x∗ ∥2 (12.4.8)
∗ 2 ∗
=∥xt − x ∥ + ηt2 ∥∂x f (ξ t , x)∥2 − 2ηt ⟨xt − x , ∂x f (ξ t , x)⟩ .
We assume that the ℓ2 norm of stochastic gradient ∂x f (ξ t , x) is bounded by some constant
L, hence we have that
We are mostly interested in how the distance between xt and x∗ changes in expectation.
In fact, for any specific sequence of steps the distance might well increase, depending on
whichever ξ t we encounter. Hence we need to bound the dot product. Since for any convex
function f it holds that f (y) ≥ f (x) + ⟨f ′ (x), y − x⟩ for all x and y, by convexity we
have
f (ξ t , x∗ ) ≥ f (ξ t , xt ) + ⟨x∗ − xt , ∂x f (ξ t , xt )⟩ . (12.4.10)
Plugging both inequalities (12.4.9) and (12.4.10) into (12.4.8) we obtain a bound on the
distance between parameters at time t + 1 as follows:
This means that we make progress as long as the difference between current loss and the
optimal loss outweighs ηt L2 /2. Since this difference is bound to converge to zero it follows
that the learning rate ηt also needs to vanish.
461 Stochastic Gradient Descent
The last step involves summing over the inequalities for t ∈ {1, . . . , T }. Since the sum
telescopes and by dropping the lower term we obtain
( T )
∑ ∑
T
∥x1 − x∗ ∥2 ≥ 2 ηt [E[R(xt )] − R∗ ] − L2 ηt2 . (12.4.13)
t=1 t=1
Note that we exploited that x1 is given and thus the expectation can be dropped. Last de-
fine
∑T
def ηt xt
x̄ = ∑t=1 T
. (12.4.14)
t=1 ηt
Since
(∑ ) ∑T
T
t=1 ηt R(xt ) ηt E[R(xt )]
t=1
E ∑T = ∑T = E[R(xt )], (12.4.15)
t=1 ηt t=1 ηt
∑T
by Jensen’s inequality (setting i = t, αi = ηt / t=1 ηt in (12.2.3)) and convexity of R it
follows that E[R(xt )] ≥ E[R(x̄)], thus
∑
T ∑
T
ηt E[R(xt )] ≥ ηt E [R(x̄)] . (12.4.16)
t=1 t=1
So far we have played a bit fast and loose when it comes to talking about stochastic gra-
dient descent. We posited that we draw instances xi , typically with labels yi from some
distribution p(x, y) and that we use this to update the model parameters in some man-
ner. In particular, for a finite sample size we simply argued that the discrete distribution
∑n
p(x, y) = n1 i=1 δxi (x)δyi (y) for some functions δxi and δyi allows us to perform stochas-
tic gradient descent over it.
However, this is not really what we did. In the toy examples in the current section we simply
added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs (xi , yi ).
It turns out that this is justified here (see the exercises for a detailed discussion). More trou-
bling is that in all previous discussions we clearly did not do this. Instead we iterated over all
instances exactly once. To see why this is preferable consider the converse, namely that we
are sampling n observations from the discrete distribution with replacement. The probability
of choosing an element i at random is 1/n. Thus to choose it at least once is
A similar reasoning shows that the probability of picking some sample (i.e., training example)
exactly once is given by
( ) ( )n−1 ( )n
n 1 1 n 1
1− = 1− ≈ e−1 ≈ 0.37. (12.4.19)
1 n n n−1 n
Sampling with replacement leads to an increased variance and decreased data efficiency rel-
ative to sampling without replacement. Hence, in practice we perform the latter (and this is
the default choice throughout this book). Last note that repeated passes through the training
dataset traverse it in a different random order.
12.4.5 Summary
• For convex problems we can prove that for a wide choice of learning rates stochastic gra-
dient descent will converge to the optimal solution.
• For deep learning this is generally not the case. However, the analysis of convex problems
gives us useful insight into how to approach optimization, namely to reduce the learning
rate progressively, albeit not too quickly.
• Problems occur when the learning rate is too small or too large. In practice a suitable
learning rate is often found only after multiple experiments.
• When there are more examples in the training dataset, it costs more to compute each
iteration for gradient descent, so stochastic gradient descent is preferred in these cases.
• Optimality guarantees for stochastic gradient descent are in general not available in non-
convex cases since the number of local minima that require checking might well be
exponential.
12.4.6 Exercises
1. Experiment with different learning rate schedules for stochastic gradient descent and with
different numbers of iterations. In particular, plot the distance from the optimal solution
(0, 0) as a function of the number of iterations.
2. Prove that for the function f (x1 , x2 ) = x21 + 2x22 adding normal noise to the gradient is
equivalent to minimizing a loss function f (x, w) = (x1 − w1 )2 + 2(x2 − w2 )2 where x
is drawn from a normal distribution.
3. Compare convergence of stochastic gradient descent when you sample from {(x1 , y1 ), . . . , (xn , yn )}
with replacement and when you sample without replacement.
4. How would you change the stochastic gradient descent solver if some gradient (or rather
some coordinate associated with it) was consistently larger than all the other gradients?
5. Assume that f (x) = x2 (1 + sin x). How many local minima does f have? Can you
change f in such a way that to minimize it one needs to evaluate all the local minima?
Discussions 169 169
So far we encountered two extremes in the approach to gradient-based learning: Section 12.3
uses the full dataset to compute gradients and to update parameters, one pass at a time.