Convolutional Neural Networks: Jianxin Wu
Convolutional Neural Networks: Jianxin Wu
Convolutional Neural Networks: Jianxin Wu
Jianxin Wu
LAMDA Group
National Key Lab for Novel Software Technology
Nanjing University, China
wujx2001@gmail.com
Contents
1 Preliminaries 3
1.1 Tensor and vectorization . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Vector calculus and the chain rule . . . . . . . . . . . . . . . . . 4
2 CNN overview 5
2.1 The architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The forward run . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . 7
2.4 Error back propagation . . . . . . . . . . . . . . . . . . . . . . . 9
1
7 A case study: the VGG-16 net 28
7.1 VGG-Verydeep-16 . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Receptive field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Exercises 32
2
categorization) in this chapter. In image categorization, every image has a ma-
jor object that occupies a large portion of the image. An image is classified into
one of the classes based on the identity of its main object—e.g., dog, airplane,
bird, etc.
1 Preliminaries
We start with a discussion of some background knowledge that is necessary in
order to understand how a CNN runs. The reader can ignore this section if
he/she is familiar with these basics.
3
Given a tensor, we can arrange all the numbers inside it into a long vec-
tor, following a pre-specified order. For example, in Matlab/Octave, the (:)
operator converts a matrix into a column vector in the column-first order. An
example is:
1
1 2 T
3
A= 2 .
, A(:) = (1, 3, 2, 4) = (1)
3 4
4
In mathematics, we use the notation “vec” to represent this vectorization
operator. That is, vec(A) = (1, 3, 2, 4)T in the example in Equation 1. In order
to vectorize an order 3 tensor, we could vectorize its first channel (which is a
matrix and we already know how to vectorize it), then the second channel, . . . ,
till all channels are vectorized. The vectorization of the order 3 tensor is then
the concatenation of the vectorization of all the channels in this order.
The vectorization of an order 3 tensor is a recursive process, which utilizes
the vectorization of order 2 tensors. This recursive process can be applied to
vectorize an order 4 (or even higher order) tensor in the same manner.
4
A sanity check for Equation 4 is to check the matrix/vector dimensions. Note
∂z
that ∂y T is a row vector with H elements, or a 1 × H matrix. (Be reminded
∂z ∂y
that ∂y is a column vector). Since ∂x T is an H × W matrix, the vector/matrix
multiplication between them is valid, and the result should be a row vector with
∂z
W elements, which matches the dimensionality of ∂x T .
2 CNN overview
In this section, we will see how a CNN trains and predicts at the abstract level,
with the details left for later sections.
5
The last layer is a loss layer. Let us suppose t is the corresponding target
(groundtruth) value for the input x1 , then a cost or loss function can be used
to measure the discrepancy between the CNN prediction xL and the target t.
For example, a simple loss function could be
1
z= kt − xL k2 , (6)
2
although more complex loss functions are usually used. This squared `2 loss can
be used in a regression problem.
In a classification problem, the cross entropy (cf. Chapter 10) loss is often
used. The groundtruth in a classification problem is a categorical variable t. We
first convert the categorical variable t to a C dimensional vector t (cf. Chapter
9). Now both t and xL are probability mass functions, and the cross entropy
loss measures the distance between them. Hence, we can minimize the cross
entropy loss. Equation 5 explicitly models the loss function as a loss layer,
whose processing is modeled as a box with parameters wL , although in many
cases a loss layer does not involve any parameter—i.e., wL = ∅.
Note that there are other layers that do not have any parameter—that is,
wi may be empty for some i < L. The softmax layer is one such example.
This layer can convert a vector into a probability mass function. The input to
a softmax layer is a vector, whose values may be positive, zero, or negative.
Suppose layer l is a softmax layer and its input is a vector xl ∈ Rd . Then, its
output is a vector xl+1 ∈ Rd , which is computed as
exp(xli )
xl+1
i = d
, (7)
l
P
j=1 exp(xj )
that is, a softmax transformed version of the input. After the softmax layer’s
processing, values in xl+1 form a probability mass function, and can be used as
input to the cross entropy loss.
Note that the loss layer is not needed in prediction. It is only useful when
we try to learn CNN parameters using a set of training examples. Now, the
problem is: how do we learn the model parameters?
6
2.3 Stochastic gradient descent (SGD)
As in many other learning systems, the parameters of a CNN model are opti-
mized to minimize the loss z—i.e., we want the predictions of a CNN model to
match the groundtruth labels on the training set.
Let’s suppose one training example x1 is given for training such parameters.
The training process involves running the CNN network in both directions. We
first run the network in the forward direction to get xL to achieve a predic-
tion using the current CNN parameters. Instead of outputting this prediction,
however, we need to compare it with the target t corresponding to x1 —that is,
continue running the forward pass till the last loss layer. Finally, we achieve a
loss z.
The loss z is then a supervision signal, guiding how the parameters of the
model should be modified (updated). And the Stochastic Gradient Descent
(SGD) way of modifying the parameters is
∂z
wi ←− wi − η . (9)
∂wi
7
ࢍ
࢝
െߟࢍ
Figure 1: Illustration of the gradient descent method, in which η is the learning
rate.
8
examples have been used to update the parameters, we say one epoch has been
processed.
One epoch will in general reduce the average loss on the training set until the
learning system overfits the training data. Hence, we can repeat the gradient
descent updating for many epochs and terminate at some point to obtain the
CNN parameters (e.g., we can terminate when the average loss on a validation
set increases).
Gradient descent may seem simple in its math form (Equation 9), but it is
a very tricky operation in practice. For example, if we update the parameters
using the gradient calculated from only one training example, we will observe
an unstable loss function: the average loss of all training examples will bounce
up and down at very high frequency. This is because the gradient is estimated
using only one training example instead of the entire training set—the gradient
computed using only one example can be very unstable.
Contrary to single example based parameter updating, we can compute the
gradient using all training examples and then update the parameters. However,
this batch processing strategy requires a lot of computations because the param-
eters are updated only once in an epoch, and is hence impractical, especially
when the number of training examples is large.
A compromise is to use a mini-batch of training examples, to compute the
gradient using this mini-batch, and to update the parameters correspondingly.
Updating the parameters using the gradient estimated from a (usually) small
subset of training examples is called stochastic gradient descent. For example,
we can set 32 or 64 examples as a mini-batch. Stochastic gradient descent
(using the mini-batch strategy) is the mainstream method used to learn a CNN’s
parameters. We also want to note that when mini-batches are used, the input
of the CNN becomes a 4th order tensor—e.g., H × W × 3 × 32 if the mini-batch
size is 32.
A new problem now becomes apparent: how to compute the gradient, which
seems a very complex task?
This step is only needed when wL is not empty. In the same spirit, it is also
∂z
easy to compute ∂x L . For example, if the squared `2 loss is used, we have an
∂z
empty ∂wL , and
∂z
= xL − t .
∂xL
In fact, for every layer, we compute two sets of gradients: the partial deriva-
tives of z with respect to the layer parameters wi , and that layer’s input xi .
∂z
• The term ∂w i , as seen in Equation 9, can be used to update the current
9
∂z
• The term ∂x i can be used to update parameters backwards—e.g., to the
∂z ∂z ∂ vec(xi+1 )
= , (11)
∂(vec(wi )T ) ∂(vec(xi+1 )T ) ∂(vec(wi )T )
∂z ∂z ∂ vec(xi+1 )
= . (12)
∂(vec(xi )T ) ∂(vec(xi+1 )T ) ∂(vec(xi )T )
Since ∂x∂z i+1 is already computed and stored in memory, it requires just a
10
l l l
used. In that case, xl becomes an order 4 tensor in RH ×W ×D ×N where N is
the mini-batch size. For simplicity we assume that N = 1 in this chapter. The
results in this chapter, however, are easy to adopt to mini-batch versions.
In order to simplify the notations that will appear later, we follow the zero-
based indexing convention, which specifies that 0 ≤ il < H l , 0 ≤ j l < W l , and
0 ≤ dl < Dl .
In the l-th layer, a function will transform the input xl to an output y,
which is also the input to the next layer. Thus, we notice that y and xl+1 in
fact refers to the same object, and it is very helpful to keep this point in mind.
We assume the output has size H l+1 ×W l+1 ×Dl+1 , and an element in the output
is indexed by a triplet (il+1 , j l+1 , dl+1 ), 0 ≤ il+1 < H l+1 , 0 ≤ j l+1 < W l+1 ,
0 ≤ dl+1 < Dl+1 .
where J·K is the indicator function, being 1 if its argument is true, and 0 other-
wise.
Hence, we have
∂z
∂z
if xli,j,d > 0
= ∂y i,j,d . (15)
∂xl i,j,d
0 otherwise
11
Figure 2: The ReLU function.
However, the logistic sigmoid performs significantly worse than ReLU in CNN
12
learning. Note that 0 < y < 1 if a sigmoid function is used and
dy
= y(1 − y) ,
dx
we have
dy 1
0< ≤ .
dx 4
∂z ∂z dy
Hence, in the error back propagation process, the gradient ∂x = ∂y dx will
∂z 1
have much smaller magnitude than ∂y (at most 4 ). In other words, a sigmoid
layer will cause the magnitude of the gradient to significantly reduce, and after
several sigmoid layers, the gradient will vanish (i.e., all its components will be
close to 0). A vanishing gradient makes gradient based learning (e.g., SGD) very
difficult. Another major drawback of the sigmoid is that it is saturated. When
the magnitude of x is large—e.g., when x > 6 or x < −6—the corresponding
gradient is almost 0.
On the other hand, the ReLU layer sets the gradient of some features in the
l-th layer to 0, but these features are not activated (i.e., we are not interested
in them). For those activated features, the gradient is back propagated without
any change, which is beneficial for SGD learning. The introduction of ReLU to
replace sigmoid is an important change in CNN, which significantly reduces the
difficulty in learning CNN parameters and improves its accuracy. There are also
more complex variants of ReLU, for example, parametric ReLU and exponential
linear unit, which we do not touch on in this chapter.
1 × 1 + 1 × 4 + 1 × 2 + 1 × 5 = 12 .
We then move the kernel down by one pixel and get the next convolution result
as
1 × 4 + 1 × 7 + 1 × 5 + 1 × 8 = 24 .
13
ϭ ϭ
ϭ ϭ
(a) A 2 × 2 kernel
ϭ Ϯ ϯ ϭ
ϭϮ ϭϲ ϭϭ
ϰ ϱ ϲ ϭ
Ϯϰ Ϯϴ ϭϳ
ϳ ϴ ϵ ϭ
We keep moving the kernel down till it reaches the bottom border of the input
matrix (image). Then, we return the kernel to the top, and move the kernel to
its right by one element (pixel). We repeat the convolution for every possible
pixel location until we have moved the kernel to the bottom right corner of the
input image, as shown in Figure 3.
For order 3 tensors, the convolution operation is defined similarly. Suppose
the input in the l-th layer is an order 3 tensor with size H l × W l × Dl . One
convolution kernel is also an order 3 tensor with size H × W × Dl . When we
overlay the kernel on top of the input tensor at the spatial location (0, 0, 0),
we compute the products of corresponding elements in all the Dl channels and
sum the HW Dl products to get the convolution result at this spatial location.
Then, we move the kernel from top to bottom and from left to right to complete
the convolution.
In a convolution layer, multiple convolution kernels are usually used. As-
suming D kernels are used and each kernel is of spatial span H × W , we denote
l
all the kernels as f . f is an order 4 tensor in RH×W ×D ×D . Similarly, we use
index variables 0 ≤ i < H, 0 ≤ j < W , 0 ≤ dl < Dl and 0 ≤ d < D to pinpoint
a specific element in the kernels. Also note that the set of kernels f refer to
the same object as the notation wl in Equation 5. We change the notation
slightly to make the derivation a little simpler. It is also clear that even if the
mini-batch strategy is used, the kernels remain unchanged.
As shown in Figure 3, the spatial extent of the output is smaller than that
14
of the input so long as the convolution kernel is larger than 1 × 1. Sometimes
we need the input and output images to have the same height and width, and a
simple padding trick can be used. If the input is H l × W l × Dl and the kernel
size is H × W × Dl × D, the convolution result has size
(H l − H + 1) × (W l − W + 1) × D .
For every channel of the input, if we pad (i.e., insert) b H−1
2 c rows above the first
row and b H2 c rows below the last row, and pad b W2−1 c columns to the left of the
first column and b W2 c columns to the right of the last column of the input, the
convolution output will be H l × W l × D in size—i.e., having the same spatial
extent as the input (b·c is the floor function). Elements of the padded rows and
columns are usually set to 0, but other values are also possible.
Stride is another important concept in convolution. In Figure 3, we convolve
the kernel with the input at every possible spatial location, which corresponds
to the stride s = 1. However, if s > 1, every movement of the kernel skips
s − 1 pixel locations (i.e., the convolution is performed once every s pixels both
horizontally and vertically). When s > 1, a convolution’s output will be much
smaller than that of the input—H l+1 and W l+1 will be roughly 1/s of H l and
W l , respectively.
In this section, we consider the simple case when the stride is 1 and no
l+1 l+1 l+1
padding is used. Hence, we have y (or xl+1 ) in RH ×W ×D , with H l+1 =
H l − H + 1, W l+1 = W l − W + 1, and Dl+1 = D.
In precise mathematics, the convolution procedure can be expressed as an
equation:
l
H−1
XW −1 D
X X −1
yil+1 ,j l+1 ,d = fi,j,dl ,d × xlil+1 +i,j l+1 +j,dl . (16)
i=0 j=0 dl =0
Equation 16 is repeated for all 0 ≤ d < D = Dl+1 , and for any spatial location
(il+1 , j l+1 ) satisfying
0 ≤ il+1 < H l − H + 1 = H l+1 , (17)
l+1 l l+1
0≤j <W −W +1=W . (18)
In this equation, xlil+1 +i,j l+1 +j,dl refers to the element of xl indexed by the
triplet (il+1 + i, j l+1 + j, dl ).
A bias term bd is usually added to yil+1 ,j l+1 ,d . We omit this term in this
chapter for clearer presentation.
15
(a) Input image
processing.
16
considers the combined effect of many features in layer l. These more complex
patterns will be further assembled by deeper layers to activate for semantically
meaningful object parts or even a particular type of object—e.g., dog, cat, tree,
beach, etc.
One more benefit of the convolution layer is that all spatial locations share
the same convolution kernel, which greatly reduces the number of parameters
needed for a convolution layer. For example, if multiple dogs appear in an input
image, the same “dog-head-like pattern” feature might be activated at multiple
locations, corresponding to heads of different dogs.
In a deep neural network setup, convolution also encourages parameter shar-
ing. For example, suppose “dog-head-like pattern” and “cat-head-like pattern”
are two features learned by a deep convolutional network. The CNN does not
need to devote two sets of disjoint parameters (e.g., convolution kernels in multi-
ple layers) to them. The CNN’s bottom layers can learn “eye-like pattern” and
“animal-fur-texture pattern,” which are shared by both these more abstract
features. In short, the combination of convolution kernels and deep and hier-
archical structures is very effective in learning good representations (features)
from images for visual recognition tasks.
We want to add a note here. Although we have used phrases such as “dog-
head-like pattern,” the representation or feature learned by a CNN may not
correspond exactly to semantic concepts such as “dog’s head.” A CNN feature
may activate frequently for dogs’ heads and often be deactivated for other types
of patterns. However, there are also possible false activations at other locations,
and possible deactivations at dogs’ heads.
In fact, a key concept in CNN (or more generally deep learning) is distributed
representation. For example, suppose our task is to recognize N different types
of objects, and a CNN extracts M features from any input image. It is most
likely that any one of the M features is useful for recognizing all N object
categories, and to recognize one object type requires the joint effort of all M
features.
17
a B matrix that is an expanded version of A:
1 4 2 5 3 6
4 7 5 8 6 9
B= .
2 5 3 6 1 1
5 8 6 9 1 1
18
to index an element in this matrix. The expansion operator copies the element
at (il , j l , dl ) in xl to the (p, q)-th entry in φ(xl ).
From the description of the expansion process, it is clear that given a fixed
(p, q), we can calculate its corresponding (il , j l , dl ) triplet, because obviously
In Equation 22, dividing q by HW and taking the integer part of the quo-
tient, we can determine which channel (dl ) it belongs to. Similarly, we can
get the offsets inside the convolution kernel as (i, j), in which 0 ≤ i < H and
0 ≤ j < W . q completely determines one specific location inside the convolution
kernel by the triplet (i, j, dl ).
Note that the convolution result is xl+1 , whose spatial extent is H l+1 =
H l − H + 1 and W l+1 = W l − W + 1. Thus, in Equation 21, the remainder
and quotient of dividing p by H l+1 = H l − H + 1 will give us the offset in the
convolved result (il+1 , j l+1 ), or, the top-left spatial location of the region in xl
(which is to be convolved with the kernel).
Based on the definition of convolution, it is clear that we can use Equa-
tions 23 and 24 to find the offset in the input xl as il = il+1 +i and j l = j l+1 +j.
That is, the mapping from (p, q) to (il , j l , dl ) is one-to-one. However, we want
to emphasize that the reverse mapping from (il , j l , dl ) to (p, q) is one-to-many, a
fact that is useful in deriving the back propagation rules in a convolution layer.
Now we use the standard vec operator to convert the set of convolution
kernels f (order 4 tensor) into a matrix. Let’s start from one kernel, which
l
can be vectorized into a vector in RHW D . Thus, all convolution kernels can
be reshaped into a matrix with HW Dl rows and D columns (remember that
Dl+1 = D.) Let’s call this matrix F .
Finally, with all these notations, we have a beautiful equation to calculate
convolution results (cf. Equation 20, in which φ(xl ) is B T ):
19
Table 1: Variables, their sizes and meanings. Note that “alias” means a variable
has a different name or can be reshaped into another form.
20
to the previous ((l − 1)-th) layer, and the second term will determine how the
parameters of the current (l-th) layer will be updated. A friendly reminder
is to remember that f , F , and wi refer to the same thing (modulo reshaping
of the vector or matrix or tensor). Similarly, we can reshape y into a matrix
l+1 l+1
Y ∈ R(H W )×D ; then y, Y , and xl+1 refer to the same object (again modulo
reshaping).
∂z
From the chain rule (Equation 11), it is easy to compute ∂ vec(F ) as
∂z ∂z ∂ vec(y)
T
= . (31)
∂(vec(F )) ∂(vec(Y ) ) ∂(vec(F )T )
T
The first term in the RHS is already computed in the (l + 1)-th layer as (equiva-
∂z
lently) ∂(vec(x l+1 ))T . The second term, based on Equation 29, is pretty straight-
forward:
∂ vec(y) ∂ I ⊗ φ(xl ) vec(F )
= = I ⊗ φ(xl ) . (32)
∂(vec(F )T ) ∂(vec(F )T )
T
Note that we have used the fact ∂Xa ∂Xa
∂a = X or ∂aT = X so long as the matrix
multiplications are well defined. This equation leads to
∂z ∂z
T
= (I ⊗ φ(xl )) . (33)
∂(vec(F )) ∂(vec(y)T )
Note that both Equation 28 (from RHS to LHS) and Equation 27 are used in
the above derivation.
Thus, we conclude that
∂z ∂z
= φ(xl )T , (38)
∂F ∂Y
which is a simple rule to update the parameters in the l-th layer: the gradient
with respect to the convolution parameters is the product between φ(xl )T (the
∂z
im2col expansion) and ∂Y (the supervision signal transferred from the (l+1)-th
layer).
21
5.6 Even higher dimensional indicator matrices
The function φ(·) has been very useful in our analysis. It is pretty high dimensional—
e.g., φ(xl ) has H l+1 W l+1 HW Dl elements. From the above, we know that an
element in φ(xl ) is indexed by a pair p and q.
A quick recap about φ(xl ): 1) from q we can determine dl , which channel
of the convolution kernel is used; and can also determine i and j, the spatial
offsets inside the kernel; 2) from p we can determine il+1 and j l+1 , the spatial
offsets inside the convolved result xl+1 ; and, 3) the spatial offsets in the input
xl can be determined as il = il+1 + i and j l = j l+1 + j.
That is, the mapping m : (p, q) 7→ (il , j l , dl ) is one-to-one, and thus is
a valid function. The inverse mapping, however, is one-to-many (thus not a
valid function). If we use m−1 to represent the inverse mapping, we know that
m−1 (il , j l , dl ) is a set S, where each (p, q) ∈ S satisfies that m(p, q) = (il , j l , dl ).
Now we take a look at φ(xl ) from a different perspective. In order to fully
specify φ(xl ), what information is required? It is obvious that the following
three types of information are needed (and only those). For every element of
φ(xl ), we need to know
(A) Which region does it belong to—i.e., what is the value of p (0 ≤ p <
H l+1 W l+1 )?
(B) Which element is it inside the region (or equivalently inside the convolution
kernel)—i.e., what is the value of q (0 ≤ q < HW Dl )?
The above two pieces of information determine a location (p, q) inside φ(xl ).
The only missing information is
(C) What is the value in that position—i.e., φ(xl ) pq ?
22
Then, we can use the “indicator” method to encode the function m(p, q) =
(il , j l , dl ) into M . That is, for any possible element in M , its row index x
determines a (p, q) pair, and its column index y determines a (il , j l , dl ) triplet,
and M is defined as
(
1 if m(p, q) = (il , j l , dl )
M (x, y) = . (39)
0 otherwise
• M , which uses information [A, B, C.1], only encodes the one-to-one cor-
respondence between any element in φ(xl ) and any element in xl ; it does
not encode any specific value in xl ;
• Most importantly, putting together the one-to-one correspondence infor-
mation in M and the value information in xl , obviously we have
∂z ∂z ∂ vec(y)
= .
∂(vec(xl )T ) ∂(vec(y)T ) ∂(vec(xl )T )
We will start by studying the second term in the RHS (utilizing Equations 30
and 40):
∂ vec(y) ∂(F T ⊗ I) vec(φ(xl ))
= = (F T ⊗ I)M . (41)
∂(vec(xl )T ) ∂(vec(xl )T )
Thus,
∂z ∂z
l T
= (F T ⊗ I)M . (42)
∂(vec(x ) ) ∂(vec(y)T )
Since (using Equation 28 from right to left)
T
∂z ∂z
(F T ⊗ I) = (F ⊗ I) (43)
∂(vec(y)T ) ∂ vec(y)
23
T
∂z
= (F ⊗ I) vec (44)
∂Y
T
∂z T
= vec I F (45)
∂Y
T
∂z T
= vec F , (46)
∂Y
we have T
∂z ∂z T
= vec F M, (47)
∂(vec(xl )T ) ∂Y
or equivalently
∂z ∂z T
= M T vec F . (48)
∂(vec(xl )) ∂Y
l+1 l+1 l
∂z T
Let’s have a closer look at the RHS. ∂Y F ∈ R(H W )×(HW D ) , and
l+1 l+1 l
∂z T
vec ∂Y F is a vector in RH W HW D . On the other hand, M T is an
l l l l+1 l+1 l
indicator matrix in R(H W D )×(H W HW D ) .
In order to pinpoint one element in vec(xl ) or one row in M T , we need an
index triplet (il , j l , dl ), with 0 ≤ il < H l , 0 ≤ j l < W l , and 0 ≤ dl < Dl .
∂z T
Similarly, to locate a column in M T or an element in ∂Y F , we need an index
l+1 l+1
pair (p, q), with 0 ≤ p < H W and 0 ≤ q < HW Dl .
∂z
Thus, the (il , j l , dl )-th entry of ∂(vec(x l )) equals the multiplication of two
vectors: the row in M (or the column in M ) that is indexed by (il , j l , dl ), and
T
∂z T
vec ∂Y F .
Furthermore, since M T is an indicator matrix, in the row vector indexed
by (il , j l , dl ), only those entries whose index (p, q) satisfies m(p, q) = (il , j l , dl )
∂z
have a value 1; all other entries are 0. Thus, the (il , j l , dl )-th entry of ∂(vec(x l ))
∂z T
equals the sum of these corresponding entries in vec ∂Y F .
Transferring the above description into precise mathematical form, we get
the following succinct equation:
∂z X ∂z T
= F . (49)
∂X (il ,j l ,dl ) −1 l l l
∂Y (p,q)
(p,q)∈m (i ,j ,d )
∂z
In other words, to compute ∂X , we do not need to explicitly use the extremely
high dimensional matrix M . Instead, Equation 49 and Equations 21 to 24 can
∂z
be used to efficiently find ∂X .
We use the simple convolution example in Figure 3 to illustrate the inverse
mapping m−1 , which is shown in Figure 5.
∂z T
In the right half of Figure 5, the 6 × 4 matrix is ∂Y F . In order to compute
the partial derivative of z with respect to one element in the input X, we need
∂z T
to find which elements in ∂Y F are involved and add them. In the left half of
Figure 5, we show that the input element 5 (shown in larger font) is involved
in four convolution operations, shown by the red, green, blue, and black boxes,
24
ϭ Ϯ ϯ ϭ
ϰ ϱ ϲ ϭ
ϳ ϴ ϵ ϭ
∂z
Figure 5: Illustration of how to compute ∂X . ()
this operation is operated in a row2im manner, although the name row2im is not explicitly
used.
25
with size H l × W l × Dl × D. The output is y ∈ RD . It is obvious that to
compute any element in y, we need to use all elements in the input xl . Hence,
this layer is a fully connected layer, but can be implemented as a convolution
layer. Hence, we do not need to derive learning rules for a fully connected layer
separately.
where 0 ≤ il+1 < H l+1 , 0 ≤ j l+1 < W l+1 , and 0 ≤ d < Dl+1 = Dl .
Pooling is a local operator, and its forward computation is pretty straight-
forward. Now we focus on the back propagation. Only max pooling is discussed,
and we can resort to the indicator matrix again. Average pooling can be dealt
with using a similar idea.
All we need to encode in this indicator matrix is: for every element in y,
where does it come from in xl ?
We need a triplet (il , j l , dl ) to pinpoint one element in the input xl , and
another triplet (il+1 , j l+1 , dl+1 ) to locate one element in y. The pooling output
yil+1 ,j l+1 ,dl+1 comes from xlil ,j l ,dl , if and only if the following conditions are met:
4 That is, the strides in the vertical and horizontal direction are H and W , respectively.
26
• They are in the same channel;
• The (il , j l )-th spatial entry belongs to the (il+1 , j l+1 )-th subregion;
• The (il , j l )-th spatial entry is the largest one in that subregion.
Translating these conditions into equations, we get
dl+1 = dl , (53)
l l
i l+1 j
=i , = j l+1 , (54)
H W
xlil ,j l ,dl ≥ yi+il+1 ×H,j+j l+1 ×W,dl , ∀ 0 ≤ i < H, 0 ≤ j < W , (55)
where b·c is the floor function. If the stride is not H (W ) in the vertical (hori-
zontal) direction, Equation 54 must be changed accordingly.
Given a (il+1 , j l+1 , dl+1 ) triplet, there is only one (il , j l , dl ) triplet that sat-
isfies all these conditions. Thus, we define an indicator matrix
l+1
W l+1 D l+1 )×(H l W l D l )
S(xl ) ∈ R(H . (56)
One triplet of indexes (il+1 , j l+1 , dl+1 ) specifies a row in S, while (il , j l , dl )
specifies a column. These two triplets together pinpoint one element in S(xl ).
We set that element to 1 if Equations 53 to 55 are simultaneously satisfied, and
0 otherwise. One row of S(xl ) corresponds to one element in y, and one column
corresponds to one element in xl .
With the help of this indicator matrix, we have
27
If the strides are H and W , respectively, one column in S(xl ) contains at
most one nonzero element. In the above example, the column of S(xl ) indexed
by (0, 0, dl ), (1, 0, dl ), and (1, 1, dl ) are all zero vectors. The column correspond-
ing to (0, 1, dl ) contains only one nonzero entry, whose row index is determined
by (0, 0, dl ). Hence, in the back propagation, we have
∂z ∂z
= ,
∂ vec(xl ) (0,1,dl ) ∂ vec(y) (0,0,dl )
and
∂z ∂z ∂z
= = = 0.
∂ vec(xl ) (0,0,dl ) ∂ vec(xl ) (1,0,dl ) ∂ vec(xl ) (1,1,dl )
However, if the pooling strides are smaller than H and W in the vertical
and horizontal directions, respectively, one element in the input tensor may be
the largest element in several pooling subregions. Hence, there can be more
than one nonzero entry in one column of S(xl ). Let us consider the example
input in Figure 5. If a 2 × 2 max pooling is applied to it and the stride is 1 in
both directions, the element 9 is the largest in two pooling regions: [ 58 69 ] and
[ 69 11 ]. Hence, in the column of S(xl ) corresponding to the element 9 (indexed by
(2, 2, dl ) in the input tensor), there are two nonzero entries whose row indexes
correspond to (il+1 , j l+1 , dl+1 ) = (1, 1, dl ) and (1, 2, dl ). Thus, in this example,
we have
∂z ∂z ∂z
= + .
∂ vec(xl ) (2,2,dl ) ∂ vec(y) (1,1,dl ) ∂ vec(y) (1,2,dl )
7.1 VGG-Verydeep-16
The VGG-Verydeep-16 CNN model is a pretrained CNN model released by the
Oxford VGG group.5 We use it as an example to study the detailed structure
of CNN networks. The VGG-16 model architecture is listed in Table 2.
There are six types of layers in this model.
Convolution A convolution layer is abbreviated as “Conv.” Its description
includes four parts: number of channels; kernel spatial extent (kernel size);
padding (‘p’); and stride (‘st’) size.
ReLU No description is needed for a ReLU layer.
5 http://www.robots.ox.ac.uk/
~vgg/research/very_deep/
28
Table 2: The VGG-Verydeep-16 architecture and receptive field
We want to add a few notes about this example deep CNN architecture.
• A convolution layer is always followed by a ReLU layer in VGG-16. The
ReLU layers increase the nonlinearity of the CNN model.
• The convolution layers between two pooling layers have the same number
of channels, kernel size, and stride. In fact, stacking two 3 × 3 convolution
layers is equivalent to one 5 × 5 convolution layer; and stacking three
3 × 3 convolution kernels replaces a 7 × 7 convolution layer. Stacking a
few (2 or 3) smaller convolution kernels, however, computes faster than
29
a large convolution kernel. In addition, the number of parameters is also
reduced—e.g., 2 × 3 × 3 = 18 < 25 = 5 × 5. The ReLU layers inserted in
between small convolution layers are also helpful.
• The input to VGG-16 is an image with size 224 × 224 × 3. Because the
padding is one in the convolution kernels (meaning one row or column is
added outside of the four edges of the input), convolution will not change
the spatial extent. The pooling layers will reduce the input size by a factor
of 2. Hence, the output after the last (5th) pooling layer has spatial extent
7 × 7 (and 512 channels). We may interpret this tensor as 7 × 7 × 512 =
25088 “features.” The first fully connected layer converts these into 4096
features. The number of features remains at 4096 after the second fully
connected layer.
• The VGG-16 is trained for the ImageNet classification challenge, which is
an object recognition problem with 1000 classes. The last fully connected
layer (4096 × 1000) outputs a length 1000 vector for every input image,
and the softmax layer converts this length 1000 vector into the estimated
posterior probability for the 1000 classes.
30
8 Hands-on CNN experiences
We hope this introductory chapter on CNN is clear, self-contained, and easy to
understand to our readers.
Once a reader is confident in his/her understanding of CNN at the math-
ematical level, in the next step it is very helpful to get some hands-on CNN
experiences. For example, one can validate what has been talked about in this
chapter using the MatConvNet software package if you prefer the Matlab envi-
ronment.6 For C++ lovers, Caffe is a widely used tool.7 The Theano package
is a python package for deep learning.8 Many more resources for deep learn-
ing (not only CNN) are available—e.g., Torch,9 PyTorch,10 MXNet,11 Keras,12
TensorFlow,13 and more. The exercise problems in this chapter are offered as
appropriate first-time CNN programming practice.
6 http://www.vlfeat.org/matconvnet/
7 http://caffe.berkeleyvision.org/
8 http://deeplearning.net/software/theano/
9 http://torch.ch/
10 http://pytorch.org/
11 https://mxnet.incubator.apache.org/
12 https://keras.io/
13 https://www.tensorflow.org/
31
Exercises
1. Dropout is a very useful technique in training neural networks, which is
proposed by Srivastava et al. in a paper titled “Dropout: A Simple Way
to Prevent Neural Networks from Overfitting” in JMLR .14 Carefully read
this paper and answer the following questions (please organize your answer
to every question in one brief sentence).
(a) How does dropout operate during training?
(b) How does dropout operate during testing?
(c) What is the benefit of dropout?
(d) Why can dropout achieve this benefit?
2. The VGG16 CNN model (also called VGG-Verydeep-16) was publicized
by Karen Simonyan and Andrew Zisserman in a paper titled “Very Deep
Convolutional Networks for Large-Scale Image Recognition” in the arXiv
preprint server .15 And, the GoogLeNet model was publicized by Szegedy
et al. in a paper titled “Going Deeper with Convolutions” in the arXiv
preprint server .16 These two papers were publicized around the same time
and share some similar ideas. Carefully read both papers and answer the
following questions (please organize your answer to every question in one
brief sentence).
(a) Why do they use small convolution kernels (mainly 3 × 3) rather than
larger ones?
(b) Why are both networks quite deep (i.e., with many layers, around 20)?
(c) What difficulties are caused by the large depth? How are they solved
in these two networks?
3. Batch Normalization (BN) is another very useful technique in training
deep neural networks, which is proposed by Sergey Ioffe and Christian
Szegedy, in a paper titled “Batch Normalization: Accelerating Deep Net-
work Training by Reducing Internal Covariate Shift” in ICML 2015 .17
Carefully read this paper and answer the following questions (please or-
ganize your answer to every question in one brief sentence).
(a) What is internal covariate shift?
(b) How does BN deal with this?
(c) How does BN operate in a convolution layer?
(d) What is the benefit of using BN?
14 Available at http://jmlr.org/papers/v15/srivastava14a.html
15 Available at https://arxiv.org/abs/1409.1556, later published in ICLR 2015 as a con-
ference track paper.
16 Available at https://arxiv.org/abs/1409.4842, later published in CVPR 2015.
17 Available at http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
32
4. ResNet is a very deep neural network learning technique proposed by He
et al. in a paper titled “Deep Residual Learning for Image Recognition”
in CVPR 2016 .18 Carefully read this paper and answer the following
questions (please organize your answer to every question in one brief sen-
tence).
(a) Although VGG16 and GoogLeNet have encountered difficulties in
training networks of around 20–30 layers, what enables ResNet to train
networks as deep as 1000 layers?
(b) VGG16 is a feed-forward network, where each layer has only one in-
put and only one output. Conversely, GoogLeNet and ResNet are DAGs
(directed acyclic graphs), where one layer can have multiple inputs and
multiple outputs, so long as the data flow in the network structure does
not form a cycle. What is the benefit of DAGs vs. feed-forward?
(c) VGG16 has two fully connected layers (fc6 and fc7), while ResNet and
GoogLeNet do not have fully connected layers (except the last layer for
classification). What is used to replace FC layers in them? What is the
benefit?
5. AlexNet refers to the deep convolutional neural network trained on the
ILSVRC challenge data, which is a groundbreaking work of deep CNN
for computer vision tasks. The technical details of AlexNet are reported
in the paper “ImageNet Classification with Deep Convolutional Neural
Networks” by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton
in NIPS 25 .19 It proposed the ReLU activation function and creatively
used GPUs to accelerate the computations. Carefully read this paper
and answer the following questions (please organize your answer to every
question in one brief sentence).
(a) Describe your understanding of how ReLU helps its success? And,
how do the GPUs help out?
(b) Using the average of predictions from several networks helps reduce
the error rates. Why?
(c) Where is the dropout technique applied? How does it help? And what
is the cost of using dropout?
(d) How many parameters are there in AlexNet? Why is the dataset size
(1.2 million images) important for the success of AlexNet?
6. We will try different CNN structures on the MNIST dataset. We denote
the “baseline” network in the MNIST example in MatConvNet as BASE
in this question.20 In this question, a convolution layer is denoted as
18 Availableat https://arxiv.org/pdf/1512.03385.pdf
19 This paper is available at http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks
20 MatConvNet version 1.0-beta20. Please refer to MatConvNet for all the details of BASE,
33
“x × y × nIn × nOut,” whose kernel size is x × y, with nIn input and nOut
output channels, with stride 1 and pad 0. The pooling layers are 2 × 2
max pooling with stride equal 2. The BASE network has four blocks. The
first consists of a 5 × 5 × 1 × 20 convolution and a max pooling; the second
block is composed of a 5 × 5 × 20 × 50 convolution and a max pooling; the
third block is a 4 × 4 × 50 × 500 convolution (FC) plus a ReLU layer; and
the final block is the classification layer (1 × 1 × 500 × 10 convolution).
(a) The MNIST dataset is available at http://yann.lecun.com/exdb/
mnist/. Read the instructions on that page, and write a program to
transform the data to formats that suit your favorite deep learning soft-
ware.
(b) Learning deep learning models often involves random numbers. Before
the training starts, set the random number generator’s seed to 0. Then,
use the BASE network structure and the first 10000 training examples to
learn its parameters. What is the test set error rate (on the 10000 test
examples) after 20 training epochs?
(c) From now on, if not otherwise specified, we assume the first 10000
training examples and 20 epochs are used. Now we define the BN network
structure, which adds a batch normalization layer after every convolution
layer in the first three blocks. What is its error rate? What can you say
about BN vs. BASE?
(d) If you add a dropout layer after the classification layer in the 4th
block, what is the new error rate of BASE and BN? Comment on the use
of dropout?
(e) Now we define the SK network structure, which refers to small kernel
size. SK is based on BN. The first block (5 × 5 convolution plus pooling)
is now changed to two 3 × 3 convolutions, and BN + ReLU is applied after
every convolution. For example, block 1 is now 3 × 3 × 1 × 20 convolution
+ BN + ReLU + 3 × 3 × 20 × 20 convolution + BN + ReLU + pool.
What is SK’s error rate? Comment on that (e.g., how and why the error
rate changes).
(f) Now we define the SK-s network structure. The notation ‘s’ refers to
a multiplier that changes the number of channels in convolution layers.
For example, SK is the same as SK-1. And, SK-2 means the number of
channels in all convolution layers (except the one in block 4) are multiplied
by 2. Train networks for SK-2, SK-1.5, SK-1, SK-0.5, and SK-0.2. Report
their error rates and comment on them.
(g) Now we experiment with different training set sizes using the SK-0.2
network structure. Using the first 500, 1000, 2000, 5000, 10000, 20000, and
60000 (all) training examples, what error rates do you achieve? Comment
on your observations.
(h) Using the SK-0.2 network structure, study how different training sets
affect its performance. Train 6 networks, and use the (10000 × (i − 1) + 1)-
34
th to (i × 10000)-th training examples in training the i-th network. Are
CNNs stable in terms of different training sets?
(i) Now we study how randomness affects CNN learning. Instead of setting
the random number generator’s seed to 0, use 1, 12, 123, 1234, 12345, and
123456 as the seed to train 6 different SK-0.2 networks. What are their
error rates? Comment on your observations.
(j) Finally, in SK-0.2, change all ReLU layers to sigmoid layers. Com-
ment on the comparison of error rates between using ReLU and sigmoid
activation functions.
35
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: