0% found this document useful (0 votes)
14 views

Unit 4a - Convolutional Neural Networks

Convolutional neural networks emerged from studies of the visual cortex and have achieved superhuman performance on complex visual tasks. CNNs use techniques like convolutional layers and pooling layers, which apply concepts like sparse interactions, parameter sharing, and equivariant representations to process spatial information in images. CNNs have become very successful in applications like image recognition, natural language processing, and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Unit 4a - Convolutional Neural Networks

Convolutional neural networks emerged from studies of the visual cortex and have achieved superhuman performance on complex visual tasks. CNNs use techniques like convolutional layers and pooling layers, which apply concepts like sparse interactions, parameter sharing, and equivariant representations to process spatial information in images. CNNs have become very successful in applications like image recognition, natural language processing, and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Convolutional Neural

Networks
(Chapter 9 from the DL book)
(Chapter 14 from Hands-on ML book)
(Chapter 8 from Chollet’s book 2e)
Chapter 5 from Weidman’s book 1
Convolutional Neural Networks
 CNNs emerged from the study of the
brain’s visual cortex.
 CNNs have managed to achieve superhuman
performance on complex visual tasks.
 They power image search services, self-
driving cars, automatic video classification
systems.
 CNNs are also successful at many other tasks,
such as voice recognition and natural
language processing.

2
Convolutional Neural Networks
 Hubel and Wiesel discovered that the neurons
that receive visual input from the eye are in
general most responsive to simple, straight
edges at particular, specific orientations.
 Fittingly, they named these cells simple neurons.
 A large group of simple neurons together is able
to represent all 360 degrees of orientation.
 These edge-orientation detecting simple cells
then pass along information to a large number of
so-called complex neurons.
 Capable of detecting complex shapes like a corner or
a curve. 3
Convolutional Neural Networks

4
Convolutional Neural Networks
 The studies of visual cortex inspired the
neocognitron, introduced in 1980, which
gradually evolved into what we now call
convolutional neural networks.
 In 1998, Yann LeCun et al. introduced the
famous LeNet-5 architecture, widely used
by banks to recognize handwritten check
numbers.
 Introduced two new building blocks:
convolutional layers and pooling layers.
5
Convolutional Neural Networks
 Convolution is an operation on two
functions
 CNN convolutions
 First function is network input x, second is
kernel w
 The convolution kernel is usually a sparse
matrix in contrast to the usual fully-connected
weight matrix

6
Convolution operation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

7
Convolutional Neural Networks
 Convolution leverages three important
ideas that help improve machine learning
systems
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
 CNNs take advantage of spatial
information
 Local patterns that are translation-invariant.
 and spatial hierarchies of these patterns.
8
Sparse Interactions
 Fully connected traditional networks
 with m neurons in a layer and n neurons in the
next layer
 requires O(m x n) runtime (per example).
 Sparse interactions
 Also called sparse- connectivity or weights.
 Accomplished by making kernel smaller than
input
 k << m requires O(k x n) runtime (per example)
 k is typically several orders of magnitude smaller
than m
9
Sparse Connectivit y

Sparse s1 s2 s3 s4 s5
Viewed
connections
from
due to small below
convolution
x1 x2 x3 x4 x5
kernel

s1 s2 s3 s4 s5
Dense
connections
Fully
connected x1 x2 x3 x4 x5

10
Sparse Connectivit y

Sparse s1 s2 s3 s4 s5
Viewed
connections
from above
due to small (receptive
convolution fields)
x1 x2 x3 x4 x5
kernel

s1 s2 s3 s4 s5
Dense
connections
Fully
connected x1 x2 x3 x4 x5

11
Growing Receptive Fields

g1 g2 g3 g4 g5

h1 h2 h3 h4 h5

x1 x2 x3 x4 x5
Parameter Sharing
 In traditional neural networks
 Each element of the weight matrix is unique.
 Parameter sharing mean using the same
value for more than one parameters.
 The network has tied weights.
 Reduces storage requirements to k
parameters.
 Forward propagation runtime O(k x n).

13
Equivariant Representations
 For an invariant function, if the input
changes, the output change in same way.
 For convolution, a particular form of
parameter sharing causes equivariance to
translation
 For example, as the dog moves in the input
image, the detected edges move in same way.
 In image processing, detecting edges is useful
in the first layer, and edges appear more or
less everywhere in the image.

14
Problem of Equivariance

 Solution: Capsule networks!

15
Receptive Field
 A neuron located in row i, column j of a given layer is
connected to the outputs of the neurons in the previous
layer located in rows i to i + fh – 1, columns j to j + fw – 1.
 Zero padding: In order for a layer to have the same height
and width as the previous layer, it is common to add
zeros around the inputs.

16
Stride greater than 1
 It is also possible to connect a large input
layer to a much smaller layer by spacing
out the receptive fields.

17
Padding valid/same

18
Stacking Multiple Feature Maps

 Typically, a convolutional
layer has multiple filters
and outputs one feature
map per filter.
 All neurons in a

feature map share the


same parameters.
 Neurons in different

feature maps use


different parameters.

19
Pooling Layers
 The pooling function replaces the output of
the net at a certain location with a
summary statistic of the nearby outputs
 Max pooling reports the maximum output
within a rectangular neighborhood
 Average pooling reports the average output
 Pooling helps make the representation
approximately invariant to small input
translations.
 Max pooling layer is the most commonly
used and performs better.
20
Pooling Layers
 People mostly use max pooling layers
instead of average pooling layers because
 Max pooling generally perform better.
 Max pooling preserves only the strongest
features, getting rid of all the meaningless
ones, so the next layers get a cleaner signal to
work with.
 Max pooling offers stronger translation
invariance than average pooling, and it
requires slightly less computing.

21
Pooling Layers
 Pooling layers subsample (i.e., shrink) the
input image in order to reduce the
computational load, the memory usage,
and the number of parameters.
 Thereby limiting the risk of overfitting.

22
Pooling Layers
 Other than reducing computations, memory
usage, and the number of parameters, a max
pooling layer also introduces some level of
invariance to small translations.

23
Pooling Layers - Depthwise
 Max pooling and average pooling can be
performed along the depth dimension rather than
the spatial dimensions.
 This can allow the CNN to learn to be invariant to
various features.
 For example, learn multiple filters, each detecting
a different rotation of the same pattern.
 The depthwise max pooling layer would ensure that
the output is the same regardless of the rotation.
 The CNN could similarly learn to be invariant to
anything else: thickness, brightness, skew, color,
and so on. 24
Pooling Layers - Depthwise

25
Pooling Layers
 One last type of pooling layer that you will
often see in modern architectures is the
global average pooling layer.
 Computes the mean of each entire feature map (it’s
like an average pooling layer using a pooling kernel
with the same spatial dimensions as the inputs).
 This means that it just outputs a single number per
feature map and per instance.
 USED as output layer in many well-known CNN
architectures (e.g., Googlnet, Xception, SEnet).

26
Convolutional Filter Hyperparameters

 Kernel size
 Padding
 Stride length

27
Convolutional Neural
Networks

(Chapter 8 from Chollet book 2e)

28
CNN for mnist
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

29
CNN for mnist
Calculate the number of parameters:
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
conv2d (Conv2D) (None, 26, 26, 32) 320
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64) 0
conv2d_2 (Conv2D) (None, 3, 3, 128) 73856
flatten (Flatten) (None, 1152) 0
dense (Dense) (None, 10) 11530
=================================================================
Total params: 104,202
30
CNN for mnist
 Convolutions operate over rank-3 tensors called
feature maps, with two spatial axes (height and
width) as well as a depth axis (or channels axis).
 The convolution operation extracts patches from
its input feature map and applies the same
transformation to all of these patches, producing
an output feature map.
 Each of the 32 output channels contains a 26 ×
26 grid of values, which is a response map of
the filter over the input, indicating the response of
that filter pattern at different locations in the input.
31
CNN for mnist
>>> test_loss, test_acc = model.evaluate(test_images, test_labels)
>>> print(f"Test accuracy: {test_acc:.3f}")
Test accuracy: 0.991

 Whereas the densely connected model


from Lab Experiment 1 had a test accuracy
of 97.8%, the basic convnet has a test
accuracy of 99.1%.
 We decreased the error rate by about 60%
(from 2.2% to 0.9%).
32
Training a CNN on small dataset
 Classification of 5000 dogs and cats
 Training set: 1000 dogs and 1000 cats
 Validation set: 500 dogs and 500 cats
 Test set: 1000 dogs and 1000 cats
 4 tools in our deep learning toolbox
 Training from scratch on a small dataset
 Data augmentation to increase dataset size
 Feature extraction using a pretrained model
 Fine-tuning a pretrained model
33
Training a CNN on small dataset

34
Training a CNN on small dataset
inputs = keras.Input(shape=(180, 180, 3))
x = layers.Rescaling(1./255)(inputs)
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
A total of
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
991,041
x = layers.MaxPooling2D(pool_size=2)(x)
parameters.
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
35
Training a CNN on small dataset
Data preprocessing

1 Read the picture files.


2 Decode the JPEG content to RGB grids of pixels.
3 Convert these into floating-point tensors.
4 Resize them to a shared size (we’ll use 180 × 180).
5 Pack them into batches (we’ll use batches of 32 images).

Keras' utility function image_dataset_from_directory()


helps preprocessing images.

36
Training a CNN on small dataset
callbacks = [
keras.callbacks.ModelCheckpoint(
filepath="convnet_from_scratch.keras",
save_best_only=True, monitor="val_loss")
]

history = model.fit(
train_dataset, epochs=30,
validation_data=validation_dataset,
callbacks=callbacks)

37
Training a CNN on small dataset
 Overfitting starts within 10 epochs.
 Validation accuracy peaks at 75%.
 We get a test accuracy of 69.5%.
 Expected due to random sampling on a small
dataset.
 We can try many techniques to mitigate
overfitting, such as dropout and weight
decay (L2 regularization).
 We try data augmentation technique.
38
Training a CNN on small dataset

data_augmentation = keras.Sequential(
[
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
layers.RandomZoom(0.2),
]
)

39
Training a CNN on small dataset
inputs = keras.Input(shape=(180, 180, 3))
x = data_augmentation(inputs)
x = layers.Rescaling(1./255)(x)
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
40
Training a CNN on small dataset
 After training for 100 epochs, we found
 Overfitting occurring around 60th epoch (much
better than 10th epoch).
 Validation accuracy in the 80–85% range
(again a big improvement over our first try).
 We got a test accuracy of 83.5% (pretty decent
compared to 69.5%).

41
Feature extraction - pretrained model
 A common and highly effective approach to deep
learning on small image datasets is to use a
pretrained model that was previously trained on a
large dataset, typically on a large-scale image-
classification task.
 If this original dataset is large enough and
general enough, the spatial hierarchy of features
learned by the pretrained model can effectively
act as a generic model of the visual world, and
can prove useful for many different computer
vision problems.
42
Feature extraction - pretrained model

 Let’s consider a large convnet trained on


the ImageNet dataset (1.4 million labeled
images and 1,000 different classes).
 ImageNet contains many animal classes,
including different species of cats and
dogs, and you can thus expect it to perform
well on the dogs-versus-cats classification
problem.
 Two ways to use a pretrained model:
feature extraction and fine-tuning.
43
Feature extraction - pretrained model
 CNNs used for image classification comprise two
parts:
 a series of pooling and convolution layers,

 and a densely connected classifier.

 The first part is called the convolutional base of


the model.
 Feature extraction consists of taking the
convolutional base of a previously trained
network, running the new data through it, and
training a new classifier on top of the output.

44
Feature extraction - pretrained model
Prediction Prediction Prediction

Trained Trained New


Classifier Classifier Classifier

Trained
Trained Trained
Convolutional
Convolutional Convolutional
Base
base base
(frozen)

Input Input Input


45
Feature extraction - pretrained model

 Let’s put this into practice by using the


convolutional base of the VGG16 network,
trained on ImageNet, to extract interesting
features from cat and dog images, and
then train a dogs-versus-cats classifier on
top of these features.
 The VGG16 model, among others, comes
prepackaged with Keras.
 Import it from the keras.applications module.

46
Feature extraction - pretrained model
There are two ways we could proceed:
 Run the convolutional base over our dataset, record its
output to a NumPy array on disk, and then use this data
as input to a standalone, densely connected classifier.
 This solution is fast and cheap to run, because it only

requires running the convolutional base once for every


input image. But for the same reason, this technique
won’t allow us to use data augmentation.
 Extend the conv_base by adding Dense layers on top,
and run the whole thing from end to end on the input
data. This will allow us to use data augmentation,
because every input image goes through the
convolutional base every time it’s seen by the model.
47
Feature extraction - pretrained model
conv_base = keras.applications.vgg16.VGG16(weights="imagenet",
Extracting the VGG16 features and labels

include_top=False, input_shape=(180, 180, 3))

def get_features_and_labels(dataset):
all_features = [ ]
all_labels = [ ]
for images, labels in dataset:
preprocessed_images = keras.applications.vgg16.preprocess_input(images)
features = conv_base.predict(preprocessed_images)
all_features.append(features)
all_labels.append(labels)
return np.concatenate(all_features), np.concatenate(all_labels)

train_features, train_labels = get_features_and_labels(train_dataset)


val_features, val_labels = get_features_and_labels(validation_dataset)
test_features, test_labels = get_features_and_labels(test_dataset)

48
Feature extraction - pretrained model
Defining and training the output layer

>>> train_features.shape
(2000, 5, 5, 512)

inputs = keras.Input(shape=(5, 5, 512))


x = layers.Flatten()(inputs)
x = layers.Dense(256)(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

history = model.fit(
train_features, train_labels, epochs=20,
validation_data=(val_features, val_labels),
callbacks=callbacks)

49
Feature extraction - pretrained model

 We reach a validation accuracy of about


97% — much better than what we
achieved in the previous section with the
small model trained from scratch.

 Now, let us look at feature extraction


together with data augmentation
 creating a model that chains the conv_base
with a new dense classifier, and training it end
to end on the inputs.
50
Feature extraction - pretrained model
conv_base = keras.applications.vgg16.VGG16(weights="imagenet",
include_top=False)
conv_base.trainable = False
inputs = keras.Input(shape=(180, 180, 3))
x = data_augmentation(inputs)
x = keras.applications.vgg16.preprocess_input(x)
x = conv_base(x)
x = layers.Flatten()(x)
x = layers.Dense(256)(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
51
Feature extraction - pretrained model

 Now, we reach a validation accuracy of


over 98%.
 This is a strong improvement over the
previous model.
 We got a test accuracy of 97.5%.

52
Fine-tuning a pretrained model

1. Add our custom network on top


of an already-trained base
network.
2. Freeze the base network.
3. Train the part we added.
4. Unfreeze some layers in the base
network.
5. Jointly train both these layers and
the part we added.
53
Fine-tuning a pretrained model
conv_base.trainable = True
for layer in conv_base.layers[:-4]:
layer.trainable = False

history = model.fit(train_dataset, epochs=30,


validation_data=validation_dataset,
callbacks=callbacks)

 Here, we got a test accuracy of 98.5%.

54
Chapter Summary
 Convnets are the best type of machine learning
models for computer vision tasks.
 Convnets work by learning a hierarchy of
modular patterns and concepts to represent the
visual world.
 It’s easy to reuse an existing convnet on a new
dataset via feature extraction.
 A valuable technique for small image datasets.
 As a complement to feature extraction, we can
use fine-tuning.
 This pushes performance a bit further.
55
Convolutional Neural
Networks

Chapter 5 from Weidman’s book

56
Representation Learning
 Learning process in ANNs starts by creating
initially random combinations of the original
features via multiplication by a random weight
matrix;
 Through training, the neural network learns to
refine combinations that are helpful and discard
those that aren’t.
 e.g., x1 being higher than average, x139 being lower
than average, and x237 also being lower than average
strongly predicts that an image will be of digit 9.

57
Representation Learning
 This process of learning which
combinations of features are important is
known as representation learning, and
it’s the main reason why neural networks
are successful across different domains.

58
Spatial Patterns in images
 In images, the interesting “combinations of
features” (pixels) tend to come from pixels
that are close together in the image.
 In an image, it is simply much less likely that
an interesting feature will result from a
combination of 9 randomly selected pixels
throughout the image than from a 3 × 3 patch
of adjacent pixels.
 We want to exploit this fundamental fact
about image data.
59
Spatial Patterns in images
 How to exploit spatial patterns in machine
learning for computer vision?
 A solution, at a high level, is to create an
order of magnitude more combinations of
features, and have each one to be only a
combination of the pixels from a small
rectangular patch in the input image.

60
Spatial Patterns in images

61
Convolution operation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

62
Convolution operation
 It turns out that features computed in this way
have a special interpretation: they represent
whether a visual pattern defined by the weights is
present at that location of the image.
 Kernels are essentially “pattern detectors.”

 The same set of weights W are used to detect


whether the visual pattern defined by the kernel
W existed at each location in the input image.
 The result is a “feature map” showing the

locations in the input image where the pattern


defined by W was present.
63
Multichannel Convolution Operation
 CNNs create an order of magnitude more features,
and each feature is a function of just a small patch
from the input image.

f f
nxf
64
Multichannel Convolution Operation
 The first hidden layer with m1 convolutional filters
transforms an input image into m1 feature maps.
 m1 feature maps represent presence/absence of m1
visual patterns at each location in the input image.
 Output of next layer with m2 filters represents
presence/absence of pattern of patterns at each
location in the input image.
 m2 feature maps of the second layer represent a
combination of the m1 visual features already learned
in the prior convolutional layer.

65
Multichannel Convolution Operation
 Each convolutional layer has
1. Input shape (batch size x input channels x
image height x image width)
2. Output shape (batch size x output channels x
image height x image width)
3. The convolutional filters have shape (input
channels x output channels x filter height x
filter width)
 We’ll keep all of this in mind when we
implement this convolution operation.
66
Convolutional vs Dense layers

67
Convolutional vs Dense layers
 One last difference between the two kinds of
layers is the way in which the individual neurons
themselves are interpreted:
 The interpretation of each neuron of a fully connected
layer is that it detects whether or not a particular
combination of the features learned by the prior layer
is present in the current observation.
 The interpretation of each neuron of a convolutional
layer is that it detects whether or not a particular
combination of visual patterns learned by the prior
layer is present at the given location of the input
image.

68
The Flatten Layer
 The last convolutional layer outputs a 3D
array of shape (channels × image height ×
image width) for each input image.
 This needs to be converted into 1D array to
be fed to the output layer to make a final
prediction.
 We do this with a flatten layer.

69
Pooling Layers
 Pooling layers simply down sample each of the
feature maps created by a convolution operation;
 for the most typically used pooling size of 2, a 2n × 2n
image would be downsampled to size n × n.

70
Pooling Layers
 The main advantage of pooling is
computational: by down-sampling the
image to contain one-fourth as many pixels
as the prior layer, pooling decreases both
the number of weights and the number of
computations needed to train the network
by a factor of 4;
 This can be further compounded if multiple
pooling layers are used in the network, as they
were in many architectures in the early days.
71
Pooling Layers
 The downside of pooling, of course, is that only one fourth
as much information can be extracted from the down-
sampled image.
 However, the strong performance in CV proved the trade-offs in
terms of increased computational speed were worth it.
 Nevertheless, pooling was considered by many to be a
trick that just happened to work but should probably be
done away with.
“The pooling operation used in convolutional neural networks is a
big mistake and the fact that it works so well is a disaster.”
---Geoffrey Hinton 2014.
 Most recent CNN architectures (such as “ResNets”) use
pooling minimally or not at all.
72
Pooling Layers
 A much more widely accepted way to do down-sampling
is to modify the stride of the convolution operation.
 With a stride of 2, the filter would be convolved with every
other element of the input image, so that the output would
be half the size of the input.
 This means that, using a stride of 2 would result in the
same output size and thus much the same reduction in
computation we would get from pooling with size 2, but
without as much loss of information:
 with pooling of size 2, only one-fourth of the elements

in the input have any effect on the output, whereas


with a stride of 2, every element of the input has some
effect on the output.
73
Applying CNNs beyond images
 Organizing data into “channels” and then processing
that data using a CNN goes beyond just images.
 For example, this data representation was a key to
DeepMind’s series of AlphaGo programs showing
that neural networks could learn to play Go.
 The input to the neural network is a 19 × 19 × 17
image stack comprising 17 binary feature planes.
 8 planes for white stones in the 8 prior moves
 8 planes for black stones in the 8 prior moves
 1 plane to represents the color to play (player turn)

74
Board of GO

75
Implementing the MCO – 1D case
Implementing the Multichannel
Convolution Operation
 The convolution in one dimension is conceptually
identical to the convolution in 2D: we take in a
one-dimensional input and a one-dimensional
convolutional filter as inputs and then create the
output by sliding the filter along the input.
 Building up to the full operation from that starting
point will turn out mostly to be a matter of adding
a bunch of for loops.

76
Implementing the MCO – 1D case
 Padding in 1D
def _pad_1d(inp: ndarray, num: int) -> ndarray:
z = np.array([0])
z = np.repeat(z, num)
return np.concatenate([z, inp, z])
input_1d = np.array([1,2,3,4,5])
param_1d = np.array([1,1,1])
_pad_1d(input_1d, 1)
>>> array([0, 1, 2, 3, 4, 5, 0])

77
Implementing the MCO – 1D case

Convolutions: The Forward Pass


def conv_1d(inp: ndarray, param: ndarray) -> ndarray:
# assert correct dimensions
assert_dim(inp, 1)
assert_dim(param, 1)
# pad the input
param_len = param.shape[0]
param_mid = param_len // 2
input_pad = _pad_1d(inp, param_mid)
# initialize the output
out = np.zeros(inp.shape)
# perform the 1d convolution
for o in range(out.shape[0]):
for p in range(param_len): out[o] += param[p] * input_pad[o+p]
# ensure shapes didn't change
assert_same_shape(inp, out)
return out
78
Implementing the MCO – 1D case
input_1d = np.array([1,2,3,4,5])
param_1d = np.array([1,1,1])
conv_1d(input_1d, param_1d)
>>> array([ 3., 6., 9., 12., 9.])

Convolutions: The Forward Pass


79
Implementing the MCO – 1D case
Convolutions: The Backward Pass
 We want to compute:
 The partial derivative of the loss with respect
to each element of the input to the convolution
operation.
 The partial derivative of the loss with respect

to each element of the filter.


 We need to write a function that takes in an
output_grad with the same shape as the input and
produces an input_grad and a param_grad.
80
Implementing the MCO – 1D case
 Computing an input_grad

For illustration purpose only!


def conv_1d_sum(inp: ndarray, param: ndarray) -> ndarray:
out = conv_1d(inp, param)
return np.sum(out)
# randomly choose to increase 5th element by 1
input_1d= np.array([1,2,3,4,5])
input_1d_2 = np.array([1,2,3,4,6])
param_1d = np.array([1,1,1])
print(conv_1d_sum(input_1d, param_1d))
print(conv_1d_sum(input_1d_2, param_1d))
>>> 39.0
>>> 41.0

 So, the gradient of the fifth element of the


input should be 41 – 39 = 2.
81
Implementing the MCO – 1D case
 Intuitively, the gradient of the fifth element
of the input is 2 as t5 appears twice in the
output of the convolution operation:
 O1 = t0w1 + t1w2 + t2w3
 O2 = t1w1 + t2w2 + t3w3

 O3 = t2w1 + t3w2 + t4w3

 O4 = t3w1 + t4w2 + t5w3

 O5 = t4w1 + t5w2 + t6w3

 What is the gradient of the fourth element?????

82
Implementing the MCO – 1D case
 Notice the pattern in:
L
 o4 * w3  o5 * w2  o6 * w1
grad grad grad

t5
L
 o3grad * w3  o4grad * w2  o5grad * w1
t4
L
 o2 * w3  o3 * w2  o4 * w1
grad grad grad

t3
 The indices on the output increase at the same
time the indices on the weights decrease.
83
Implementing the MCO – 1D case
 The indices on the output increase at the same
time the indices on the weights decrease.

input_grad = np.zeros_like(inp)
for o in range(inp.shape[0]):
for p in range(param.shape[0]):
input_grad[o] += output_pad[o+param_len-p-1] * param[p]

84
Implementing the MCO – 1D case
 Computing the parameter gradient

For illustration purpose only!


input_1d = np.array([1,2,3,4,5])
# randomly choose to increase first element by 1
param_1d = np.array([1,1,1])
param_1d_2 = np.array([2,1,1])
print(conv_1d_sum(input_1d, param_1d))
print(conv_1d_sum(input_1d, param_1d_2))
>>> 39.0
>>> 49.0
 So, the gradient of the first parameter
should be 49 – 39 = 10.

85
Implementing the MCO – 1D case
 Just as we did for the input, by closely examining
the output and seeing which elements of the filter
affect it, we can clearly see the pattern:
wgrad
1  t0 * o
grad
1  t1 * o
grad
2  t2 * o
grad
3  t3 * o
grad
4  t4 * o
grad
5

 And since, for the sum, all of the ograd elements


are just 1, and t0 is 0, we have:
w grad
1  t1  t2  t3  t4  1  2  3  4  10

86
Implementing the MCO – 1D case
 Coding this is easier, since “the indices are
moving in the same direction.”
param_grad = np.zeros_like(param)
for o in range(inp.shape[0]):
for p in range(param.shape[0]):
param_grad[p] += input_pad[o+p] * output_grad[o]

87
Implementing the Multichannel
Convolution Operation

Batches of inputs

88
Implementing the MCO – batches
 Let’s now add the capability for these
convolution functions to work with batches
of inputs — 2D inputs whose first
dimension represents the batch size of the
input and whose second dimension
represents the length of the 1D sequence:

input_1d_batch = np.array([[0,1,2,3,4,5,6],
[1,2,3,4,5,6,7]])

89
Implementing the MCO – batches
 The only difference in implementing the
forward pass with batches is that we have
to pad and compute the output for each
observation individually and then stack
the results to get a batch of outputs.

def conv_1d_batch(inp: ndarray, param: ndarray) -> ndarray:


outs = [conv_1d(obs, param) for obs in inp]
return np.stack(outs)

90
Implementing the MCO – batches
 The backward pass is similar for computing
the input gradients.

# "input_grad" is the function containing the for loop from earlier:


# it takes in a 1d input, a 1d filter, and a 1d output_gradient and
# computes the input grad
grads = [input_grad(inp[i], param, out_grad[i]) for i in range(batch_size)]
return np.stack(grads)

91
Implementing the MCO – batches
 The backward pass involves adding an outer for loop
to the code to compute the parameter gradients.

param_grad = np.zeros_like(param)
for i in range(inp.shape[0]): # inp.shape[0] = 2
for o in range(inp.shape[1]): # inp.shape[1] = 5
for p in range(param.shape[0]): # param.shape[0] = 3
param_grad[p] += input_pad[i][o+p] * output_grad[i][o]
return param_grad

92
Implementing the Multichannel
Convolution Operation

2D Convolutions

93
Implementing the MCO – 2D case
1. On the forward pass, we:
 Appropriately pad the input.
 Use the padded input and the parameters to compute the output.
2. On the backward pass, to compute the input gradient
and the parameter gradient we:
 Appropriately pad the output gradient.
 Use this padded output gradient, along with the input and the
parameters, to compute both the input gradient and the
parameter gradient.

94
Implementing the MCO – 2D case
 Coding the forward pass
out = np.zeros_like(inp)
for o_w in range(img_size): # loop through the image height
for o_h in range(img_size): # loop through the image width
for p_w in range(param_size): # loop through the param width
for p_h in range(param_size): # loop through the param height
out[o_w][o_h] += param[p_w][p_h] * input_pad[o_w+p_w][o_h+p_h]

Replacing the 1D loops


for o in range(out.shape[0]):
for p in range(param_len): out[o] += param[p] * input_pad[o+p]
95
Implementing the MCO – 2D case
 Coding the backward pass (input)

input_grad = np.zeros_like(inp)
for i_w in range(img_width):
for i_h in range(img_height):
for p_w in range(param_size):
for p_h in range(param_size):
input_grad[i_w][i_h] +=
output_pad[i_w+param_size-p_w-1][i_h+param_size-p_h-1] *
param[p_w][p_h]

96
Implementing the MCO – 2D case
 Coding the backward pass (parameter)

param_grad = np.zeros_like(param)
for i in range(batch_size): # equal to inp.shape[0]
for o_w in range(img_size):
for o_h in range(img_size):
for p_w in range(param_size):
for p_h in range(param_size):
param_grad[p_w][p_h] += input_pad[i][o_w+p_w][o_h+p_h]
* output_grad[i][o_w][o_h]

97
Implementing the MCO – channels
 So far, our code convolves filters over a
two-dimensional input and produces a two-
dimensional output.
 Now we modify it to account for cases
where both the input and the output are
multichannel.
 All we need to do is to add two outer for
loops to the code we’ve already seen —
one loop for the input channels and
another for the output channels.
98
Implementing the MCO – channels
 Forward pass
def _compute_output_obs(obs: ndarray, param: ndarray) -> ndarray:

out = np.zeros((out_channels,) + obs.shape[1:])
for c_in in range(in_channels):
for c_out in range(out_channels):
for o_w in range(img_size):
for o_h in range(img_size):
for p_w in range(param_size):
for p_h in range(param_size):
out[c_out][o_w][o_h] += param[c_in][c_out][p_w][p_h]*
obs_pad[c_in][o_w+p_w][o_h+p_h]
return out

99
Implementing the MCO – channels
 Forward pass

def _output(inp: ndarray, param: ndarray) -> ndarray:


'''
obs: [batch_size, channels, img_width, img_height]
param: [in_channels, out_channels, param_width, param_height]
'''
outs = [_compute_output_obs(obs, param) for obs in inp]
return np.stack(outs)

100
Implementing the MCO – channels
Backward pass
 The backward pass is similar and follows the
same conceptual principles as the backward
pass in the simple 2D case:
1. For the input gradients, we compute the gradients of
each observation individually—padding the output
gradient to do so—and then stack the gradients.
2. We also use the padded output gradient for the
parameter gradient, but we loop through the
observations as well and use the appropriate values
from each one to update the parameter gradient.

101
Implementing the MCO – channels
 Backward pass
def _compute_grads_obs(input_obs: ndarray, output_grad_obs: ndarray,
param: ndarray) -> ndarray:

for c_in in range(in_channels):
for c_out in range(out_channels):
for i_w in range(input_obs.shape[1]):
for i_h in range(input_obs.shape[2]):
for p_w in range(param_size):
for p_h in range(param_size):
input_grad[c_in][i_w][i_h] += output_obs_pad[c_out][i_w
+param_size-p_w-1][i_h+param_size-p_h-1] * param[c_in][c_out][p_w][p_h]
return input_grad

102
Implementing the MCO – channels
 Backward pass

def _input_grad(inp: ndarray, output_grad: ndarray,


param: ndarray) -> ndarray:

grads = [_compute_grads_obs(inp[i], output_grad[i], param)


for i in range(output_grad.shape[0])]
return np.stack(grads)

103
Implementing the MCO – channels
 Backward pass
def _param_grad(inp: ndarray, output_grad: ndarray,
param: ndarray) -> ndarray:

for i in range(inp.shape[0]):
for c_in in range(in_channels):
for c_out in range(out_channels):
for o_w in range(img_shape[0]):
for o_h in range(img_shape[1]):
for p_w in range(param_size):
for p_h in range(param_size):
param_grad[c_in][c_out][p_w][p_h] += inp_pad[i][c_in][o_w+p_w][o_h+p_h]
* output_grad[i][c_out][o_w][o_h]
return param_grad

104
The Flatten Operation
 The output of a convolution operation is a 3D
ndarray for each observation, of dimension
(channels, img_height, img_width).
 We flatten this 3D ndarray into a 1D vector.
class Flatten(Operation):
def __init__(self):
super().__init__()
def _output(self) -> ndarray:
return self.input.reshape(self.input.shape[0], -1)
def _input_grad(self, output_grad: ndarray) -> ndarray:
return output_grad.reshape(self.input.shape)
105
The Full Conv2D Layer

106
Summary
 We learnt
 What CNNs are
 Their similarities and differences from fully
connected neural networks
 How they work at the lowest level
 How to implement the core multichannel
convolution operation from scratch in Python.
 Forward pass: output method
 Backward pass: input_grad and param_grad methods

107

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy