Unit 4a - Convolutional Neural Networks
Unit 4a - Convolutional Neural Networks
Networks
(Chapter 9 from the DL book)
(Chapter 14 from Hands-on ML book)
(Chapter 8 from Chollet’s book 2e)
Chapter 5 from Weidman’s book 1
Convolutional Neural Networks
CNNs emerged from the study of the
brain’s visual cortex.
CNNs have managed to achieve superhuman
performance on complex visual tasks.
They power image search services, self-
driving cars, automatic video classification
systems.
CNNs are also successful at many other tasks,
such as voice recognition and natural
language processing.
2
Convolutional Neural Networks
Hubel and Wiesel discovered that the neurons
that receive visual input from the eye are in
general most responsive to simple, straight
edges at particular, specific orientations.
Fittingly, they named these cells simple neurons.
A large group of simple neurons together is able
to represent all 360 degrees of orientation.
These edge-orientation detecting simple cells
then pass along information to a large number of
so-called complex neurons.
Capable of detecting complex shapes like a corner or
a curve. 3
Convolutional Neural Networks
4
Convolutional Neural Networks
The studies of visual cortex inspired the
neocognitron, introduced in 1980, which
gradually evolved into what we now call
convolutional neural networks.
In 1998, Yann LeCun et al. introduced the
famous LeNet-5 architecture, widely used
by banks to recognize handwritten check
numbers.
Introduced two new building blocks:
convolutional layers and pooling layers.
5
Convolutional Neural Networks
Convolution is an operation on two
functions
CNN convolutions
First function is network input x, second is
kernel w
The convolution kernel is usually a sparse
matrix in contrast to the usual fully-connected
weight matrix
6
Convolution operation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l
Output
aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz
ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz
7
Convolutional Neural Networks
Convolution leverages three important
ideas that help improve machine learning
systems
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
CNNs take advantage of spatial
information
Local patterns that are translation-invariant.
and spatial hierarchies of these patterns.
8
Sparse Interactions
Fully connected traditional networks
with m neurons in a layer and n neurons in the
next layer
requires O(m x n) runtime (per example).
Sparse interactions
Also called sparse- connectivity or weights.
Accomplished by making kernel smaller than
input
k << m requires O(k x n) runtime (per example)
k is typically several orders of magnitude smaller
than m
9
Sparse Connectivit y
Sparse s1 s2 s3 s4 s5
Viewed
connections
from
due to small below
convolution
x1 x2 x3 x4 x5
kernel
s1 s2 s3 s4 s5
Dense
connections
Fully
connected x1 x2 x3 x4 x5
10
Sparse Connectivit y
Sparse s1 s2 s3 s4 s5
Viewed
connections
from above
due to small (receptive
convolution fields)
x1 x2 x3 x4 x5
kernel
s1 s2 s3 s4 s5
Dense
connections
Fully
connected x1 x2 x3 x4 x5
11
Growing Receptive Fields
g1 g2 g3 g4 g5
h1 h2 h3 h4 h5
x1 x2 x3 x4 x5
Parameter Sharing
In traditional neural networks
Each element of the weight matrix is unique.
Parameter sharing mean using the same
value for more than one parameters.
The network has tied weights.
Reduces storage requirements to k
parameters.
Forward propagation runtime O(k x n).
13
Equivariant Representations
For an invariant function, if the input
changes, the output change in same way.
For convolution, a particular form of
parameter sharing causes equivariance to
translation
For example, as the dog moves in the input
image, the detected edges move in same way.
In image processing, detecting edges is useful
in the first layer, and edges appear more or
less everywhere in the image.
14
Problem of Equivariance
15
Receptive Field
A neuron located in row i, column j of a given layer is
connected to the outputs of the neurons in the previous
layer located in rows i to i + fh – 1, columns j to j + fw – 1.
Zero padding: In order for a layer to have the same height
and width as the previous layer, it is common to add
zeros around the inputs.
16
Stride greater than 1
It is also possible to connect a large input
layer to a much smaller layer by spacing
out the receptive fields.
17
Padding valid/same
18
Stacking Multiple Feature Maps
Typically, a convolutional
layer has multiple filters
and outputs one feature
map per filter.
All neurons in a
19
Pooling Layers
The pooling function replaces the output of
the net at a certain location with a
summary statistic of the nearby outputs
Max pooling reports the maximum output
within a rectangular neighborhood
Average pooling reports the average output
Pooling helps make the representation
approximately invariant to small input
translations.
Max pooling layer is the most commonly
used and performs better.
20
Pooling Layers
People mostly use max pooling layers
instead of average pooling layers because
Max pooling generally perform better.
Max pooling preserves only the strongest
features, getting rid of all the meaningless
ones, so the next layers get a cleaner signal to
work with.
Max pooling offers stronger translation
invariance than average pooling, and it
requires slightly less computing.
21
Pooling Layers
Pooling layers subsample (i.e., shrink) the
input image in order to reduce the
computational load, the memory usage,
and the number of parameters.
Thereby limiting the risk of overfitting.
22
Pooling Layers
Other than reducing computations, memory
usage, and the number of parameters, a max
pooling layer also introduces some level of
invariance to small translations.
23
Pooling Layers - Depthwise
Max pooling and average pooling can be
performed along the depth dimension rather than
the spatial dimensions.
This can allow the CNN to learn to be invariant to
various features.
For example, learn multiple filters, each detecting
a different rotation of the same pattern.
The depthwise max pooling layer would ensure that
the output is the same regardless of the rotation.
The CNN could similarly learn to be invariant to
anything else: thickness, brightness, skew, color,
and so on. 24
Pooling Layers - Depthwise
25
Pooling Layers
One last type of pooling layer that you will
often see in modern architectures is the
global average pooling layer.
Computes the mean of each entire feature map (it’s
like an average pooling layer using a pooling kernel
with the same spatial dimensions as the inputs).
This means that it just outputs a single number per
feature map and per instance.
USED as output layer in many well-known CNN
architectures (e.g., Googlnet, Xception, SEnet).
26
Convolutional Filter Hyperparameters
Kernel size
Padding
Stride length
27
Convolutional Neural
Networks
28
CNN for mnist
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
29
CNN for mnist
Calculate the number of parameters:
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
conv2d (Conv2D) (None, 26, 26, 32) 320
max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0
conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64) 0
conv2d_2 (Conv2D) (None, 3, 3, 128) 73856
flatten (Flatten) (None, 1152) 0
dense (Dense) (None, 10) 11530
=================================================================
Total params: 104,202
30
CNN for mnist
Convolutions operate over rank-3 tensors called
feature maps, with two spatial axes (height and
width) as well as a depth axis (or channels axis).
The convolution operation extracts patches from
its input feature map and applies the same
transformation to all of these patches, producing
an output feature map.
Each of the 32 output channels contains a 26 ×
26 grid of values, which is a response map of
the filter over the input, indicating the response of
that filter pattern at different locations in the input.
31
CNN for mnist
>>> test_loss, test_acc = model.evaluate(test_images, test_labels)
>>> print(f"Test accuracy: {test_acc:.3f}")
Test accuracy: 0.991
34
Training a CNN on small dataset
inputs = keras.Input(shape=(180, 180, 3))
x = layers.Rescaling(1./255)(inputs)
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
A total of
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
991,041
x = layers.MaxPooling2D(pool_size=2)(x)
parameters.
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
35
Training a CNN on small dataset
Data preprocessing
36
Training a CNN on small dataset
callbacks = [
keras.callbacks.ModelCheckpoint(
filepath="convnet_from_scratch.keras",
save_best_only=True, monitor="val_loss")
]
history = model.fit(
train_dataset, epochs=30,
validation_data=validation_dataset,
callbacks=callbacks)
37
Training a CNN on small dataset
Overfitting starts within 10 epochs.
Validation accuracy peaks at 75%.
We get a test accuracy of 69.5%.
Expected due to random sampling on a small
dataset.
We can try many techniques to mitigate
overfitting, such as dropout and weight
decay (L2 regularization).
We try data augmentation technique.
38
Training a CNN on small dataset
data_augmentation = keras.Sequential(
[
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
layers.RandomZoom(0.2),
]
)
39
Training a CNN on small dataset
inputs = keras.Input(shape=(180, 180, 3))
x = data_augmentation(inputs)
x = layers.Rescaling(1./255)(x)
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
40
Training a CNN on small dataset
After training for 100 epochs, we found
Overfitting occurring around 60th epoch (much
better than 10th epoch).
Validation accuracy in the 80–85% range
(again a big improvement over our first try).
We got a test accuracy of 83.5% (pretty decent
compared to 69.5%).
41
Feature extraction - pretrained model
A common and highly effective approach to deep
learning on small image datasets is to use a
pretrained model that was previously trained on a
large dataset, typically on a large-scale image-
classification task.
If this original dataset is large enough and
general enough, the spatial hierarchy of features
learned by the pretrained model can effectively
act as a generic model of the visual world, and
can prove useful for many different computer
vision problems.
42
Feature extraction - pretrained model
44
Feature extraction - pretrained model
Prediction Prediction Prediction
Trained
Trained Trained
Convolutional
Convolutional Convolutional
Base
base base
(frozen)
46
Feature extraction - pretrained model
There are two ways we could proceed:
Run the convolutional base over our dataset, record its
output to a NumPy array on disk, and then use this data
as input to a standalone, densely connected classifier.
This solution is fast and cheap to run, because it only
def get_features_and_labels(dataset):
all_features = [ ]
all_labels = [ ]
for images, labels in dataset:
preprocessed_images = keras.applications.vgg16.preprocess_input(images)
features = conv_base.predict(preprocessed_images)
all_features.append(features)
all_labels.append(labels)
return np.concatenate(all_features), np.concatenate(all_labels)
48
Feature extraction - pretrained model
Defining and training the output layer
>>> train_features.shape
(2000, 5, 5, 512)
history = model.fit(
train_features, train_labels, epochs=20,
validation_data=(val_features, val_labels),
callbacks=callbacks)
49
Feature extraction - pretrained model
52
Fine-tuning a pretrained model
54
Chapter Summary
Convnets are the best type of machine learning
models for computer vision tasks.
Convnets work by learning a hierarchy of
modular patterns and concepts to represent the
visual world.
It’s easy to reuse an existing convnet on a new
dataset via feature extraction.
A valuable technique for small image datasets.
As a complement to feature extraction, we can
use fine-tuning.
This pushes performance a bit further.
55
Convolutional Neural
Networks
56
Representation Learning
Learning process in ANNs starts by creating
initially random combinations of the original
features via multiplication by a random weight
matrix;
Through training, the neural network learns to
refine combinations that are helpful and discard
those that aren’t.
e.g., x1 being higher than average, x139 being lower
than average, and x237 also being lower than average
strongly predicts that an image will be of digit 9.
57
Representation Learning
This process of learning which
combinations of features are important is
known as representation learning, and
it’s the main reason why neural networks
are successful across different domains.
58
Spatial Patterns in images
In images, the interesting “combinations of
features” (pixels) tend to come from pixels
that are close together in the image.
In an image, it is simply much less likely that
an interesting feature will result from a
combination of 9 randomly selected pixels
throughout the image than from a 3 × 3 patch
of adjacent pixels.
We want to exploit this fundamental fact
about image data.
59
Spatial Patterns in images
How to exploit spatial patterns in machine
learning for computer vision?
A solution, at a high level, is to create an
order of magnitude more combinations of
features, and have each one to be only a
combination of the pixels from a small
rectangular patch in the input image.
60
Spatial Patterns in images
61
Convolution operation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l
Output
aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz
ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz
62
Convolution operation
It turns out that features computed in this way
have a special interpretation: they represent
whether a visual pattern defined by the weights is
present at that location of the image.
Kernels are essentially “pattern detectors.”
f f
nxf
64
Multichannel Convolution Operation
The first hidden layer with m1 convolutional filters
transforms an input image into m1 feature maps.
m1 feature maps represent presence/absence of m1
visual patterns at each location in the input image.
Output of next layer with m2 filters represents
presence/absence of pattern of patterns at each
location in the input image.
m2 feature maps of the second layer represent a
combination of the m1 visual features already learned
in the prior convolutional layer.
65
Multichannel Convolution Operation
Each convolutional layer has
1. Input shape (batch size x input channels x
image height x image width)
2. Output shape (batch size x output channels x
image height x image width)
3. The convolutional filters have shape (input
channels x output channels x filter height x
filter width)
We’ll keep all of this in mind when we
implement this convolution operation.
66
Convolutional vs Dense layers
67
Convolutional vs Dense layers
One last difference between the two kinds of
layers is the way in which the individual neurons
themselves are interpreted:
The interpretation of each neuron of a fully connected
layer is that it detects whether or not a particular
combination of the features learned by the prior layer
is present in the current observation.
The interpretation of each neuron of a convolutional
layer is that it detects whether or not a particular
combination of visual patterns learned by the prior
layer is present at the given location of the input
image.
68
The Flatten Layer
The last convolutional layer outputs a 3D
array of shape (channels × image height ×
image width) for each input image.
This needs to be converted into 1D array to
be fed to the output layer to make a final
prediction.
We do this with a flatten layer.
69
Pooling Layers
Pooling layers simply down sample each of the
feature maps created by a convolution operation;
for the most typically used pooling size of 2, a 2n × 2n
image would be downsampled to size n × n.
70
Pooling Layers
The main advantage of pooling is
computational: by down-sampling the
image to contain one-fourth as many pixels
as the prior layer, pooling decreases both
the number of weights and the number of
computations needed to train the network
by a factor of 4;
This can be further compounded if multiple
pooling layers are used in the network, as they
were in many architectures in the early days.
71
Pooling Layers
The downside of pooling, of course, is that only one fourth
as much information can be extracted from the down-
sampled image.
However, the strong performance in CV proved the trade-offs in
terms of increased computational speed were worth it.
Nevertheless, pooling was considered by many to be a
trick that just happened to work but should probably be
done away with.
“The pooling operation used in convolutional neural networks is a
big mistake and the fact that it works so well is a disaster.”
---Geoffrey Hinton 2014.
Most recent CNN architectures (such as “ResNets”) use
pooling minimally or not at all.
72
Pooling Layers
A much more widely accepted way to do down-sampling
is to modify the stride of the convolution operation.
With a stride of 2, the filter would be convolved with every
other element of the input image, so that the output would
be half the size of the input.
This means that, using a stride of 2 would result in the
same output size and thus much the same reduction in
computation we would get from pooling with size 2, but
without as much loss of information:
with pooling of size 2, only one-fourth of the elements
74
Board of GO
75
Implementing the MCO – 1D case
Implementing the Multichannel
Convolution Operation
The convolution in one dimension is conceptually
identical to the convolution in 2D: we take in a
one-dimensional input and a one-dimensional
convolutional filter as inputs and then create the
output by sliding the filter along the input.
Building up to the full operation from that starting
point will turn out mostly to be a matter of adding
a bunch of for loops.
76
Implementing the MCO – 1D case
Padding in 1D
def _pad_1d(inp: ndarray, num: int) -> ndarray:
z = np.array([0])
z = np.repeat(z, num)
return np.concatenate([z, inp, z])
input_1d = np.array([1,2,3,4,5])
param_1d = np.array([1,1,1])
_pad_1d(input_1d, 1)
>>> array([0, 1, 2, 3, 4, 5, 0])
77
Implementing the MCO – 1D case
82
Implementing the MCO – 1D case
Notice the pattern in:
L
o4 * w3 o5 * w2 o6 * w1
grad grad grad
t5
L
o3grad * w3 o4grad * w2 o5grad * w1
t4
L
o2 * w3 o3 * w2 o4 * w1
grad grad grad
t3
The indices on the output increase at the same
time the indices on the weights decrease.
83
Implementing the MCO – 1D case
The indices on the output increase at the same
time the indices on the weights decrease.
input_grad = np.zeros_like(inp)
for o in range(inp.shape[0]):
for p in range(param.shape[0]):
input_grad[o] += output_pad[o+param_len-p-1] * param[p]
84
Implementing the MCO – 1D case
Computing the parameter gradient
85
Implementing the MCO – 1D case
Just as we did for the input, by closely examining
the output and seeing which elements of the filter
affect it, we can clearly see the pattern:
wgrad
1 t0 * o
grad
1 t1 * o
grad
2 t2 * o
grad
3 t3 * o
grad
4 t4 * o
grad
5
86
Implementing the MCO – 1D case
Coding this is easier, since “the indices are
moving in the same direction.”
param_grad = np.zeros_like(param)
for o in range(inp.shape[0]):
for p in range(param.shape[0]):
param_grad[p] += input_pad[o+p] * output_grad[o]
87
Implementing the Multichannel
Convolution Operation
Batches of inputs
88
Implementing the MCO – batches
Let’s now add the capability for these
convolution functions to work with batches
of inputs — 2D inputs whose first
dimension represents the batch size of the
input and whose second dimension
represents the length of the 1D sequence:
input_1d_batch = np.array([[0,1,2,3,4,5,6],
[1,2,3,4,5,6,7]])
89
Implementing the MCO – batches
The only difference in implementing the
forward pass with batches is that we have
to pad and compute the output for each
observation individually and then stack
the results to get a batch of outputs.
90
Implementing the MCO – batches
The backward pass is similar for computing
the input gradients.
91
Implementing the MCO – batches
The backward pass involves adding an outer for loop
to the code to compute the parameter gradients.
param_grad = np.zeros_like(param)
for i in range(inp.shape[0]): # inp.shape[0] = 2
for o in range(inp.shape[1]): # inp.shape[1] = 5
for p in range(param.shape[0]): # param.shape[0] = 3
param_grad[p] += input_pad[i][o+p] * output_grad[i][o]
return param_grad
92
Implementing the Multichannel
Convolution Operation
2D Convolutions
93
Implementing the MCO – 2D case
1. On the forward pass, we:
Appropriately pad the input.
Use the padded input and the parameters to compute the output.
2. On the backward pass, to compute the input gradient
and the parameter gradient we:
Appropriately pad the output gradient.
Use this padded output gradient, along with the input and the
parameters, to compute both the input gradient and the
parameter gradient.
94
Implementing the MCO – 2D case
Coding the forward pass
out = np.zeros_like(inp)
for o_w in range(img_size): # loop through the image height
for o_h in range(img_size): # loop through the image width
for p_w in range(param_size): # loop through the param width
for p_h in range(param_size): # loop through the param height
out[o_w][o_h] += param[p_w][p_h] * input_pad[o_w+p_w][o_h+p_h]
input_grad = np.zeros_like(inp)
for i_w in range(img_width):
for i_h in range(img_height):
for p_w in range(param_size):
for p_h in range(param_size):
input_grad[i_w][i_h] +=
output_pad[i_w+param_size-p_w-1][i_h+param_size-p_h-1] *
param[p_w][p_h]
96
Implementing the MCO – 2D case
Coding the backward pass (parameter)
param_grad = np.zeros_like(param)
for i in range(batch_size): # equal to inp.shape[0]
for o_w in range(img_size):
for o_h in range(img_size):
for p_w in range(param_size):
for p_h in range(param_size):
param_grad[p_w][p_h] += input_pad[i][o_w+p_w][o_h+p_h]
* output_grad[i][o_w][o_h]
97
Implementing the MCO – channels
So far, our code convolves filters over a
two-dimensional input and produces a two-
dimensional output.
Now we modify it to account for cases
where both the input and the output are
multichannel.
All we need to do is to add two outer for
loops to the code we’ve already seen —
one loop for the input channels and
another for the output channels.
98
Implementing the MCO – channels
Forward pass
def _compute_output_obs(obs: ndarray, param: ndarray) -> ndarray:
…
out = np.zeros((out_channels,) + obs.shape[1:])
for c_in in range(in_channels):
for c_out in range(out_channels):
for o_w in range(img_size):
for o_h in range(img_size):
for p_w in range(param_size):
for p_h in range(param_size):
out[c_out][o_w][o_h] += param[c_in][c_out][p_w][p_h]*
obs_pad[c_in][o_w+p_w][o_h+p_h]
return out
99
Implementing the MCO – channels
Forward pass
100
Implementing the MCO – channels
Backward pass
The backward pass is similar and follows the
same conceptual principles as the backward
pass in the simple 2D case:
1. For the input gradients, we compute the gradients of
each observation individually—padding the output
gradient to do so—and then stack the gradients.
2. We also use the padded output gradient for the
parameter gradient, but we loop through the
observations as well and use the appropriate values
from each one to update the parameter gradient.
101
Implementing the MCO – channels
Backward pass
def _compute_grads_obs(input_obs: ndarray, output_grad_obs: ndarray,
param: ndarray) -> ndarray:
…
for c_in in range(in_channels):
for c_out in range(out_channels):
for i_w in range(input_obs.shape[1]):
for i_h in range(input_obs.shape[2]):
for p_w in range(param_size):
for p_h in range(param_size):
input_grad[c_in][i_w][i_h] += output_obs_pad[c_out][i_w
+param_size-p_w-1][i_h+param_size-p_h-1] * param[c_in][c_out][p_w][p_h]
return input_grad
102
Implementing the MCO – channels
Backward pass
103
Implementing the MCO – channels
Backward pass
def _param_grad(inp: ndarray, output_grad: ndarray,
param: ndarray) -> ndarray:
…
for i in range(inp.shape[0]):
for c_in in range(in_channels):
for c_out in range(out_channels):
for o_w in range(img_shape[0]):
for o_h in range(img_shape[1]):
for p_w in range(param_size):
for p_h in range(param_size):
param_grad[c_in][c_out][p_w][p_h] += inp_pad[i][c_in][o_w+p_w][o_h+p_h]
* output_grad[i][c_out][o_w][o_h]
return param_grad
104
The Flatten Operation
The output of a convolution operation is a 3D
ndarray for each observation, of dimension
(channels, img_height, img_width).
We flatten this 3D ndarray into a 1D vector.
class Flatten(Operation):
def __init__(self):
super().__init__()
def _output(self) -> ndarray:
return self.input.reshape(self.input.shape[0], -1)
def _input_grad(self, output_grad: ndarray) -> ndarray:
return output_grad.reshape(self.input.shape)
105
The Full Conv2D Layer
106
Summary
We learnt
What CNNs are
Their similarities and differences from fully
connected neural networks
How they work at the lowest level
How to implement the core multichannel
convolution operation from scratch in Python.
Forward pass: output method
Backward pass: input_grad and param_grad methods
107