0% found this document useful (0 votes)
18 views54 pages

Chapter14 CNN

computer vision using convolutional neural networks

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views54 pages

Chapter14 CNN

computer vision using convolutional neural networks

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Chapter 14: Deep Computer Vision

Using Convolutional Neural Networks

Tsz-Chiu Au
chiu@unist.ac.kr

Ulsan National Institute of Science and Technology (UNIST)


South Korea
Convolutional Neural Networks
• Convolutional neural networks (CNNs) emerged from the study of
the brain’s visual cortex.
» Used in image recognition since the 1980s
» Today, CNNs have managed to achieve superhuman performance on some
complex visual tasks.
» They power image search services, self-driving cars, automatic video
classification systems, and more.
» CNNs are also successful at many other tasks, such as voice recognition
and natural language processing.
• In this chapter,
» Describe the architecture of CNNs.
» Discuss how to implement CNNs using TensorFlow and Keras.
» Present some of the best CNN architectures.
» Discuss visual tasks such as object detection and semantic segmentation.
The Architecture of the Visual Cortex
• David H. Hubel and Torsten Wiesel showed that
» Many neurons in the visual cortex have a small local receptive field
§ react only to visual stimuli located in a limited region of the visual field.
» Some neurons react only to images of horizontal lines, while others
react only to lines with different orientation.
» Some neurons have larger receptive fields, and they react to more
complex patterns that are combinations of the lower-level patterns.
The Architecture of the Visual Cortex (cont.)
• The studies of visual cortex inspired the neocognitron,
introduced in 1980, which gradually evolved into what we
now call convolutional neural networks.
• In 1998, Yann LeCun et al. introduced the famous LeNet-5
architecture, widely used by banks to recognize handwritten
check numbers.
» introduced two new building blocks: convolutional layers and pooling
layers.
• We cannot simply use a deep neural network with fully
connected layers for image recognition tasks.
» Too many connections: 1,000 neurons for 100x100 pixel images
requires 10 million connections.
» CNNs solve this problem using partially connected layers and weight
sharing.
Convolutional Layers
• A convolution is a mathematical operation that slides one function over
another and measures the integral of their pointwise multiplication.
• Convolutional layers with rectangular local receptive fields.
» Only connected to pixels in their receptive fields.
» Each neuron in the second convolutional layer is connected only to neurons
located within a small rectangle in the first layer.

• This architecture allows the network to concentrate on small low-level


features in the first hidden layer, then assemble them into larger higher-
level features in the next hidden layer, and so on.
» This hierarchical structure is common in real-world images.
Convolutional Layers (cont.)
• A neuron located in row i, column j of a given layer is connected to the
outputs of the neurons in the previous layer located in rows i to i + fh – 1,
columns j to j + fw – 1, where fh and fw are the height and width of the
receptive field.

• Zero padding: In order for a layer to have the same height and width as
the previous layer, it is common to add zeros around the inputs.
Convolutional Layers (cont.)
• It is also possible to connect a large input layer to a much smaller layer by
spacing out the receptive fields.
• Stride: The shift from one receptive field to the next.
• A 5 × 7 input layer (plus zero padding) is connected to a 3 × 4 layer, using 3 × 3
receptive fields and a stride of 2.

• In this example the stride is the same in both directions, but it does not have
to be so.
• A neuron located in row i, column j in the upper layer is connected to the
outputs of the neurons in the previous layer located in rows i × sh to i × sh + fh –
1, columns j × sw to j × sw + fw – 1, where sh and sw are the vertical and
horizontal strides.
Filters
• Convolutional layers can act as filters (or convolution kernals) that outputs feature
maps, which highlights the areas in an image that activate the filter the most.

• The vertical filter is represented as a black square with a vertical white line in the
middle (it is a 7 × 7 matrix full of 0s except for the central column, which is full of 1s).
» ignore everything in their receptive field except for the central vertical line.
• If all neurons in a layer use the same vertical line filter (and the same bias term), the
output is the Feature map 1.
» The vertical white lines get enhanced while the rest gets blurred.
» All neurons within a given feature map share the same parameters.
• During training the convolutional layer will automatically learn the filters.
Stacking Multiple Feature Maps
• Typically, a convolutional layer has
multiple filters and outputs one
feature map per filter.
» All neurons in a feature map
share the same parameters
§ Dramatically reduces the number of
parameters in the model.
§ Once the CNN has learned to
recognize a pattern in one location, it
can recognize it in any other location.
» Neurons in different feature maps
use different parameters.
» A neuron’s receptive field is the
same as described earlier, but it
extends across all the previous
layers’ feature maps
Stacking Multiple Feature Maps (cont.)
• Input images are composed of multiple sublayers: one per color channel.
» There are typically three: red, green, and blue (RGB).
» Grayscale images have just one channel.
» But some images may have much more—for example, satellite images that
capture extra light frequencies (such as infrared).
• Specifically, a neuron located in row i, column j of the feature map k in a
given convolutional layer l is connected to the outputs of the neurons in
the previous layer l – 1, located in rows i × sh to i × sh + fh – 1 and columns j
× sw to j × sw + fw – 1, across all feature maps (in layer l – 1).
» All neurons located in the same row i and column j but in different feature
maps are connected to the outputs of the exact same neurons in the previous
layer.
TensorFlow Implementation of Convolutional Layers
• Each input image is typically represented as a 3D tensor of shape [height,
width, channels]
• A mini-batch is represented as a 4D tensor of shape [mini-batch size,
height, width, channels].
• The weights of a convolutional layer are represented as a 4D tensor of
shape [fh, fw, fnʹ, fn].
• The bias terms of a convolutional layer are simply represented as a 1D
tensor of shape [fn].
Padding
Keras Implementation of Convolutional Layers
• We don’t want to manually define the filters.
» Filters should be trainable variables so the neural net can learn which filters work best
• In Keras, creates a Conv2D layer with 32 filters, each 3 × 3, using a stride of 1
(both horizontally and vertically) and "same" padding, and applying the ReLU
activation function to its outputs.

• Hyperparameters: the number of filters, their height and width, the strides,
and the padding type.
» But using cross-validation to find the right hyperparameter values is very time-consuming.
• The convolutional layers require a huge amount of RAM.
» The reverse pass of backpropagation requires all the intermediate values computed during the
forward pass.
» If training crashes because of an out-of-memory error, you can try
§ reducing the mini-batch size.
§ reducing dimensionality using a stride, or removing a few layers.
§ using 16-bit floats instead of 32-bit floats.
§ Distributing the CNN across multiple devices.
Pooling Layers
• Pooling layers subsample (i.e., shrink) the input image in order to reduce
the computational load, the memory usage, and the number of
parameters
» Thereby limiting the risk of overfitting.
• A neuron in pooling layers is just like a neuron in convolutional layers
except it has no weights.
» All it does is aggregate the inputs using an aggregation function such as the
max or mean.
• Max pooling layer is the most common type of pooling layer.
» 2 × 2 pooling kernel with a stride of 2 and no padding.
» Only the max input value in each receptive field makes it to the next layer, while the
other inputs are dropped.
Pooling Layers (cont.)
• A pooling layer typically works on every input channel independently, so the
output depth is the same as the input depth.
• Max pooling offers a small amount of rotational invariance and a slight scale
invariance.
» Such invariance (even if it is limited) can be useful in cases where the prediction should not
depend on these details, such as in classification tasks.

• The downside of max pooling is that it is very destructive.


• In some applications (e.g., semantic segmentation), invariance is not desirable.
» For semantic segmentation, the goal is equivariance, not invariance: a small change to the
inputs should lead to a corresponding small change in the output.
Pooling Layers (cont.)
• Implementing max pooling in Keras:

• Average pooling layer:


» computes the mean rather than the max
» In Keras, use AvgPool2D instead of MaxPool2D
• People mostly use max pooling layers instead of average pooling layers
» Max pooling generally perform better.
» Max pooling preserves only the strongest features, getting rid of all the meaningless
ones, so the next layers get a cleaner signal to work with.
» Max pooling offers stronger translation invariance than average pooling, and it requires
slightly less compute.
Depthwise Max Pooling Layer
• Max pooling and average pooling can be performed along the depth
dimension rather than the spatial dimensions
» This can allow the CNN to learn to be invariant to various features.
• For example, learn multiple filters, each detecting a different rotation of
the same pattern
» The depthwise max pooling layer would ensure that the output is the same regardless
of the rotation.
• The CNN could similarly learn to be invariant to anything else: thickness,
brightness, skew, color, and so on.
Global Average Pooling Layer
• Keras does not include a depthwise max pooling layer, but TensorFlow’s low-
level Deep Learning API does.

• Global average pooling layer: compute the mean of each entire feature map
» it’s like an average pooling layer using a pooling kernel with the same spatial dimensions as
the inputs.
» it can be useful as the output layer
CNN Architectures
• Typical CNN architectures stack a few convolutional layers (each one generally
followed by a ReLU layer), then a pooling layer, then another few convolutional
layers (+ReLU), then another pooling layer, and so on.
» The image gets smaller and smaller as it progresses through the network, but it also typically
gets deeper and deeper (i.e., with more feature maps)
• At the top of the stack, a few fully connected layers are added, and then the
final layer outputs the prediction.

• A common mistake is to use convolution kernels that are too large.


» E.g., instead of using a convolutional layer with a 5 × 5 kernel, stack two layers with 3 × 3 kernels
§ it will use fewer parameters, require fewer computations, and usually perform better.
» One exception is for the first convolutional layer: it can typically have a large kernel (e.g., 5 × 5),
usually with a stride of 2 or more
§ This will reduce the spatial dimension of the image without losing too much information,
and since the input image only has three channels in general, it will not be too costly.
Implementing a Simple CNN in Keras
Beyond Simple CNNs

• This simple CNN reaches over 92% accuracy in the Fashion MNIST dataset
» Much better than what we achieved with dense networks in Chapter 10.
• But variants of this fundamental architecture have been developed.
• A good measure of this progress is the error rate in competitions such as
the ILSVRC ImageNet challenge.
» In this competition the top-five error rate for image classification fell from over 26% to
less than 2.3% in just six years.
» The top-five error rate is the number of test images for which the system’s top five
predictions did not include the correct answer.
» The images are large (256 pixels high) and there are 1,000 classes, some of which are
really subtle (try distinguishing 120 dog breeds).
• We will first look at the classical LeNet-5 architecture (1998), then four of
the winners of the ILSVRC challenge: AlexNet (2012), GoogLeNet (2014),
ResNet (2015), and SENet (2017)
LeNet-5
• The LeNet-5 architecture was created by Yann LeCun in 1998 and has been
widely used for handwritten digit recognition (MNIST).
» It introduces most of the elements in modern CNN architectures.
» But the choice of the average pooling layers and the use of the square of the Euclidian
distance between its input vector and its weight vector at output layers are obsolete.
AlexNet
• The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge by
a large margin.
» Achieved a top-five error rate of 17%, while the second best achieved only 26%.
• Similar to LeNet-5 but larger.
• It was the first to stack convolutional layers directly on top of one another,
instead of stacking a pooling layer on top of each convolutional layer.
AlexNet (cont.)
• Introduced two regularization techniques:
» Dropout with a 50% dropout rate during training to the outputs of layers F9 and F10.
» Data augmentation by randomly shifting the training images by various offsets,
flipping them horizontally, and changing the lighting conditions.
§ Increases the size of the training set by generating many realistic variants of each training instance.
§ Forces the model to be more tolerant to variations in the position, orientation, and size of the objects in
the pictures.
AlexNet (cont.)
• AlexNet also uses a competitive normalization step immediately after the
ReLU step of layers C1 and C3, called local response normalization (LRN)
» The most strongly activated neurons inhibit other neurons located at the same
position in neighboring feature maps.
» Such competitive activation has been observed in biological neurons
» This encourages different feature maps to specialize, pushing them apart and
forcing them to explore a wider range of features, ultimately improving
generalization.

• A variant of AlexNet called ZF Net was developed by Matthew Zeiler and


Rob Fergus and won the 2013 ILSVRC challenge.
» It is essentially AlexNet with a few tweaked hyperparameters (number of feature maps,
kernel size, stride, etc.).
GoogLeNet
• The GoogLeNet architecture won the ILSVRC 2014 challenge by pushing
the top-five error rate below 7%.
» much deeper than previous CNNs
» Inception modules allow GoogLeNet to use parameters much more efficiently
than previous architectures.
§ GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million
instead of 60 million).
Inception Modules
• The notation “3 × 3 + 1(S)” means that the layer uses a 3 × 3 kernel, stride 1,
and "same" padding.
• The input signal is first copied and fed to four different layers.
• All convolutional layers use the ReLU activation function.
• The second set of convolutional layers uses different kernel sizes (1 × 1, 3 × 3,
and 5 × 5), allowing them to capture patterns at different scales.
• Every single layer uses a stride of 1 and "same" padding (even the max pooling
layer), so their outputs all have the same height and width as their inputs.
• This makes it possible to concatenate all
the outputs along the depth dimension
in the final depth concatenation layer
(i.e., stack the feature maps from all four
top convolutional layers).
• This concatenation layer can be
implemented in TensorFlow using the
tf.concat() operation, with axis=3 (the
axis is the depth).
Convolutional Layers with 1 × 1 Kernels
• Although Convolutional Layers with 1 × 1 Kernel cannot capture spatial
patterns, they can capture patterns along the depth dimension.
• They are configured to output fewer feature maps than their inputs, so
they serve as bottleneck layers
» They reduce dimensionality, cutting the computational cost and the number of
parameters, speeding up training and improving generalization.
• Each pair of convolutional layers ([1 × 1, 3 × 3] and [1 × 1, 5 × 5]) acts like a
single powerful convolutional layer, capable of capturing more complex
patterns.
» This pair of convolutional layers sweeps a two-layer neural network across the image.
• In summary, you can think of the whole inception module as a
convolutional layer on steroids, able to output feature maps that capture
complex patterns at various scales.
The Architecture of GoogLeNet
• The number of feature maps output
by each convolutional layer and
each pooling layer is shown before
the kernel size.
• The six numbers in the inception
modules represent the number of
feature maps output by each
convolutional layer in the module.
• All the convolutional layers use the
ReLU activation function.
• The first two layers divide the
image’s height and width by 4 (so its
area is divided by 16), to reduce the
computational load.
» The first layer uses a large kernel size so
that much of the information is
preserved.
• Then the local response
normalization layer ensures that the
previous layers learn a wide variety
of features.
The Architecture of GoogLeNet (cont.)
• Two convolutional layers follow,
where the first acts like a
bottleneck layer.
• A local response normalization
layer ensures that the previous
layers capture a wide variety of
patterns.
• Next, a max pooling layer reduces
the image height and width by 2,
again to speed up computations.
• The tall stack of nine inception
modules interleaves with a couple
max pooling layers to reduce
dimensionality and speed up the
net.
The Architecture of GoogLeNet (cont.)
• Next, the global average pooling
layer outputs the mean of each
feature map
» This drops any remaining spatial
information, which is fine because there
was not much spatial information left at
that point.
» Thanks to the dimensionality reduction
brought by this layer, there is no need to
have several fully connected layers at the
top of the CNN, considerably reducing
the number of parameters.
• Dropout for regularization, then a
fully connected layer with 1,000
units (since there are 1,000 classes)
and a softmax activation function to
output estimated class probabilities.
• The original GoogLeNet architecture
also included two auxiliary
classifiers plugged on top of the
third and sixth inception modules.
» But their effect was relatively minor.
VGGNet
• The runner-up in the ILSVRC 2014 challenge was VGGNet.
• It had a very simple and classical architecture
» 2 or 3 convolutional layers and a pooling layer, then again 2 or 3 convolutional
layers and a pooling layer, and so on
§ reaching a total of just 16 or 19 convolutional layers, depending on the VGG variant
» A final dense network with 2 hidden layers and the output layer. It used only 3
× 3 filters, but many filters.
• Top 5 error: 7.3% (vs. 6.7% of GoogLeNet)
ResNet
• From AlexNet to GoogLeNet, the performance increases as the network gets
deeper.
• Residual Network is the winner of the ILSVRC 2015 challenge.
» delivered an astounding top-five error rate under 3.6%.
• The winning variant used an extremely deep CNN composed of 152 layers
(other variants had 34, 50, and 101 layers).
» It confirmed the general trend: models are getting deeper and deeper, with fewer and fewer
parameters.
• The key to being able to train such a deep network is to use skip connections
(also called shortcut connections)
» The signal feeding into a layer is also added to the output of a layer located a bit higher up the
stack.
Residual Units
• A deep residual network can be seen as a stack of residual units (or residual
blocks), where each residual unit is a small neural network with a skip
connection.
• Thanks to skip connections, the signal can easily make its way across the whole
network.
» When training a neural network, the goal is to make it model a target function h(x).
» If you add the input x to the output of the network (i.e., you add a skip connection),
then the network will be forced to model f(x) = h(x) – x rather than h(x).
» Initially, a residual unit outputs a copy of its inputs (i.e., identify function)
The Architecture of ResNet
• ResNet starts and ends exactly like GoogLeNet (except without a dropout layer),
and in between is just a very deep stack of simple residual units.
• Each residual unit is composed of two convolutional layers (and no pooling layer!),
with Batch Normalization (BN) and ReLU activation, using 3 × 3 kernels and
preserving spatial dimensions (stride 1, "same" padding).
• The number of feature maps is
doubled every few residual units, at
the same time as their height and
width are halved (using a conv. layer
with stride 2).
» The inputs cannot be added directly to
the outputs of the residual unit
because they don’t have the same
shape
» To solve this problem, the inputs are
passed through a 1 × 1 convolutional
layer with stride 2 and the right
number of output feature maps
• ResNet-34 contains 34 layers (only
counting the convolutional layers and
the fully connected layer), including 3
residual units that output 64 feature
maps, 4 RUs with 128 maps, 6 RUs with
256 maps, and 3 RUs with 512 maps.
Xception
• Xception (stands for Extreme Inception) significantly outperformed
Inception-v3 on a huge vision task (350 million images and 17,000 classes).
• Just like Inception-v4, it merges the ideas of GoogLeNet and ResNet, but it
replaces the inception modules with depthwise separable convolution
layers (or separable convolution layer).
• A separable convolutional layer makes the strong assumption that spatial
patterns and cross-channel patterns can be modeled separately.
» The first part applies a single spatial filter for each input feature map
» The second part looks exclusively for cross-channel patterns—it is just a regular
convolutional layer with 1 × 1 filters.
Xception (cont.)
• An inception module can be considered as an intermediate between a
regular convolutional layer (which considers spatial patterns and cross-
channel patterns jointly) and a separable convolutional layer (which
considers them separately).
» Therefore, Xception is considered a variant of GoogLeNet
» In practice, it seems that separable convolutional layers generally perform better than
inception modules.
• Avoid using separable convolutional layers after layers that have too few
channels, such as the input layer (e.g., the input layer)
• The Xception architecture starts with 2 regular convolutional layers, but
then the rest of the architecture uses only separable convolutions (34 in
all), plus a few max pooling layers and the usual final layers (a global
average pooling layer and a dense output layer).
SENet
• The winning architecture in the ILSVRC 2017 challenge was the Squeeze-
and-Excitation Network (SENet).
» 2.25% top-five error rate
• SENet uses the extended versions of inception networks and ResNets,
called SE-Inception and SE-ResNet, respectively.
• SENet adds a small neural network, called an SE block, to every unit.
SE Blocks
• An SE block focuses on the depth dimension and learns which features are
usually most active together, and then recalibrate the feature maps.
» Idea: An SE block may learn that mouths, noses, and eyes usually appear together in pictures.
So if the block sees a strong activation in the mouth and nose feature maps but not the eye
feature map, it will boost the eye feature map.
• An SE block has three layers:
» A global average pooling layer
» A hidden dense layer using the ReLU activation
function
» A dense output layer using the sigmoid
activation function
• The middle dense layer “squeezes” the
input vector into a smaller vector.
» Learn a low-dimensional vector representation
(i.e., an embedding)
» This bottleneck step forces the SE block to learn
a general representation of the feature
combinations
• The output layer takes the embedding and
outputs a recalibration vector.
• The feature maps are then multiplied by
this recalibration vector
» Irrelevant features get scaled down while
relevant features (with a recalibration score
close to 1) are left alone.
Implementing a ResNet-34 CNN Using Keras
• Implement a ResNet-34 from scratch using Keras.
• First, define a ResidualUnit layer.
Implementing a ResNet-34 CNN Using Keras
• Second, build the ResNet-34 using a Sequential model.

• This model can beat the winner in the ILSVRC 2015 challenge.
Using Pretrained Models from Keras
• Load the ResNet-50 model, pretrained on ImageNet.

• To use it, you first need to ensure that the images have the right size.
» A ResNet-50 model expects 224 × 224-pixel images (other models may expect other
sizes, such as 299 × 299),
» Use TensorFlow’s tf.image.resize() function to resize the images

• Each model provides a preprocess_input() function that you can use to


preprocess your images.
» These functions assume that the pixel values range from 0 to 255, so we must multiply
them by 255 if the image’s pixel value is between 0-1.

• Display the top K predictions:


Pretrained Models for Transfer Learning
• Train a model to classify pictures of flowers, reusing a pretrained Xception
model.

• This model should reach around 95% accuracy on the test set.
Classification and Localization
• Localizing an object in a picture can be expressed as a regression task that predicts a
bounding box around the object
» horizontal and vertical coordinates of the object’s center, as well as its height and width.
• Just add a second dense output layer with four units (typically on top of the global
average pooling layer), and train the model using the MSE loss.

• To annotate images with bounding boxes:


» Open source image labeling tool like VGG Image Annotator, LabelImg, OpenLabeler, or ImgLab, or
perhaps a commercial tool like LabelBox or Supervisely.
» Crowdsourcing platforms such as Amazon Mechanical Turk
• The bounding boxes should be normalized so that the horizontal and vertical
coordinates, as well as the height and width, all range from 0 to 1.
• It is common to predict the square root of the height and width rather than the
height and width directly
» A 10-pixel error for a large bounding box will not be penalized as much as a 10-pixel error for a small
bounding box.
Intersection over Union (IoU)
• The MSE often works fairly well as a cost function to train the model
» But it is not a great metric to evaluate how well the model can predict
bounding boxes.
• The most common metric for object localization is the Intersection over
Union (IoU)
» The area of overlap between the predicted bounding box and the target bounding box,
divided by the area of their union.
» In tf.keras, it is implemented by the tf.keras.metrics.MeanIoU class.
Object Detection
• The task of classifying and localizing multiple objects in an image is called
object detection.
• An old approach was to take a CNN that was trained to classify and locate
a single object, then slide it across the image
» Chop an image into a grid and slide a CNN across the grid
» Since objects can have varying sizes, you would also slide the CNN across regions of
different sizes.
Non-max Suppression
• The above approach can detect the same object multiple times
at slightly different positions.
• Use non-max suppression to get rid of all the unnecessary
bounding boxes.
» Add an extra objectness output to your CNN, to estimate the probability that an
object is indeed present in the image
» Get rid of all the bounding boxes for which the objectness score is below some
threshold
» Find the bounding box with the highest objectness score, and get rid of all the
other bounding boxes that overlap a lot with it (e.g., with an IoU greater than
60%).
§ For example, the bounding box with the max objectness score is the thick
bounding box over the topmost rose
» Repeat step two until there are no more bounding boxes to get rid of.
Fully Convolutional Networks
• If we replace the dense layers at the top of a CNN by convolutional layers,
the output remains the same except the shapes of the outputs.
» But now the network can be trained and executed on images of any size.
• Fully Convolutional Networks (FCN) contain only convolutional layers (and
pooling layers).
• When the input image has a higher dimension, the outputs is exactly like
taking the original CNN and sliding it across the image.
» The FCN approach is much more efficient, since the network only looks at the image once.
You Only Look Once (YOLO)
• YOLO is an extremely fast and accurate object detection architecture.
» YOLOv3 is so fast that it can run in real time on a video.
• YOLOv3’s architecture is based on FCN with some differences:
» It outputs five bounding boxes for each grid cell (instead of just one), and each bounding
box comes with an objectness score.
» It also outputs 20 class probabilities per grid cell, one for each class in PASCAL VOC dataset.
» YOLOv3 predicts an offset relative to the coordinates of the grid cell, where (0, 0) means the
top left of that cell and (1, 1) means the bottom right.
» For each grid cell, only predict bounding boxes whose center lies in that cell.
» Before training, YOLOv3 finds five representative bounding box dimensions, called anchor
boxes (or bounding box priors) in the training set by the K-Means algorithm.
» When YOLOv3 predicts five bounding boxes per grid cell, it actually predicts how much to
rescale each of the anchor boxes.
» The network is trained using images of different scales.
» The use of skip connections to recover some of the spatial resolution that is lost in the CNN.
» The model predicts a probability for each node in a visual hierarchy called WordTree.
• Other object detection models are SSD and Faster-RCNN.
Mean Average Precision (mAP)
• One way to get a fair idea of the object detection model’s performance is to
compute the maximum precision you can get with at least 0% recall, then
10% recall, 20%, and so on up to 100%, and then calculate the mean of these
maximum precisions.
» This is called the Average Precision (AP) metric.
• When there are more than two classes, we can compute the AP for each
class, and then compute the mean AP (mAP).
• What if the system detected the correct class, but at the wrong location?
» Define an IOU threshold (e.g., we may consider that a prediction is correct only if the IOU is
greater than, say, 0.5, and the predicted class is correct.)
» The corresponding mAP is generally noted mAP@0.5 (or mAP@50%, or sometimes just
AP50).
» This performance measure is used in some competitions such as the PASCAL VOC challenge.
• In others (such as the COCO competition), the mAP is computed for different
IOU thresholds (0.50, 0.55, 0.60, ..., 0.95), and the final metric is the mean of
all these mAPs (noted AP@[.50:.95] or AP@[.50:0.05:.95]). Yes, that’s a
mean mean average.
Semantic Segmentation
• In semantic segmentation, each pixel is classified according to the class of
the object it belongs to (e.g., road, car, pedestrian, building, etc.)
• Note that different objects of the same class are not distinguished.
Upsampling by Transposed Convolutional Layers
• One simple approach is the use of upsampling layer.
» Take a pretrained CNN and turning it into an FCN.
» Use an upsampling layer to increase the resolution of the output.
• Upsampling by transposed convolutional layers: first stretching the image
by inserting empty rows and columns (full of zeros), then performing a
regular convolution.
Super-resolution
• Recover some of the spatial resolution that was lost in earlier pooling
layers by adding skip connections from lower layers in transposed
convolutional layers.

• It is even possible to scale up beyond the size of the original image


» Super-resolution: increase the resolution of an image
• Instance segmentation is similar to semantic segmentation, but instead of
merging all objects of the same class into one big lump, each object is
distinguished from the others.
» The Mask R-CNN architecture: extends the Faster R-CNN model by additionally
producing a pixel mask for each bounding box.
§ Not only get a bounding box around each object, with a set of estimated class probabilities, but also get a pixel
mask that locates pixels in the bounding box that belong to the object.
Advanced Topics in Computer Vision
• Adversarial learning
» Attempt to make the network more resistant to images
designed to fool it
• Explainability
» Understand why the network makes a specific classification
• Realistic image generation
» E.g., generative adversarial networks (GANs)
• Single-shot learning
» a system that can recognize an object after it has seen it just
once.
• Novel architectures such as Geoffrey Hinton’s capsule
networks

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy