0% found this document useful (0 votes)
27 views110 pages

L3 - UUCLxDeepMind DL2020

The document describes a lecture on convolutional neural networks for image recognition. It provides background on CNNs and how they take advantage of the topological structure of images. It then discusses the basic building blocks of CNNs, including convolutional layers, pooling layers, and how they are stacked to create hierarchical representations of images.

Uploaded by

Neel Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views110 pages

L3 - UUCLxDeepMind DL2020

The document describes a lecture on convolutional neural networks for image recognition. It provides background on CNNs and how they take advantage of the topological structure of images. It then discusses the basic building blocks of CNNs, including convolutional layers, pooling layers, and how they are stacked to create hierarchical representations of images.

Uploaded by

Neel Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

WELCOME TO THE

UCL x DeepMind
lecture series
In this lecture series, leading research scientists
from leading AI research lab, DeepMind, will give
12 lectures on an exciting selection of topics
in Deep Learning, ranging from the fundamentals
of training neural networks via advanced ideas around
memory, attention, and generative modelling to the
important topic of responsible innovation.
Please join us for a deep dive lecture series into
Deep Learning!

#UCLxDeepMind
General
information Exits:
At the back, the way you came in

Wifi:
UCL guest
TODAY’S SPEAKER

Sander Dieleman
Sander Dieleman is a Research Scientist at DeepMind
in London, UK, where he he has worked on the development
of AlphaGo and WaveNet. He was previously a PhD student at
Ghent University, where he conducted research on feature
learning and deep learning techniques for learning hierarchical
representations of musical audio signals. During his PhD he also
developed the deep learning library Lasagne and won solo and
team gold medals respectively in Kaggle's "Galaxy Zoo"
competition and the first National Data Science Bowl. In the
summer of 2014, he interned at Spotify in New York, where he
worked on implementing audio-based music recommendation
using deep learning on an industrial scale.
In the past decade, convolutional neural
networks have revolutionised computer
vision. In this lecture, we will take a closer
look at convolutional network architectures
through several case studies, ranging from
the early 90's to the current state of the art.
We will review some of the building blocks
that are in common use today, discuss the
TODAY’S LECTURE
challenges of training deep models, and

Convolutional strategies for finding effective architectures,


with a focus on image recognition.

Neural Networks
for Image
Recognition
Convolutional
Neural Networks for
Image Recognition
Sander Dieleman

UCL x DeepMind Lectures


Plan for this lecture Private & Confidential

01 02 03
Background Building blocks Convolutional neural
networks

04 05 06
Going deeper: Advanced topics Beyond image
case studies recognition
1 Background
Last week:
neural networks
Linear Sigmoid Linear Softmax Cross
Linear Node Loss entropy

Data Target
How can we feed
images to a neural
network?
Linear Sigmoid Linear Softmax Cross
Linear Node Loss entropy

Data Target
Neural networks for images

A digital image is a 2D grid of pixels.


Neural networks for images

A digital image is a 2D grid of pixels.


Neural networks for images

A digital image is a 2D grid of pixels.

A neural network expects a vector of numbers as input.


Neural networks for images

A digital image is a 2D grid of pixels.

A neural network expects a vector of numbers as input.


Neural networks for images

A digital image is a 2D grid of pixels.

A neural network expects a vector of numbers as input.


Locality and translation invariance

Locality: nearby pixels are more strongly correlated

Translation invariance: meaningful patterns can occur anywhere in the image


Taking advantage of topological structure
Taking advantage of topological structure

Weight sharing: use the same network parameters to


detect local patterns at many locations in the image
Taking advantage of topological structure

Weight sharing: use the same network parameters to


detect local patterns at many locations in the image

Hierarchy: local low-level features are


composed into larger, more abstract features

edges and textures object parts objects


Data drives
research
The ImageNet challenge Private & Confidential

Major computer vision


benchmark
Ran from 2010 to 2017
1.4M images, 1000 classes
Image classification

Want to learn more?


Russakovsky, Olga et al. ImageNet Large Scale Visual
Recognition Challenge International Journal of
Computer Vision 115.3 (2015)
Top-5 classification error rate of
the competition winners
Traditional computer vision techniques
AlexNet
VGGNet and GoogLeNet
ResNet
2 Building
blocks

UCL x DeepMind Lectures


From fully connected to locally connected
From fully connected to locally connected
From fully connected to locally connected

fully-connected unit
From fully connected to locally connected

locally-connected units
3✕3 receptive field
From locally connected to convolutional

convolutional units
3✕3 receptive field
From locally connected to convolutional

Receptive field

Feature map
Implementation: the convolution operation

The kernel slides across the image and


produces an output value at each position
Implementation: the convolution operation

The kernel slides across the image and


produces an output value at each position
Implementation: the convolution operation

The kernel slides across the image and


produces an output value at each position
Implementation: the convolution operation

The kernel slides across the image and


produces an output value at each position
Implementation: the convolution operation

The kernel slides across the image and


produces an output value at each position
Implementation: the convolution operation

The kernel slides across the image and


produces an output value at each position
Implementation: the convolution operation

The kernel slides across the image and


produces an output value at each position
Implementation: the convolution operation

We convolve multiple kernels and obtain


multiple feature maps or channels
Inputs and outputs are tensors
Inputs and outputs are tensors
channels

height

width
Inputs and outputs are tensors

channels

height

width
Variants of the convolution operation

Valid convolution: output size = input size - kernel size + 1


Variants of the convolution operation

Full convolution: output size = input size + kernel size - 1


Variants of the convolution operation

Same convolution: output size = input size


Variants of the convolution operation

Strided convolution: kernel slides along the image with a step > 1
Variants of the convolution operation

Strided convolution: kernel slides along the image with a step > 1
Variants of the convolution operation

Strided convolution: kernel slides along the image with a step > 1
Variants of the convolution operation

Strided convolution: kernel slides along the image with a step > 1
Variants of the convolution operation

Dilated convolution: kernel is spread out, step > 1 between kernel elements
Variants of the convolution operation

Dilated convolution: kernel is spread out, step > 1 between kernel elements
Variants of the convolution operation

Dilated convolution: kernel is spread out, step > 1 between kernel elements
Variants of the convolution operation

Depthwise convolution: each output channel is connected only to one input channel
Pooling

Pooling: compute mean or max over small windows to reduce resolution


Pooling

Pooling: compute mean or max over small windows to reduce resolution


3 Convolutional
neural networks

UCL x DeepMind Lectures


Stacking the building blocks Private & Confidential

CNNs or “convnets”
Up to 100s of layers
Alternate convolutions and
pooling to create a hierarchy
Recap: neural networks as computational graphs

input loss

computation parameters
Simplified diagram: implicit parameters and loss

input

computation
Simplified diagram: implicit parameters and loss

input

computation
Computational building blocks of convnets

fully connected

input convolution

nonlinearity pooling
4 Going deeper:
Case studies

UCL x DeepMind Lectures


LeNet-5 (1998)

Architecture of LeNet-5, a convnet


for handwritten digit recognition

Want to learn more?


Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.
Gradient-based learning applied to document recognition
Proceedings of the IEEE 86(11) (1998)
In
pu
ti
m
ag
e
C
on
vo
lu
t io
n
LeNet-5 (1998)

No
nl
in
ea
rity

Po
ol
in
g
C
on
vo
lu
t io
n
No
nl
in
ea
rity

Po
ol
in
g
Fu
lly
co
nn
ec
te
No d
nl
in
ea
rity
Fu
lly
co
nn
ec
te
No d
nl
in
ea
rity
AlexNet (2012)

Figure from Krizhevsky et al. (2012)

Want to learn more?


Architecture: 8 layers, ReLU, dropout, weight decay
Krizhevsky, A.; Sutskever, I.; Hinton, G.E.
ImageNet classification with deep
convolutional neural networks
Neural Information Processing Systems (2012) Infrastructure: large dataset, trained 6 days on 2 GPUs
AlexNet (2012)

Figure from Krizhevsky et al. (2012)


AlexNet (2012)

Input image:
→ 224✕224✕3
AlexNet (2012)
Layer 1 convolution:
kernel 11✕11, 96 channels, stride 4
→ 56✕56✕96
AlexNet (2012)

ReLU
AlexNet (2012)
Max-pooling:
window 2✕2
→ 28✕28✕96
AlexNet (2012)

Layer 8 fully connected:


→ 1000
AlexNet (2012)

Softmax
22
4✕
22
4✕
So 3
ft 56
m ✕
ax 56

96
AlexNet (2012)

10
00
Re
LU

Re 28
LU ✕
28

96
40 28
96 ✕
28

25
Re 6
LU
Re
LU

40 14
96 ✕
14

25
7✕ 6
7✕ 14
25 ✕
6 14

38
4
Re
LU
Re
LU
14
✕ 14
14
✕ ✕
14
25 ✕
6 38
4

Re
LU
Deeper is better Private & Confidential

Each layer is a linear


classifier by itself
More layers – more
nonlinearities
What limits the number
of layers in convnets?
VGGNet (2014): building very deep convnets

Want to learn more?


Simonyan, K.; Zisserman, A. Stack many convolutional layers before pooling
Very deep convolutional networks for
large-scale image recognition International
Conference on Learning Representations (2015)
Use “same” convolutions to avoid resolution reduction
VGGNet (2014): stacking 3✕3 kernels

1st 3✕3 conv. layer


5

2nd 3✕3 conv. layer

Architecture: up to 19 layers, 3✕3 kernels only, “same” convolutions

Infrastructure: trained for 2-3 weeks on 4 GPUs (data parallelism)


VGGNet (2014): error plateaus after 16 layers
Challenges of depth Private & Confidential

Computational
complexity
Optimisation difficulties
Improving optimisation

Careful initialisation
Sophisticated optimisers
Normalisation layers
Network design
GoogLeNet (2014)

1✕ Figure from Szegedy et al. (2015)


1

1✕
3✕3
1

1✕
1
5✕5 Want to learn more?
Szegedy, C. et al.
Going deeper with convolutions IEEE
conference on computer vision and pattern
recognition (2015)
3✕3 1✕1
Batch normalisation

Figure from Ioffe et al. (2015)

Want to learn more?


Ioffe, S.; Szegedy, C.
Reduces sensitivity to initialisation
Batch normalization: Accelerating deep
network training by reducing internal
covariate shift International conference on
machine learning (2015) Introduces stochasticity and acts as a regulariser
Batch normalisation

Figure from Ioffe et al. (2015)


ResNet (2015): residual connections

n
.

.
rm

rm
io

io
ut

ut
no

no
LU

LU
l

l
vo

vo
Re

Re
h

h
tc

tc
on

on
Ba

Ba
C

C
+

residual connection

Want to learn more?


Residual connections facilitate training deeper networks
He, K. et al.
Deep residual learning for image recognition
IEEE conference on computer vision and
pattern recognition (2016)
ResNet (2015): different flavours
3✕3 1✕1 +

1✕1 3✕3 1✕1 +

3✕3 1✕1 +

Want to learn more?


He, K. et al. ResNet V2 (bottom) avoids all
Identity mappings in deep residual networks
European conference on computer vision
(2016)
nonlinearities in the residual pathway
ResNet (2015): up to 152 layers

Table from He et al. (2015)


DenseNet (2016): connect layers to all previous layers

Want to learn more?


Huang, G. et al.
Densely connected convolutional networks
IEEE conference on computer vision and
pattern recognition (2017)

Figures from Huang et al. (2015)


Squeeze-and-excitation networks (2017)

Figure from Hu et al. (2018)

Want to learn more?


Features can incorporate global context
Hu, J.; Shen, L.; Sun, G.
Squeeze-and-excitation networks IEEE
conference on computer vision and pattern
recognition (2018)
AmoebaNet (2018): neural architecture search

Figure from Real et al. (2019)

Want to learn more?


Architecture found by evolution
Real, E. et al.
Regularized evolution for image classifier
architecture search AAAI conference on
artificial intelligence (2019) Search acyclic graphs composed of predefined layers
Reducing complexity

Depthwise convolutions
Separable convolutions
Inverted bottlenecks
(MobileNetV2, MNasNet,
EfficientNet)
5 Advanced
topics

UCL x DeepMind Lectures


Data augmentation

By design, convnets are only robust against translation

Data augmentation makes them robust against other


transformations: rotation, scaling, shearing, warping, ...
Visualising what a convnet learns

Figures from Zeiler et al. (2014)

Want to learn more?


Zeiler, M.D.; Fergus, R.
Visualizing and understanding convolutional
networks European conference on computer
vision (2014)
Visualising what a convnet learns

Figure from Simonyan et al. (2013)


Visualising what a convnet learns

Figure from Nguyen et al. (2016)


https://distill.pub/2017/feature-visualization/ by Chris Olah, Alexander Mordvintsev and Ludwig Schubert
Other topics to explore

Pre-training and fine-tuning


Group equivariant convnets:
invariance to e.g. rotation
Recurrence and attention:
other building blocks to
exploit topological structure
6 Beyond image
recognition

UCL x DeepMind Lectures


What else can we do
with convnets?
Figures from Lin et al. (2015)
Generative models of images

Generative adversarial nets


Variational autoencoders
Autoregressive models
(PixelCNN)
More convnets

Representation learning and


self-supervised learning
Convnets for video, audio,
text, graphs, ...
Convolutional neural networks
replaced handcrafted features
With handcrafted architectures.

Prior knowledge is not obsolete: it is merely


incorporated at a higher level of abstraction.
Thank you
Questions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy