On Deep Learning
On Deep Learning
On Deep Learning
Document 1
… car parking available for a small fee.
… parts of our floor model inventory for sale.
Document 2
Selling all kinds of automobile and pickup truck parts,
engines, and transmissions.
Outline
Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...
Time
The promise (or wishful dream) of Deep Learning
Speech Speech
Text Text
Search Queries Search Queries
Images Simple, Images
Videos Reconfigurable, Videos
Labels High Capacity, Labels
Entities Trainable end-to-end Entities
Words Building Blocks Words
Audio Audio
Features Features
The promise (or wishful dream) of Deep Learning
Common representations across domains.
Language Modeling
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson
Parsing
Grammar as a Foreign Language
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton
Neural Networks
What is Deep Learning?
● A powerful class of machine learning model
● Modern reincarnation of artificial neural networks
● Collection of simple, trainable mathematical functions
● Compatible with many variants of machine learning
“cat”
What is Deep Learning?
● Loosely based on
(what little) we know
about the brain
“cat”
The Neuron
y
w1 w2 ... wn
x1 x2 ... xn
ConvNets
Learning algorithm
While not done:
Pick a random training example “(input, label)”
Run neural network on “input”
Adjust weights on edges to make output closer to “label”
Learning algorithm
While not done:
Pick a random training example “(input, label)”
Run neural network on “input”
Adjust weights on edges to make output closer to “label”
Backpropagation
Use partial derivatives along the paths in the neural net
Deep
“How cold is
Recurrent
it outside?”
Neural Network
Acoustic Input Text Output
Given an image,
predict one of 1000
different classes
Image credit:
www.cs.toronto.
edu/~fritz/absps/imagene
t.pdf
The Inception Architecture (GoogLeNet, 2014)
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
Deep
Convolutional “ocean”
Neural Network
Automatic Tag
Your Photo
Document 1
… car parking available for a small fee.
… parts of our floor model inventory for sale.
Document 2
Selling all kinds of automobile and pickup truck parts,
engines, and transmissions.
How to deal with Sparse Data?
Mikolov, Sutskever, Chen, Corrado and Dean. Distributed Representations of Words and
Phrases and Their Compositionality, NIPS 2013.
Nearest Neighbors are Closely Related Semantically
Trained language model on Wikipedia
tiger shark car new york
Launched in 2015
Third most important search ranking signal (of 100s)
Bloomberg, Oct 2015: “Google Turning Its Lucrative Web Search Over to AI Machines”
Recurrent Neural Networks
t ← t+1
Xt
X1 X2 X3
Recurrent Connections
(trainable weights)
Tied Weights
Recurrent Neural Networks
RNNs very difficult to train for more than a few timesteps: numerically
unstable gradients (vanishing / exploding).
Instruction Input
Output WRITE? READ?
WRITE X, M X M Y
READ M, Y
FORGET M
FORGET?
Key Idea: Make Your Program Differentiable
Sigmoids
W R
WRITE? READ?
X M Y X M Y
FORGET?
F
Sequence-to-Sequence Model
Target sequence
[Sutskever & Vinyals & Le NIPS 2014] X Y Z Q
Deep LSTM
A B C D __ X Y Z
Input sequence
Sequence-to-Sequence Model: Machine Translation
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How
Input sentence
Sequence-to-Sequence Model: Machine Translation
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall
Input sentence
Sequence-to-Sequence Model: Machine Translation
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are
Input sentence
Sequence-to-Sequence Model: Machine Translation
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are you?
Input sentence
Sequence-to-Sequence Model: Machine Translation
At inference time:
Beam search to choose most probable
[Sutskever & Vinyals & Le NIPS 2014] over possible output sequences
Input sentence
Sequence-to-Sequence Model: Machine Translation
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are you?
Input sentence
Sequence-to-Sequence
● Active area of research
● Many groups actively pursuing RNN/LSTM
○ Montreal
○ Stanford
○ U of Toronto
○ Berkeley
○ Google
○ ...
● Further Improvements
○ Attention
○ NTM / Memory Nets
○ ...
Sequence-to-Sequence
● Translation: [Kalchbrenner et al., EMNLP 2013][Cho et al., EMLP 2014][Sutskever & Vinyals & Le, NIPS
2014][Luong et al., ACL 2015][Bahdanau et al., ICLR 2015]
● Image captions: [Mao et al., ICLR 2015][Vinyals et al., CVPR 2015][Donahue et al., CVPR 2015][Xu et al.,
ICML 2015]
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small Feed-
Forward yes/no
Neural Network
Google Research Blog
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small Feed-
Forward yes/no
Neural Network
Generated Replies
Deep Recurrent
Neural Network
How to do Image Captions?
P(English | French)
Image )
How?
[Vinyals et al., CVPR 2015] A young girl asleep
W __ A young girl
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.
Grammar as a Foreign Language, Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav
Petrov, Ilya Sutskever, and Geoffrey Hinton (NIPS 2015)
http://arxiv.org/abs/1412.7449
Turnaround Time and Effect on Research
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks:
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Train in a day what would take a single GPU card 6 weeks
How Can We Train Large, Powerful Models Quickly?
● Exploit many kinds of parallelism
○ Model parallelism
○ Data parallelism
Model Parallelism
Model Parallelism
Model Parallelism
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism Choices
Can do this synchronously:
● N replicas equivalent to an N times larger batch size
● Pro: No noise
● Con: Less fault tolerant (requires some recovery if any single machine fails)
http://tensorflow.org/
and
https://github.com/tensorflow/tensorflow
http://tensorflow.org/
http://tensorflow.org/whitepaper2015.pdf
Source on GitHub
https://github.com/tensorflow/tensorflow
Source on GitHub
https://github.com/tensorflow/tensorflow
Motivations
DistBelief (1st system) was great for scalability, and
production training of basic kinds of models
● Core in C++
○ Very low overhead
● Core in C++
○ Very low overhead
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
● Core in C++
○ Very low overhead
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
MatMul Xent
examples
labels
s o rs
Computation is a dataflow graph ten
with
MatMul Xent
examples
labels
t a t e
Computation is a dataflow graph
w ith s
biases
learning rate
ut e d
Computation is a dataflow graph
is t r i b
d
Device A Device B
biases
learning rate
If you’re not considering how to use deep neural nets to solve your search or
understanding problems, you almost certainly should be
Questions?