On Deep Learning

Large-Scale Deep Learning for
Intelligent Computer Systems

Jeff Dean
In collaboration with many other people at Google
“Web Search and Data Mining”
“Web Search and Data Mining”
Really hard without understanding
Not there yet, but making significant progress

What do I mean by understanding?
Query
[ car parts for sale ]
Query
Document 1
… car parking available for a small fee.
… parts of our floor model inventory for sale.
Document 2
Selling all kinds of automobile and pickup truck parts,
engines, and transmissions.
Outline
● Why deep neural networks?

● Perception
● Language understanding
● TensorFlow: software infrastructure for our work (and yours!)
Google Brain project started in 2011, with a focus on
pushing state-of-the-art in neural networks. Initial
emphasis:
● use large datasets, and

● large amounts of computation
to push boundaries of what is possible in perception and

language understanding
Growing Use of Deep Learning at Google
# of directories containing model description files Across many
products/areas:
Android
Unique Project Directories
Apps
drug discovery
Gmail
Image understanding
Maps
Natural language
understanding
Photos
Robotics research
Speech
Translation
YouTube
… many others ...
Time
The promise (or wishful dream) of Deep Learning
Speech Speech
Text Text
Search Queries Search Queries
Images Simple, Images
Videos Reconfigurable, Videos
Labels High Capacity, Labels
Entities Trainable end-to-end Entities
Words Building Blocks Words
Audio Audio
Features Features
The promise (or wishful dream) of Deep Learning
Common representations across domains.
Replacing piles of code with data and learning.
Would merely be an interesting academic exercise…
…if it didn’t work so well!

In Research and Industry
Speech Recognition
Speech Recognition with Deep Recurrent Neural Networks
Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks

Tara N. Sainath, Oriol Vinyals, Andrew Senior, Hasim Sak
Object Recognition and Detection

Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
Scalable Object Detection using Deep Neural Networks

Dumitru Erhan, Christian Szegedy, Alexander Toshev, Dragomir Anguelov
In Research and Industry
Machine Translation
Sequence to Sequence Learning with Neural Networks
Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
Language Modeling
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson
Parsing
Grammar as a Foreign Language
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton
Neural Networks
What is Deep Learning?
● A powerful class of machine learning model
● Modern reincarnation of artificial neural networks
● Collection of simple, trainable mathematical functions
● Compatible with many variants of machine learning
“cat”
What is Deep Learning?
● Loosely based on
(what little) we know
about the brain
“cat”
The Neuron
y
w1 w2 ... wn
x1 x2 ... xn
ConvNets
Learning algorithm
While not done:
Pick a random training example “(input, label)”
Run neural network on “input”
Adjust weights on edges to make output closer to “label”
Learning algorithm
While not done:
Pick a random training example “(input, label)”
Run neural network on “input”
Adjust weights on edges to make output closer to “label”
Backpropagation
Use partial derivatives along the paths in the neural net
Follow the gradient of the error w.r.t. the connections
Gradient points in direction of improvement

Good description: “Calculus on Computational Graphs: Backpropagation"
http://colah.github.io/posts/2015-08-Backprop/
This shows a function of 2 variables: real neural nets
are functions of hundreds of millions of variables!
Plenty of raw data
● Text: trillions of words of English + other languages

● Visual data: billions of images and videos
● Audio: tens of thousands of hours of speech per day
● User activity: queries, marking messages spam, etc.
● Knowledge graph: billions of labelled relation triples
● ...
How can we build systems that truly understand this data?

Important Property of Neural Networks
Results get better with
more data +
bigger models +
more computation
(Better algorithms, new insights and improved

techniques always help, too!)
What are some ways that
deep learning is having
a significant impact at Google?
Speech Recognition
Deep
“How cold is
Recurrent
it outside?”
Neural Network
Acoustic Input Text Output
Reduced word errors by more than 30%

Google Research Blog - August 2012, August 2015
ImageNet
Challenge
Given an image,
predict one of 1000
different classes
Image credit:
www.cs.toronto.
edu/~fritz/absps/imagene
t.pdf
The Inception Architecture (GoogLeNet, 2014)
Going Deeper with Convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich
ArXiv 2014, CVPR 2015

Neural Nets: Rapid Progress in Image Recognition
Team Year Place Error (top-5) ImageNet
challenge
XRCE (pre-neural-net explosion) 2011 1st 25.8%
classification
Supervision (AlexNet) 2012 1st 16.4% task
Clarifai 2013 1st 11.7%
GoogLeNet (Inception) 2014 1st 6.66%
Andrej Karpathy (human) 2014 N/A 5.1%
BN-Inception (Arxiv) 2015 N/A 4.9%
Inception-v3 (Arxiv) 2015 N/A 3.46%

Good Fine-Grained Classification
Good Generalization
Both recognized as “meal”

Sensible Errors
Google Photos Search
Deep
Convolutional “ocean”
Neural Network
Automatic Tag
Your Photo
Search personal photos without tags.

Google Research Blog - June 2013
Language Understanding
Query
Document 1
… car parking available for a small fee.
… parts of our floor model inventory for sale.
Document 2
Selling all kinds of automobile and pickup truck parts,
engines, and transmissions.
How to deal with Sparse Data?
Usually use many more than 3 dimensions (e.g. 100D, 1000D)

Embeddings Can be Trained With Backpropagation
Mikolov, Sutskever, Chen, Corrado and Dean. Distributed Representations of Words and
Phrases and Their Compositionality, NIPS 2013.
Nearest Neighbors are Closely Related Semantically
Trained language model on Wikipedia
tiger shark car new york
bull shark cars new york city

blacktip shark muscle car brooklyn
shark sports car long island
oceanic whitetip shark compact car syracuse
sandbar shark autocar manhattan
dusky shark automobile washington
blue shark pickup truck bronx
requiem shark racing car yonkers
great white shark passenger car poughkeepsie
lemon shark dealership new york state
* 5.7M docs, 5.4B terms, 155K unique terms, 500-D embeddings

Directions are Meaningful
Solve analogies with vector arithmetic!

V(queen) - V(king) ≈ V(woman) - V(man)
V(queen) ≈ V(king) + (V(woman) - V(man))
RankBrain in Google Search Ranking
Query: “car parts for sale”,

Deep Score for
Neural doc,query
Doc: “Rebuilt transmissions …” pair
Network
Query & document features
Launched in 2015
Third most important search ranking signal (of 100s)
Bloomberg, Oct 2015: “Google Turning Its Lucrative Web Search Over to AI Machines”
Recurrent Neural Networks
Compact View Unrolled View

Tied Weights
Neural Network
Yt
Y1 Y2 Y3
t ← t+1
Xt
X1 X2 X3
Recurrent Connections
(trainable weights)
Tied Weights
Recurrent Neural Networks
RNNs very difficult to train for more than a few timesteps: numerically
unstable gradients (vanishing / exploding).
Thankfully, LSTMs… [ “Long Short-Term Memory”, Hochreiter & Schmidhuber, 1997 ]

LSTMs: Long Short-Term Memory Networks
‘RNNs done right’:

● Very effective at modeling long-term dependencies.
● Very sound theoretical and practical justifications.
● A central inspiration behind lots of recent work on using deep
learning to learn complex programs:
Memory Networks, Neural Turing Machines.
A Simple Model of Memory
Instruction Input
Output WRITE? READ?
WRITE X, M X M Y
READ M, Y
FORGET M
FORGET?
Key Idea: Make Your Program Differentiable
Sigmoids
W R
WRITE? READ?
X M Y X M Y
FORGET?
F
Sequence-to-Sequence Model
Target sequence
[Sutskever & Vinyals & Le NIPS 2014] X Y Z Q
Deep LSTM
A B C D __ X Y Z
Input sequence
Sequence-to-Sequence Model: Machine Translation
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How
Quelle est votre taille? <EOS>
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall
Quelle est votre taille? <EOS> How
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are
Quelle est votre taille? <EOS> How tall
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are you?
Quelle est votre taille? <EOS> How tall are
Input sentence
At inference time:
Beam search to choose most probable
[Sutskever & Vinyals & Le NIPS 2014] over possible output sequences
Input sentence
Target sentence
[Sutskever & Vinyals & Le NIPS 2014] How tall are you?
Input sentence
Sequence-to-Sequence
● Active area of research
● Many groups actively pursuing RNN/LSTM
○ Montreal
○ Stanford
○ U of Toronto
○ Berkeley
○ Google
○ ...
● Further Improvements
○ Attention
○ NTM / Memory Nets
○ ...
Sequence-to-Sequence
● Translation: [Kalchbrenner et al., EMNLP 2013][Cho et al., EMLP 2014][Sutskever & Vinyals & Le, NIPS
2014][Luong et al., ACL 2015][Bahdanau et al., ICLR 2015]
● Image captions: [Mao et al., ICLR 2015][Vinyals et al., CVPR 2015][Donahue et al., CVPR 2015][Xu et al.,
ICML 2015]
● Speech: [Chorowsky et al., NIPS DL 2014][Chan et al., arxiv 2015]

● Language Understanding: [Vinyals & Kaiser et al., NIPS 2015][Kiros et al., NIPS 2015]
● Dialogue: [Shang et al., ACL 2015][Sordoni et al., NAACL 2015][Vinyals & Le, ICML DL 2015]
● Video Generation: [Srivastava et al., ICML 2015]
● Algorithms: [Zaremba & Sutskever, arxiv 2014][Vinyals & Fortunato & Jaitly, NIPS 2015][Kaiser &
Sutskever, arxiv 2015][Zaremba et al., arxiv 2015]
Google Research Blog
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small Feed-
Forward yes/no
Neural Network
Google Research Blog
Incoming Email
Smart Reply - Nov 2015
Activate
Smart Reply?
Small Feed-
Forward yes/no
Neural Network
Generated Replies
Deep Recurrent
Neural Network
How to do Image Captions?
P(English | French)
Image )
How?
[Vinyals et al., CVPR 2015] A young girl asleep
W __ A young girl
Human: A young girl asleep on
the sofa cuddling a stuffed
bear.
Model: A close up of a child

holding a stuffed animal.
Model: A baby is asleep next to

a teddy bear.
Combined Vision + Translation
Can also learn a grammatical parser
n:(S.17 n:(S.17 n:(NP.11 p:NNP.53 n:) ...
Allen is locked in, regardless of his situ...

It works well
Completely learned parser with no parsing-specific code
State of the art results on WSJ 23 parsing task
Grammar as a Foreign Language, Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav
Petrov, Ilya Sutskever, and Geoffrey Hinton (NIPS 2015)
http://arxiv.org/abs/1412.7449
Turnaround Time and Effect on Research
● Minutes, Hours:
○ Interactive research! Instant gratification!
● 1-4 days
○ Tolerable
○ Interactivity replaced by running many experiments in parallel
● 1-4 weeks:
○ High value experiments only
○ Progress stalls
● >1 month
○ Don’t even try
Train in a day what would take a single GPU card 6 weeks
How Can We Train Large, Powerful Models Quickly?
● Exploit many kinds of parallelism
○ Model parallelism
○ Data parallelism
Model Parallelism
Model Parallelism
Model Parallelism
Data Parallelism
Parameter Servers p’’ = p’ + ∆p
∆p’ p’
Model
Replicas ...
Data
...
Data Parallelism Choices
Can do this synchronously:
● N replicas equivalent to an N times larger batch size
● Pro: No noise
● Con: Less fault tolerant (requires some recovery if any single machine fails)
Can do this asynchronously:

● Con: Noise in gradients
● Pro: Relatively fault tolerant (failure in model replica doesn’t block other
replicas)
(Or hybrid: M asynchronous groups of N synchronous replicas)

What do you want in a machine learning system?
● Ease of expression: for lots of crazy ML ideas/algorithms
● Scalability: can run experiments quickly
● Portability: can run on wide variety of platforms
● Reproducibility: easy to share and reproduce research
● Production readiness: go from research to real products
TensorFlow:
Second Generation Deep Learning System
If we like it, wouldn’t the rest of the world like it, too?
Open sourced single-machine TensorFlow on Monday, Nov. 9th, 2015

● Flexible Apache 2.0 open source licensing
● Updates for distributed implementation coming soon
http://tensorflow.org/
and
https://github.com/tensorflow/tensorflow
http://tensorflow.org/
http://tensorflow.org/whitepaper2015.pdf
Source on GitHub
Source on GitHub
Motivations
DistBelief (1st system) was great for scalability, and
production training of basic kinds of models
Not as flexible as we wanted for research purposes
Better understanding of problem space allowed us to

make some dramatic simplifications
TensorFlow: Expressing High-Level ML Computations
● Core in C++
○ Very low overhead
Core TensorFlow Execution System
CPU GPU Android iOS ...

● Core in C++
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more

● Core in C++
● Different front ends for specifying/driving the computation
○ Python and C++ today, easy to add more
C++ front end Python front end ...

Computation is a dataflow graph
biases Graph of Nodes, also called Operations or ops.
weights Add Relu
MatMul Xent
examples
labels
s o rs
Computation is a dataflow graph ten
with
biases Edges are N-dimensional arrays: Tensors
weights Add Relu
MatMul Xent
examples
labels
t a t e
w ith s
'Biases' is a variable Some ops compute gradients −= updates biases
biases
... Add ... Mul −=
learning rate
ut e d
is t r i b
d
Device A Device B
biases
... Add ... Mul −=
learning rate
Devices: Processes, Machines, GPUs, etc

Automatically runs models on range of platforms:
from phones ...
to single machines (CPU and/or GPUs) …
to distributed systems of many 100s of GPU cards

Conclusions
Deep neural networks are making significant strides in understanding:
In speech, vision, language, search, …
If you’re not considering how to use deep neural nets to solve your search or
understanding problems, you almost certainly should be
TensorFlow makes it easy for everyone to experiment with these techniques
● Highly scalable design allows faster experiments, accelerates research

● Easy to share models and to publish code to give reproducible results
● Ability to go from research to production within same system
Further Reading
● Le, Ranzato, Monga, Devin, Chen, Corrado, Dean, & Ng. Building High-Level Features
Using Large Scale Unsupervised Learning, ICML 2012. research.google.
com/archive/unsupervised_icml2012.html
● Dean, et al., Large Scale Distributed Deep Networks, NIPS 2012, research.google.
com/archive/large_deep_networks_nips2012.html.
● Mikolov, Chen, Corrado & Dean. Efﬁcient Estimation of Word Representations in Vector
Space, NIPS 2013, arxiv.org/abs/1301.3781.
● Le and Mikolov, Distributed Representations of Sentences and Documents, ICML 2014,
arxiv.org/abs/1405.4053
● Sutskever, Vinyals, & Le, Sequence to Sequence Learning with Neural Networks, NIPS,
2014, arxiv.org/abs/1409.3215.
● Vinyals, Toshev, Bengio, & Erhan. Show and Tell: A Neural Image Caption Generator.
CVPR 2015. arxiv.org/abs/1411.4555
● TensorFlow white paper, tensorflow.org/whitepaper2015.pdf (clickable links in bibliography)
research.google.com/people/jeff
research.google.com/pubs/MachineIntelligence.html
Questions?

On Deep Learning

Uploaded by

Copyright:

Available Formats

On Deep Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On Deep Learning

Uploaded by

Copyright:

Available Formats

Large-Scale Deep Learning for

Intelligent Computer Systems

Really hard without understanding

Not there yet, but making significant progress

● Why deep neural networks?

● use large datasets, and

to push boundaries of what is possible in perception and

Replacing piles of code with data and learning.

Would merely be an interesting academic exercise…

…if it didn’t work so well!

Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks

Object Recognition and Detection

Scalable Object Detection using Deep Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

Follow the gradient of the error w.r.t. the connections

Gradient points in direction of improvement

● Text: trillions of words of English + other languages

How can we build systems that truly understand this data?

(Better algorithms, new insights and improved

Reduced word errors by more than 30%

Going Deeper with Convolutions

ArXiv 2014, CVPR 2015

GoogLeNet (Inception) 2014 1st 6.66%

Andrej Karpathy (human) 2014 N/A 5.1%

BN-Inception (Arxiv) 2015 N/A 4.9%

Inception-v3 (Arxiv) 2015 N/A 3.46%

Both recognized as “meal”

Search personal photos without tags.

Usually use many more than 3 dimensions (e.g. 100D, 1000D)

bull shark cars new york city

* 5.7M docs, 5.4B terms, 155K unique terms, 500-D embeddings

Solve analogies with vector arithmetic!

Query: “car parts for sale”,

Compact View Unrolled View

Thankfully, LSTMs… [ “Long Short-Term Memory”, Hochreiter & Schmidhuber, 1997 ]

‘RNNs done right’:

Quelle est votre taille? <EOS>

Quelle est votre taille? <EOS> How

Quelle est votre taille? <EOS> How tall

Quelle est votre taille? <EOS> How tall are

Quelle est votre taille? <EOS>

Quelle est votre taille? <EOS>

● Speech: [Chorowsky et al., NIPS DL 2014][Chan et al., arxiv 2015]

Model: A close up of a child

Model: A baby is asleep next to

Allen is locked in, regardless of his situ...

State of the art results on WSJ 23 parsing task

Can do this asynchronously:

(Or hybrid: M asynchronous groups of N synchronous replicas)

Open sourced single-machine TensorFlow on Monday, Nov. 9th, 2015

Not as flexible as we wanted for research purposes

Better understanding of problem space allowed us to

Core TensorFlow Execution System

CPU GPU Android iOS ...

Core TensorFlow Execution System

CPU GPU Android iOS ...

C++ front end Python front end ...

Core TensorFlow Execution System

CPU GPU Android iOS ...

biases Graph of Nodes, also called Operations or ops.