DL Lecture8 Autoencoder

Deep Learning Basics
Lecture 8: Autoencoder & DBM

Princeton University COS 495
Instructor: Yingyu Liang
Autoencoder
Autoencoder
• Neural networks trained to attempt to copy its input to its output
• Contain two parts:

• Encoder: map the input to a hidden representation
• Decoder: map the hidden representation to the output
Autoencoder
ℎ Hidden representation (the code)
Input 𝑥 𝑟 Reconstruction
Autoencoder
Encoder 𝑓(⋅) Decoder 𝑔(⋅)
𝑥 𝑟
ℎ = 𝑓 𝑥 , 𝑟 = 𝑔 ℎ = 𝑔(𝑓 𝑥 )
Why want to copy input to output
• Not really care about copying
• Interesting case: NOT able to copy exactly but strive to do so

• Autoencoder forced to select which aspects to preserve and thus
hopefully can learn useful properties of the data
• Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988;
Hinton and Zemel, 1994).
Undercomplete autoencoder
• Constrain the code to have smaller dimension than the input
• Training: minimize a loss function
𝐿 𝑥, 𝑟 = 𝐿(𝑥, 𝑔 𝑓 𝑥 )
𝑥 ℎ 𝑟
• Constrain the code to have smaller dimension than the input
𝐿 𝑥, 𝑟 = 𝐿(𝑥, 𝑔 𝑓 𝑥 )
• Special case: 𝑓, 𝑔 linear, 𝐿 mean square error

• Reduces to Principal Component Analysis
• What about nonlinear encoder and decoder?
• Capacity should not be too large

• Suppose given data 𝑥1 , 𝑥2 , … , 𝑥𝑛
• Encoder maps 𝑥𝑖 to 𝑖
• Decoder maps 𝑖 to 𝑥𝑖
• One dim ℎ suffices for perfect reconstruction
Regularization
• Typically NOT
• Keeping the encoder/decoder shallow or
• Using small code size
• Regularized autoencoders: add regularization term that encourages

the model to have other properties
• Sparsity of the representation (sparse autoencoder)
• Robustness to noise or to missing inputs (denoising autoencoder)
• Smallness of the derivative of the representation
Sparse autoencoder
• Constrain the code to have sparsity
𝐿𝑅 = 𝐿(𝑥, 𝑔 𝑓 𝑥 ) + 𝑅(ℎ)
𝑥 ℎ 𝑟
Probabilistic view of regularizing ℎ
• Suppose we have a probabilistic model 𝑝(ℎ, 𝑥)
• MLE on 𝑥
log 𝑝(𝑥) = log ෍ 𝑝(ℎ′ , 𝑥)
ℎ′
•  Hard to sum over ℎ′

• MLE on 𝑥
max log 𝑝(𝑥) = max log ෍ 𝑝(ℎ′ , 𝑥)
ℎ′
• Approximation: suppose ℎ = 𝑓(𝑥) gives the most likely hidden

representation, and σℎ′ 𝑝(ℎ′ , 𝑥) can be approximated by 𝑝(ℎ, 𝑥)
• Approximate MLE on 𝑥, ℎ = 𝑓(𝑥)
max log 𝑝(ℎ, 𝑥) = max log 𝑝(𝑥|ℎ) + log 𝑝(ℎ)
Loss Regularization
Sparse autoencoder
• Constrain the code to have sparsity
𝜆 𝜆
• Laplacian prior: 𝑝 ℎ = exp(− ℎ 1)
2 2

𝐿𝑅 = 𝐿(𝑥, 𝑔 𝑓 𝑥 ) + 𝜆 ℎ 1
Denoising autoencoder
• Traditional autoencoder: encourage to learn 𝑔 𝑓 ⋅ to be identity
• Denoising : minimize a loss function

𝐿 𝑥, 𝑟 = 𝐿(𝑥, 𝑔 𝑓 𝑥෤ )
where 𝑥෤ is 𝑥 + 𝑛𝑜𝑖𝑠𝑒
Boltzmann machine
Boltzmann machine
• Introduced by Ackley et al. (1985)
• General “connectionist” approach to learning arbitrary probability

distributions over binary vectors
exp(−𝐸 𝑥 )
• Special case of energy model: 𝑝 𝑥 =
𝑍
Boltzmann machine
• Energy model:
exp(−𝐸 𝑥 )
𝑝 𝑥 =
𝑍
• Boltzmann machine: special case of energy model with
𝐸 𝑥 = −𝑥 𝑇 𝑈𝑥 − 𝑏 𝑇 𝑥
where 𝑈 is the weight matrix and 𝑏 is the bias parameter
Boltzmann machine with latent variables
• Some variables are not observed
𝑥 = 𝑥𝑣 , 𝑥ℎ , 𝑥𝑣 visible, 𝑥ℎ hidden
𝐸 𝑥 = −𝑥𝑣𝑇 𝑅𝑥𝑣 − 𝑥𝑣𝑇 𝑊𝑥ℎ − 𝑥ℎ𝑇 𝑆𝑥ℎ − 𝑏 𝑇 𝑥𝑣 − 𝑐 𝑇 𝑥ℎ
• Universal approximator of probability mass functions

Maximum likelihood
• Suppose we are given data 𝑋 = 𝑥𝑣1 , 𝑥𝑣2 , … , 𝑥𝑣𝑛
• Maximum likelihood is to maximize
log 𝑝 𝑋 = ෍ log 𝑝(𝑥𝑣𝑖 )
𝑖
where
1
𝑝 𝑥𝑣 = ෍ 𝑝(𝑥𝑣 , 𝑥ℎ ) = ෍ exp(−𝐸(𝑥𝑣 , 𝑥ℎ ))
𝑍
𝑥ℎ 𝑥ℎ
• 𝑍 = σ exp(−𝐸(𝑥𝑣 , 𝑥ℎ )): partition function, difficult to compute

Restricted Boltzmann machine
• Invented under the name harmonium (Smolensky, 1986)
• Popularized by Hinton and collaborators to Restricted Boltzmann
machine
• Special case of Boltzmann machine with latent variables:
exp(−𝐸 𝑣, ℎ )
𝑝 𝑣, ℎ =
𝑍
where the energy function is
𝐸 𝑣, ℎ = −𝑣 𝑇 𝑊ℎ − 𝑏 𝑇 𝑣 − 𝑐 𝑇 ℎ
with the weight matrix 𝑊 and the bias 𝑏, 𝑐
• Partition function
𝑍 = ෍ ෍ exp(−𝐸 𝑣, ℎ )
𝑣 ℎ
Figure from Deep Learning,

Goodfellow, Bengio and Courville
• Conditional distribution is factorial
𝑝(𝑣, ℎ)
𝑝 ℎ|𝑣 = = ෑ 𝑝(ℎ𝑗 |𝑣)
𝑝(𝑣)
𝑗
and
𝑝 ℎ𝑗 = 1|𝑣 = 𝜎 𝑐𝑗 + 𝑣 𝑇 𝑊:,𝑗
is logistic function
• Similarly,
𝑝(𝑣, ℎ)
𝑝 𝑣|ℎ = = ෑ 𝑝(𝑣𝑖 |ℎ)
𝑝(ℎ)
𝑖
and
𝑝 𝑣𝑖 = 1|ℎ = 𝜎 𝑏𝑖 + 𝑊𝑖,: ℎ
is logistic function
Deep Boltzmann machine
• Special case of energy model. Take 3 hidden layers and ignore bias:
exp(−𝐸 𝑣, ℎ1 , ℎ2 , ℎ3 )
𝑝 𝑣, ℎ1 , ℎ2 , ℎ3 =
𝑍
• Energy function
𝐸 𝑣, ℎ1 , ℎ2 , ℎ3 = −𝑣 𝑇 𝑊 1 ℎ1 − (ℎ1 )𝑇 𝑊 2 ℎ2 − (ℎ2 )𝑇 𝑊 3 ℎ3
with the weight matrices 𝑊 1 , 𝑊 2 , 𝑊 3
• Partition function
𝑍= ෍ exp(−𝐸 𝑣, ℎ1 , ℎ2 , ℎ3 )
𝑣,ℎ1 ,ℎ2 ,ℎ3
Deep Boltzmann machine
Figure from Deep Learning,

Goodfellow, Bengio and Courville

DL Lecture8 Autoencoder

Uploaded by

Copyright:

Available Formats

DL Lecture8 Autoencoder

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Lecture8 Autoencoder

Uploaded by

Copyright:

Available Formats

What is an autoencoder and what are its main components?

What is an autoencoder and what are its main components?

What is the difference between a complete and undercomplete autoencoder?

What is the difference between a complete and undercomplete autoencoder?

Deep Learning Basics

Lecture 8: Autoencoder & DBM

• Contain two parts:

ℎ Hidden representation (the code)

Encoder 𝑓(⋅) Decoder 𝑔(⋅)

• Interesting case: NOT able to copy exactly but strive to do so

• Special case: 𝑓, 𝑔 linear, 𝐿 mean square error

• Capacity should not be too large

• Regularized autoencoders: add regularization term that encourages

•  Hard to sum over ℎ′

• Approximation: suppose ℎ = 𝑓(𝑥) gives the most likely hidden

• Training: minimize a loss function

• Denoising : minimize a loss function

• General “connectionist” approach to learning arbitrary probability

• Universal approximator of probability mass functions

• 𝑍 = σ exp(−𝐸(𝑥𝑣 , 𝑥ℎ )): partition function, difficult to compute

Figure from Deep Learning,

Figure from Deep Learning,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.