Chapter 7 - Autoencoders
Chapter 7 - Autoencoders
Autoencoders
6.1. Introduction to Autoencoders
• Autoencoders are an unsupervised learning technique in which we
leverage neural networks for the task of representation learning.
• We design a neural network architecture such that we impose a
bottleneck in the network which forces a compressed knowledge
representation of the original input.
• If the input features were each independent of one another, this
compression and subsequent reconstruction would be a very
difficult task.
• However, if some sort of structure exists in the data (i.e.
correlations between input features), this structure can be learned
and consequently leveraged when forcing the input through the
network's bottleneck.
6.1. Introduction to Autoencoders…
• Autoencoders are artificial neural networks capable of learning
dense representations of the input data, called latent
representations or codings, without any supervision (i.e., the
training set is unlabeled).
• These codings typically have a much lower dimensionality than
the input data, making autoencoders useful for dimensionality
reduction, especially for visualization purposes.
• Autoencoder first take the input and compress it into a
low-dimensional vector.
• This part of the network is called the encoder because it is
responsible for producing the low-dimensional embedding or
code.
6.1. Introduction to Autoencoders…
• The second part of the network tries to invert the computation of
the first half of the network and reconstruct the original input.
• This piece is known as the decoder.
• The overall architecture is illustrated in figure below.
• The autoencoder architecture attempts to construct a
high-dimensional input into a low-dimensional embedding.
• It then uses that low-dimensional embedding to reconstruct the
input.
6.1. Introduction to Autoencoders…
• An autoencoder consists of three components:
• i. Encoder: An encoder is a feedforward, fully connected neural
network that compresses the input into a latent space representation
• It encodes the input image as a compressed representation in a
reduced dimension.
• The compressed image is the distorted version of the original image.
• ii. Code: This part of the network contains the reduced representation
of the input that is fed into the decoder.
• It is also called bottleneck.
• iii. Decoder: Decoder is also a feedforward network like the encoder
and has a similar structure to the encoder.
• The decoder layer decodes the encoded image back to the original
dimension.
• It is responsible for reconstructing the input back to the original
dimensions from the code.
• The decoded image is reconstructed from latent space representation,
and it is reconstructed from the latent space representation and is a
lossy reconstruction of the original image.
6.1. Introduction to Autoencoders…
• First, the input goes through the encoder where it is compressed and
stored in the layer called Code, then the decoder decompresses the
original input from the code.
• The main objective of the autoencoder is to get an output identical to
the input.
• Note that the decoder architecture is the mirror image of the encoder.
• This is not a requirement but it’s typically the case.
• The only requirement is the dimensionality of the input and output
must be the same.
6.1. Introduction to Autoencoders…
• The architecture as a whole looks something like this:
6.1. Introduction to Autoencoders…
• An autoencoder will encode the input distribution into a
low-dimensional tensor, which usually takes the form of a vector.
• This will approximate the hidden structure that is commonly
referred to as the latent representation, code, or vector.
• This process constitutes the encoding part.
• The latent vector will then be decoded by the decoder part to
recover the original input.
• As a result of the latent vector being a low-dimensional
compressed representation of the input distribution, it should be
expected that the output recovered by the decoder can only
approximate the input.
• The dissimilarity between the input and the output can be
measured by a loss function.
6.1. Introduction to Autoencoders…
Figure autoencoder
• Autoencoders are mainly a dimensionality reduction (or
compression) algorithm with a couple of important properties:
• Data-specific: Autoencoders are only able to meaningfully
compress data similar to what they have been trained on.
• Since they learn features specific for the given training data, they
are different than a standard data compression algorithm like gzip.
• So, we can’t expect an autoencoder trained on handwritten digits
to compress landscape photos.
• Lossy: The output of the autoencoder will not be exactly the same
as the input, it will be a close but degraded representation.
• If you want lossless compression, autoencoders are not the way to
go.
• Unsupervised: To train an autoencoder we don’t need to label the
data, just throw the raw input data at it.
• Autoencoders are considered an unsupervised learning technique
since they don’t need explicit labels to train on.
• But to be more precise they are self-supervised because they
generate their own labels from the training data.
6.1. Introduction to Autoencoders…
Stacked Autoencoders
• Just like other neural networks, autoencoders can have multiple
hidden layers.
• Such autoencoders are called stacked autoencoders (or deep
autoencoders).
• Adding more layers helps the autoencoder learn more complex
coding.
• That said, one must be careful not to make the autoencoder too deep.
• Imagine an encoder so powerful that it just learns to map each input
to a single arbitrary number (and the decoder learns the reverse
mapping).
• Obviously such an autoencoder will reconstruct the training data
perfectly, but it will not have learned any useful data representation
in the process and it is unlikely to generalize well to new instances.
• The architecture of a stacked autoencoder is typically symmetrical
with regard to the central hidden layer (the coding layer).
• To put it simply, it looks like a sandwich.
• For example, an autoencoder for MNIST may have 784 inputs (28
x 28), followed by a hidden layer with 100 neurons, then a central
hidden layer of 30 neurons, then another hidden layer with 100
neurons, and an output layer with 784 neurons.
gɸ fθ
Figure A VAE maps an image to two vectors, mean and variance, which define a
probability distribution over the latent space, used to sample a latent point to decode
• In VAE, we encode the input as a distribution over the latent
space, instead of considering it as a single point.
• This encoded distribution is chosen to be normal so that the
encoder can be trained to return the mean and the variance matrix.
• Instead of compressing its input image into a fixed code in the
latent space, VAE turns the image into the parameters of a
statistical distribution: a mean and a variance.
• Essentially, this means we are assuming the input image has been
generated by a statistical process, and that the randomness of this
process should be taken into account during encoding and
decoding.
• The VAE then uses the mean and variance parameters to
randomly sample one element of the distribution, and decodes that
element back to the original input.
• The stochasticity of this process improves robustness and forces
the latent space to encode meaningful representations everywhere:
every point sampled in the latent space is decoded to a valid
output.
6.4. Variational Autoencoders…
reparametrization enables
6.4. Variational Autoencoders…
• After reparametrization, the produced latent vector z will be the
same as before.
• But making the change allows the gradients to flow back through
to the encoder part of the VAE.
• Now, we need a method to compute the difference between two
probability distributions.
• For this, we use KL divergence (relative entropy).
• The KL divergence produces a number indicating how close two
distributions are to each other.
• The closer two distributions get to each other, the lower the KL
divergence becomes.
• The closer two distributions get to each other, the lower the loss
becomes.
• In the following graph, the blue distribution is trying to model the
green distribution.
• As the blue distribution comes closer and closer to the green one,
the KL divergence loss will get closer to zero.
Distribution p(x)
Distribution q(x)
• x 1 2 3
Distribution p(x)
Distribution q(x)
• Now we can proceed to the formulation of loss function.
• The loss function of the VAE is the negative log-likelihood with a
regularizer.
• The loss function of VAE is a combination of two terms:
– Reconstruction loss: This term measures how well the VAE can
reconstruct the input data from the latent representation.
– KL divergence loss: This term measures how close the latent
representation is to a standard normal distribution. A
commonly used loss is the Kullback–Leibler divergence
between the latent representation and a standard normal
distribution.
• For the regularization, we use an expression meant to nudge the
distribution of the encoder output towards standard normal
distribution centered around 0.
• This provides the encoder with a sensible assumption about the
structure of the latent space it is modeling.
•
6.4. Variational Autoencoders…
• The second term is a regularizer that we throw in.
• This is the KL divergence between the encoder’s distribution q∅
(z∣x) and 𝑝𝜃(z) where 𝑝𝜃(z) is a standard normal distribution
(μ=0, σ2=1).
• Here, p(z) is a standard normal distribution i.e. p(z) = N(μ=0,
σ2=1).
• This divergence measures how much information is lost when
using q to represent p.
• It is one of how close q is to p.
• In VAE, p is specified as a standard normal distribution with
mean 0 and variance 1, or p𝜃(z) = N(0, 1).
• If the encoder outputs representations z that are different from a
standard normal distribution, it will receive a penalty in the loss.
• This regularizer term means ‘keep the representations z of each
digit sufficiently diverse’.
• If we didn’t include the regularizer, the encoder could learn to
cheat and give each datapoint a representation in a different region
of Euclidean space.
• This is bad, because then two images of the same number, say
digit 2 written by two different people, 2bob and 2alice, could end up
with very different representations zbob , zalice.
• We want the representation space of z to be meaningful, so we
penalize this behavior.
• This has the effect of keeping similar numbers’ representations
close together (so the representations of the digit 2 of zalice, zbob,
zali remain sufficiently close).
•
Figure Sampling from nearby points of VAE latent space produces similar output images
• There are plenty of further improvements that can be made over
the variational autoencoder.
• We could replace the standard fully-connected dense
encoder-decoder with a convolutional-deconvolutional
encoder-decoder pair to produce great synthetic human face
photos.