Unit 2 Notes
Unit 2 Notes
Convolutional neural networks (CNNs) are inspired by the human brain's visual cortex, which is
responsible for processing visual stimuli. CNNs are a type of artificial neural network that use
convolution operations to identify patterns and extract features from images. Their architecture is
similar to the ventral stream of the human visual system, which includes a hierarchical sequence
of stages, increasing receptive field size, and complex neural responses
Human Vision: Visual processing begins with simple features (edges in V1) and
progresses to more complex patterns (shapes and objects in IT cortex).
Neural Networks: Early layers detect low-level features (edges), and deeper layers
combine these to recognize high-level features (faces, objects).
Receptive Fields
Human Neurons: Each neuron in the visual cortex responds to a specific region of the
visual field (its receptive field).
Neural Networks: Convolutional layers mimic this by applying filters to small regions of
the input image.
Feature Detection
Biological Neurons: Simple cells in V1 detect oriented edges, while complex cells
respond to movement and specific patterns.
Neural Networks: Filters in convolutional layers detect similar features like edges and
textures.
Parallel Processing
Human Vision: Different pathways process visual information in parallel (e.g., dorsal for
"where" and ventral for "what").
Neural Networks: Parallel layers and multiple filters allow simultaneous processing of
various features.
1. Image Classification
o Recognizing objects or scenes (e.g., dogs, cars) similar to how humans identify
objects.
2. Object Detection
o Locating and classifying multiple objects in a scene, akin to how humans
recognize and focus on specific items.
3. Facial Recognition
o Identifying and verifying faces, similar to how humans recognize individuals.
4. Autonomous Vehicles
o Using neural networks to interpret visual data for navigation, much like how
human drivers rely on vision.
Feature selection involves identifying and retaining the most relevant features in a dataset to
improve model performance. While feature selection is beneficial in traditional machine
learning, it poses several challenges and limitations in the context of deep learning
Eliminating features based on traditional statistical methods may result in the loss of
information that could have contributed to deep learning models.
Non-linear interactions between features are hard to assess without training, and removed
features may have been significant after transformation.
Deep learning thrives on high-dimensional data, especially in tasks like image or text
processing.
Feature selection could reduce the dimensionality unnecessarily, potentially limiting the
model's ability to capture complex patterns.
Effective feature selection methods like Recursive Feature Elimination (RFE) or LASSO
require high computational resources.
When applied to large datasets or deep models, this adds complexity and computational
overhead without guaranteed benefits.
5. Poor Generalization to New Data
Feature selection might yield a subset of features that performs well on training and
validation data but fails to generalize effectively to unseen data.
Deep models rely on diverse and robust feature learning from raw data, which may be
hampered by pre-filtered features.
6. Interdependence of Features
In deep learning, the interdependence of features (i.e., how features interact to produce
complex patterns) is critical.
Removing features independently might ignore these dependencies, reducing model
performance.
Feature importance methods (e.g., mutual information, correlation) often struggle in high-
dimensional, non-linear datasets typical in deep learning.
Features that are initially considered "unimportant" may become significant after several
layers of transformation in a deep network.
8. Application-Specific Limitations
In fields like computer vision or natural language processing, manual feature selection is
rarely effective due to the complexity and high-dimensional nature of the raw data (e.g.,
pixels or words).
Feature selection in these domains is likely to introduce bias or remove potentially
valuable data.
Despite these shortcomings, feature selection can still be useful in specific cases:
Small Datasets: When the dataset is small and contains many irrelevant features,
reducing feature dimensionality can help avoid overfitting.
Interpretable Models: If interpretability is important, feature selection can help identify
which inputs most influence the output.
Hybrid Models: In scenarios where deep learning is combined with traditional models,
feature selection may improve the performance of traditional components.
Vanilla Deep Neural Networks don’t Scale
A vanilla neural network works quite similar to the above regression model, the difference being
that there exits a third layer between our inputs(x) and our output(y). This third layer is referred
to as a “hidden layer (h)”. This hidden layer is connected to the output layer by another set of
weight vectors.
These individual units of the hidden layer denoted by (h0, h1, h2) are known as neurons. We
could create a hidden layer with as few or as many neurons as we require.
Thus, both Vanilla Neural Networks and Linear Regression are similar.
The trick that NNs use to make their architecture so distinguished is that they apply a nonlinear
“activation function” to the output of each layer.
Vanilla deep neural networks (DNNs) refer to basic, fully connected feedforward networks
without any architectural enhancements or optimization techniques. While they are conceptually
simple and foundational, they face several limitations when applied to large-scale problems.
1. Computational Inefficiency
Vanishing Gradients:
In deep networks, gradients diminish as they propagate backward through layers,
especially when using activation functions like sigmoid or tanh. This slows or even halts
learning in earlier layers.
Exploding Gradients:
Conversely, gradients can grow exponentially large, destabilizing the learning process.
4. Lack of Specialized Structures
High-Dimensional Data:
Vanilla DNNs cannot efficiently process data with high input dimensions, such as images
or long text sequences, without a significant increase in model size.
Feature Redundancy:
The fully connected nature of vanilla DNNs leads to redundant processing of features,
making them less efficient in handling large-scale data.
6. Training Instability
Sensitivity to Hyperparameters:
Vanilla DNNs require careful tuning of hyperparameters like learning rate, weight
initialization, and network depth. Without fine-tuning, the training process becomes
unstable.
Slow Convergence:
Gradient descent in vanilla DNNs can converge slowly, requiring many iterations to
reach an optimal solution.
Vanilla DNNs are not well-suited for transfer learning, which is crucial for scaling
models across tasks. Advanced architectures like CNNs and Transformers excel in this
area by allowing pre-trained feature extraction.
1. Architectural Innovations:
o Convolutional Neural Networks (CNNs) for images.
o Recurrent Neural Networks (RNNs) and Transformers for sequential data.
o Graph Neural Networks (GNNs) for graph-structured data.
2. Optimization Techniques:
o Batch normalization, dropout, and weight regularization improve training
stability and generalization.
o Adam and other advanced optimizers accelerate convergence.
3. Efficient Training Methods:
o Mini-batch training reduces memory requirements and speeds up training.
o Gradient clipping prevents exploding gradients.
4. Scalable Frameworks:
o Distributed training and model parallelism using frameworks like TensorFlow,
PyTorch, and JAX enable scaling across multiple GPUs and TPUs.
Vanilla deep neural networks, while foundational, fail to scale efficiently due to computational
inefficiency, overfitting, and limitations in handling complex data. Modern deep learning
leverages specialized architectures, optimization techniques, and hardware advancements to
address these challenges.
Filters and feature maps are fundamental concepts in convolutional neural networks (CNNs).
Filters, also known as kernels, are small matrices that slide over the input data to detect specific
features, such as edges or textures. Each filter generates a feature map, which is the output that
highlights the presence of the detected features in the input data.
1) Filters (Kernels)
Filters are small matrices used in the convolution operation to extract features from the
input data.
Functionality:
o They slide over the input image or feature map, performing element-wise
multiplication with the corresponding values.
o The result of this multiplication is summed to produce a single value in the output
feature map.
2) Feature Maps
Key Elements :
Stride:
Padding:
Adding extra pixels (usually zeros) around the input to maintain spatial dimensions.
Types: Valid padding (no padding) vs. same padding (output size matches input size).
Receptive Field:
The area of the input image that influences a particular value in the feature map.
As depth increases, receptive fields grow, allowing the network to detect larger, more
abstract features.
Feature Extraction: Filters and feature maps enable CNNs to automatically learn and
extract meaningful features from raw image data, which is crucial for tasks like image
classification, object detection, and segmentation.
Efficiency: By discarding irrelevant data and focusing on important features, CNNs can
process large datasets more effectively.
Generalization: The learned features allow CNNs to generalize well to new and unseen
data, making them powerful tools in various applications, including medical imaging and
autonomous vehicles.
Full Description of the Convolutional Layer
(or)
a. Input Volume
Definition: The input to a convolutional layer is typically a multi-dimensional array
(tensor) representing the input data. For image data, this is often a 3D tensor with
dimensions corresponding to width, height, and depth (channels).
Example: A color image of size 32x32 pixels has a shape of (32, 32, 3), where 3
represents the RGB color channels.
b. Filters (Kernels)
Definition: Filters are small matrices (kernels) that slide over the input volume to
perform convolution operations. Each filter is designed to detect specific features in the
input.
Example: A 3x3 filter for a color image would have dimensions (3, 3, 3).
c. Stride
Definition: Stride refers to the number of pixels by which the filter moves across the
input volume during the convolution operation.
Effect: A larger stride results in a smaller output feature map, as the filter skips more
pixels.
d. Padding
Definition: Padding involves adding extra pixels (usually zeros) around the border of the
input volume to control the spatial dimensions of the output feature map.
Types:
o Valid Padding: No padding is applied; the filter only convolves over valid input
regions.
o Same Padding: Padding is applied so that the output feature map has the same
spatial dimensions as the input.
e. Activation Function
Common Functions: ReLU (Rectified Linear Unit), Sigmoid, and Tanh are commonly
used activation functions.
2. Operations of a Convolutional Layer
a. Convolution Operation
The core operation of the convolutional layer involves sliding the filter over the input
volume and performing the following steps:
1. Positioning: Place the filter at the top-left corner of the input volume.
3. Summation: Sum all the multiplied values to obtain a single output value.
4. Sliding: Move the filter by the specified stride to the next position and repeat the
process until the entire input volume has been processed.
The result of the convolution operation is a 2D feature map (for each filter) that
represents the presence of the feature detected by the filter across the spatial dimensions
of the input.
If ( H ) is the height and ( W ) is the width of the input, ( F ) is the height and width of the
filter, ( S ) is the stride, and ( P ) is the padding, the output dimensions can be calculated
as: [ \text{Output Height} = \frac{H + 2P - F}{S} + 1 ] [ \text{Output Width} = \frac{W
+ 2P - F}{S} + 1 ]
When multiple filters are applied in a convolutional layer, each filter generates its own
feature map. These feature maps are stacked together to form a 3D output volume, where
the depth corresponds to the number of filters used.
Image preprocessing pipelines are essential components in the development of robust machine
learning models, particularly in the field of computer vision. These pipelines consist of a series
of steps designed to enhance the quality of input images, making them more suitable for analysis
and improving the performance of models. By systematically applying various preprocessing
techniques, practitioners can ensure that their models are not only accurate but also resilient to
variations in input data.
One of the primary goals of an image preprocessing pipeline is to normalize the input data.
Normalization involves scaling pixel values to a common range, often between 0 and 1 or -1 and
1. This step is crucial because it helps models converge faster during training and reduces the
risk of numerical instability. For instance, in deep learning frameworks, large variations in input
values can lead to issues like exploding or vanishing gradients. By normalizing the data, we
create a more stable environment for the model to learn from, ultimately leading to better
generalization on unseen data.
Another key aspect of image preprocessing is data augmentation. This technique involves
artificially expanding the training dataset by applying various transformations to the original
images, such as rotations, flips, translations, and color adjustments. Data augmentation helps
models become more robust by exposing them to a wider variety of input conditions. For
example, a model trained on augmented images is less likely to overfit to specific features of the
training set and is better equipped to handle real-world variations, such as changes in lighting or
orientation. This increased diversity in training data can significantly enhance the model's ability
to generalize, leading to improved performance on test datasets.
In addition to normalization and augmentation, other preprocessing steps like resizing, cropping,
and filtering can also play a critical role in refining image data. Resizing images to a consistent
size ensures that they can be processed uniformly by the model, while cropping can focus the
model's attention on specific areas of interest within an image. Filtering techniques, such as
Gaussian blurring or sharpening, can enhance important features or reduce noise, further
improving the quality of input data. By incorporating these steps into an image preprocessing
pipeline, practitioners can create a more effective input representation that captures relevant
information while minimizing distractions.
Moreover, preprocessing pipelines can also include techniques for handling imbalanced datasets.
In scenarios where certain classes of images are underrepresented, methods such as
oversampling the minority class or undersampling the majority class can be employed. This
ensures that the model receives a balanced view of the data, which is crucial for training robust
classifiers. Additionally, implementing techniques like class weighting during model training can
further mitigate the effects of class imbalance, leading to better performance across all classes.
In summary, image preprocessing pipelines are vital for developing robust machine learning
models in computer vision. By normalizing data, employing data augmentation, and applying
various image enhancement techniques, these pipelines improve the quality and diversity of
input data, ultimately leading to better model performance. Furthermore, addressing issues like
class imbalance ensures that models are well-equipped to handle real-world scenarios. As the
field of computer vision continues to evolve, the importance of effective preprocessing cannot be
overstated, as it lays the groundwork for creating reliable and accurate models capable of
tackling complex visual tasks.
+------------------+
| Input Images |
| (Raw Images of |
| Various Sizes) |
+------------------+
|
v
+------------------+
| Resizing |
| (Standardized |
| Size: 128x128) |
+------------------+
|
v
+------------------+
| Normalization |
| (Scale to [0, 1]) |
+------------------+
|
v
+------------------+
| Data Augmentation |
| (Rotate, Flip, |
| Zoom, Shift) |
+------------------+
|
v
+------------------+
| Cropping |
| (Focus on areas |
| of interest) |
+------------------+
|
v
+------------------+
| Filtering |
| (Gaussian Blur, |
| Sharpening) |
+------------------+
|
v
+------------------+
| Output Images |
| (Ready for Model) |
+------------------+
Batch normalization is a powerful technique introduced to improve the training of deep neural
networks. It addresses several challenges associated with training deep learning models,
including the internal covariate shift, vanishing/exploding gradients, and the need for careful
initialization. By normalizing the inputs to each layer of the network, batch normalization can
significantly accelerate training, improve convergence rates, and enhance model performance.
Batch normalization involves normalizing the output of a layer by adjusting and scaling the
activations. Specifically, for each mini-batch during training, the mean and variance of the
activations are computed. The activations are then normalized by subtracting the mean and
dividing by the standard deviation. After normalization, the output is scaled and shifted using
learned parameters (gamma and beta), allowing the network to maintain the capacity to represent
complex functions.
Implementation Considerations
While batch normalization offers many advantages, there are some considerations to keep in
mind:
Batch Size: The effectiveness of batch normalization can depend on the batch size.
Smaller batches may lead to noisier estimates of the mean and variance, potentially
affecting performance. However, this can be mitigated by using techniques like virtual
batch normalization or group normalization.
Inference Mode: During inference, the running averages of the mean and variance
(computed during training) are used instead of the mini-batch statistics. This ensures that
the model behaves consistently during evaluation.
Placement in the Network: Batch normalization can be applied after the activation
function or before it. The placement can affect the performance, and it may require
experimentation to determine the best configuration for a specific model.
Batch normalization is a transformative technique that accelerates the training of deep neural
networks while improving their stability and performance. By normalizing activations and
reducing internal covariate shift, it allows for faster convergence, enhanced generalization, and
reduced sensitivity to hyperparameters. As deep learning continues to evolve, batch
normalization remains a fundamental component in the training of robust and efficient neural
networks, making it a standard practice in modern deep learning workflows.
Embeddings are dense vector representations of data points in a continuous vector space. They
are particularly useful for representing categorical variables, such as words, items, or users, in a
way that captures semantic relationships. For example, in NLP, word embeddings map words to
vectors such that words with similar meanings are closer together in the vector space.
1. Word Embeddings:
o Word2Vec: A popular algorithm that uses either the Continuous Bag of Words (CBOW)
or Skip-Gram model to learn word representations based on their context in large text
corpora.
o GloVe (Global Vectors for Word Representation): A method that captures global
statistical information of word occurrences in a corpus to learn embeddings.
o FastText: An extension of Word2Vec that considers subword information, allowing it to
generate embeddings for out-of-vocabulary words.
Representation Learning
Natural Language Processing: Embeddings are widely used for tasks such as sentiment analysis,
machine translation, and question answering.
Recommendation Systems: Item and user embeddings help in generating personalized
recommendations based on user preferences and behaviors.
Image Recognition: Deep learning models use embeddings to represent images, enabling tasks
such as object detection and image classification.
Graph Analysis: Graph embeddings facilitate tasks like node classification, link prediction, and
community detection.
Embedding and representation learning are crucial components of modern machine learning,
enabling models
1. Linear Techniques
Mathematical Steps:
2. Non-Linear Techniques
3. Other Techniques
Factor Analysis:
o Factor analysis is a statistical method used to model observed variables as linear
combinations of unobserved latent variables (factors). It is often used in psychology and
social sciences.
Kernel Methods:
o Kernel PCA is an extension of PCA that applies the kernel trick to allow for non-linear
dimensionality reduction. It maps the data into a higher-dimensional space where linear
separation is possible.
1. Data Visualization: Techniques like PCA, t-SNE, and UMAP are commonly used to
visualize high-dimensional data in 2D or 3D, helping to identify patterns, clusters, and
anomalies.
2. Preprocessing for Machine Learning: Lower-dimensional representations can serve as
input features for machine learning models, improving efficiency and performance.
3. Image Compression: Techniques like autoencoders can be used to compress images into
lower-dimensional representations while retaining key features.
4. Natural Language Processing: Word embeddings (like Word2Vec and GloVe) are
examples of lower-dimensional representations of words that capture semantic meanings.
5. Anomaly Detection: Lower-dimensional representations can help identify outliers by
analyzing the distribution of the data in a reduced space.
Principal component analysis (PCA) is a dimensionality reduction and machine learning method
used to simplify a large data set into a smaller set while still maintaining significant patterns and
trends.
Principal component analysis can be broken down into five steps. I’ll go through each step,
providing logical explanations of what PCA is doing and simplifying mathematical concepts
such as standardization, covariance, eigenvectors and eigenvalues without focusing on how to
compute them.
Principal component analysis, or PCA, is a dimensionality reduction method that is often used to
reduce the dimensionality of large data sets, by transforming a large set of variables into a
smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but
the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller
data sets are easier to explore and visualize, and thus make analyzing data points much easier
and faster for machine learning algorithms without extraneous variables to process.
Principal Components?
Principal components are new variables that are constructed as linear combinations or mixtures
of the initial variables. These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you
10 principal components, but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second and so on, until having
something like shown in the scree plot below.
Percentage of Variance (Information) for each by PC.
Organizing information in principal components this way will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.
An important thing to realize here is that the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
Geometrically speaking, principal components represent the directions of the data that explain a
maximal amount of variance, that is to say, the lines that capture most information of the data.
The relationship between variance and information here, is that, the larger the variance carried by
a line, the larger the dispersion of the data points along it, and the larger the dispersion along a
line, the more information it has. To put all this simply, just think of principal components as
new axes that provide the best angle to see and evaluate the data, so that the differences between
the observations are better visible.
Step-by-Step Explanation of PCA
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each one
of them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of the initial variables. That is, if there are large
differences between the ranges of initial variables, those variables with larger ranges will
dominate over those with small ranges (for example, a variable that ranges between 0 and 100
will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So,
transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation
for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.
The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them. Because sometimes, variables are highly correlated in such a way that they contain
redundant information. So, in order to identify these correlations, we compute the covariance
matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that
has as entries the covariances associated with all possible pairs of the initial variables. For
example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3
data matrix of this from:
What do the covariances that we have as entries of the matrix tell us about the correlations
between the variables?
Now that we know that the covariance matrix is not more than a table that summarizes the
correlations between all the possible pairs of variables, let’s move to the next step.
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data.
What you first need to know about eigenvectors and eigenvalues is that they always come in
pairs, so that every eigenvector has an eigenvalue. Also, their number is equal to the number of
dimensions of the data. For example, for a 3-dimensional data set, there are 3 variables, therefore
there are 3 eigenvectors with 3 corresponding eigenvalues.
It is eigenvectors and eigenvalues who are behind all the magic of principal components because
the eigenvectors of the Covariance matrix are actually the directions of the axes where there is
the most variance (most information) and that we call Principal Components. And eigenvalues
are simply the coefficients attached to eigenvectors, which give the amount of variance carried
in each Principal Component.
By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the
principal components in order of significance.
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector
that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the
second principal component (PC2) is v2.
After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum of
eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry respectively
96 percent and 4 percent of the variance of the data.
As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these components or
discard those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the components
that we decide to keep. This makes it the first step towards dimensionality reduction, because if
we choose to keep only p eigenvectors (components) out of n, the final data set will have
only p dimensions.
Continuing with the example from the previous step, we can either form a feature vector with
both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector
with v1 only:
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a
loss of information in the final data set. But given that v2 was carrying only 4 percent of the
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.
So, as we saw in the example, it’s up to you to choose whether to keep all the components or
discard the ones of lesser significance, depending on what you are looking for. Because if you
just want to describe your data in terms of new variables (principal components) that are
uncorrelated without seeking to reduce dimensionality, leaving out lesser significant components
is not needed.
In the previous steps, apart from standardization, you do not make any changes on the data, you
just select the principal components and form the feature vector, but the input data set remains
always in terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis). This
can be done by multiplying the transpose of the original data set by the transpose of the feature
vector.
Autoencoders
Autoencoders are a special type of unsupervised feedforward neural network (no labels needed!).
The main application of Autoencoders is to accurately capture the key aspects of the provided
data to provide a compressed version of the input data, generate realistic synthetic data, or flag
anomalies.
Autoencoders are composed of 2 key fully connected feedforward neural networks (Figure 1):
Encoder: compresses the input data to remove any form of noise and generates a latent
space/bottleneck. Therefore, the output neural network dimensions are smaller than the input
and can be adjusted as a hyperparameter in order to decide how much lossy our compression
should be.
Decoder: making use of only the compressed data representation from the latent space, tries to
reconstruct with as much fidelity as possible the original input data (the architecture of this
neural network is, therefore, generally a mirror image of the encoder). The “goodness” of the
prediction can then be measured by calculating the reconstruction error between the input and
output data using a loss function.
Repeating iteratively this process of passing data through the encoder and decoder and
measuring the error to tune the parameters through backpropagation, the Autoencoder can, with
time, correctly work with extremely difficult forms of data.
If an Autoencoder is provided with a set of input features completely independent of each other,
then it would be really difficult for the model to find a good lower-dimensional representation
without losing a great deal of information (lossy compression).
Additionally, compared to standard data compression algorithms like gzpi, Autoencoders can not
be used as general-purpose compression algorithms but are handcrafted to work best just on
similar data on which they have been trained on.
Some of the most common hyperparameters that can be tuned when optimizing your
Autoencoder are:
The number of layers for the Encoder and Decoder neural networks
The number of nodes for each of these layers
The loss function to use for the optimization process (e.g., binary cross-entropy or mean
squared error)
The size of the latent space (the smaller, the higher the compression, acting, therefore as a
regularization mechanism)
Finally, Autoencoders can be designed to work with different types of data, such as tabular, time-
series, or image data, and can, therefore, be designed to use a variety of layers, such as
convolutional layers, for image analysis.
Ideally, a well-trained Autoencoder should be responsive enough to adapt to the input data in
order to provide a tailor-made response but not so much as to just mimic the input data and not
be able to generalize with unseen data (therefore overfitting).
Types of Autoencoders
Undercomplete Autoencoder
Sparse Autoencoder
Contractive Autoencoder
Denoising Autoencoder
Convolutional Autoencoder
Variational Autoencoder
1. Undercomplete autoencoders
2. Sparse autoencoders
Sparse autoencoders are similar to the undercomplete autoencoders in that they use the
same image as input and ground truth. However—
The means via which encoding of information is regulated is significantly different.
While undercomplete autoencoders are regulated and fine-tuned by regulating the size of
the bottleneck, the sparse autoencoder is regulated by changing the number of nodes at
each hidden layer.
Since it is not possible to design a neural network that has a flexible number of nodes at
its hidden layers, sparse autoencoders work by penalizing the activation of some neurons
in hidden layers.
In other words, the loss function has a term that calculates the number of neurons that
have been activated and provides a penalty that is directly proportional to that.
This penalty, called the sparsity function, prevents the neural network from activating
more neurons and serves as a regularizer.
While typical regularizers work by creating a penalty on the size of the weights at the
nodes, sparsity regularizer works by creating a penalty on the number of nodes activated.
This form of regularization allows the network to have nodes in hidden layers dedicated
to find specific features in images during training and treating the regularization problem
as a problem separate from the latent space problem.
We can thus set latent space dimensionality at the bottleneck without worrying about
regularization.
There are two primary ways in which the sparsity regularizer term can be incorporated
into the loss function.
L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general
regularizers:
Where represents the hidden layer, represents the image in the minibatch, and
represents the activation.
KL-Divergence: In this case, we consider the activations over a collection of samples at
once rather than summing them as in the L1 Loss method. We constrain the average
activation of each neuron over this collection.
Considering the ideal distribution as a Bernoulli distribution, we include KL divergence
within the loss to reduce the difference between the current distribution of the activations
and the ideal (Bernoulli) distribution:
Where and denotes the specific neuron for layer , and a collection of samples is being
made here, each denoted as
.
3. Contractive autoencoders
Similar to other autoencoders, contractive autoencoders perform task of learning a
representation of the image while passing it through a bottleneck and reconstructing it in
the decoder.
The contractive autoencoder also has a regularization term to prevent the network from
learning the identity function and mapping input into the output.
Contractive autoencoders work on the basis that similar inputs should have similar
encodings and a similar latent space representation. It means that the latent space should
not vary by a huge amount for minor variations in the input.
To train a model that works along with this constraint, we have to ensure that the
derivatives of the hidden layer activations are small with respect to the input data.
Mathematically:
Where represents the hidden layer and
represents the input.
An important thing to note in the loss function (formed from the norm of the derivatives
and the reconstruction loss) is that the two terms contradict each other.
While the reconstruction loss wants the model to tell differences between two inputs and
observe variations in the data, the frobenius norm of the derivatives says that the model
should be able to ignore variations in the input data.
Putting these two contradictory conditions into one loss function enables us to train a
network where the hidden layers now capture only the most essential information. This
information is necessary to separate images and ignore information that is non-
discriminatory in nature, and therefore, not important.
The total loss function can be mathematically expressed as:
Where is the hidden layer for which a gradient is calculated and represented with respect
to the input as
.
The gradient is summed over all training samples, and a frobenius norm of the same is
taken.
4. Denoising autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from
an image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does
not have the input image as its ground truth.
In denoising autoencoders, we feed a noisy version of the image, where noise has been
added via digital alterations. The noisy image is fed to the encoder-decoder architecture,
and the output is compared with the ground truth image.
The denoising autoencoder gets rid of noise by learning a representation of the input
where the noise can be filtered out easily.
While removing noise directly from the image seems difficult, the autoencoder performs
this by mapping the input data into a lower-dimensional manifold (like in undercomplete
autoencoders), where filtering of noise becomes much easier.
Essentially, denoising autoencoders work with the help of non-linear dimensionality
reduction. The loss function generally used in these types of networks is L2 or L1 loss.
5. Variational autoencoders
Standard and variational autoencoders learn to represent the input just in a compressed
form called the latent space or the bottleneck.
Therefore, the latent space formed after training the model is not necessarily continuous
and, in effect, might not be easy to interpolate.
For example—
This is what a variational autoencoder would learn from the input:
While these attributes explain the image and can be used in reconstructing the image
from the compressed latent space, they do not allow the latent attributes to be
expressed in a probabilistic fashion.
Variational autoencoders deal with this specific topic and express their latent attributes as
a probability distribution, leading to the formation of a continuous latent space that can
be easily sampled and interpolated.
When fed the same input, a variational autoencoder would construct latent attributes in
the following manner:
The latent attributes are then sampled from the latent distribution formed and fed to the
decoder, reconstructing the input.
The motivation behind expressing the latent attributes as a probability distribution can be
very easily understood via statistical expressions.
Autoencoder Architecture
Autoencoders are a class of artificial neural networks designed to learn efficient representations
of data through unsupervised learning. They are particularly useful for dimensionality reduction,
feature extraction, and noise reduction. The basic architecture of an autoencoder consists of two
main components: the encoder and the decoder.
1. Encoder: The encoder compresses the input data into a lower-dimensional representation
(also known as the bottleneck or latent space). It maps the input (x) to a hidden
representation (h): [ h = f(x) = \text{Encoder}(x) ] where (f) is a function (often a neural
network) that transforms the input data into a lower-dimensional space.
2. Bottleneck: This is the layer that contains the compressed representation of the input
data. It is the smallest layer in the network and captures the most salient features of the
input.
3. Decoder: The decoder reconstructs the original input from the compressed
representation. It maps the hidden representation (h) back to the original input space: [ \
hat{x} = g(h) = \text{Decoder}(h) ] where (g) is another function (often a neural
network) that reconstructs the input from the latent representation.
4. Loss Function: The performance of an autoencoder is evaluated based on how well it can
reconstruct the original input. A common loss function used is the Mean Squared Error
(MSE) between the input (x) and the reconstructed output
Input Layer: 784 neurons (for MNIST images, which are 28x28 pixels).
Encoder:
o Fully connected layer with 256 neurons (activation: ReLU).
o Fully connected layer with 64 neurons (latent space).
Bottleneck: 64 neurons (compressed representation).
Decoder:
o Fully connected layer with 256 neurons (activation: ReLU).
o Fully connected layer with 784 neurons (output layer, activation: Sigmoid).
Training an Autoencoder
1. Data Preparation: Normalize the input data to ensure that the model learns effectively.
2. Model Definition: Build the autoencoder architecture using a deep learning framework
(e.g., TensorFlow, PyTorch).
3. Loss Function: Choose an appropriate loss function (e.g., MSE) for reconstruction.
4. Optimizer: Use an optimizer
Sparsity in Autoencoders
Sparsity in autoencoders is a training criterion that encourages only a few neurons to activate in a
hidden layer. This is achieved by:
Penalizing the loss function: The loss function is constructed so that activations are
penalized within a layer.
Using L1 regularization: L1 regularization is used to encourage sparsity by
penalizing the absolute values of the weights.
The goal of sparsity in autoencoders is to achieve an information bottleneck, which means
representing the same information with fewer neurons. This is important because it guarantees
that the autoencoder is learning latent representations instead of redundant information.
Sparse autoencoders are a type of artificial neural network that are used for unsupervised
learning. They are designed to be sensitive to specific types of high-level features in the data,
while being insensitive to most other features.
Sparsity in autoencoders refers to the constraint or regularization applied to the hidden layer (the
bottleneck) of the network that encourages only a small number of neurons to be active (i.e., to
have non-zero outputs) at any given time. This concept is particularly useful for feature
extraction and representation learning, as it helps the model learn more meaningful and
interpretable features from the input data.
1. Sparsity Constraint: This approach adds a penalty to the loss function based on the
average activation of the hidden neurons. The penalty encourages the model to keep the
average activation of the hidden layer below a certain threshold.
o Loss Function Modification: [ \text{Loss} = \text{Reconstruction Loss} + \
lambda \cdot \text{Sparsity Penalty} ]
o The sparsity penalty can be defined using the Kullback-Leibler (KL) divergence
between the average activation of the hidden neurons and a target sparsity level ( \
rho ): [ \text{Sparsity Penalty} = \sum_{j=1}^{n_h} \left( \rho \log\left(\frac{\
rho}{\hat{\rho}_j}\right) + (1 - \rho) \log\left(\frac{1 - \rho}{1 - \hat{\rho}_j}\
right) \right) ]
o Here, ( n_h ) is the number of neurons in the hidden layer, and ( \hat{\rho}_j ) is
the average activation of neuron ( j ).