Unit 4 Notes
Unit 4 Notes
The encoder’s job is to compress the data into a simpler form (latent
space), and the decoder tries to reconstruct the original data from that
compressed version.
• When both the encoder and decoder are linear, the
transformation is simple, like rotating and scaling the data, without
adding any curves or complex mappings.
• PCA also works through linear transformations, finding the best
"directions" (principal components) to project the data onto.
• This linear setup ensures that the autoencoder behaves similarly
to PCA because both are trying to find the best linear way to
represent the data
2. Squared Error Loss Function
The squared error measures how far the reconstructed data is from the
original data.
• The autoencoder aims to minimize this error, making sure the
reconstructed version is as close as possible to the original.
• In PCA, we also want to project the data in such a way that the
variance is maximized, which indirectly minimizes reconstruction
error when projecting back.
• So, using squared error aligns the autoencoder’s goal with PCA’s
goal of preserving data structure.
3. Normalizing Inputs
We will now see a practical application in which AEs are used and then
compare Denoising Autoencoders with regular autoencoders
Sparse Autoencoders
A Sparse Autoencoder is a type of autoencoder where the key idea is to
force most of the neurons in the hidden layer to be inactive (or
“sparse”).
This helps the model learn more meaningful features from the data,
especially when dealing with high-dimensional data like images, text, or
audio.
Contractive Autoencoders (CAE):
A Contractive Autoencoder (CAE) is a variation of the traditional autoencoder,
designed to learn robust, invariant features by adding a contractive penalty to
the loss function.
This penalty encourages the model to be less sensitive to small changes in the
input, making it excellent for tasks where data can be noisy or slightly distorted.
The model learns features that are stable even if the input data has slight
variations (like noise, distortions, or small transformations).
Unlike regular autoencoders that might overfit to noise, CAEs focus on the
underlying structure of the data.
This idea mimics how the brain works—neurons respond to features consistently,
even if the environment changes slightly.
Recurrent Neural Networks: Vanishing and Exploding
Gradients, GRU, LSTMs. Encoder Decoder Models,
Attention Mechanism.
Sequence Learning Problems
Recurrent Neural Networks
How do we model such tasks involving sequences ?
RNN Structure
After a model is Prepared we need to train and test the model .
• Weight updates become too big, leading the model to diverge instead of converge.
• Solution:
• Gradient clipping: Limits the size of gradients during training.
Again we have applied the sigmoid function over it. As a result, the value of I at timestamp t will be
between 0 and 1
New
Information
Here, Ct-1 is the cell state at the current timestamp, and the others are the values
we have calculated previously.
• Output Gate
• “Bob single-handedly fought the enemy and died for his country.
For his contributions, brave______.ˮ
• Now to calculate the current hidden state, we will use Ot and tanh
of the updated cell state.
• It turns out that the hidden state is a function of Long term memory Ct and the
current output.
• If you need to take the output of the current timestamp, just apply the SoftMax
activation on hidden state Ht.
Here the token with the
maximum score in the output is
the prediction.
memory cell that stores information from previous time steps and uses it to
influence the output of the cell at the current time step.
The output of each LSTM cell is passed to the next cell in the network, allowing the
LSTM to process and analyze sequential data over multiple time steps.
How LSTMs avoid the problem of vanishing gradients
Limitations of Standard RNN
• At each timestamp t, it takes an input Xt and the hidden state Ht-1 from the
previous timestamp t-1.
• Later it outputs a new hidden state Ht which again passed to the next
timestamp
• two gates in a GRU as opposed to three gates in an LSTM cell.
• i) Reset Gate (Short term memory)
• Ii)Reset
The Update Gate
gate can (Long
discard Term past
irrelevant memory)
information, and the Update gate controls the balance
between keeping past information and incorporating new information.
GRU Works ?
•The GRU takes two inputs as vectors: the current input X_t and the
previous hidden state (h_(t-1)).
• we perform an element-wise multiplication (like a dot
product for each element) between the current input and
the previous hidden state vectors.
• Step 1 It takes in the current input and the hidden state from the
previous timestamp t-1 which is multiplied by the reset gate output
rt.
• how we are using the value of the reset gate to control how much influence
the previous hidden state can have on the candidate state. Selective read
• rt is equal to 1 then entire information from the previous hidden state Ht-1 is being
considered else ignored .
Step 2 Once we have the candidate state, it is used to generate the current
hidden state Ht using Update gate.
instead of using a separate gate like in LSTM and GRU Architecture we use a
single update gate to control both the historical information which is Ht-1
as well as the new information which comes from the candidate state.
Selective forget and Selective write
• Now assume the value of ut is around 0 then the first term in the equation will
vanish which means the new hidden state will not have much information from
the previous hidden state.
• On the other hand, the second part becomes almost 1 that essentially means
the hidden state at the current timestamp will consist of the information from
the candidate state only.
• if the value of ut is on the second term will become entirely 0 and
the current hidden state will entirely depend on the first term
• i.e the information from the hidden state at the previous timestamp
t-1.
Encoder
•Image classification
•Speech recognition:
•Decoder
•Text generation:
•Image creation:
•Encoder-decoder
•Text translation from one lang to another
•Summarisation:
Encoder-decoder models, particularly in the context of recurrent neural networks (RNNs), are a
framework commonly used for sequence-to-sequence tasks, such as machine translation, text
summarization, and speech recognition. Here's a breakdown of how they work:
Structure
1.Encoder:
- The encoder processes the input sequence (e.g., a sentence in a source language) and converts it
into a fixed-size context vector (also called the state vector).
- It typically consists of one or more RNN layers (e.g., LSTM or GRU) that read the input sequence
step-by-step. Each input element is passed through the RNN, updating its hidden state.
- The final hidden state of the encoder represents the entire input sequence.
2.Decoder:
- The decoder is another RNN that generates the output sequence (e.g., a translated sentence) from
the context vector produced by the encoder.
- It starts with the context vector and generates one token at a time. Each output token is fed back into
the decoder along with the previous hidden state to produce the next token.
- The decoder can also use techniques like attention mechanisms to focus on different parts of the input
sequence during generation, improving performance.
Training
•The model is typically trained using pairs of input and output sequences. During training, the decoder
often uses the true output sequence (teacher forcing) to predict the next token, which helps in learning
the sequence generation better.
Applications
•Machine Translation: Translating sentences from one language to another.
•Text Summarization: Generating concise summaries of longer texts.
•Speech Recognition: Converting audio sequences into text sequences.
Advantages
•Handling Variable-Length Sequences: Encoder-decoder models can process input and output
sequences of varying lengths, making them suitable for many NLP tasks.
•Context Awareness: The context vector captures information from the entire input sequence, allowing
the decoder to generate coherent outputs.
Limitations
•Fixed-Length Context Vector: Traditional implementations use a fixed-size context vector, which can
struggle with longer sequences. This limitation has led to the development of attention mechanisms and
transformer models, which can better manage longer dependencies.
Drawbacks of the Encoder-Decoder Approach RNN/LSTM):
1.Dependence on Encoder Summary:
1. If the encoder produces a poor summary, the translation will also be poor.
2. This issue worsens with longer sentences.
2.Long-Range Dependency Problem:
1. RNNs/LSTMs struggle to understand and remember long sentences.
2. This happens due to the vanishing/exploding gradient problem, making it hard to retain
distant information.
3. They perform better on recent inputs but forget earlier parts.
3.Performance Degradation with Longer Inputs:
1. Even the original creators Cho et al., 2014) showed that translation quality drops as
sentence length increases.
4.LSTM Limitations:
1. While LSTMs handle long-range dependencies better than RNNs, they still fail in certain
cases.
2. They can become “forgetfulˮ over long sequences.
5.Lack of Selective Attention:
1. The model cannot focus more on important words in the input while translating.
2. All words are treated equally, even if some are more critical for translation.
Attention Mechanism.
what is the color of the soccer
ball? Also, which Georgetown
player, the guys in white, is
wearing the captaincy band?
• Attention mechanisms enhance deep
learning models by selectively focusing on
important input elements, improving
prediction accuracy and computational
efficiency.
• They prioritize and emphasize relevant
information, acting as a spotlight to enhance
overall model performance.
• In psychology, attention is the cognitive
process of selectively concentrating on one
or a few things while ignoring others.
How Attention Mechanism Works?
1.Breaking Down the Input: Letʼs say you have a bunch of words (or any kind of
data) that you want the computer to understand. First, it breaks down this input
into smaller pieces, like individual words.
2.Picking Out Important Bits: Then, it looks at these pieces and decides which
ones are the most important. It does this by comparing each piece to a question
or ‘queryʼ it has in mind.
3.Assigning Importance: Each piece gets a score based on how well it matches
the question. The higher the score, the more important that piece is.
4.Focusing Attention: After scoring each piece, it figures out how much attention
to give to each one. Pieces with higher scores get more attention, while less
important ones get less attention.
5.Putting It All Together: Finally, it adds up all the pieces, but gives more weight
to the important ones. This way, the computer gets a clearer picture of whatʼs
most important in the input.
Advantages of Attention Mechanism in Deep Learning Models
The attention mechanism in deep learning models has multiple benefits, including enhanced
performance and versatility across a variety of tasks. The following are some of the primary benefits
of attention mechanisms:
1.Selective Information Processing: The attention mechanism enables the model to concentrate on
select parts of the input sequence, emphasizing critical information while potentially ignoring less
significant bits.
2.Improved Model Interpretability: Through attention weights, the Attention Mechanism reveals
which elements of the input data are considered relevant for a given prediction, improving model
interpretability and assisting practitioners and stakeholders in understanding and believing model
judgments.
3.Capturing Long-Range Dependencies: It tackles the challenge of capturing long-term
dependencies in sequential data by allowing the model to connect distant pieces, boosting the
model’s ability to recognize context and relationships between elements separated by substantial
distances.
4.Transfer Learning Capabilities: It aids in knowledge transfer by allowing the model to focus on
relevant aspects when adapting information from one task to another. This improves the model’s
adaptability and generalizability across domains.
5.Efficient Information Processing: It enables the model to process relevant information
selectively, decreasing computational waste and enabling more scalable and efficient learning,
improving the model’s performance on large datasets and computationally expensive tasks.
Drawbacks :
1.Computational Complexity: Attention processes can greatly increase a model’s
computational complexity, particularly when dealing with long input sequences.
Because of the increasing complexity, training and inference periods may be longer,
making attention-based models more demanding of resources.
2.Dependency on Model Architecture: The overall model design and the job at hand
can influence the effectiveness of attention mechanisms. Attention mechanisms do not
benefit all models equally, and their influence varies among architectures.
3.Overfitting Risks: Overfitting can also affect attention mechanisms, especially when
the number of attention heads is significant. When there are too many attention heads
in the model, it may begin to memorize the training data rather than generalize to new
data. As a result, performance on unseen data may suffer.
4.Attention to Noise: Attention mechanisms may pay attention to noisy or irrelevant
sections of the input, particularly when the data contains distracting information. This
can result in inferior performance and necessitates careful model adjustment.
THE END