0% found this document useful (0 votes)
0 views105 pages

Unit 4 Notes

The document discusses various types of autoencoders, including their relationship to PCA, regularization techniques, and specific types like denoising, sparse, and contractive autoencoders. It also covers Recurrent Neural Networks (RNNs), addressing issues like vanishing and exploding gradients, and introduces advanced architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). The text emphasizes the importance of these models in processing sequential data and their mechanisms for maintaining long-term dependencies.

Uploaded by

nainalashalini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views105 pages

Unit 4 Notes

The document discusses various types of autoencoders, including their relationship to PCA, regularization techniques, and specific types like denoising, sparse, and contractive autoencoders. It also covers Recurrent Neural Networks (RNNs), addressing issues like vanishing and exploding gradients, and introduces advanced architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). The text emphasizes the importance of these models in processing sequential data and their mechanisms for maintaining long-term dependencies.

Uploaded by

nainalashalini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

UNIT-IV

Auto encoders: relation to PCA, Regularization in


auto encoders, Denoising auto encoders, sparse
auto encoders, Contractive auto encoders.

Recurrent Neural Networks: Vanishing and


Exploding Gradients, GRU, LSTMs. Encoder
Decoder Models,
Attention Mechanism.
Introduction to Autoencoders
Link between PCA and Autoencoders
1. Linear Encoder and Linear Decoder

The encoder’s job is to compress the data into a simpler form (latent
space), and the decoder tries to reconstruct the original data from that
compressed version.
• When both the encoder and decoder are linear, the
transformation is simple, like rotating and scaling the data, without
adding any curves or complex mappings.
• PCA also works through linear transformations, finding the best
"directions" (principal components) to project the data onto.
• This linear setup ensures that the autoencoder behaves similarly
to PCA because both are trying to find the best linear way to
represent the data
2. Squared Error Loss Function

The squared error measures how far the reconstructed data is from the
original data.
• The autoencoder aims to minimize this error, making sure the
reconstructed version is as close as possible to the original.
• In PCA, we also want to project the data in such a way that the
variance is maximized, which indirectly minimizes reconstruction
error when projecting back.
• So, using squared error aligns the autoencoder’s goal with PCA’s
goal of preserving data structure.
3. Normalizing Inputs

Data often comes with different scales—some features might range


from 0 to 1, while others might range from 0 to 1000.
• Normalization (centering the data) ensures that no single feature
dominates the variance simply because of its scale.
• PCA works best when the data is centered because it focuses on
how the data varies rather than its absolute values.
• Without normalization, both PCA and the autoencoder could
produce misleading results, as the variance could be biased by the
scale of the features.
In PCA, the principal components are the directions in which the data has the
most variance.
• When you train the autoencoder under the conditions above, the encoder
learns to find these same directions because it’s trying to compress the
data while keeping as much important information as possible.
• This is why, after training, the encoder’s weights are essentially the same as
the principal eigenvectors from PCA.

PCA and Autoencoders are Trying to Solve the Same Problem:


Both methods aim to reduce the data’s dimensionality without losing important
information.
•PCA does this by projecting data onto new axes (principal components) that
capture the most variance.
•Autoencoders do this by learning how to compress the data and still be able to
reconstruct it accurately.
Regularization in autoencoders

• Regularization is to provide Generalization.


• Overfitting leads to less generalization.
• Overfitting happens when you have a large number of parameters.
Denoising Autoencoders
30/55

We will now see a practical application in which AEs are used and then
compare Denoising Autoencoders with regular autoencoders
Sparse Autoencoders
A Sparse Autoencoder is a type of autoencoder where the key idea is to
force most of the neurons in the hidden layer to be inactive (or
“sparse”).
This helps the model learn more meaningful features from the data,
especially when dealing with high-dimensional data like images, text, or
audio.
Contractive Autoencoders (CAE):
A Contractive Autoencoder (CAE) is a variation of the traditional autoencoder,
designed to learn robust, invariant features by adding a contractive penalty to
the loss function.
This penalty encourages the model to be less sensitive to small changes in the
input, making it excellent for tasks where data can be noisy or slightly distorted.

The model learns features that are stable even if the input data has slight
variations (like noise, distortions, or small transformations).

Unlike regular autoencoders that might overfit to noise, CAEs focus on the
underlying structure of the data.

This idea mimics how the brain works—neurons respond to features consistently,
even if the environment changes slightly.
Recurrent Neural Networks: Vanishing and Exploding
Gradients, GRU, LSTMs. Encoder Decoder Models,
Attention Mechanism.
Sequence Learning Problems
Recurrent Neural Networks
How do we model such tasks involving sequences ?
RNN Structure
After a model is Prepared we need to train and test the model .

Backpropagation through time


The problem of Exploding and Vanishing Gradients
• But when training RNNs using Backpropagation Through Time (BPTT), we often run into two
major issues:
•Vanishing gradients
•Exploding gradients
•During backpropagation, the gradients (used to update the weights) become smaller and
smaller as they move backward through time. Eventually, they shrink so much that earlier
layers stop learning.
•The model forgets long-term dependencies.
•It becomes very hard for the RNN to learn relationships between distant inputs and outputs in
a sequence.
•Solution:
•Use ReLU activation (instead of sigmoid/tanh)
•Use Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU) which are designed to
preserve long-term dependencies.
•Layer normalization.
• Sometimes, gradients become extremely large during backpropagation.
• This happens when weights are large or the network is deep in time (many time steps).

• Causes numerical instability.

• Weight updates become too big, leading the model to diverge instead of converge.

• Loss becomes NaN or the model crashes.

• Solution:
• Gradient clipping: Limits the size of gradients during training.

• Use smaller learning rates.

• Proper weight initialization.


• Summary:
• RNNs Recurrent Neural Networks are a type of neural network that are
designed to process sequential data.
• They can analyze data with a temporal dimension, such as time series,
speech, and text.
• RNNs can do this by using a hidden state passed from one timestep to
the next.
• they remember the previous information and use it for processing the current input.
• The hidden state is updated at each timestep based on the input and
the previous hidden state.
• RNNs are able to capture short-term dependencies in sequential data,
but they struggle with capturing long-term dependencies..
• Due to the vanishing gradient problem, where gradients diminish as they
propagate through many time steps.
Selective Read, Selective Write, Selective Forget - The
Whiteboard Analogy
RNN also has a finite state
size, we need to figure out a way to
allow it to selectively read, write and
forget
• Long Short-Term Memory (LSTM) is a type of recurrent neural
network (RNN) architecture designed to address the vanishing
gradient problem, enabling it to effectively learn and retain
information over long sequences, making it suitable for tasks
like natural language processing and time series analysis.

• The structure of an LSTM network consists of a series of LSTM cells,


each of which has a set of gates (input, output, and forget gates)
that control the flow of information into and out of the cell.
• [LSTM= Memory + Gate Mechanisms]
• The gates are used to selectively forget or retain information from
the previous time steps, allowing the LSTM to maintain long-term
dependencies in the input data.
• The first part chooses whether the information coming from the previous
timestamp is to be remembered or is irrelevant and can be forgotten.
• In the second part, the cell tries to learn new information from the input to
this cell.
• At last, in the third part, the cell passes the updated information from the
current timestamp to the next timestamp.
• Just like a simple RNN, an LSTM also has a hidden state where
H(t-1) represents the hidden state of the previous timestamp and Ht
is the hidden state of the current timestamp.
• In addition to that, LSTM also has a cell state represented by C(t-1)
and C(t) for the previous and current timestamps, respectively.
• Here the hidden state is known as Short term memory, and the cell
state is known as Long term memory.It is interesting to note that the cell state carries the
information along with all the timestamps.
we move from the first sentence to the second sentence, our
network should realize that we are no more talking about
Bob. Now our subject is Dan. Here, the Forget gate of the
network allows it to forget about it.
Forget Gate
In a cell of the LSTM neural network, the first step is to decide
whether we should keep the information from the previous time step
or forget it.
Here is the equation for forget gate. •Xt: input to the current
timestamp.
•Uf: weight associated with
the input
•Ht-1 The hidden state of
the previous timestamp
•Wf: It is the weight matrix
associated with the hidden
state
Later, a sigmoid function is applied to it. That will make ft a
number between 0 and 1.
This ft is later multiplied with the cell state of the previous
timestamp, as shown below.
• Input Gate
• Letʼs take another example.
• “Bob knows swimming. He told me over the phone that he had served the
navy for four long years.ˮ
• Now just think about it, based on the context given in the first sentence,
which information in the second sentence is critical?
• The fact that he was in the navy is important information, and this is something we want
our model to remember for future computation.

• The input gate is used to quantify the importance of the new information carried by the
input. Here is the equation of the input gate
•Xt: Input at the current timestamp t
•Ui: weight matrix of input
•Ht-1 A hidden state at the previous timestamp
•Wi: Weight matrix of input associated with hidden
state

Again we have applied the sigmoid function over it. As a result, the value of I at timestamp t will be
between 0 and 1
New
Information

• the new information that needed to be passed to the cell state is a


function of a hidden state at the previous timestamp t-1 and input x
at timestamp t.
• The activation function here is tanh.
• Due to the tanh function, the value of new information will be
between 1 and 1.
• If the value of Nt is negative, the information is subtracted from the
cell state, and if the value is positive, the information is added to the
cell state at the current timestamp.
However, the Nt wonʼt be added directly to the cell state. Here comes the
updated equation:

Here, Ct-1 is the cell state at the current timestamp, and the others are the values
we have calculated previously.
• Output Gate
• “Bob single-handedly fought the enemy and died for his country.
For his contributions, brave______.ˮ

based on the current expectation, we have to give a relevant word


to fill in the blank. That word is our output, and this is the function
of our Output gate.
• Its value will also lie between 0 and 1 because of this sigmoid
function.

• Now to calculate the current hidden state, we will use Ot and tanh
of the updated cell state.

• It turns out that the hidden state is a function of Long term memory Ct and the
current output.
• If you need to take the output of the current timestamp, just apply the SoftMax
activation on hidden state Ht.
Here the token with the
maximum score in the output is
the prediction.
memory cell that stores information from previous time steps and uses it to
influence the output of the cell at the current time step.
The output of each LSTM cell is passed to the next cell in the network, allowing the
LSTM to process and analyze sequential data over multiple time steps.
How LSTMs avoid the problem of vanishing gradients
Limitations of Standard RNN

•Vanishing Gradient problem : occurs when processing long sequences.


• the gradients used to update the network weights become very small (vanish).
•This makes it difficult for the network to learn long-term dependencies in the
data.
•Exploding Gradients: occur when the gradients become very large during
backpropagation.
•This can lead to unstable training and prevent the network from converging to
an optimal solution.
•Limited Memory: Standard RNNs rely solely on the hidden state to capture
information from previous time steps.
•This hidden state has a limited capacity, making it difficult for the network to
remember information over long sequences.
•Difficulty in Training: Due to vanishing/exploding gradients and limited
memory, standard RNNs can be challenging to train, especially for complex
tasks involving long sequences.
GRU or Gated recurrent unit is an advancement of the standard
RNN
• GRUs are very similar to Long Short Term Memory(LSTM).

• LSTM, GRU uses gates to control the flow of information.


• They are relatively new as compared to LSTM.
• This is the reason they offer some improvement over LSTM and have simpler
architecture.
• it does not have a separate cell state (Ct). It only has a hidden state(Ht)- GRUs
are faster to train
The Architecture of Gated Recurrent
Unit
GRU cell which more or less similar to an LSTM cell or RNN cell.

• At each timestamp t, it takes an input Xt and the hidden state Ht-1 from the
previous timestamp t-1.
• Later it outputs a new hidden state Ht which again passed to the next
timestamp
• two gates in a GRU as opposed to three gates in an LSTM cell.
• i) Reset Gate (Short term memory)
• Ii)Reset
The Update Gate
gate can (Long
discard Term past
irrelevant memory)
information, and the Update gate controls the balance
between keeping past information and incorporating new information.

• i) Reset Gate: responsible for the short-term memory


The valueof
of the network
rt will i.e the
range from 0 to hidden state
1 because of the
(Ht). sigmoid function.
Here Ur and Wr are weight matrices for the reset
gate.
Update Gate Long Term memory) : Update gate for long-term
memory

GRU Works ?

•The GRU takes two inputs as vectors: the current input X_t and the
previous hidden state (h_(t-1)).
• we perform an element-wise multiplication (like a dot
product for each element) between the current input and
the previous hidden state vectors.

• This is done separately for each gate, essentially creating


“parameterizedˮ versions of the inputs specific to each
gate.

• we apply an activation function (a function that transforms the


values) element-wise to each element in these parameterized
vectors.

• This activation function typically outputs values between 0 and 1,


To find the Hidden state Ht in GRU, it follows a two-step process.

• Step 1 It takes in the current input and the hidden state from the
previous timestamp t-1 which is multiplied by the reset gate output
rt.

• Later passed this entire information to the tanh function, the


resultant value is the candidateʼs hidden state.

• how we are using the value of the reset gate to control how much influence
the previous hidden state can have on the candidate state. Selective read
• rt is equal to 1 then entire information from the previous hidden state Ht-1 is being
considered else ignored .
Step 2 Once we have the candidate state, it is used to generate the current
hidden state Ht using Update gate.

instead of using a separate gate like in LSTM and GRU Architecture we use a
single update gate to control both the historical information which is Ht-1
as well as the new information which comes from the candidate state.
Selective forget and Selective write

• Now assume the value of ut is around 0 then the first term in the equation will
vanish which means the new hidden state will not have much information from
the previous hidden state.
• On the other hand, the second part becomes almost 1 that essentially means
the hidden state at the current timestamp will consist of the information from
the candidate state only.
• if the value of ut is on the second term will become entirely 0 and
the current hidden state will entirely depend on the first term

• i.e the information from the hidden state at the previous timestamp
t-1.

• the value of ut is very critical in this equation and it can range


from 0 to 1.
How GRU Solve the Limitations of Standard
RNN
• GRUs use special gates Update gate and Reset gate) to control the flow of information within
the network.
These gates act as filters, deciding what information from the past to keep, forget, or update.
• By selectively allowing relevant information through the gates, GRUs prevent gradients from vanishing entirely.
This allows the network to learn long-term dependencies even in long sequences.
Encoder-Decoder
Model
• it is a Deep Learning model composed of two neural
networks.
• These two neural networks usually have the same structure.
• The first one will be used normally but the second one will work in
reverse
• A first neural network will take a sentence as input to output a
sequence of numbers.
• The second network will take this sequence of numbers as input to
output a sentence this time!
• In fact these two networks do the same thing.
• But one is used in the normal direction and the other in the
opposite.
• if we translate a sentence directly with the classical
approach, the network would translate word by word
without caring about the global meaning of the
sentence.
• Ex:if we have the sentence “prendre une expression au pied de la
lettre” and we translate it word by word, we would obtain: “take an
expression at the foot of the letter”.
• Indeed, the structure of the encoder allows to extract the meaning
of a sentence.
• It stores the extracted information in a vector (the result of the
encoder).
• Then, the decoder analyzes the vector to produce its own version of
the sentence.
Different Forms of Encoder-Decoder Models
What is the difference between encoder, decoder and
encoder-decoder in a neural network?

• would use an encoder when you wanted to compress the input


data, a decoder when you needed to produce some output based
on the input,

• and an encoder-decoder when you needed to change the format


of a sequence of data.
Applications of encoder and decoder architecture in a neural network

Encoder
•Image classification
•Speech recognition:

•Decoder
•Text generation:
•Image creation:

•Encoder-decoder
•Text translation from one lang to another
•Summarisation:
Encoder-decoder models, particularly in the context of recurrent neural networks (RNNs), are a
framework commonly used for sequence-to-sequence tasks, such as machine translation, text
summarization, and speech recognition. Here's a breakdown of how they work:
Structure
1.Encoder:
- The encoder processes the input sequence (e.g., a sentence in a source language) and converts it
into a fixed-size context vector (also called the state vector).
- It typically consists of one or more RNN layers (e.g., LSTM or GRU) that read the input sequence
step-by-step. Each input element is passed through the RNN, updating its hidden state.
- The final hidden state of the encoder represents the entire input sequence.
2.Decoder:
- The decoder is another RNN that generates the output sequence (e.g., a translated sentence) from
the context vector produced by the encoder.
- It starts with the context vector and generates one token at a time. Each output token is fed back into
the decoder along with the previous hidden state to produce the next token.
- The decoder can also use techniques like attention mechanisms to focus on different parts of the input
sequence during generation, improving performance.
Training
•The model is typically trained using pairs of input and output sequences. During training, the decoder
often uses the true output sequence (teacher forcing) to predict the next token, which helps in learning
the sequence generation better.
Applications
•Machine Translation: Translating sentences from one language to another.
•Text Summarization: Generating concise summaries of longer texts.
•Speech Recognition: Converting audio sequences into text sequences.
Advantages
•Handling Variable-Length Sequences: Encoder-decoder models can process input and output
sequences of varying lengths, making them suitable for many NLP tasks.
•Context Awareness: The context vector captures information from the entire input sequence, allowing
the decoder to generate coherent outputs.
Limitations
•Fixed-Length Context Vector: Traditional implementations use a fixed-size context vector, which can
struggle with longer sequences. This limitation has led to the development of attention mechanisms and
transformer models, which can better manage longer dependencies.
Drawbacks of the Encoder-Decoder Approach RNN/LSTM):
1.Dependence on Encoder Summary:
1. If the encoder produces a poor summary, the translation will also be poor.
2. This issue worsens with longer sentences.
2.Long-Range Dependency Problem:
1. RNNs/LSTMs struggle to understand and remember long sentences.
2. This happens due to the vanishing/exploding gradient problem, making it hard to retain
distant information.
3. They perform better on recent inputs but forget earlier parts.
3.Performance Degradation with Longer Inputs:
1. Even the original creators Cho et al., 2014) showed that translation quality drops as
sentence length increases.
4.LSTM Limitations:
1. While LSTMs handle long-range dependencies better than RNNs, they still fail in certain
cases.
2. They can become “forgetfulˮ over long sequences.
5.Lack of Selective Attention:
1. The model cannot focus more on important words in the input while translating.
2. All words are treated equally, even if some are more critical for translation.
Attention Mechanism.
what is the color of the soccer
ball? Also, which Georgetown
player, the guys in white, is
wearing the captaincy band?
• Attention mechanisms enhance deep
learning models by selectively focusing on
important input elements, improving
prediction accuracy and computational
efficiency.
• They prioritize and emphasize relevant
information, acting as a spotlight to enhance
overall model performance.
• In psychology, attention is the cognitive
process of selectively concentrating on one
or a few things while ignoring others.
How Attention Mechanism Works?

1.Breaking Down the Input: Letʼs say you have a bunch of words (or any kind of
data) that you want the computer to understand. First, it breaks down this input
into smaller pieces, like individual words.
2.Picking Out Important Bits: Then, it looks at these pieces and decides which
ones are the most important. It does this by comparing each piece to a question
or ‘queryʼ it has in mind.
3.Assigning Importance: Each piece gets a score based on how well it matches
the question. The higher the score, the more important that piece is.
4.Focusing Attention: After scoring each piece, it figures out how much attention
to give to each one. Pieces with higher scores get more attention, while less
important ones get less attention.
5.Putting It All Together: Finally, it adds up all the pieces, but gives more weight
to the important ones. This way, the computer gets a clearer picture of whatʼs
most important in the input.
Advantages of Attention Mechanism in Deep Learning Models
The attention mechanism in deep learning models has multiple benefits, including enhanced
performance and versatility across a variety of tasks. The following are some of the primary benefits
of attention mechanisms:
1.Selective Information Processing: The attention mechanism enables the model to concentrate on
select parts of the input sequence, emphasizing critical information while potentially ignoring less
significant bits.
2.Improved Model Interpretability: Through attention weights, the Attention Mechanism reveals
which elements of the input data are considered relevant for a given prediction, improving model
interpretability and assisting practitioners and stakeholders in understanding and believing model
judgments.
3.Capturing Long-Range Dependencies: It tackles the challenge of capturing long-term
dependencies in sequential data by allowing the model to connect distant pieces, boosting the
model’s ability to recognize context and relationships between elements separated by substantial
distances.
4.Transfer Learning Capabilities: It aids in knowledge transfer by allowing the model to focus on
relevant aspects when adapting information from one task to another. This improves the model’s
adaptability and generalizability across domains.
5.Efficient Information Processing: It enables the model to process relevant information
selectively, decreasing computational waste and enabling more scalable and efficient learning,
improving the model’s performance on large datasets and computationally expensive tasks.
Drawbacks :
1.Computational Complexity: Attention processes can greatly increase a model’s
computational complexity, particularly when dealing with long input sequences.
Because of the increasing complexity, training and inference periods may be longer,
making attention-based models more demanding of resources.
2.Dependency on Model Architecture: The overall model design and the job at hand
can influence the effectiveness of attention mechanisms. Attention mechanisms do not
benefit all models equally, and their influence varies among architectures.
3.Overfitting Risks: Overfitting can also affect attention mechanisms, especially when
the number of attention heads is significant. When there are too many attention heads
in the model, it may begin to memorize the training data rather than generalize to new
data. As a result, performance on unseen data may suffer.
4.Attention to Noise: Attention mechanisms may pay attention to noisy or irrelevant
sections of the input, particularly when the data contains distracting information. This
can result in inferior performance and necessitates careful model adjustment.
THE END

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy