0% found this document useful (0 votes)

13 views

RNN

1. RNNs allow for input and output of arbitrary lengths by sharing parameters across time steps and using autoregressive hidden states to incorporate context. This assumes the input/output distribution is stationary. 2. RNN architectures can solve tasks of varying nature, including sentiment analysis, image captioning, part-of-speech tagging, and translation. 3. LSTMs address RNN challenges like long term dependencies and vanishing gradients by incorporating a cell state to store long term memory and gates to regulate its updates.

Uploaded by

Fang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

RNN

Uploaded by

Fang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

RNN

RNN
RNN is analogous to an autoregressive equation model, in such a way that the current output
y_t is a function of current input and lagged hidden state. The gist is that the parameter
sharing enables input and output of arbitrary lengths, and autoregressive hidden states allow
for context consideration in the output formation process. Note that parameter sharing
implicitly assumes a stationary distribution in the sequence input/output space.
For example, as per Goodfellow, a one-to-one RNN classifier can be represented as:
A visual of this from assignment 4:
Per class slides:

RNN architectures support end-to-end input regardless of their shapes and dimensions,
allowing them to solve tasks of various nature:
Many-to-one tasks, e.g. sentiment analysis
One-to-many tasks, e.g. image caption
Many-to-many (one-to-one correspondence between input and output), e.g. part of
speech tagging (whether word is subject/object/adjective etc)
Many-to-many (no one-to-one correspondence per se), e.g. translation
Challenges of RNN:
1. Long Term Dependencies. To the extent that the output depends only on the final hidden
layer (e.g. many-to-one tasks), RNN’s have difficulties accounting for long term memory
such as context.
LSTM has a cell state to store long term memory.
2. Vanishing and exploding gradients. As we see in the backpropagation through time
algorithm, the chain rule dictates that in each time step we back propagate, we multiply
the gradient by W transpose (the weight matrix for hidden layer). Depending on the
magnitude of its singular values, we either have exploding or vanishing gradients
(vanishing gradients imply weak dependencies for long term memory).
A remedy for exploding gradients is gradient clipping.
There is no direct solution for vanishing gradient (some indirect solutions include
weight initialization schemes or change activation functions). LSTM is usually
preferred to implement an RNN.
3. Not parallelizable. The autoregressive computational graph dictates that weight updates
in training must be sequential and therefore cannot be parallelized.

LSTM
Long Short-Term Memory is a RNN architecture that accounts for long term memory in
sequences, enhancing the RNN’s ability to process long sequences.
Intuitively, the cell state stores (long term) memory which is controlled by several gates as to
whether to forget or update it. It is a control signal to the hidden state h which is analogous to
the hidden state in Vanilla RNN, updating the output.
Interpretation of the four gates (“IFOG”):

Note: sigmoid is between 0 and 1, so it is good for assigning how much input or past to take in
for next output.
To see how LSTM alleviates exploding and vanishing gradients from the mechanical
perspective, note that the benefit of using separate state (h) and control signals (c) is that
LSTMs have the ability to remember information across many time steps. Without the control
signals, simple RNNs tend to forget information across time steps due to the vanishing
gradient problem.
In terms of calculus and gradient flow, note that now backpropagation through the passage of
time involves only the cell state (c) - indicated by the red BP path / “gradient highway”. The
accumulated matrix multiplication with W is avoided.

Note: compare this to ResNet, isn't this a weaker version of skip connection? If the forget
gate is one, then this works the same as skip connection.
Challenges of LSTM
1. Training remains inefficient: un-parallelizable and memory expensive. The network is still
recurrent in nature, also it requires storing hidden and cell states in the memory.
2. Prone to overfitting and it is difficult to apply the dropout algorithm to curb this issue.
Recall that dropout is a regularization method where input and recurrent connections
to LSTM units are probabilistically excluded from activation and weight updates
while training a network.
3. LSTMs became popular because they could solve the problem of vanishing gradients.
But it turns out, they fail to remove it completely. The problem lies in the fact that the
data still has to move from cell to cell for its evaluation.
Moreover, the cell has become quite complex now with the additional features (such
as forget gates) being brought into the picture.
Conditional language models
n

P (w 1 , . . . , w n ) = P (w 1 ) ∏ P (w i |w i−1 , . . . , w 1 )

i=1

How to train them (teacher/student forcing)?

After the model is trained, a “start-of-sequence” token can be used to start the process and
the generated word in the output sequence is used as input on the subsequent time step,
perhaps along with other input like an image or a source text.
This same recursive output-as-input process can be used when training the model, but it can
result in problems such as:
Slow convergence.
Model instability.
Poor skill.
Teacher forcing is an approach to improve model skill and stability when training these types
of models.
Teacher forcing is a strategy for training recurrent neural networks that uses ground truth as
input, instead of model output from a prior time step as an input.
Despite being effective and simple, teacher forcing has its downsides when deployed. As
ground truth may not be readily available. In such cases, we approximate the ground truth
with the model’s output at time step t and feed it back into the model, but this kind of inputs
the network sees could be quite different, or even diverge from the training distribution. A
solution to this problem would be for the network to predict a few steps ahead during training
(e.g. beam search heuristics), so it gets accustomed to the model output (as inputs).
Language metrics
Cross Entropy: a measure of how well the estimated probability distribution of words
matches a reference (empirical conditional) distribution. It is also the expected number of bits
we need to encode the information.

Perplexity: Another measurement of how well a probability model predicts a sample, defined
as the geometric mean of the inverse probability of a sequence of words:
−1/n −1/n
(p(w 1 , . . . w n )) = (p(w 1 ) ∗ p(w 2 |w 1 ) ∗ … ∗ p(w n |w n − 1, . . . , w 1 ))

Note that it is the exponentiated sequence CE.

Note that the perplexity of a discrete uniform distribution of K events is exactly K. The
intuition is that on average it takes K trials for one to correctly guess the outcome of an event
from this distribution (e.g. need to guess twice for a fair coin). Another view is that since
entropy is the average number of bits to encode the information contained in a random
variable, the exponentiation of the entropy should be the total amount of all possible
information
BLEU: The Bi-Lingual Evaluation Understudy (BLEU) score is a metric to assess the quality of
machine translation. For each source sentence, BLEU compares the machine translation
output against an established translation for this sentence. It accounts for the number of
unique tokens covered in the output, and penalizes incorrect number of occurrences for the
covered tokens.
Knowledge Distillation
Knowledge distillation is a technique to compress and accelerate model training. The premise
is that we have a pre-trained large network - the Teacher model - that performs very well but
might be too slow or expensive to run. The goal is to train a smaller network - the Student
model - to perform similar tasks that can be deployed under limited resources, such as
mobile phones.
Word Embeddings
Skip-gram
Train a neural network (input-hidden layer with just linear part no activation-output) to predict
the probability of appearance of a word appearing around the input word. This is a fake task,
we do not care about it. We only care about the hidden layer as an encoding of the input
word.

The input word is coded via one-hot encoding. If we have 10000 words, then the input is
10000-dimension. If the hidden layer has 300 neurals as suggested in the Google paper,
the skip-gram neural network has 10000 x 300 = 3M weights, which makes GD difficult,
and the learning process is very data hungry. The Word2Vec algorithm utilizes two tricks
to facilitate the training of this skip-gram.
Subsampling
Negative sampling
Intrinsic/Extrinsic Evaluation
Intrinsic evaluation of word vectors (with the aim of forming word embeddings) is the
evaluation on specific, intermediate tasks (e.g. analogy completion for question
answering systems, or the “fake task” of predicting neighbor word pairs in word2vec skip
gram). These subtasks are typically simple and fast to compute and thereby allow us to
help understand the system used to generate the word vectors. An intrinsic evaluation
should typically return to us a number (e.g. cosine distance) that indicates the
performance of those word vectors on the evaluation subtask. A few observations with
intrinsic evaluated embeddings:
Performance is heavily dependent on the model used for word embedding.
Performance increases with larger corpus sizes.
Performance is lower for extremely low as well as for extremely high dimensional word
vectors because of under-/over-fitting.
Extrinsic evaluation of word vectors is the evaluation of a set of word vectors generated by
an embedding technique on the real task at hand. These tasks are typically elaborate (e.g.
evaluate answers from questions) and slow to compute. In the absence of computation cost
consideration, it can be considered the “first best” for intrinsic evaluation.
##t-SNE
t-SNE is an unsupervised, non-linear technique primarily used for data exploration and
visualizing high-dimensional data, which aims at providing a feel/intuition of how the data
is arranged in a high-dimensional space.
It can be considered as a non-linear generalization of PCA. Recall that PCA is a linear
dimension reduction technique that seeks to maximize variance and preserves large
pairwise distances. This can lead to poor visualization especially when dealing with non-
linear manifold structures (think: manifold structure like: cylinder, ball, curve, etc).
For example, PCA is not able to separate red dots and blue dots from two swiss roll
manifolds:

The performances of PCA vs t-SNE on MNIST:

Intuitively, PCA seeks to preserve large pairwise distances (dotted line), while t-SNE is
concerned with preserving small distances (solid line).

Encoder-decoder & seq2seq

Beam Search
Beam search is a heuristic search algorithm that explores a graph by expanding the most
promising node in a limited set. The input parameter is width - the maximum number of
beams considered in each iteration. As width restricts the number of candidates, it is a
greedy algorithm. Nevertheless it is considered an improvement over simply taking the
argmax of the softmax output layer at each time step.
An example of beam search with width = 2:
Neural attention
Self-Attention is an attention mechanism relating different positions of a single
sequence in order to compute a representation of the same sequence. It has been shown
to be very useful in machine reading, abstractive summarization, or image description
generation.
Cross-Attention: source is from the encoder, and input is from decoder. For each time
step t, attention is evaluated between h_t and each step in the encoder source s. For
example, in the transformer model with encoder-decoder structure, the key and value
vectors in the attention layer in the decoder are taken from the encoder.
Masked-attention: Used in transformer encoder. For the query at time step t, attention
should only consider keys prior to t.
Using attention to substitute recurrence and autoregressive architecture greatly reduces the
amount of sequential operation and improves the parallelizability of training.
Soft Attention: the alignment weights are learned and placed “softly” over all patches in
the source image; essentially the same type of attention as in Bahdanau et al., 2015.
Pro: the model is smooth and differentiable.
Con: expensive when the source input is large.
Hard Attention: only selects one patch of the image to attend to at a time.
Pro: less calculation at the inference time.
Con: the model is non-differentiable and requires more complicated techniques such
as variance reduction or reinforcement learning to train. (Luong, et al., 2015)
Transformer
Byte pair encoding
Byte pair encoding (BPE) is a technique to reduce the size of (output) vocabularies in
machine translation tasks, so as to decrease training time and computational costs (e.g. avoid
large matrices). It is an example of a subword algorithm, which aims at extracting common
substring patterns in words to model language morphology (how words are formed). The
extreme case for subwording would be at the alphabet level, which is not practical because
this creates exceedingly long sequences.
Instead, BPE looks for popular patterns in words. It starts by iteratively looking for the most
commonly found adjacent pair in a word (represented as a byte sequence), compressing it to
a single token until all adjacent pairs do not repeat in this byte sequence.

Transformers (architecture details, positional encoding, computational complexity)

Byte pair encoding.
Computation questions:
Please be familiar with how to calculate the sizes of the different self-attention matrices
in the encoder, decoder, and between the encoder and decoder (often called cross-
attention) based on "Attention is all you Need."

whitepaper_emebddings_vectorstores_v2
No ratings yet
whitepaper_emebddings_vectorstores_v2
64 pages
What is LSTM - Long Short Term Memory_ - GeeksforGeeks
No ratings yet
What is LSTM - Long Short Term Memory_ - GeeksforGeeks
10 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Canada CV Format PDF
100% (1)
Canada CV Format PDF
1 page
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Recurrent Neural Networks cheatsheet
No ratings yet
Recurrent Neural Networks cheatsheet
44 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
lecture 11
No ratings yet
lecture 11
57 pages
Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
NN Text Generation Zaid Bouslikhin
No ratings yet
NN Text Generation Zaid Bouslikhin
14 pages
chapter 2
No ratings yet
chapter 2
68 pages
Character-Aware Neural Language Models
No ratings yet
Character-Aware Neural Language Models
9 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
dis6-sol
No ratings yet
dis6-sol
6 pages
cs224n-2021-LSTM NN
No ratings yet
cs224n-2021-LSTM NN
59 pages
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
No ratings yet
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
9 pages
6 - RNN LSTM & Gru
No ratings yet
6 - RNN LSTM & Gru
14 pages
For Seminar
No ratings yet
For Seminar
17 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
Day 4
No ratings yet
Day 4
22 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
DL_MOD4 (3)
No ratings yet
DL_MOD4 (3)
105 pages
CSE 4237 SoftCom Solutions
No ratings yet
CSE 4237 SoftCom Solutions
115 pages
04 - RNNs
No ratings yet
04 - RNNs
37 pages
aM3RdIpjnYdPsGKF
No ratings yet
aM3RdIpjnYdPsGKF
20 pages
Unit IV
No ratings yet
Unit IV
22 pages
module-4-RNN-LSTM-GRU
No ratings yet
module-4-RNN-LSTM-GRU
59 pages
NLP-week7-rnnlstm
No ratings yet
NLP-week7-rnnlstm
66 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
RNN and LSTM.pptx
No ratings yet
RNN and LSTM.pptx
65 pages
CS5560 Lect12-RNN - LSTM
No ratings yet
CS5560 Lect12-RNN - LSTM
30 pages
Context Based
No ratings yet
Context Based
10 pages
RNN
No ratings yet
RNN
48 pages
Cs224n 2025 Lecture06 Fancy Rnn
No ratings yet
Cs224n 2025 Lecture06 Fancy Rnn
57 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
7 pages
ML 5
No ratings yet
ML 5
20 pages
ML 5
No ratings yet
ML 5
20 pages
Time Series Rnn Lstm 1746197734
No ratings yet
Time Series Rnn Lstm 1746197734
25 pages
AAM unit 6 notes
No ratings yet
AAM unit 6 notes
20 pages
Rnn
No ratings yet
Rnn
50 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
10 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
MODULE 4
No ratings yet
MODULE 4
14 pages
LSTM Material 1
No ratings yet
LSTM Material 1
3 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
OlahLSTM NEURAL NETWORK TUTORIAL 15
No ratings yet
OlahLSTM NEURAL NETWORK TUTORIAL 15
9 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
Unit 3
No ratings yet
Unit 3
8 pages
Deep Learning L3
No ratings yet
Deep Learning L3
37 pages
Unit_3_rcnn
No ratings yet
Unit_3_rcnn
25 pages
LSTM
No ratings yet
LSTM
22 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Understanding LSTM Networks - Colah's Blog
No ratings yet
Understanding LSTM Networks - Colah's Blog
7 pages
RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3
No ratings yet
RNN & LSTM: Vamsi Krishna B 1 9 M E 0 2 3
14 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
nndl (2)
No ratings yet
nndl (2)
10 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
5-Using_Large_Language_Models_to_Develop_Readability_Formulas_for_Educational_Settings(科研通-ablesci.com)
No ratings yet
5-Using_Large_Language_Models_to_Develop_Readability_Formulas_for_Educational_Settings(科研通-ablesci.com)
6 pages
Brain Tumor Classification Using Deep Learning
No ratings yet
Brain Tumor Classification Using Deep Learning
3 pages
Full Chapter Programming Large Language Models With Azure Open Ai Conversational Programming and Prompt Engineering With Llms Developer Reference 1St Edition Esposito PDF
100% (25)
Full Chapter Programming Large Language Models With Azure Open Ai Conversational Programming and Prompt Engineering With Llms Developer Reference 1St Edition Esposito PDF
54 pages
Hadiyyisa POS Tagger With Deep Learning
100% (2)
Hadiyyisa POS Tagger With Deep Learning
34 pages
Neural Word Embedding As Implicit Matrix Factorization
No ratings yet
Neural Word Embedding As Implicit Matrix Factorization
9 pages
Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets
No ratings yet
Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets
13 pages
Item2Vec: Neural Item Embedding For Collaborative Filtering: Oren Barkan and Noam Koenigstein
No ratings yet
Item2Vec: Neural Item Embedding For Collaborative Filtering: Oren Barkan and Noam Koenigstein
6 pages
Question Bank NLP SOLUTIONS
No ratings yet
Question Bank NLP SOLUTIONS
21 pages
NLP Session 1-7 Bt Dr.chetna
No ratings yet
NLP Session 1-7 Bt Dr.chetna
469 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
An Innovative Hashing Scheme and BiLSTM-based Dynamic Resume Ranking System
No ratings yet
An Innovative Hashing Scheme and BiLSTM-based Dynamic Resume Ranking System
8 pages
Seminar On Deep CNN
No ratings yet
Seminar On Deep CNN
36 pages
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
No ratings yet
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
24 pages
1 s2.0 S1877050922015058 Main
No ratings yet
1 s2.0 S1877050922015058 Main
11 pages
Machine Learned Resume-Job Matching Solution
No ratings yet
Machine Learned Resume-Job Matching Solution
9 pages
Fake News Classification Using Transformer Based Enhanced LSTM and BERT
No ratings yet
Fake News Classification Using Transformer Based Enhanced LSTM and BERT
8 pages
Gen Ai Lab
No ratings yet
Gen Ai Lab
3 pages
History of Deep Learning
No ratings yet
History of Deep Learning
23 pages
A Deep Learning Approach To Job Recommendation Analysis With NLP
No ratings yet
A Deep Learning Approach To Job Recommendation Analysis With NLP
8 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
Sentiment Analysis of Comment Texts Based On BiLSTM
No ratings yet
Sentiment Analysis of Comment Texts Based On BiLSTM
11 pages
Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira
No ratings yet
Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira
48 pages
64 Natural Language Processing Interview Questions and Answers-18 Juli 2019
No ratings yet
64 Natural Language Processing Interview Questions and Answers-18 Juli 2019
30 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
Yunjiu Et Al 2022 Artificial Intelligence Generated and Human Expert Designed Vocabulary Tests A Comparative Study
No ratings yet
Yunjiu Et Al 2022 Artificial Intelligence Generated and Human Expert Designed Vocabulary Tests A Comparative Study
12 pages
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
26 pages
579-Article Text-2248-1-10-20201027
No ratings yet
579-Article Text-2248-1-10-20201027
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

RNN

Uploaded by

RNN

Uploaded by

RNN

How to train them (teacher/student forcing)?

Note that it is the exponentiated sequence CE.

The performances of PCA vs t-SNE on MNIST:

Encoder-decoder & seq2seq

Transformers (architecture details, positional encoding, computational complexity)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.