0% found this document useful (0 votes)
13 views

RNN

1. RNNs allow for input and output of arbitrary lengths by sharing parameters across time steps and using autoregressive hidden states to incorporate context. This assumes the input/output distribution is stationary. 2. RNN architectures can solve tasks of varying nature, including sentiment analysis, image captioning, part-of-speech tagging, and translation. 3. LSTMs address RNN challenges like long term dependencies and vanishing gradients by incorporating a cell state to store long term memory and gates to regulate its updates.

Uploaded by

Fang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

RNN

1. RNNs allow for input and output of arbitrary lengths by sharing parameters across time steps and using autoregressive hidden states to incorporate context. This assumes the input/output distribution is stationary. 2. RNN architectures can solve tasks of varying nature, including sentiment analysis, image captioning, part-of-speech tagging, and translation. 3. LSTMs address RNN challenges like long term dependencies and vanishing gradients by incorporating a cell state to store long term memory and gates to regulate its updates.

Uploaded by

Fang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

RNN

RNN
RNN is analogous to an autoregressive equation model, in such a way that the current output
y_t is a function of current input and lagged hidden state. The gist is that the parameter
sharing enables input and output of arbitrary lengths, and autoregressive hidden states allow
for context consideration in the output formation process. Note that parameter sharing
implicitly assumes a stationary distribution in the sequence input/output space.
For example, as per Goodfellow, a one-to-one RNN classifier can be represented as:
A visual of this from assignment 4:
Per class slides:

RNN architectures support end-to-end input regardless of their shapes and dimensions,
allowing them to solve tasks of various nature:
Many-to-one tasks, e.g. sentiment analysis
One-to-many tasks, e.g. image caption
Many-to-many (one-to-one correspondence between input and output), e.g. part of
speech tagging (whether word is subject/object/adjective etc)
Many-to-many (no one-to-one correspondence per se), e.g. translation
Challenges of RNN:
1. Long Term Dependencies. To the extent that the output depends only on the final hidden
layer (e.g. many-to-one tasks), RNN’s have difficulties accounting for long term memory
such as context.
LSTM has a cell state to store long term memory.
2. Vanishing and exploding gradients. As we see in the backpropagation through time
algorithm, the chain rule dictates that in each time step we back propagate, we multiply
the gradient by W transpose (the weight matrix for hidden layer). Depending on the
magnitude of its singular values, we either have exploding or vanishing gradients
(vanishing gradients imply weak dependencies for long term memory).
A remedy for exploding gradients is gradient clipping.
There is no direct solution for vanishing gradient (some indirect solutions include
weight initialization schemes or change activation functions). LSTM is usually
preferred to implement an RNN.
3. Not parallelizable. The autoregressive computational graph dictates that weight updates
in training must be sequential and therefore cannot be parallelized.

LSTM
Long Short-Term Memory is a RNN architecture that accounts for long term memory in
sequences, enhancing the RNN’s ability to process long sequences.
Intuitively, the cell state stores (long term) memory which is controlled by several gates as to
whether to forget or update it. It is a control signal to the hidden state h which is analogous to
the hidden state in Vanilla RNN, updating the output. 
Interpretation of the four gates (“IFOG”):

Note: sigmoid is between 0 and 1, so it is good for assigning how much input or past to take in
for next output.
To see how LSTM alleviates exploding and vanishing gradients from the mechanical
perspective, note that the benefit of using separate state (h) and control signals (c) is that
LSTMs have the ability to remember information across many time steps. Without the control
signals, simple RNNs tend to forget information across time steps due to the vanishing
gradient problem.
In terms of calculus and gradient flow, note that now backpropagation through the passage of
time involves only the cell state (c) - indicated by the red BP path / “gradient highway”. The
accumulated matrix multiplication with W is avoided.

Note: compare this to ResNet, isn't this a weaker version of skip connection? If the forget
gate is one, then this works the same as skip connection.
Challenges of LSTM
1. Training remains inefficient: un-parallelizable and memory expensive. The network is still
recurrent in nature, also it requires storing hidden and cell states in the memory.
2. Prone to overfitting and it is difficult to apply the dropout algorithm to curb this issue. 
Recall that dropout is a regularization method where input and recurrent connections
to LSTM units are probabilistically excluded from activation and weight updates
while training a network.
3. LSTMs became popular because they could solve the problem of vanishing gradients.
But it turns out, they fail to remove it completely. The problem lies in the fact that the
data still has to move from cell to cell for its evaluation. 
Moreover, the cell has become quite complex now with the additional features (such
as forget gates) being brought into the picture.
Conditional language models
n

P (w 1 , . . . , w n ) = P (w 1 ) ∏ P (w i |w i−1 , . . . , w 1 )

i=1

How to train them (teacher/student forcing)?


After the model is trained, a “start-of-sequence” token can be used to start the process and
the generated word in the output sequence is used as input on the subsequent time step,
perhaps along with other input like an image or a source text.
This same recursive output-as-input process can be used when training the model, but it can
result in problems such as:
Slow convergence.
Model instability.
Poor skill.
Teacher forcing is an approach to improve model skill and stability when training these types
of models.
Teacher forcing is a strategy for training recurrent neural networks that uses ground truth as
input, instead of model output from a prior time step as an input.
Despite being effective and simple, teacher forcing has its downsides when deployed. As
ground truth may not be readily available. In such cases, we approximate the ground truth
with the model’s output at time step t and feed it back into the model, but this kind of inputs
the network sees could be quite different, or even diverge from the training distribution. A
solution to this problem would be for the network to predict a few steps ahead during training
(e.g. beam search heuristics), so it gets accustomed to the model output (as inputs).
Language metrics
Cross Entropy: a measure of how well the estimated probability distribution of words
matches a reference (empirical conditional) distribution. It is also the expected number of bits
we need to encode the information.

Perplexity: Another measurement of how well a probability model predicts a sample, defined
as the geometric mean of the inverse probability of a sequence of words:
−1/n −1/n
(p(w 1 , . . . w n )) = (p(w 1 ) ∗ p(w 2 |w 1 ) ∗ … ∗ p(w n |w n − 1, . . . , w 1 ))

Note that it is the exponentiated sequence CE.

Note that the perplexity of a discrete uniform distribution of K events is exactly K. The
intuition is that on average it takes K trials for one to correctly guess the outcome of an event
from this distribution (e.g. need to guess twice for a fair coin). Another view is that since
entropy is the average number of bits to encode the information contained in a random
variable, the exponentiation of the entropy should be the total amount of all possible
information
BLEU: The Bi-Lingual Evaluation Understudy (BLEU) score is a metric to assess the quality of
machine translation. For each source sentence, BLEU compares the machine translation
output against an established translation for this sentence. It accounts for the number of
unique tokens covered in the output, and penalizes incorrect number of occurrences for the
covered tokens.
Knowledge Distillation
Knowledge distillation is a technique to compress and accelerate model training. The premise
is that we have a pre-trained large network - the Teacher model - that performs very well but
might be too slow or expensive to run. The goal is to train a smaller network - the Student
model - to perform similar tasks that can be deployed under limited resources, such as
mobile phones.
Word Embeddings
Skip-gram
Train a neural network (input-hidden layer with just linear part no activation-output) to predict
the probability of appearance of a word appearing around the input word. This is a fake task,
we do not care about it. We only care about the hidden layer as an encoding of the input
word.

The input word is coded via one-hot encoding. If we have 10000 words, then the input is
10000-dimension. If the hidden layer has 300 neurals as suggested in the Google paper,
the skip-gram neural network has 10000 x 300 = 3M weights, which makes GD difficult,
and the learning process is very data hungry. The Word2Vec algorithm utilizes two tricks
to facilitate the training of this skip-gram.
Subsampling
Negative sampling
Intrinsic/Extrinsic Evaluation
Intrinsic evaluation of word vectors (with the aim of forming word embeddings) is the
evaluation on specific, intermediate tasks (e.g. analogy completion for question
answering systems, or the “fake task” of predicting neighbor word pairs in word2vec skip
gram). These subtasks are typically simple and fast to compute and thereby allow us to
help understand the system used to generate the word vectors. An intrinsic evaluation
should typically return to us a number (e.g. cosine distance) that indicates the
performance of those word vectors on the evaluation subtask. A few observations with
intrinsic evaluated embeddings:
Performance is heavily dependent on the model used for word embedding.
Performance increases with larger corpus sizes.
Performance is lower for extremely low as well as for extremely high dimensional word
vectors because of under-/over-fitting.
Extrinsic evaluation of word vectors is the evaluation of a set of word vectors generated by
an embedding technique on the real task at hand. These tasks are typically elaborate (e.g.
evaluate answers from questions) and slow to compute. In the absence of computation cost
consideration, it can be considered the “first best” for intrinsic evaluation.
##t-SNE
t-SNE is an unsupervised, non-linear technique primarily used for data exploration and
visualizing high-dimensional data, which aims at providing a feel/intuition of how the data
is arranged in a high-dimensional space. 
It can be considered as a non-linear generalization of PCA. Recall that PCA is a linear
dimension reduction technique that seeks to maximize variance and preserves large
pairwise distances. This can lead to poor visualization especially when dealing with non-
linear manifold structures (think: manifold structure like: cylinder, ball, curve, etc).
For example, PCA is not able to separate red dots and blue dots from two swiss roll
manifolds:

The performances of PCA vs t-SNE on MNIST:


Intuitively, PCA seeks to preserve large pairwise distances (dotted line), while t-SNE is
concerned with preserving small distances (solid line).

Encoder-decoder & seq2seq


Beam Search
Beam search is a heuristic search algorithm that explores a graph by expanding the most
promising node in a limited set. The input parameter is width - the maximum number of
beams considered in each iteration. As width restricts the number of candidates, it is a
greedy algorithm. Nevertheless it is considered an improvement over simply taking the
argmax of the softmax output layer at each time step.
An example of beam search with width = 2:
Neural attention
Self-Attention is an attention mechanism relating different positions of a single
sequence in order to compute a representation of the same sequence. It has been shown
to be very useful in machine reading, abstractive summarization, or image description
generation.
Cross-Attention: source is from the encoder, and input is from decoder. For each time
step t, attention is evaluated between h_t and each step in the encoder source s. For
example, in the transformer model with encoder-decoder structure, the key and value
vectors in the attention layer in the decoder are taken from the encoder.
Masked-attention: Used in transformer encoder. For the query at time step t, attention
should only consider keys prior to t.
Using attention to substitute recurrence and autoregressive architecture greatly reduces the
amount of sequential operation and improves the parallelizability of training.
Soft Attention: the alignment weights are learned and placed “softly” over all patches in
the source image; essentially the same type of attention as in Bahdanau et al., 2015.
Pro: the model is smooth and differentiable.
Con: expensive when the source input is large.
Hard Attention: only selects one patch of the image to attend to at a time.
Pro: less calculation at the inference time.
Con: the model is non-differentiable and requires more complicated techniques such
as variance reduction or reinforcement learning to train. (Luong, et al., 2015)
Transformer
Byte pair encoding
Byte pair encoding (BPE) is a technique to reduce the size of (output) vocabularies in
machine translation tasks, so as to decrease training time and computational costs (e.g. avoid
large matrices). It is an example of a subword algorithm, which aims at extracting common
substring patterns in words to model language morphology (how words are formed). The
extreme case for subwording would be at the alphabet level, which is not practical because
this creates exceedingly long sequences.
Instead, BPE looks for popular patterns in words. It starts by iteratively looking for the most
commonly found adjacent pair in a word (represented as a byte sequence), compressing it to
a single token until all adjacent pairs do not repeat in this  byte sequence.

Transformers (architecture details, positional encoding, computational complexity)


Byte pair encoding.
Computation questions:
Please be familiar with how to calculate the sizes of the different self-attention matrices
in the encoder, decoder, and between the encoder and decoder (often called cross-
attention) based on "Attention is all you Need."

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy