RNN
RNN
RNN
RNN is analogous to an autoregressive equation model, in such a way that the current output
y_t is a function of current input and lagged hidden state. The gist is that the parameter
sharing enables input and output of arbitrary lengths, and autoregressive hidden states allow
for context consideration in the output formation process. Note that parameter sharing
implicitly assumes a stationary distribution in the sequence input/output space.
For example, as per Goodfellow, a one-to-one RNN classifier can be represented as:
A visual of this from assignment 4:
Per class slides:
RNN architectures support end-to-end input regardless of their shapes and dimensions,
allowing them to solve tasks of various nature:
Many-to-one tasks, e.g. sentiment analysis
One-to-many tasks, e.g. image caption
Many-to-many (one-to-one correspondence between input and output), e.g. part of
speech tagging (whether word is subject/object/adjective etc)
Many-to-many (no one-to-one correspondence per se), e.g. translation
Challenges of RNN:
1. Long Term Dependencies. To the extent that the output depends only on the final hidden
layer (e.g. many-to-one tasks), RNN’s have difficulties accounting for long term memory
such as context.
LSTM has a cell state to store long term memory.
2. Vanishing and exploding gradients. As we see in the backpropagation through time
algorithm, the chain rule dictates that in each time step we back propagate, we multiply
the gradient by W transpose (the weight matrix for hidden layer). Depending on the
magnitude of its singular values, we either have exploding or vanishing gradients
(vanishing gradients imply weak dependencies for long term memory).
A remedy for exploding gradients is gradient clipping.
There is no direct solution for vanishing gradient (some indirect solutions include
weight initialization schemes or change activation functions). LSTM is usually
preferred to implement an RNN.
3. Not parallelizable. The autoregressive computational graph dictates that weight updates
in training must be sequential and therefore cannot be parallelized.
LSTM
Long Short-Term Memory is a RNN architecture that accounts for long term memory in
sequences, enhancing the RNN’s ability to process long sequences.
Intuitively, the cell state stores (long term) memory which is controlled by several gates as to
whether to forget or update it. It is a control signal to the hidden state h which is analogous to
the hidden state in Vanilla RNN, updating the output.
Interpretation of the four gates (“IFOG”):
Note: sigmoid is between 0 and 1, so it is good for assigning how much input or past to take in
for next output.
To see how LSTM alleviates exploding and vanishing gradients from the mechanical
perspective, note that the benefit of using separate state (h) and control signals (c) is that
LSTMs have the ability to remember information across many time steps. Without the control
signals, simple RNNs tend to forget information across time steps due to the vanishing
gradient problem.
In terms of calculus and gradient flow, note that now backpropagation through the passage of
time involves only the cell state (c) - indicated by the red BP path / “gradient highway”. The
accumulated matrix multiplication with W is avoided.
Note: compare this to ResNet, isn't this a weaker version of skip connection? If the forget
gate is one, then this works the same as skip connection.
Challenges of LSTM
1. Training remains inefficient: un-parallelizable and memory expensive. The network is still
recurrent in nature, also it requires storing hidden and cell states in the memory.
2. Prone to overfitting and it is difficult to apply the dropout algorithm to curb this issue.
Recall that dropout is a regularization method where input and recurrent connections
to LSTM units are probabilistically excluded from activation and weight updates
while training a network.
3. LSTMs became popular because they could solve the problem of vanishing gradients.
But it turns out, they fail to remove it completely. The problem lies in the fact that the
data still has to move from cell to cell for its evaluation.
Moreover, the cell has become quite complex now with the additional features (such
as forget gates) being brought into the picture.
Conditional language models
n
P (w 1 , . . . , w n ) = P (w 1 ) ∏ P (w i |w i−1 , . . . , w 1 )
i=1
Perplexity: Another measurement of how well a probability model predicts a sample, defined
as the geometric mean of the inverse probability of a sequence of words:
−1/n −1/n
(p(w 1 , . . . w n )) = (p(w 1 ) ∗ p(w 2 |w 1 ) ∗ … ∗ p(w n |w n − 1, . . . , w 1 ))
Note that the perplexity of a discrete uniform distribution of K events is exactly K. The
intuition is that on average it takes K trials for one to correctly guess the outcome of an event
from this distribution (e.g. need to guess twice for a fair coin). Another view is that since
entropy is the average number of bits to encode the information contained in a random
variable, the exponentiation of the entropy should be the total amount of all possible
information
BLEU: The Bi-Lingual Evaluation Understudy (BLEU) score is a metric to assess the quality of
machine translation. For each source sentence, BLEU compares the machine translation
output against an established translation for this sentence. It accounts for the number of
unique tokens covered in the output, and penalizes incorrect number of occurrences for the
covered tokens.
Knowledge Distillation
Knowledge distillation is a technique to compress and accelerate model training. The premise
is that we have a pre-trained large network - the Teacher model - that performs very well but
might be too slow or expensive to run. The goal is to train a smaller network - the Student
model - to perform similar tasks that can be deployed under limited resources, such as
mobile phones.
Word Embeddings
Skip-gram
Train a neural network (input-hidden layer with just linear part no activation-output) to predict
the probability of appearance of a word appearing around the input word. This is a fake task,
we do not care about it. We only care about the hidden layer as an encoding of the input
word.
The input word is coded via one-hot encoding. If we have 10000 words, then the input is
10000-dimension. If the hidden layer has 300 neurals as suggested in the Google paper,
the skip-gram neural network has 10000 x 300 = 3M weights, which makes GD difficult,
and the learning process is very data hungry. The Word2Vec algorithm utilizes two tricks
to facilitate the training of this skip-gram.
Subsampling
Negative sampling
Intrinsic/Extrinsic Evaluation
Intrinsic evaluation of word vectors (with the aim of forming word embeddings) is the
evaluation on specific, intermediate tasks (e.g. analogy completion for question
answering systems, or the “fake task” of predicting neighbor word pairs in word2vec skip
gram). These subtasks are typically simple and fast to compute and thereby allow us to
help understand the system used to generate the word vectors. An intrinsic evaluation
should typically return to us a number (e.g. cosine distance) that indicates the
performance of those word vectors on the evaluation subtask. A few observations with
intrinsic evaluated embeddings:
Performance is heavily dependent on the model used for word embedding.
Performance increases with larger corpus sizes.
Performance is lower for extremely low as well as for extremely high dimensional word
vectors because of under-/over-fitting.
Extrinsic evaluation of word vectors is the evaluation of a set of word vectors generated by
an embedding technique on the real task at hand. These tasks are typically elaborate (e.g.
evaluate answers from questions) and slow to compute. In the absence of computation cost
consideration, it can be considered the “first best” for intrinsic evaluation.
##t-SNE
t-SNE is an unsupervised, non-linear technique primarily used for data exploration and
visualizing high-dimensional data, which aims at providing a feel/intuition of how the data
is arranged in a high-dimensional space.
It can be considered as a non-linear generalization of PCA. Recall that PCA is a linear
dimension reduction technique that seeks to maximize variance and preserves large
pairwise distances. This can lead to poor visualization especially when dealing with non-
linear manifold structures (think: manifold structure like: cylinder, ball, curve, etc).
For example, PCA is not able to separate red dots and blue dots from two swiss roll
manifolds: