Sequence Models
Sequence Models
Sequence Models
This is the fifth and final course of the deep learning specialization at Coursera which is moderated by deeplearning.ai. The course is taught by Andrew
Ng.
Table of contents
• Sequence Models
o Table of contents
o Course summary
o Recurrent Neural Networks
▪ Why sequence models
▪ Notation
▪ Recurrent Neural Network Model
▪ Backpropagation through time
▪ Different types of RNNs
▪ Language model and sequence generation
▪ Sampling novel sequences
▪ Vanishing gradients with RNNs
▪ Gated Recurrent Unit (GRU)
▪ Long Short Term Memory (LSTM)
▪ Bidirectional RNN
▪ Deep RNNs
▪ Back propagation with RNNs
o Natural Language Processing & Word Embeddings
▪ Introduction to Word Embeddings
▪ Word Representation
▪ Using word embeddings
▪ Properties of word embeddings
▪ Embedding matrix
▪ Learning Word Embeddings: Word2vec & GloVe
▪ Learning word embeddings
▪ Word2Vec
▪ Negative Sampling
▪ GloVe word vectors
▪ Applications using Word Embeddings
▪ Sentiment Classification
▪ Debiasing word embeddings
o Sequence models & Attention mechanism
▪ Various sequence to sequence architectures
▪ Basic Models
▪ Picking the most likely sentence
▪ Beam Search
▪ Refinements to Beam Search
▪ Error analysis in beam search
▪ BLEU Score
▪ Attention Model Intuition
▪ Attention Model
▪ Speech recognition - Audio data
▪ Speech recognition
▪ Trigger Word Detection
o Extras
▪ Machine translation attention model (From notebooks)
Course summary
Here are the course summary as its given on the course link:
This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms
are working far better than just two years ago, and this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots,
machine translation, natural language understanding, and many others.
You will:
• Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants such as GRUs and LSTMs.
• Be able to apply sequence models to natural language problems, including text synthesis.
• Be able to apply sequence models to audio applications, including speech recognition and music synthesis.
This is the fifth and final course of the Deep Learning Specialization.
• Sequence Models like RNN and LSTMs have greatly transformed learning on sequences in the past few years.
• Examples of sequence data in applications:
o Speech recognition (sequence to sequence):
▪ X: wave sequence
▪ Y: text sequence
o Music generation (one to sequence):
▪ X: nothing or an integer
▪ Y: wave sequence
o Sentiment classification (sequence to one):
▪ X: text sequence
▪ Y: integer rating from one to five
o DNA sequence analysis (sequence to sequence):
▪ X: DNA sequence
▪ Y: DNA Labels
o Machine translation (sequence to sequence):
▪ X: text sequence (in one language)
▪ Y: text sequence (in other language)
o Video activity recognition (sequence to one):
▪ X: video frames
▪ Y: label (activity)
o Name entity recognition (sequence to sequence):
▪ X: text sequence
▪ Y: label sequence
▪ Can be used by seach engines to index different type of words inside a text.
• All of these problems with different input and output (sequence or not) can be addressed as supervised learning with label data X, Y as the
training set.
Notation
• In this section we will discuss the notations that we will use through the course.
• Motivating example:
• We will index the first element of x by x<1>, the second x<2> and so on.
o x<1> = Harry
o x<2> = Potter
• Similarly, we will index the first element of y by y<1>, the second y<2> and so on.
o y<1> = 1
o y<2> = 1
• Tx is the size of the input sequence and Ty is the size of the output sequence.
• x(i)<t> is the element t of the sequence of input vector i. Similarly y(i)<t> means the t-th element in the output sequence of the i training example.
• Tx(i) the input sequence length for training example i. It can be different across the examples. Similarly for Ty(i) will be the length of the output
sequence in the i-th training example.
• Representing words:
o We will now work in this course with NLP which stands for natural language processing. One of the challenges of NLP is how can we
represent a word?
ii. We need a vocabulary list that contains all the words in our target sets.
▪ Example:
▪ [a ... And ... Harry ... Potter ... Zulu]
▪ Each word will have a unique index that it can be represented with.
▪ The sorting here is in alphabetical order.
▪ Vocabulary sizes in modern applications are from 30,000 to 50,000. 100,000 is not uncommon. Some of the bigger companies
use even a million.
▪ To build vocabulary list, you can read all the texts you have and get m words with the most occurrence, or search online for m
most occurrent words.
iii. Create a one-hot encoding sequence for each word in your dataset given the vocabulary you have created.
▪ While converting, what if we meet a word thats not in your dictionary?
▪ We can add a token in the vocabulary with name <UNK> which stands for unknown text and use its index for your one-hot
vector.
o Full example:
• The goal is given this representation for x to learn a mapping using a sequence model to then target output y as a supervised learning
problem.
• Why not to use a standard network for sequence tasks? There are two problems:
o Inputs, outputs can be different lengths in different examples.
▪ This can be solved for normal NNs by paddings with the maximum lengths but it's not a good solution.
o Doesn't share features learned across different positions of text/sequence.
▪ Using a feature sharing like in CNNs can significantly reduce the number of parameters in your model. That's what we will do in
RNNs.
• Recurrent neural network doesn't have either of the two mentioned problems.
• Lets build a RNN that solves name entity recognition task:
o In this problem Tx = Ty. In other problems where they aren't equal, the RNN architecture may be different.
o a<0> is usually initialized with zeros, but some others may initialize it randomly in some cases.
o There are three weight matrices here: Wax, Waa, and Wya with shapes:
▪ Wax: (NoOfHiddenNeurons, nx)
▪ Waa: (NoOfHiddenNeurons, NoOfHiddenNeurons)
▪ Wya: (ny, NoOfHiddenNeurons)
• The weight matrix Waa is the memory the RNN is trying to maintain from the previous layers.
• A lot of papers and books write the same architecture this way:
o It's harder to interpret. It's easier to roll this drawings to the unrolled version.
• In the discussed RNN architecture, the current output ŷ<t> depends on the previous inputs and activations.
• Let's have this example 'He Said, "Teddy Roosevelt was a great president"'. In this example Teddy is a person name but we know that from the
word president that came after Teddy not from He and said that were before it.
• So limitation of the discussed architecture is that it can not learn from elements later in the sequence. To address this problem we will later
discuss Bidirectional RNN (BRNN).
• Now let's discuss the forward propagation equations on the discussed architecture:
o The activation function of a is usually tanh or ReLU and for y depends on your task choosing some activation functions like sigmoid
and softmax. In name entity recognition task we will use sigmoid because we only have two classes.
• In order to help us develop complex RNN architectures, the last equations needs to be simplified a bit.
• Simplified RNN notation:
o Where wa, ba, wy, and by are shared across each element in a sequence.
• We will use the cross-entropy loss function:
o Where the first equation is the loss for one example and the loss for the whole sequence is given by the summation over all the
calculated single example losses.
• Graph with losses:
• The backpropagation here is called backpropagation through time because we pass activation a from one sequence element to another like
backwards in time.
• So far we have seen only one RNN architecture in which Tx equals TY. In some other problems, they may not equal so we need different
architectures.
• The ideas in this section was inspired by Andrej Karpathy blog. Mainly this image has all types:
o Note that starting the second layer we are feeding the generated output back to the network.
• There are another interesting architecture in Many To Many. Applications like machine translation inputs and outputs sequences have
different lengths in most of the cases. So an alternative Many To Many architecture that fits the translation would be as follows:
o There are an encoder and a decoder parts in this architecture. The encoder encodes the input sequence into one matrix and feed it to
the decoder to generate the outputs. Encoder and decoder have different weight matrices.
• Summary of RNN types:
• There is another architecture which is the attention architecture which we will talk about in chapter 3.
• RNNs do very well in language model problems. In this section, we will build a language model using RNNs.
• What is a language model
o Let's say we are solving a speech recognition problem and someone says a sentence that can be interpreted into to two sentences:
▪ The apple and pair salad
▪ The apple and pear salad
o Pair and pear sounds exactly the same, so how would a speech recognition application choose from the two.
o That's where the language model comes in. It gives a probability for the two sentences and the application decides the best based on
this probability.
• The job of a language model is to give a probability of any given sequence of words.
• How to build language models with RNNs?
o The first thing is to get a training set: a large corpus of target language text.
o Then tokenize this training set by getting the vocabulary and then one-hot each word.
o Put an end of sentence token <EOS> with the vocabulary and include it with each converted sentence. Also, use the token <UNK> for the
unknown words.
• Given the sentence "Cats average 15 hours of sleep a day. <EOS>"
o In training time we will use this:
o The loss function is defined by cross-entropy loss:
• After a sequence model is trained on a language model, to check what the model has learned you can apply it to sample novel sequence.
• Lets see the steps of how we can sample a novel sequence from a trained sequence language model:
i. Given this model:
ii. We first pass a<0> = zeros vector, and x<1> = zeros vector.
iii. Then we choose a prediction randomly from distribution obtained by ŷ<1>. For example it could be "The".
▪ In numpy this can be implemented using: numpy.random.choice(...)
▪ This is the line where you get a random beginning of the sentence each time you sample run a novel sequence.
iv. We pass the last predicted word with the calculated a<1>
v. We keep doing 3 & 4 steps for a fixed length or until we get the <EOS> token.
vi. You can reject any <UNK> token if you mind finding it in your output.
• So far we have to build a word-level language model. It's also possible to implement a character-level language model.
• In the character-level language model, the vocabulary will contain [a-zA-Z0-9], punctuation, special characters and possibly token.
• Character-level language model has some pros and cons compared to the word-level language model
o Pros:
a. There will be no <UNK> token - it can create any word.
o Cons:
. The main disadvantage is that you end up with much longer sequences.
a. Character-level language models are not as good as word-level language models at capturing long range dependencies
between how the the earlier parts of the sentence also affect the later part of the sentence.
b. Also more computationally expensive and harder to train.
• The trend Andrew has seen in NLP is that for the most part, a word-level language model is still used, but as computers get faster there are
more and more applications where people are, at least in some special cases, starting to look at more character-level models. Also, they are
used in specialized applications where you might need to deal with unknown words or other vocabulary words a lot. Or they are also used in
more specialized applications where you have a more specialized vocabulary.
• One of the problems with naive RNNs that they run into vanishing gradient problem.
• An RNN that process a sequence data with the size of 10,000 time steps, has 10,000 deep layers which is very hard to optimize.
• Let's take an example. Suppose we are working with language modeling problem and there are two sequences that model tries to learn:
• What we need to learn here that "was" came with "cat" and that "were" came with "cats". The naive RNN is not very good at capturing very
long-term dependencies like this.
• As we have discussed in Deep neural networks, deeper networks are getting into the vanishing gradient problem. That also happens with RNNs
with a long sequence size.
o For computing the word "was", we need to compute the gradient for everything behind. Multiplying fractions tends to vanish the
gradient, while multiplication of large number tends to explode it.
o Therefore some of your weights may not be updated properly.
• In the problem we descried it means that its hard for the network to memorize "was" word all over back to "cat". So in this case, the network
won't identify the singular/plural words so that it gives it the right grammar form of verb was/were.
• In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve
toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
• Vanishing gradients problem tends to be the bigger problem with RNNs than the exploding gradients problem. We will discuss how to solve it
in next sections.
• Exploding gradients can be easily seen when your weight values become NaN. So one of the ways solve exploding gradient is to apply gradient
clipping means if your gradient is more than some threshold - re-scale some of your gradient vector so that is not too big. So there are cliped
according to some maximum value.
• Extra:
o Solutions for the Exploding gradient problem:
▪ Truncated backpropagation.
▪ Not to update all the weights in the way back.
▪ Not optimal. You won't update all the weights.
▪ Gradient clipping.
o Solution for the Vanishing gradient problem:
▪ Weight initialization.
▪ Like He initialization.
▪ Echo state networks.
▪ Use LSTM/GRU networks.
▪ Most popular.
▪ We will discuss it next.
• GRU is an RNN type that can help solve the vanishing gradient problem and can remember the long-term dependencies.
• The basic RNN unit can be visualized to be like this:
• Each layer in GRUs has a new variable C which is the memory cell. It can tell to whether memorize something or not.
• Lets take the cat sentence example and apply it to understand this equations:
o We will suppose that U is 0 or 1 and is a bit that tells us if a singular word needs to be memorized.
The 0 val
Word Update gate(U) Cell memory (C)
cat 1 new_val
which 0 new_val
already 0 new_val
... 0 new_val
full .. ..
• Because the update gate U is usually a small number like 0.00001, GRUs doesn't suffer the vanishing gradient problem.
• Shapes:
• What has been descried so far is the Simplified GRU unit. Let's now describe the full one:
o The full GRU contains a new gate that is used with to calculate the candidate C. The gate tells you how relevant is C<t-1> to C<t>
o Equations:
o Shapes are the same
• So why we use these architectures, why don't we change them, how we know they will work, why not add another gate, why not use the
simpler GRU instead of the full GRU; well researchers has experimented over years all the various types of these architectures with many many
different versions and also addressing the vanishing gradient problem. They have found that full GRUs are one of the best RNN architectures
to be used for many different problems. You can make your design but put in mind that GRUs and LSTMs are standards.
• LSTM - the other type of RNN that can enable you to account for long-term dependencies. It's more powerful and general than GRU.
• In LSTM , C<t> != a<t>
• Here are the equations of an LSTM unit:
• In GRU we have an update gate U, a relevance gate r, and a candidate cell variables C~<t> while in LSTM we have an update gate U (sometimes
it's called input gate I), a forget gate F, an output gate O, and a candidate cell variables C~<t>
• Drawings (inspired by http://colah.github.io/posts/2015-08-Understanding-LSTMs/):
Bidirectional RNN
• There are still some ideas to let you build much more powerful sequence models. One of them is bidirectional RNNs and another is Deep
RNNs.
• As we saw before, here is an example of the Name entity recognition task:
• The name Teddy cannot be learned from He and said, but can be learned from bears.
• BiRNNs fixes this issue.
• Here is BRNNs architecture:
Deep RNNs
• In a lot of cases the standard one layer RNNs will solve your problem. But in some problems its useful to stack some RNN layers to make a
deeper network.
• For example, a deep RNN with 3 layers would look like this:
• In feed-forward deep nets, there could be 100 or even 200 layers. In deep RNNs stacking 3 layers is already considered deep and expensive to
train.
• In some cases you might see some feed-forward network layers connected after recurrent cell.
Back propagation with RNNs
• In modern deep learning frameworks, you only have to implement the forward pass, and the framework takes care of the backward pass, so
most deep learning engineers do not need to bother with the details of the backward pass. If however you are an expert in calculus and want
to see the details of backprop in RNNs, you can work through this optional portion of the notebook.
• The quote is taken from this notebook. If you want the details of the back propagation with programming notes look at the linked notebook.
Word Representation
• NLP has been revolutionized by deep learning and especially by RNNs and deep RNNs.
• Word embeddings is a way of representing words. It lets your algorithm automatically understand the analogies between words like "king" and
"queen".
• So far we have defined our language by a vocabulary. Then represented our words with a one-hot vector that represents the word in the
vocabulary.
o An image example would be:
o We will use the annotation O idx for any word that is represented with one-hot like in the image.
o One of the weaknesses of this representation is that it treats a word as a thing that itself and it doesn't allow an algorithm to generalize
across words.
▪ For example: "I want a glass of orange ______", a model should predict the next word as juice.
▪ A similar example "I want a glass of apple ______", a model won't easily predict juice here if it wasn't trained on it. And if so the
two examples aren't related although orange and apple are similar.
o Inner product between any one-hot encoding vector is zero. Also, the distances between them are the same.
• So, instead of a one-hot presentation, won't it be nice if we can learn a featurized representation with each of these words: man, woman, king,
queen, apple, and orange?
o Each word will have a, for example, 300 features with a type of float point number.
o Each word column will be a 300-dimensional vector which will be the representation.
o We will use the notation e5391 to describe man word features vector.
o Now, if we return to the examples we described again:
▪ "I want a glass of orange ______"
▪ I want a glass of apple ______
o Orange and apple now share a lot of similar features which makes it easier for an algorithm to generalize between them.
o We call this representation Word embeddings.
• To visualize word embeddings we use a t-SNE algorithm to reduce the features to 2 dimensions which makes it easy to visualize:
o You will get a sense that more related words are closer to each other.
• The word embeddings came from that we need to embed a unique vector inside a n-dimensional space.
• Let's see how we can take the feature representation we have extracted from each word and apply it in the Named entity recognition problem.
• Given this example (from named entity recognition):
o In this problem, we encode each face into a vector and then check how similar are these vectors.
o Words encoding and embeddings have a similar meaning here.
• In the word embeddings task, we are learning a representation for each word in our vocabulary (unlike in image encoding where we have to
map each new image to some n-dimensional vector). We will discuss the algorithm in next sections.
• One of the most fascinating properties of word embeddings is that they can also help with analogy reasoning. While analogy reasoning may
not be by itself the most important NLP application, but it might help convey a sense of what these word embeddings can do.
• Analogies example:
o Given this word embeddings table:
o It turns out that eQueen is the best solution here that gets the the similar vector.
• Cosine similarity - the most commonly used similarity function:
o Equation:
▪ CosineSimilarity(u, v)=u.v||u||2||v||2=cos(θ)
▪ The top part represents the inner product of u and v vectors. It will be large if the vectors are very similar.
• You can also use Euclidean distance as a similarity function (but it rather measures a dissimilarity, so you should take it with negative sign).
• We can use this equation to calculate the similarities between word embeddings and on the analogy problem where u = ew and v = eking -
eman + ewoman
Embedding matrix
• When you implement an algorithm to learn a word embedding, what you end up learning is a embedding matrix.
• Let's take an example:
o Suppose we are using 10,000 words as our vocabulary (plus token).
o The algorithm should create a matrix E of the shape (300, 10000) in case we are extracting 300 features.
o If O6257 is the one hot encoding of the word orange of shape (10000, 1), then
np.dot(E,O6257) = e6257 which shape is (300, 1).
o Generally np.dot(E, Oj) = ej
• In the next sections, you will see that we first initialize E randomly and then try to learn all the parameters of this matrix.
• In practice it's not efficient to use a dot multiplication when you are trying to extract the embeddings of a specific word, instead, we will use
slicing to slice a specific column. In Keras there is an embedding layer that extracts this column with no multiplication.
• Let's start learning some algorithms that can learn word embeddings.
• At the start, word embeddings algorithms were complex but then they got simpler and simpler.
• We will start by learning the complex examples to make more intuition.
• Neural language model:
o Let's start with an example:
o We want to build a language model so that we can predict the next word.
o So we use this neural network to learn the language model
▪ We get ej by np.dot(E,o<sub>j</sub>)
▪ NN layer has parameters W1 and b1 while softmax layer has parameters W2 and b2
▪ Input dimension is (300*6, 1) if the window size is 6 (six previous words).
▪ Here we are optimizing E matrix and layers parameters. We need to maximize the likelihood to predict the next word given the
context (previous words).
o This model was build in 2003 and tends to work pretty decent for learning word embeddings.
• In the last example we took a window of 6 words that fall behind the word that we want to predict. There are other choices when we are trying
to learn word embeddings.
o Suppose we have an example: "I want a glass of orange juice to go along with my cereal"
o To learn juice, choices of context are:
a. Last 4 words.
▪ We use a window of last 4 words (4 is a hyperparameter), "a glass of orange" and try to predict the next word from it.
b. 4 words on the left and on the right.
▪ "a glass of orange" and "to go along with"
c. Last 1 word.
▪ "orange"
d. Nearby 1 word.
▪ "glass" word is near juice.
▪ This is the idea of skip grams model.
▪ The idea is much simpler and works remarkably well.
▪ We will talk about this in the next section.
• Researchers found that if you really want to build a language model, it's natural to use the last few words as a context. But if your main goal is
really to learn a word embedding, then you can use all of these other contexts and they will result in very meaningful work embeddings as well.
• To summarize, the language modeling problem poses a machines learning problem where you input the context (like the last four words) and
predict some target words. And posing that problem allows you to learn good word embeddings.
Word2Vec
o For example, we have the sentence: "I want a glass of orange juice to go along with my cereal"
orange juice +1
orange glass -2
orange my +6
o This is not an easy learning problem because learning within -10/+10 words (10 - an example) is hard.
• Word2Vec model:
o Vocabulary size = 10,000 words
o Let's say that the context word are c and the target word is t
o We want to learn c to t
o We get ec by E. oc
o We then use a softmax layer to get P(t|c) which is ŷ
o Also we will use the cross-entropy loss function.
o This model is called skip-grams model.
• The last model has a problem with the softmax layer:
o Here we are summing 10,000 numbers which corresponds to the number of words in our vocabulary.
o If this number is larger say 1 million, the computation will become very slow.
• One of the solutions for the last problem is to use "Hierarchical softmax classifier" which works as a tree classifier.
• In practice, the hierarchical softmax classifier doesn't use a balanced tree like the drawn one. Common words are at the top and less common
are at the bottom.
• How to sample the context c?
o One way is to choose the context by random from your corpus.
o If you have done it that way, there will be frequent words like "the, of, a, and, to, .." that can dominate other words like "orange, apple,
durian,..."
o In practice, we don't take the context uniformly random, instead there are some heuristics to balance the common words and the non-
common words.
• word2vec paper includes 2 ideas of learning word embeddings. One is skip-gram model and another is CBoW (continious bag-of-words).
Negative Sampling
• Negative sampling allows you to do something similar to the skip-gram model, but with a much more efficient learning algorithm. We will
create a different learning problem.
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0
• We get positive example by using the same skip-grams technique, with a fixed window that goes around.
• Notice, that we got word "of" as a negative example although it appeared in the same sentence.
• So the steps to generate the samples are:
• We will have a ratio of k negative examples to 1 positive ones in the data we are collecting.
• Now let's define the model that will learn this supervised learning problem:
o Lets say that the context word are c and the word are t and y is the target.
o We will apply the simple logistic regression model.
o So we are like having 10,000 binary classification problems, and we only train k+1 classifier of them in each iteration.
o We can sample according to empirical frequencies in words corpus which means according to how often different words appears. But
the problem with that is that we will have more frequent words like the, of, and...
o The best is to sample with this equation (according to authors):
• GloVe is another algorithm for learning the word embedding. It's the simplest of them.
• This is not used as much as word2vec or skip-gram models, but it has some enthusiasts because of its simplicity.
• Let's use our previous example: "I want a glass of orange juice to go along with my cereal".
• We will choose a context and a target from the choices we have mentioned in the previous sections.
• Then we will calculate this for every pair: Xct = # times t appears in context of c
• Xct = Xtc if we choose a window pair, but they will not equal if we choose the previous words for example. In GloVe they use a window which
means they are equal
• The model is defined like this:
• f(x) - the weighting term, used for many reasons which include:
o The log(0) problem, which might occur if there are no pairs for the given target and context values.
o Giving not too much weight for stop words like "is", "the", and "this" which occur many times.
o Giving not too little weight for infrequent words.
• Theta and e are symmetric which helps getting the final word embedding.
o If this is your first try, you should try to download a pre-trained model that has been made and actually works best.
o If you have enough data, you can try to implement one of the available algorithms.
o Because word embeddings are very computationally expensive to train, most ML practitioners will load a pre-trained set of
embeddings.
o A final note that you can't guarantee that the axis used to represent the features will be well-aligned with what might be easily
humanly interpretable axis like gender, royal, age.
Applications using Word Embeddings
Sentiment Classification
• As we have discussed before, Sentiment classification is the process of finding if a text has a positive or a negative review. Its so useful in NLP
and is used in so many applications. An example would be:
• One of the challenges with it, is that you might not have a huge labeled training data for it, but using word embeddings can help getting rid of
this.
• The common dataset sizes varies from 10,000 to 100,000 words.
• A simple sentiment classification model would be like this:
o The embedding matrix may have been trained on say 100 billion words.
o Number of features in word embedding is 300.
o We can use sum or average given all the words then pass it to a softmax classifier. That makes this classifier works for short or long
sentences.
• One of the problems with this simple model is that it ignores words order. For example "Completely lacking in good taste, good service,
and good ambience" has the word good 3 times but its a negative review.
• A better model uses an RNN for solving this problem:
o And so if you train this algorithm, you end up with a pretty decent sentiment classification algorithm.
o Also, it will generalize better even if words weren't in your dataset. For example you have the sentence "Completely absent of good
taste, good service, and good ambience", then even if the word "absent" is not in your label training set, if it was in your 1 billion or 100
billion word corpus used to train the word embeddings, it might still get this right and generalize much better even to words that were
in the training set used to train the word embeddings but not necessarily in the label training set that you had for specifically the
sentiment classification problem.
• We want to make sure that our word embeddings are free from undesirable forms of bias, such as gender bias, ethnicity bias and so on.
• Horrifying results on the trained word embeddings in the context of Analogies:
o Man : Computer_programmer as Woman : Homemaker
o Father : Doctor as Mother : Nurse
• Word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of text used to train the model.
• Learning algorithms by general are making important decisions and it mustn't be biased.
• Andrew thinks we actually have better ideas for quickly reducing the bias in AI than for quickly reducing the bias in the human race, although it
still needs a lot of work to be done.
• Addressing bias in word embeddings steps:
o Idea from the paper: https://arxiv.org/abs/1607.06520
o Given these learned embeddings:
o We need to solve the gender bias here. The steps we will discuss can help solve any bias problem but we are focusing here on gender
bias.
o Here are the steps:
a. Identify the direction:
▪ Calculate the difference between:
▪ ehe - eshe
▪ emale - efemale
▪ ....
▪ Choose some k differences and average them.
▪ This will help you find this:
▪ By that we have found the bias direction which is 1D vector and the non-bias vector which is 299D vector.
b. Neutralize: For every word that is not definitional, project to get rid of bias.
▪ Babysitter and doctor need to be neutral so we project them on non-bias axis with the direction of the bias:
▪ To do that, we move grandfather and grandmother to a point where they will be in the middle of the non-bias axis.
▪ There are some words you need to do this for in your steps. Number of these words is relatively small.
Basic Models
• In this section we will learn about sequence to sequence - Many to Many - models which are useful in various applications including machine
translation and speech recognition.
• Let's start with the basic model:
o Given this machine translation problem in which X is a French sequence and Y is an English sequence.
o The architecture uses a pretrained CNN (like AlexNet) as an encoder for the image, and the decoder is an RNN.
o Ideas are from the following papers (they share similar ideas):
▪ Maoet et. al., 2014. Deep captioning with multimodal recurrent neural networks
▪ Vinyals et. al., 2014. Show and tell: Neural image caption generator
▪ Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions
• There are some similarities between the language model we have learned previously, and the machine translation model we have just
discussed, but there are some differences as well.
• The language model we have learned is very similar to the decoder part of the machined translation model, except for a<0>
• The most common algorithm is the beam search, which we will explain in the next section.
• Why not use greedy search? Why not get the best choices each time?
o It turns out that this approach doesn't really work!
o Lets explain it with an example:
▪ The best output for the example we talked about is "Jane is visiting Africa in September."
▪ Suppose that when you are choosing with greedy approach, the first two words were "Jane is", the word that may come after
that will be "going" as "going" is the most common word that comes after " is" so the result may look like this: "Jane is going to
be visiting Africa in September.". And that isn't the best/optimal solution.
• So what is better than greedy approach, is to get an approximate solution, that will try to maximize the output (the last equation above).
Beam Search
• Beam search is the most widely used algorithm to get the best output sequence. It's a heuristic search algorithm.
• To illustrate the algorithm we will stick with the example from the previous section. We need Y = "Jane is visiting Africa in September."
• The algorithm has a parameter B which is the beam width. Lets take B = 3 which means the algorithm will get 3 outputs at a time.
• For the first step you will get ["in", "jane", "september"] words that are the best candidates.
• Then for each word in the first output, get B next (second) words and select top best B combinations where the best are those what give the
highest value of multiplying both probabilities - P(y<1>|x) * P(y<2>|x,y<1>). Se we will have then ["in september", "jane is", "jane visit"]. Notice, that
we automatically discard september as a first word.
• Repeat the same process and get the best B words for ["september", "is", "visit"] and so on.
• In this algorithm, keep only B instances of your network.
• If B = 1 this will become the greedy search.
o But there's another problem. The two optimization functions we have mentioned are preferring small sequences rather than long ones.
Because multiplying more fractions gives a smaller value, so fewer fractions - bigger result.
o So there's another step - dividing by the number of elements in the sequence.
• We have talked before on Error analysis in "Structuring Machine Learning Projects" course. We will apply these concepts to improve our beam
search algorithm.
• We will use error analysis to figure out if the B hyperparameter of the beam search is the problem (it doesn't get an optimal solution) or in our
RNN part.
• Let's take an example:
o Initial info:
▪ x = "Jane visite l’Afrique en septembre."
▪ y* = "Jane visits Africa in September." - right answer
▪ ŷ = "Jane visited Africa last September." - answer produced by model
o Our model that has produced not a good result.
o We now want to know who to blame - the RNN or the beam search.
o To do that, we calculate P(y* | X) and P(ŷ | X). There are two cases:
▪ Case 1 (P(y* | X) > P(ŷ | X)):
▪ Conclusion: Beam search is at fault.
▪ Case 2 (P(y* | X) <= P(ŷ | X)):
▪ Conclusion: RNN model is at fault.
• The error analysis process is as following:
o You choose N error examples and make the following table:
BLEU Score
• One of the challenges of machine translation, is that given a sentence in a language there are one or more possible good translation in
another language. So how do we evaluate our results?
• The way we do this is by using BLEU score. BLEU stands for bilingual evaluation understudy.
• The intuition is: as long as the machine-generated translation is pretty close to any of the references provided by humans, then it will get a
high BLEU score.
• Let's take an example:
o X = "Le chat est sur le tapis."
o Y1 = "The cat is on the mat." (human reference 1)
o Y2 = "There is a cat on the mat." (human reference 2)
o Suppose that the machine outputs: "the the the the the the the."
o One way to evaluate the machine output is to look at each word in the output and check if it is in the references. This is called precision:
▪ precision = 7/7 because "the" appeared in Y1 or Y2
o This is not a useful measure!
o We can use a modified precision in which we are looking for the reference with the maximum number of a particular word and set the
maximum appearing of this word to this number. So:
▪ modified precision = 2/7 because the max is 2 in Y1
▪ We clipped the 7 times by the max which is 2.
o Here we are looking at one word at a time - unigrams, we may look at n-grams too
• BLEU score on bigrams
o The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram".
o Suppose that the machine outputs: "the cat the cat on the mat."
cat the 1 0
cat on 1 1 (Y2)
Pairs Count Count clip
on the 1 1 (Y1)
Totals 6 4
• So here are the equations for modified presicion for the n-grams case:
• So far we were using sequence to sequence models with an encoder and decoders. There is a technique called attention which makes these
models even better.
• The attention idea has been one of the most influential ideas in deep learning.
• The problem of long sequences:
o Given this model, inputs, and outputs.
o The encoder should memorize this long sequence into one vector, and the decoder has to process this vector to generate the
translation.
o If a human would translate this sentence, he/she wouldn't read the whole sentence and memorize it then try to translate it. He/she
translates a part at a time.
o The performance of this model decreases if a sentence is long.
o We will discuss the attention model that works like a human that looks at parts at a time. That will significantly increase the accuracy
even with longer sequence:
▪ Blue is the normal model, while green is the model with attention mechanism.
• In this section we will give just some intuitions about the attention model and in the next section we will discuss it's details.
• At first the attention model was developed for machine translation but then other applications used it like computer vision and new
architectures like Neural Turing machine.
• The attention model was descried in this paper:
o Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate
• Now for the intuition:
o Suppose that our encoder is a bidirectional
RNN:
o We give the French sentence to the encoder and it should generate a vector that represents the inputs.
o Now to generate the first word in English which is "Jane" we will make another RNN which is the decoder.
o Attention weights are used to specify which words are needed when to generate a word. So to generate "jane" we will look at "jane",
"visite", "l'Afrique"
o alpha<1,1>, alpha<1,2>, and alpha<1,3> are the attention weights being used.
o And so to generate any word there will be a set of attention weights that controls which words we are looking at right now.
Attention Model
• Lets formalize the intuition from the last section into the exact details on how this can be implemented.
• First we will have an bidirectional RNN (most common is LSTMs) that encodes French language:
• For learning purposes, lets assume that a<t'> will include the both directions activations at time step t'.
• We will have a unidirectional RNN to produce the output using a context c which is computed using the attention weights, which denote how
much information does the output needs to look in a<t'>
• Sum of the attention weights for each element in the sequence should be 1:
o Now we need to know how to calculate e<t, t'>. We will compute e using a small neural network (usually 1-layer, because we will need to
compute this a lot):
▪ s<t-1> is the hidden state of the RNN s, and a<t'> is the activation of the other bidirectional RNN.
• One of the disadvantages of this algorithm is that it takes quadratic time or quadratic cost to run.
• One fun way to see how attention works is by visualizing the attention weights:
Speech recognition
• One of the most exciting developments using sequence-to-sequence models has been the rise of very accurate speech recognition.
• Let's define the speech recognition problem:
o X: audio clip
o Y: transcript
o If you plot an audio clip it will look like this:
▪ The horizontal axis is time while the vertical is changes in air pressure.
o What really is an audio recording? A microphone records little variations in air pressure over time, and it is these little variations in air
pressure that your ear perceives as sound. You can think of an audio recording is a long list of numbers measuring the little air pressure
changes detected by the microphone. We will use audio sampled at 44100 Hz (or 44100 Hertz). This means the microphone gives us
44100 numbers per second. Thus, a 10 second audio clip is represented by 441000 numbers (= 10 * 44100).
o It is quite difficult to work with "raw" representation of audio.
o Because even human ear doesn't process raw wave forms, the human ear can process different frequencies.
o There's a common preprocessing step for an audio - generate a spectrogram which works similarly to human ears.
▪ The horizontal axis is time while the vertical is frequencies. Intensity of different colors shows the amount of energy - how loud
is the sound for different frequencies (a human ear does a very similar preprocessing step).
o A spectrogram is computed by sliding a window over the raw audio signal, and calculates the most active frequencies in each window
using a Fourier transformation.
o In the past days, speech recognition systems were built using phonemes that are a hand engineered basic units of sound. Linguists
used to hypothesize that writing down audio in terms of these basic units of sound called phonemes would be the best way to do
speech recognition.
o End-to-end deep learning found that phonemes was no longer needed. One of the things that made this possible is the large audio
datasets.
o Research papers have around 300 - 3000 hours of training data while the best commercial systems are now trained on over 100,000
hours of audio.
• You can build an accurate speech recognition system using the attention model that we have descried in the previous section:
• One of the methods that seem to work well is CTC cost which stands for "Connectionist temporal classification"
o To explain this let's say that Y = "the quick brown fox"
o We are going to use an RNN with input, output structure:
• With the rise of deep learning speech recognition, there are a lot of devices that can be waked up by saying some words with your voice. These
systems are called trigger word detection systems.
• For example, Alexa - a smart device made by Amazon - can answer your call "Alexa, what time is it?" and then Alexa will respond to you.
• Trigger word detection systems include:
• For now, the trigger word detection literature is still evolving so there actually isn't a single universally agreed on the algorithm for trigger
word detection yet. But let's discuss an algorithm that can be used.
• Let's now build a model that can solve this problem:
o X: audio clip
o X has been preprocessed and spectrogram features have been returned of X
▪ X<1>, X<2>, ... , X<t>
o Y will be labels 0 or 1. 0 represents the non-trigger word, while 1 is that trigger word that we need to detect.
o The model architecture can be like this:
▪ The vertical lines in the audio clip represent moment just after the trigger word. The corresponding to this will be 1.
o One disadvantage of this creates a very imbalanced training set. There will be a lot of zeros and few ones.
o A hack to solve this is to make an output a few ones for several times or for a fixed period of time before reverting back to zero.
Extras