Unit 5 NLP
Unit 5 NLP
Topics covered: Introduction to Probabilistic Approaches, Statistical Approaches to NLP Tasks, Sequence Labeling,
Problems - Similarity Measures, Word Embeddings, CBOW, Skip-gram, Sentence Embeddings, Recurrent Neural
Networks (RNN), Long Short-Term Memory (LSTM)
A popular idea in computational linguistics is to create a probabilistic model of language. Such a model
assigns a probability to every sentence in English in such a way that more likely sentences (in some sense) get
higher probability. If you are unsure between two possible sentences, pick the higher probability one.
Comment: A ``perfect'' language model is only attainable with true intelligence. However, approximate
language models are often easy to create and good enough for many applications.
Some models:
● Unigram: words generated one at a time, drawn from a fixed distribution.
● Bigram: probability of word depends on previous word.
● Tag bigram: probability of part of speech depends on previous part of speech; probability of word
depends on part of speech.
● Maximum entropy: lots of other random features can contribute.
● Stochastic context free: words generated by a context-free grammar augmented with probabilistic
rewrite rules.
Statistical NLP aims to perform statistical inference for the field of NLP. Statistical inference consists
of taking some data generated in accordance with some unknown probability distribution and making
inferences.
Statistical NLP is the process of predicting the next word in the sequence given the words that precede
it. Statistical modeling helps to:
● Suggest auto-completes
● Recognize handwriting with lexical acquisition even if it’s in a poorly written text
● Detect and correct spelling errors
● Recognize speech
● Recognize multi-token named entities
● Caption images
● Summarize texts
● Detect primitive acoustic features
● Perform text categorization
General applications of the statistical models and fundamentals of statistical natural language processing:
3. Sequence Labeling
In machine learning, sequence labeling is a type of pattern recognition task that involves the
algorithmic assignment of a categorical label to each member of a sequence of observed values. A
common example of a sequence labeling task is part of speech tagging, which seeks to assign a part of
speech to each word in an input sentence or document. Sequence labeling can be treated as a set of
independent classification tasks, one per member of the sequence. However, accuracy is generally
improved by making the optimal label for a given element dependent on the choices of nearby elements,
using special algorithms to choose the globally best set of labels for the entire sequence at once.
Part-Of-Speech Tagging
Part-Of-Speech Tagging (POS Tagging) is a sequence labeling task. It is the process of marking up a
word in a text (corpus) as corresponding to a particular part of speech (syntactic tag), based on both its
definition and its context. POS Tagging is a helper task for many tasks about NLP: Word Sense
Disambiguation, Dependency Parsing, etc.
A sequence is a series of tokens where tokens are not independent of each other. Series in mathematics
and sentences in linguistics are both sequences. Because; in both of them, the next token depends on
the previous ones or vice versa.
Determining POS Tags is much more complicated than simply mapping words to their tags. Consider
word back:
● Earnings growth took a back/JJ seat.
● A small building in the back/NN.
● A clear majority of senators back/VBP at the bill.
Word sense disambiguation Given a dictionary which gives one or more word senses to each word
(for example, a bank can either be a financial institution or the sloped ground next to a river), and given
a sentence, guess the sense of each word in the sentence.
Named entity detection Given a sentence, identify all the proper names (Notre Dame, Apple, etc.)
and classify them as persons, organizations, places, etc. The typical way to set this up as a
sequence-labeling problem is called BIO tagging. Each word is labeled B (beginning) if it is the
first word in a named entity, I (inside) if it is a subsequent word in a named entity, and O (outside)
otherwise. Other encodings are possible as well.
Word segmentation Given a representation of a sentence without any word boundaries, reconstruct
the word boundaries. In some languages, like Chinese, words are written without any spaces in
between them. (Indeed, it can be difficult to settle on the definition of a “word” in such languages.)
This situation also occurs with any spoken language.
Sequence labeling can be done with various methods. While traditional models are based on corpus
statistics (Hidden Markov Models, Maximum Entropy Markov Models, Conditional Random Field, etc.),
recent models are based on neural networks (Recurrent Neural Networks, Long Short-Term Memory,
BERT, etc.).
Hidden Markov Models
In data science, the similarity measure is a way of measuring how data samples are related or closed to
each other. On the other hand, the dissimilarity measure is to tell how much the data objects are distinct.
It is also used in classification (e.g. KNN), where the data objects are labeled based on the features’
similarity. Another example is when we talk about dissimilar outliers compared to other data samples
(e.g., anomaly detection).
The similarity measure is usually expressed as a numerical value: It gets higher when the data samples
are more alike. It is often expressed as a number between zero and one by conversion: zero means low
similarity (the data objects are dissimilar). One means high similarity (the data objects are very
similar).
Metric:
A given distance (e.g. dissimilarity) is meant to be a metric if and only if it satisfies the following four
conditions:
1- Non-negativity: d (p, q) ≥ 0, for any two distinct observations p andq.
2- Symmetry: d (p, q) = d (q, p) for all p and q.
3- Triangle Inequality: d (p, q) ≤ d (p, r) + d(r, q) for all p, q, r.
4- d (p, q) = 0 only if p = q.
Distance measures are the fundamental principle for classification, like the k-nearest neighbor’s
classifier algorithm, which measures the dissimilarity between given data samples.
Distance Functions:
The technique used to measure distances depends on a particular situation you are working on. For
instance, in some areas, the euclidean distance can be optimal and useful for computing distances. To
get it you just need to subtract the points from the vectors, raise them to squares, add them up and take
the square root of them. Did it seem complex?
Cosine similarity measures the similarity between two vectors of an inner product space. It is measured
by the cosine of the angle between two vectors and determines whether two vectors are pointing in
roughly the same direction. It is often used to measure document similarity in text analysis.
The mathematical equation of Cosine similarity between two non-zero vectors is:
How does cosine similarity work?
● Cosine Similarity is a value that is bound by a constrained range of 0 and 1.
● The similarity measurement measures the cosine of the angle between the two non-zero vectors A
and B.
● Suppose the angle between the two vectors was 90 degrees. In that case, the cosine similarity will
have a value of 0; this means that the two vectors are orthogonal or perpendicular to each other.
● As the cosine similarity measurement gets closer to 1, the angle between the two vectors, A and B, is
smaller. The images below depict this more clearly.
How is cosine similarity used?
1. Document Similarity
A scenario that involves identifying the similarity between pairs of a document is a good use case for
the utilization of cosine similarity as a quantification of the measurement of similarity between two
objects.
Quantification of the similarity between two documents can be obtained by converting the words or
phrases into a vectorized form of representation.
The vector representations of the documents can then be used within the cosine similarity formula to
obtain a quantification of similarity.
In the scenario described above, the cosine similarity of 1 implies that the two documents are exactly
alike. A cosine similarity of 0 would conclude that there are no similarities between the two
documents.
2. Pose Matching
Pose matching involves comparing the poses containing critical points of joint locations.
Pose estimation is a computer vision task, and it’s typically solved using Deep Learning approaches
such as Convolutional Pose Machine, Stacked hourglass, PoseNet, etc.
Jaccard Similarity
Jaccard Similarity is the ratio of common words to total unique words or we can say the intersection of
words to the union of words in both the documents. it scores range between 0–1. 1 represents the
higher similarity while 0 represents the no similarity. Let’s see the formula of Jaccard similarity:
5. Word Embeddings
In natural language processing (NLP), word embedding is a term used for the representation of words for
text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that
the words that are closer in the vector space are expected to be similar in meaning.
Word embeddings can be obtained using a set of language modeling and feature learning techniques
where words or phrases from the vocabulary are mapped to vectors of real numbers.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word
vector has values corresponding to these features.
Goal of Word Embeddings
● To reduce dimensionality
● To use a word to predict the words around it Inter word semantics must be captured
Let’s take an example to understand how word vector is generated by taking emoticons which are most
frequently used in certain conditions and transform each emoji into a vector and the conditions will be
our features.
H ? ? ?
a ? ? ?
p ? ? ?
p ? ? ?
y
S ? ? ?
a ? ? ?
d ? ? ?
? ? ?
E ? ? ?
x ? ? ?
c ? ? ?
i ? ? ?
t
e
d
S ? ? ?
i ? ? ?
c ? ? ?
k ? ? ?
1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1.If there are 500 words in the corpus
then the vector length will be 500. After assigning vectors to each word we take a window size and
iterate through the entire corpus. While we do this there are two neural embedding methods which are
used:
In this model what we do is we try to fit the neighboring words in the window to the central word.
Skip Gram
In this model, we try to make the central word closer to the neighboring words. It is the complete
opposite of the CBOW model. It is shown that this method produces more meaningful embeddings.
After applying the above neural embedding methods we get trained vectors of each word after many
iterations through the corpus. These trained vectors preserve syntactical or semantic information and
are converted to lower dimensions. The vectors with similar meaning or semantic in - formation are
placed close to each other in space.
2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate
through it and get the co-occurrence of each word with other words in the corpus. We get a co-
occurrence matrix through this. The words which occur next to each other get a value of 1, if they are
one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:
It is a nice evening. Good
Evening!
Is it a nice evening?
The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as
well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather
information about the context in which the word is used.
Initially, the vectors for each word are assigned randomly. Then we take two pairs of vectors and see
how close they are to each other in space. If they occur together more often or have a higher value in
the co-occurrence matrix and are far apart in space then they are brought close to each other. If they
are close to each other but are rarely or not frequently used together then they are moved further
apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates
the information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in
terms of both semantic and syntactic capturing.
6. CBOW
In the CBOW model, the distributed representations of context (or surrounding words) are combined
to predict the word in the middle. While in the Skip-gram model, the distributed representation of the
input word is used to predict the context.
Several times faster to train than the skip-gram, slightly better accuracy for the frequent words. The
CBOW model architecture tries to predict the current target word (the center word) based on the
source context words (surrounding words). Considering a simple sentence, “the quick brown fox
jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider
a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick),
([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the
context_window words.
1. These methods were prediction-based as they give the probabilities to the words.
2. They proved to be state of the art for tasks like word analogies and word similarities.
3. They were also able to achieve algebraic operations tasks such as King -man +woman = Queen,
which was considered a result almost magical.
The following model architectures are used for word representations with an objective to maximize
the accuracy and minimize the computation complexity:
With the help of the above table, we can see the word pairs constructed with this method. The
highlighted word denotes the word for which we want to find pairs. Here, we don’t care about how
much the distance between the words in the window is. As long as words are inside the window, we
don’t differentiate between words that are 1 word away or more.
Step-1: Initially, we will assign a vector of random numbers to each word in the corpus.
Step-2: Then, we will iterate through each word of the document and grab the vectors of the nearest
n-words on either side of our target word, and concatenate all these vectors, and then for- ward
propagate these concatenated vectors through a linear layer + softmax function, and try to predict
what our target word was.
Step-3: In this step, we will compute the error between our estimate and the actual target word and
then back propagated the error and then modifies not only the weights of the linear layer but also the
vectors or embeddings of our neighbor’s words.
Step-4: Finally, we will extract the weights from the hidden layer and by using these weights encode
the meaning of words in the vocabulary.
Word2Vec model is not a single algorithm but is composed of the following two preprocessing modules
or techniques:
Semantic Regularities: These regularities refer to the meaning of the vocabulary symbols arranged in
that structure.
The proposed technique was found that the similarity of word representations goes beyond syntactic
regularities and works surprisingly well for algebraic operations of word vectors.
For Example,
Vector (“King”) — Vector (“Man”)+Vector(“Woman”) = Word(“Queen”)
7. Skip-gram
Skip-gram is one of the unsupervised learning techniques used to find the most related words for a
given word. Skip-gram is used to predict the context word for a given target word. It’s reverse of CBOW
algorithm. Here, target word is input while context words are output. As there is more than one context
word to be predicted which makes this problem difficult.
Advantages
Working steps
1. The words are converted into a vector using one hot encoding. The dimension of these vectors
is [1,|v|].
2. The word w(t) is passed to the hidden layer from |v| neurons.
3. Hidden layer performs the dot product between weight vector W[|v|, N] and the input
vector w(t).
In this, we can conclude that the (t)th row of W[|v|, N] will be the output(H[1, N]).
Probability function
8. Sentence Embeddings
Sentence embedding is the collective name for a set of techniques in natural language process-
ing (NLP) where sentences are mapped to vectors of real numbers.
Application
● Sentence embedding is used by the deep learning software libraries PyTorch and Tensor- Flow.
● Popular embeddings are based on the hidden layer outputs of transformer models like BERT.
SIF. Introduces a model called smooth inverse frequency (SIF). They propose taking a weighted
average (weighted by a term related to the inverse document frequency) of the word embeddings
in a sentence and then removing the projection of the first singular vector. This is derived from an
assumption that the sentence has been generated by a random walk of a discourse vector on a
latent word embedding space, and by including smoothing terms for frequent words.
RNN have a “memory” which remembers all information about what has been calculated. It uses
the same parameters for each input as it performs the same task on all the inputs or hidden
layers to produce the output. This reduces the complexity of parameters, unlike other neural
networks.
How RNN works?
The working of a RNN can be understood with the help of below example:
Example: Suppose there is a deeper network with one input layer, three hidden layers and one
output layer. Then like other neural networks, each hidden layer will have its own set of weights
and biases, let’s say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2) for second
hidden layer and (w3, b3) for third hidden layer. This means that each of these layers are
independent of each other, i.e. they do not memorize the previous outputs.
● RNN converts the independent activations into dependent activations by providing the same
weights and biases to all the layers, thus reducing the complexity of increasing parameters and
memorizing each previous outputs by giving each output as input to the next hid - den layer.
● Hence these three layers can be joined together such that the weights and bias of all the hidden
layers is the same, into a single recurrent layer.
Formula for calculating current state:
● where:
● ht -> current state
● ht-1 -> previous state
● xt -> input state
where:
Yt -> output
Why -> weight at output layer
Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
particular month, can be solved using an RNN.
Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
RNN for Natural Language Processing (NLP).
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial
intelligence and deep learning. Unlike standard feed forward neural networks, LSTM has
feedback connections. Such a recurrent neural network (RNN) can process not only single data
points (such as images), but also entire sequences of data (such as speech or video). For
example, LSTM is applicable to tasks such as unsegmented, connected handwriting
recognition, speech recognition, machine translation, robot control, video games, and
healthcare.. LSTM has become the most cited neural network of the 20th century.
1. Language Modelling
2. Machine Translation
3. Image Captioning
4. Handwriting generation
5. Question Answering Chatbots