0% found this document useful (0 votes)

32 views

Unit 5 NLP

Uploaded by

brewcoder6113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Unit 5 NLP

Uploaded by

brewcoder6113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Unit-5

Topics covered: Introduction to Probabilistic Approaches, Statistical Approaches to NLP Tasks, Sequence Labeling,
Problems - Similarity Measures, Word Embeddings, CBOW, Skip-gram, Sentence Embeddings, Recurrent Neural
Networks (RNN), Long Short-Term Memory (LSTM)

1. Introduction to Probabilistic Approaches

A popular idea in computational linguistics is to create a probabilistic model of language. Such a model
assigns a probability to every sentence in English in such a way that more likely sentences (in some sense) get
higher probability. If you are unsure between two possible sentences, pick the higher probability one.
Comment: A ``perfect'' language model is only attainable with true intelligence. However, approximate
language models are often easy to create and good enough for many applications.

Some models:
● Unigram: words generated one at a time, drawn from a fixed distribution.
● Bigram: probability of word depends on previous word.
● Tag bigram: probability of part of speech depends on previous part of speech; probability of word
depends on part of speech.
● Maximum entropy: lots of other random features can contribute.
● Stochastic context free: words generated by a context-free grammar augmented with probabilistic
rewrite rules.

In spoken language use, we have a distribution:

P(A, W, T, M)
Speech people have usually looked at: P(W|A) – the rest of the hidden structure is ignored.
NLP people interested in the ‘more hidden’ structure – T and often M – but sometimes W is observable
E.g., there is much work looking at the parsing problem P(T|W ). Language generation is P(W|M).

Probability models Building a probability model consists of two steps:

1. Defining the model

2. Estimating the model’s parameters (= training/learning )

Probability models (almost) always make independence assumptions.

— Even though X and Y are not actually independent, our model may treat them as independent.
— This can drastically reduce the number of parameters to estimate.
— Models without independence assumptions have (way) too many parameters to estimate
reliably from the data we have
— But since independence assumptions are often incorrect, those models are often incorrect as
well: they assign probability mass to events that cannot occur

2. Statistical Approaches to NLP Tasks

Statistical NLP aims to perform statistical inference for the field of NLP. Statistical inference consists
of taking some data generated in accordance with some unknown probability distribution and making
inferences.

Motivations for Statistical NLP

a. Cognitive modeling of the human language processing has not reached a stage where we can have a
complete mapping between the language signal and the information contents.
b. Complete mapping is not always required.
c. Statistical approach provides the flexibility required for making the modeling of a language more
accurate.

Idea behind Statistical NLP

d. View language processing as a noisy channel information transmission.
e. The approach requires a model that characterizes the transmission by giving for every message the
probability of the observed output
Statistical Modeling and Classification
● Primitive acoustic features
● Quantization
● Maximum likelihood and related rules
● Class conditional density function
● Hidden Markov Model Methodology

Statistical NLP is the process of predicting the next word in the sequence given the words that precede
it. Statistical modeling helps to:
● Suggest auto-completes
● Recognize handwriting with lexical acquisition even if it’s in a poorly written text
● Detect and correct spelling errors
● Recognize speech
● Recognize multi-token named entities
● Caption images
● Summarize texts
● Detect primitive acoustic features
● Perform text categorization

Statistical models are used in NLP for two reasons:

● To make algorithms for processing language able to learn from observations of language (and other
contextual clues). This is called machine learning. There’s also an alternative called
expert systems. However, it doesn’t scale because it is not feasible for engineers to write down all of the
“rules” for “understanding” text.
● Natural language is based on referential and prototypical context to disambiguate its precise meaning. In
comparison with rule-based systems, statistical models are best for this type of situation-dependent
interface and corpus-based work.

General applications of the statistical models and fundamentals of statistical natural language processing:

● Spatial models: A co-variation of properties within geographic space.

● Time-series: Frequency-domain methods and time-domain methods.
● Survival analysis: Analysis of the expected duration of time until one or more events happens, such as
a death in biological organisms and failure in mechanical systems.
● Market segmentation: Dividing a broad market into the groups of customers with similar
characteristics
● Recommendation systems: The filtering system predicts the ‘rating’ or ‘preference’ that a user
would give to an item.
● Association Rule Learning: A method for discovering interesting relations between variables
in large databases.
● Attribution Modeling: A rule that determines how credit for sales and conversions is assigned to
touch points in conversion paths.
● Scoring: Statistics processing to predict the outcome and assign it a corresponding score.

3. Sequence Labeling

In machine learning, sequence labeling is a type of pattern recognition task that involves the
algorithmic assignment of a categorical label to each member of a sequence of observed values. A
common example of a sequence labeling task is part of speech tagging, which seeks to assign a part of
speech to each word in an input sentence or document. Sequence labeling can be treated as a set of
independent classification tasks, one per member of the sequence. However, accuracy is generally
improved by making the optimal label for a given element dependent on the choices of nearby elements,
using special algorithms to choose the globally best set of labels for the entire sequence at once.

Words and Their Roles

Words are sequential building blocks of sentences. Each word contributes syntactic and semantic
properties to a sentence. For example, a word can be an adjective, which can give positive semantics
(delicate). By doing that, this Adjective can describe a noun (tender boy). This sequential relation- ship
can go recursively and unbounded. This shows us words are related to each other.

Part-Of-Speech Tagging
Part-Of-Speech Tagging (POS Tagging) is a sequence labeling task. It is the process of marking up a
word in a text (corpus) as corresponding to a particular part of speech (syntactic tag), based on both its
definition and its context. POS Tagging is a helper task for many tasks about NLP: Word Sense
Disambiguation, Dependency Parsing, etc.
A sequence is a series of tokens where tokens are not independent of each other. Series in mathematics
and sentences in linguistics are both sequences. Because; in both of them, the next token depends on
the previous ones or vice versa.
Determining POS Tags is much more complicated than simply mapping words to their tags. Consider
word back:
● Earnings growth took a back/JJ seat.
● A small building in the back/NN.
● A clear majority of senators back/VBP at the bill.

Word sense disambiguation Given a dictionary which gives one or more word senses to each word
(for example, a bank can either be a financial institution or the sloped ground next to a river), and given
a sentence, guess the sense of each word in the sentence.
Named entity detection Given a sentence, identify all the proper names (Notre Dame, Apple, etc.)
and classify them as persons, organizations, places, etc. The typical way to set this up as a
sequence-labeling problem is called BIO tagging. Each word is labeled B (beginning) if it is the
first word in a named entity, I (inside) if it is a subsequent word in a named entity, and O (outside)
otherwise. Other encodings are possible as well.

Word segmentation Given a representation of a sentence without any word boundaries, reconstruct
the word boundaries. In some languages, like Chinese, words are written without any spaces in
between them. (Indeed, it can be difficult to settle on the definition of a “word” in such languages.)
This situation also occurs with any spoken language.

Sequence Labeling Models

Sequence labeling can be done with various methods. While traditional models are based on corpus
statistics (Hidden Markov Models, Maximum Entropy Markov Models, Conditional Random Field, etc.),
recent models are based on neural networks (Recurrent Neural Networks, Long Short-Term Memory,
BERT, etc.).
Hidden Markov Models

4. Problems - Similarity Measures

In data science, the similarity measure is a way of measuring how data samples are related or closed to
each other. On the other hand, the dissimilarity measure is to tell how much the data objects are distinct.

It is also used in classification (e.g. KNN), where the data objects are labeled based on the features’
similarity. Another example is when we talk about dissimilar outliers compared to other data samples
(e.g., anomaly detection).

The similarity measure is usually expressed as a numerical value: It gets higher when the data samples
are more alike. It is often expressed as a number between zero and one by conversion: zero means low
similarity (the data objects are dissimilar). One means high similarity (the data objects are very
similar).

Metric:
A given distance (e.g. dissimilarity) is meant to be a metric if and only if it satisfies the following four
conditions:
1- Non-negativity: d (p, q) ≥ 0, for any two distinct observations p andq.
2- Symmetry: d (p, q) = d (q, p) for all p and q.
3- Triangle Inequality: d (p, q) ≤ d (p, r) + d(r, q) for all p, q, r.
4- d (p, q) = 0 only if p = q.

Distance measures are the fundamental principle for classification, like the k-nearest neighbor’s
classifier algorithm, which measures the dissimilarity between given data samples.
Distance Functions:

The technique used to measure distances depends on a particular situation you are working on. For
instance, in some areas, the euclidean distance can be optimal and useful for computing distances. To
get it you just need to subtract the points from the vectors, raise them to squares, add them up and take
the square root of them. Did it seem complex?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured
by the cosine of the angle between two vectors and determines whether two vectors are pointing in
roughly the same direction. It is often used to measure document similarity in text analysis.
The mathematical equation of Cosine similarity between two non-zero vectors is:
How does cosine similarity work?
● Cosine Similarity is a value that is bound by a constrained range of 0 and 1.
● The similarity measurement measures the cosine of the angle between the two non-zero vectors A
and B.
● Suppose the angle between the two vectors was 90 degrees. In that case, the cosine similarity will
have a value of 0; this means that the two vectors are orthogonal or perpendicular to each other.
● As the cosine similarity measurement gets closer to 1, the angle between the two vectors, A and B, is
smaller. The images below depict this more clearly.
How is cosine similarity used?
1. Document Similarity

A scenario that involves identifying the similarity between pairs of a document is a good use case for
the utilization of cosine similarity as a quantification of the measurement of similarity between two
objects.

Quantification of the similarity between two documents can be obtained by converting the words or
phrases into a vectorized form of representation.
The vector representations of the documents can then be used within the cosine similarity formula to
obtain a quantification of similarity.

In the scenario described above, the cosine similarity of 1 implies that the two documents are exactly
alike. A cosine similarity of 0 would conclude that there are no similarities between the two
documents.
2. Pose Matching

Pose matching involves comparing the poses containing critical points of joint locations.
Pose estimation is a computer vision task, and it’s typically solved using Deep Learning approaches
such as Convolutional Pose Machine, Stacked hourglass, PoseNet, etc.
Jaccard Similarity

Jaccard Similarity is the ratio of common words to total unique words or we can say the intersection of
words to the union of words in both the documents. it scores range between 0–1. 1 represents the
higher similarity while 0 represents the no similarity. Let’s see the formula of Jaccard similarity:

5. Word Embeddings

In natural language processing (NLP), word embedding is a term used for the representation of words for
text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that
the words that are closer in the vector space are expected to be similar in meaning.
Word embeddings can be obtained using a set of language modeling and feature learning techniques
where words or phrases from the vocabulary are mapped to vectors of real numbers.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word
vector has values corresponding to these features.
Goal of Word Embeddings
● To reduce dimensionality
● To use a word to predict the words around it Inter word semantics must be captured

How are Word Embeddings used?

They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in training or inference
To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

Implementations of Word Embeddings:

Word Embeddings are a method of extracting features out of text so that we can input those features into a
machine learning model to work with text data. They try to preserve syntactical and semantic information.
The methods such as Bag of Words (BOW), CountVectorizer and TFIDF rely on the word count in a
sentence but do not save any syntactical or semantic information. In these algorithms, the size of the
vector is the number of elements in the vocabulary. We can get a sparse matrix if most of the elements are
zero. Large input vectors will mean a huge number of weights which will result in high computation
required for training. Word Embeddings give a solution to these problems.

Let’s take an example to understand how word vector is generated by taking emoticons which are most
frequently used in certain conditions and transform each emoji into a vector and the conditions will be
our features.
H ? ? ?
a ? ? ?
p ? ? ?
p ? ? ?
y
S ? ? ?
a ? ? ?
d ? ? ?
? ? ?
E ? ? ?
x ? ? ?
c ? ? ?
i ? ? ?
t
e
d
S ? ? ?
i ? ? ?
c ? ? ?
k ? ? ?

The emoji vectors for the emojis will be: [happy,sad,excited,sick]

???? =[1,0,1,0]
???? =[0,1,0,1]
???? =[0,0,1,1]
.....

Two different approaches to get Word Embeddings:

1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1.If there are 500 words in the corpus
then the vector length will be 500. After assigning vectors to each word we take a window size and
iterate through the entire corpus. While we do this there are two neural embedding methods which are
used:

Continuous Bowl of Words (CBOW)

In this model what we do is we try to fit the neighboring words in the window to the central word.

Skip Gram

In this model, we try to make the central word closer to the neighboring words. It is the complete
opposite of the CBOW model. It is shown that this method produces more meaningful embeddings.

After applying the above neural embedding methods we get trained vectors of each word after many
iterations through the corpus. These trained vectors preserve syntactical or semantic information and
are converted to lower dimensions. The vectors with similar meaning or semantic in - formation are
placed close to each other in space.

2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate
through it and get the co-occurrence of each word with other words in the corpus. We get a co-
occurrence matrix through this. The words which occur next to each other get a value of 1, if they are
one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:

Corpus:
It is a nice evening. Good
Evening!

Is it a nice evening?

it is a nice evening good

it 0
is 1+1 0
a 1/2+1 1+1/2 0
nice 1/3+1/2 1/2+1/3 1+1 0
evening 1/4+1/3 1/3+1/4 1/2+1/2 1+1 0
good 0 0 0 0 1 0

The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as
well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather
information about the context in which the word is used.
Initially, the vectors for each word are assigned randomly. Then we take two pairs of vectors and see
how close they are to each other in space. If they occur together more often or have a higher value in
the co-occurrence matrix and are far apart in space then they are brought close to each other. If they
are close to each other but are rarely or not frequently used together then they are moved further
apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates
the information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in
terms of both semantic and syntactic capturing.

Pre-trained Word Embedding Models:

People generally use pre-trained models for word embeddings. Few of them are:
● SpaCy
● fastText
● Flair etc.

Common Errors made:

● You need to use the exact same pipeline during deploying your model as were used to create
the training data for the word embedding. If you use a different tokenizer or different method of
handling white space, punctuation etc. you might end up with incompatible inputs.
● Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of
Vocabulary Word(oov). What you can do is replace those words with “UNK” which means unknown
and then handle them separately.
● Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of
length say 400 and then try to apply vectors of length 1000 at inference time, you will run into errors.
So make sure to use the same dimensions throughout.

Benefits of using Word Embeddings:

● It is much faster to train than hand build models like WordNet(which uses graph embed-
dings)
● Almost all modern NLP applications start with an embedding layer
● It Stores an approximation of meaning

Drawbacks of Word Embeddings:

● It can be memory intensive
● It is corpus dependent. Any underlying bias will have an effect on your model
● It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.

6. CBOW
In the CBOW model, the distributed representations of context (or surrounding words) are combined
to predict the word in the middle. While in the Skip-gram model, the distributed representation of the
input word is used to predict the context.

Several times faster to train than the skip-gram, slightly better accuracy for the frequent words. The
CBOW model architecture tries to predict the current target word (the center word) based on the
source context words (surrounding words). Considering a simple sentence, “the quick brown fox
jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider
a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick),
([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the
context_window words.

Single context word

Word Embedding
Word embedding is a way of representing words as vectors. The main goal of word embedding is to
convert the high dimensional feature space of words into low dimensional feature vectors by pre-
serving the contextual similarity in the corpus.
These models are widely used for all NLP problems. It first generates a vocabulary with the help of a
training corpus and then learns the word embedding representations. In simple words, these models
take a text corpus as input and produce the word vectors as output.
They can be used as feature vectors for the Machine Learning model, used to measure text similarity
using cosine similarity techniques, words clustering, and text classification techniques, which we
will be discussed in the subsequent part of this series.

Prediction-based Word Embedding

So far, we have discussed the deterministic methods to determine vector representation of the words
but these methods proved to be limited in their word representations until the new word embedding
technique named word2vec comes to the NLP community.
The popular pre-trained models to create word embedding of a text are as follows:
● Word2Vec — From Google
● Fast text — From Facebook
● Glove — From Stanford

Why these models are called Prediction-based?

1. These methods were prediction-based as they give the probabilities to the words.
2. They proved to be state of the art for tasks like word analogies and word similarities.
3. They were also able to achieve algebraic operations tasks such as King -man +woman = Queen,
which was considered a result almost magical.

Different Model Architectures for Word representation

The following model architectures are used for word representations with an objective to maximize
the accuracy and minimize the computation complexity:

● FeedForward Neural Net Language Model (NNLM)

● Recurrent Neural Net Language Model (RNNLM)
For training of the above-mentioned models, we use Stochastic gradient descent as an optimizer
and backpropagation.

FeedForward Neural Net Language Model (NNLM)

This model consists of the following layers:
Input layer, Projection layer, Hidden layer, and Output layer.
This architecture becomes complex for computation between the projection and the hidden layer, as
values in the projection layer are dense.

Problem with these models

These models outperform for the huge dataset of words but the main problem in these models is the
computation complexity. So, to overcome the computation complexity, the Word2Vec uses CBOW and
Skip-gram architecture in order to maximize the accuracy and minimize the computation complexity.

What is Word2Vec Model?

Word2Vec model is used for Word representations in Vector Space which is founded by Tomas Mikolov
and a group of the research teams from Google in 2013. It is a neural network model that attempts to
explain the word embeddings based on a text corpus.
These models work using context. This implies that to learn the embedding, it looks at nearby words; if
a group of words is always found close to the same words, they will end up having similar embeddings.
To label how words are similar or close to each other, we first fix the window size, which determines
which nearby words we want to pick.
For Example, For a window size of 2, implies that for every word, we’ll pick the 2 words behind and the
2 words after it. Let’s see the following example:

Sentence: the pink horse is eating

With the help of the above table, we can see the word pairs constructed with this method. The
highlighted word denotes the word for which we want to find pairs. Here, we don’t care about how
much the distance between the words in the window is. As long as words are inside the window, we
don’t differentiate between words that are 1 word away or more.

The General Flow of the Algorithm

Step-1: Initially, we will assign a vector of random numbers to each word in the corpus.
Step-2: Then, we will iterate through each word of the document and grab the vectors of the nearest
n-words on either side of our target word, and concatenate all these vectors, and then forward
propagate these concatenated vectors through a linear layer + softmax function, and try to predict
what our target word was.
Step-3: In this step, we will compute the error between our estimate and the actual target word and
then back propagated the error and then modifies not only the weights of the linear layer but also the
vectors or embeddings of our neighbor’s words.
Step-4: Finally, we will extract the weights from the hidden layer and by using these weights encode
the meaning of words in the vocabulary.

Word2Vec model is not a single algorithm but is composed of the following two preprocessing modules
or techniques:

● Continuous Bag of Words (CBOW)

● Skip-Gram.
Both of the mentioned models are basically shallow neural networks that map word(s) to the target
variable which is also a word(s). These techniques learn the weights that act as word vector
representations. Both these techniques can be used to implementing word embedding using word2vec.
Before going further deep dive into the two techniques of Word2Vec, let’s first try to understand the
given below question:

Why Word2Vec technique is created?

As we know that most of the NLP systems treat words as atomic units. In existing systems with the
same purpose as that of word2vec, there is a disadvantage that there is no notion of similarity between
words. Also, those system works for small, simpler data and outperforms on because of only a few
billions of data or less.
So, In order to train the system with a larger dataset with complex models, these techniques use neural
network architecture to train complex data models and outperform huge datasets with billions of
words and with vocabulary having millions of words. It helps to measure the quality of the resulting
vector representations and works with similar words that tend to close with words that can have
multiple degrees of similarity.
Syntactic Regularities: These regularities refer to grammatical sentence correction.

Semantic Regularities: These regularities refer to the meaning of the vocabulary symbols arranged in
that structure.
The proposed technique was found that the similarity of word representations goes beyond syntactic
regularities and works surprisingly well for algebraic operations of word vectors.
For Example,
Vector (“King”) — Vector (“Man”)+Vector(“Woman”) = Word(“Queen”)

Where, “Queen” is considered the closest result vector of word representations.

The above new two proposed models i.e, CBOW and Skip-Gram in Word2Vec uses a distributed
architecture that tries to minimize the computation complexity.

7. Skip-gram
Skip-gram is one of the unsupervised learning techniques used to find the most related words for a
given word. Skip-gram is used to predict the context word for a given target word. It’s reverse of CBOW
algorithm. Here, target word is input while context words are output. As there is more than one context
word to be predicted which makes this problem difficult.
Advantages

1. It is unsupervised learning hence can work on any raw text given.

2. It requires less memory comparing with other words to vector representations.
3. It requires two weight matrix of dimension [N, |v|] each instead of [|v|, |v|]. And usually, N is
around 300 while |v| is in millions. So, we can see the advantage of using this algorithm.
Disadvantages

1. Finding the best value for N and c is difficult.

2. Softmax function is computationally expensive.
3. The time required for training this algorithm is high.

Working steps
1. The words are converted into a vector using one hot encoding. The dimension of these vectors
is [1,|v|].

2. The word w(t) is passed to the hidden layer from |v| neurons.

3. Hidden layer performs the dot product between weight vector W[|v|, N] and the input
vector w(t).
In this, we can conclude that the (t)th row of W[|v|, N] will be the output(H[1, N]).

4. Remember there is no activation function used at the hidden layer so the

H[1,k]will be passed directly to the output layer.
5. Output layer will apply dot product between H[1, N] and W’[N, |v|] and will give
us the vector U.
6. Now, to find the probability of each vector we’ll use the softmax function. As each
iteration gives output vector U which is of one hot encoding type.
7. The word with the highest probability is the result and if the predicted word for a
given context position is wrong then we’ll use back propagation to modify our
weight vectors W and W’.
This steps will be executed for each word w(t) present in vocabulary. And each word w(t) will
be passed k times. So, we can see that forward propagation will be processed |v|*k times in
each epoch.

Probability function

8. Sentence Embeddings
Sentence embedding is the collective name for a set of techniques in natural language process-
ing (NLP) where sentences are mapped to vectors of real numbers.

Application

● Sentence embedding is used by the deep learning software libraries PyTorch and Tensor- Flow.
● Popular embeddings are based on the hidden layer outputs of transformer models like BERT.

Evaluated sentence embedding models

Sentence embedding models: This page introduces the sentence embedding models under
study and the evaluation architecture. The results of the study are presented on a separate page.

Introduction to sentence embedding

A crucial component in most natural language processing (NLP) applications is finding an
expressive representation for text. Modern methods are typically based on sentence embeddings
that map a sentence onto a numerical vector. The vector attempts to capture the semantic content
of the text.
If two sentences express a similar idea using different words, their representations (embedding
vectors) should still be similar to each other.
Several methods for constructing these embeddings have been proposed in the literature.
Interestingly, learned embeddings tend generalize quite well to other material and NLP tasks
besides the ones they were trained on. This is fortunate, because it allows us to use pre-trained
models and avoid expensive training. (A single training run on a modern, large-scale NLP model
can cost up to tens of thousands of dollars).
Sentence embeddings are being applied on almost all NLP application areas. In information
retrieval they are used for comparing the meanings of text snippets, machine translation uses
sentence embeddings as an “intermediate language” when translating between two human
languages and many classification and tagging applications are based on embeddings. With better
representations it is possible to build applications that react more naturally to the sentiment and
topic of written text.

Evaluated sentence embedding models

Bag-of-words (BoW): TF-IDF. I’m using TF-IDF (term frequency-inverse document frequency)
vectors as a baseline. A TF-IDF vector is a sparse vector with one dimension per each unique
word in the vocabulary. The value for an element in a vector is an occurrence count of the
corresponding word in the sentence multiplied by a factor that is inversely proportional to the
overall frequency of that word in the whole corpus. The latter factor is meant to diminish the
effect of very common words (mutta, ja, siksi, …) which are unlikely to tell much about the actual
content of the sentence. The vectors are L2-normalized to reduce the effect of differing sentence
lengths.

Pooled word embeddings: A word embedding is a vector representation of a word.

Words, which often occur in similar context (a horse and a pony), are assigned vector that are
close by each other and words, which rarely occur in a similar context (a horse and a monocle),
dissimilar vectors. The embedding vectors are dense, relatively low dimensional (typically
50-300 dimensions) vectors.

SIF. Introduces a model called smooth inverse frequency (SIF). They propose taking a weighted
average (weighted by a term related to the inverse document frequency) of the word embeddings
in a sentence and then removing the projection of the first singular vector. This is derived from an
assumption that the sentence has been generated by a random walk of a discourse vector on a
latent word embedding space, and by including smoothing terms for frequent words.

9. Recurrent Neural Networks (RNN)

Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous
step are fed as input to the current step. In traditional neural networks, all the inputs and
outputs are independent of each other, but in cases like when it is required to predict the next
word of a sentence, the previous words are required and hence there is a need to remember the
previous words. Thus RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is Hidden state, which remembers
some information about a sequence.

RNN have a “memory” which remembers all information about what has been calculated. It uses
the same parameters for each input as it performs the same task on all the inputs or hidden
layers to produce the output. This reduces the complexity of parameters, unlike other neural
networks.
How RNN works?
The working of a RNN can be understood with the help of below example:

Example: Suppose there is a deeper network with one input layer, three hidden layers and one
output layer. Then like other neural networks, each hidden layer will have its own set of weights
and biases, let’s say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2) for second
hidden layer and (w3, b3) for third hidden layer. This means that each of these layers are
independent of each other, i.e. they do not memorize the previous outputs.

● RNN converts the independent activations into dependent activations by providing the same
weights and biases to all the layers, thus reducing the complexity of increasing parameters and
memorizing each previous outputs by giving each output as input to the next hid - den layer.
● Hence these three layers can be joined together such that the weights and bias of all the hidden
layers is the same, into a single recurrent layer.
Formula for calculating current state:
● where:
● ht -> current state
● ht-1 -> previous state
● xt -> input state

Formula for applying Activation function (tanh):

where:

● whh -> weight at recurrent neuron

● wxh -> weight at input neuron

Formula for calculating output:

Yt -> output
Why -> weight at output layer

Training through RNN

1. A single time step of the input is provided to the network.

2. Then calculate its current state using set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the information from all the
previous states.
5. Once all the time steps are completed the final current state is used to calculate the output.
6. The output is then compared to the actual output i.e the target output and the error is
generated.
The error is then back-propagated to the network to update the weights and hence the network
(RNN) is trained.

Advantages of Recurrent Neural Network

1. An RNN remembers each and every information with time. It is useful in time series prediction
only because of the feature to remember previous inputs as well. This is called Long Short Term
Memory.
2. Recurrent neural network are even used with convolutional layers to extend the effective pixel
neighborhood.

Disadvantages of Recurrent Neural Network

1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.

Applications of Recurrent Neural Networks

Image Captioning: RNNs are used to caption an image by analyzing the activities present.

Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
particular month, can be solved using an RNN.

Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
RNN for Natural Language Processing (NLP).

Types of Recurrent Neural Networks

There are four types of Recurrent Neural Networks:
1. One to One
2. One to Many
3. Many to One
4. Many to Many

1. One to One RNN

This type of neural network is known as the Vanilla Neural Network. It's used for general
machine learning problems, which has a single input and a single output.
2. One to Many RNN
This type of neural network has a single input and multiple outputs. An example of this is the
image caption.

3. Many to One RNN

This RNN takes a sequence of inputs and generates a single output. Sentiment analysis is a
good example of this kind of network where a given sentence can be classified as expressing
positive or negative sentiments.

4. Many to Many RNN

This RNN takes a sequence of inputs and generates a sequence of outputs. Machine translation
is one of the examples.

Two Issues of Standard RNNs

1. Vanishing Gradient Problem

Recurrent Neural Networks enable you to model time-dependent and sequential data
problems, such as stock market prediction, machine translation, and text generation.

2. Exploding Gradient Problem

While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients
accumulate, resulting in very large updates to the neural network model weights during the
training process. Long training time, poor performance, and bad accuracy are the major issues
in gradient problems.
10. Long Short-Term Memory (LSTM)

Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial
intelligence and deep learning. Unlike standard feed forward neural networks, LSTM has
feedback connections. Such a recurrent neural network (RNN) can process not only single data
points (such as images), but also entire sequences of data (such as speech or video). For
example, LSTM is applicable to tasks such as unsegmented, connected handwriting
recognition, speech recognition, machine translation, robot control, video games, and
healthcare.. LSTM has become the most cited neural network of the 20th century.

LSTM with a forget gate

Applications of LSTM includes:

1. Language Modelling
2. Machine Translation
3. Image Captioning
4. Handwriting generation
5. Question Answering Chatbots

Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
9.Chapter7 POS Tagging
No ratings yet
9.Chapter7 POS Tagging
37 pages
CH2
No ratings yet
CH2
119 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
UNIT 2 Sequence labeling-1
No ratings yet
UNIT 2 Sequence labeling-1
6 pages
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
No ratings yet
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
42 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
NLP_basics
No ratings yet
NLP_basics
119 pages
MOD-1
No ratings yet
MOD-1
71 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
Ima 2000
No ratings yet
Ima 2000
56 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
UNIT-2
No ratings yet
UNIT-2
6 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
POStagging
No ratings yet
POStagging
72 pages
Unit 5 - Aiaaia
No ratings yet
Unit 5 - Aiaaia
19 pages
3 Sequence and Language Modeling
No ratings yet
3 Sequence and Language Modeling
56 pages
1 cs772 Introduction Week of 3jan22
No ratings yet
1 cs772 Introduction Week of 3jan22
53 pages
IntroductionToNLPAbebeZerihun
No ratings yet
IntroductionToNLPAbebeZerihun
45 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Sample
No ratings yet
Sample
8 pages
2022-foundations-tutorial3-sunwang-deeplearning4nlp
No ratings yet
2022-foundations-tutorial3-sunwang-deeplearning4nlp
103 pages
unit 1 and 2 (1)
No ratings yet
unit 1 and 2 (1)
5 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
107 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
Chap 7.1 Sequence Analysis Using FFN
No ratings yet
Chap 7.1 Sequence Analysis Using FFN
47 pages
Unit - 4 DL
No ratings yet
Unit - 4 DL
33 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
big data analytics Chap 11
No ratings yet
big data analytics Chap 11
8 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
Machine Learning Natural Language 2023
No ratings yet
Machine Learning Natural Language 2023
28 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
CH-3
No ratings yet
CH-3
183 pages
13. TEXT CLASSIFICATION USING NLP
No ratings yet
13. TEXT CLASSIFICATION USING NLP
28 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Transformer
No ratings yet
Transformer
5 pages
7-text classification-13-11-2024
No ratings yet
7-text classification-13-11-2024
53 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
6 pages
NLP-Questions (1)
No ratings yet
NLP-Questions (1)
26 pages
CAT King study material 5
No ratings yet
CAT King study material 5
21 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
SPA.2018.8563389
No ratings yet
SPA.2018.8563389
6 pages
NLP handwritten notes_copy
No ratings yet
NLP handwritten notes_copy
26 pages
NLP Unit 1 pdf
No ratings yet
NLP Unit 1 pdf
27 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
UD32924B Baseline Digital-Video-Recorder User-Manual V4.73.000 20230413
No ratings yet
UD32924B Baseline Digital-Video-Recorder User-Manual V4.73.000 20230413
99 pages
KMM-BT308U KMM-BT308 KMR-M308BT: Instruction Manual Mode D'Emploi Manual de Instrucciones
No ratings yet
KMM-BT308U KMM-BT308 KMR-M308BT: Instruction Manual Mode D'Emploi Manual de Instrucciones
70 pages
CS 382 - Network-Centric Computing Notes
No ratings yet
CS 382 - Network-Centric Computing Notes
155 pages
Sabroe UniSAB III Upgrade Path
No ratings yet
Sabroe UniSAB III Upgrade Path
2 pages
Updated Lab Manual DLD 20-05-22
No ratings yet
Updated Lab Manual DLD 20-05-22
102 pages
22BLC1012-Lab-06
No ratings yet
22BLC1012-Lab-06
6 pages
ARVR Student Handbook Level1
No ratings yet
ARVR Student Handbook Level1
49 pages
Competitive Profile Matrix: Critical Success Factors Weight Rating Score Rating Score Rating
No ratings yet
Competitive Profile Matrix: Critical Success Factors Weight Rating Score Rating Score Rating
6 pages
WP Blueborne Bluetooth Vulnerabilities en
No ratings yet
WP Blueborne Bluetooth Vulnerabilities en
42 pages
Unit 3: Develop System Menus and Navigation Schemes
No ratings yet
Unit 3: Develop System Menus and Navigation Schemes
24 pages
Homework 9 QMB 3200
No ratings yet
Homework 9 QMB 3200
22 pages
Ms pm6
No ratings yet
Ms pm6
84 pages
PLT-04334 A.2 - FARGO HDP5000 Printer Firmware Release Notes
No ratings yet
PLT-04334 A.2 - FARGO HDP5000 Printer Firmware Release Notes
5 pages
3 Stage and 5 Stage ARM
No ratings yet
3 Stage and 5 Stage ARM
4 pages
Project Report
No ratings yet
Project Report
3 pages
Assignment 5
No ratings yet
Assignment 5
7 pages
BA - TCS BASIC V3 - 76 Operating Instructions
No ratings yet
BA - TCS BASIC V3 - 76 Operating Instructions
166 pages
04VARIABLES
No ratings yet
04VARIABLES
4 pages
MMI - Lab Manual-2024
No ratings yet
MMI - Lab Manual-2024
23 pages
Metin
No ratings yet
Metin
138 pages
Deepak Bellara
No ratings yet
Deepak Bellara
16 pages
Advance Release Notes 9.9.0
No ratings yet
Advance Release Notes 9.9.0
84 pages
Coverity 1.5
No ratings yet
Coverity 1.5
2 pages
Autoauth Web Portal User Guide: December 6, 2018
No ratings yet
Autoauth Web Portal User Guide: December 6, 2018
19 pages
02 Independent Study Guide 1 - ARG
No ratings yet
02 Independent Study Guide 1 - ARG
3 pages
Final Project
No ratings yet
Final Project
37 pages
Datasheet: Kit Contents - NI 9229 - NI 9229 Getting Started Guide
No ratings yet
Datasheet: Kit Contents - NI 9229 - NI 9229 Getting Started Guide
14 pages
Cost Estimation: Size-Related Metrics and Function-Related Metrics
No ratings yet
Cost Estimation: Size-Related Metrics and Function-Related Metrics
6 pages
Free VRSEDGE App RouletteStrategy 1
No ratings yet
Free VRSEDGE App RouletteStrategy 1
7 pages
CS356Unit4 x86 ISA
No ratings yet
CS356Unit4 x86 ISA
81 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 5 NLP

Uploaded by

Unit 5 NLP

Uploaded by

Unit-5

1. Introduction to Probabilistic Approaches

In spoken language use, we have a distribution:

Probability models Building a probability model consists of two steps:

1. Defining the model

Probability models (almost) always make independence assumptions.

2. Statistical Approaches to NLP Tasks

Motivations for Statistical NLP

Idea behind Statistical NLP

Statistical models are used in NLP for two reasons:

● Spatial models: A co-variation of properties within geographic space.

Words and Their Roles

Sequence Labeling Models

4. Problems - Similarity Measures

How are Word Embeddings used?

Implementations of Word Embeddings:

The emoji vectors for the emojis will be: [happy,sad,excited,sick]

Two different approaches to get Word Embeddings:

Continuous Bowl of Words (CBOW)

it is a nice evening good

Pre-trained Word Embedding Models:

Common Errors made:

Benefits of using Word Embeddings:

Drawbacks of Word Embeddings:

Single context word

Prediction-based Word Embedding

Why these models are called Prediction-based?

Different Model Architectures for Word representation

● FeedForward Neural Net Language Model (NNLM)

FeedForward Neural Net Language Model (NNLM)

Problem with these models

What is Word2Vec Model?

Sentence: the pink horse is eating

The General Flow of the Algorithm

● Continuous Bag of Words (CBOW)

Why Word2Vec technique is created?

Where, “Queen” is considered the closest result vector of word representations.

1. It is unsupervised learning hence can work on any raw text given.

1. Finding the best value for N and c is difficult.

4. Remember there is no activation function used at the hidden layer so the

Evaluated sentence embedding models

Introduction to sentence embedding

Evaluated sentence embedding models

Pooled word embeddings: A word embedding is a vector representation of a word.

9. Recurrent Neural Networks (RNN)

Formula for applying Activation function (tanh):

● whh -> weight at recurrent neuron

Formula for calculating output:

Training through RNN

1. A single time step of the input is provided to the network.

Advantages of Recurrent Neural Network

Disadvantages of Recurrent Neural Network

Applications of Recurrent Neural Networks

Types of Recurrent Neural Networks

1. One to One RNN

3. Many to One RNN

4. Many to Many RNN

Two Issues of Standard RNNs

1. Vanishing Gradient Problem

2. Exploding Gradient Problem

LSTM with a forget gate

Applications of LSTM includes:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.