0% found this document useful (0 votes)
32 views

Unit 5 NLP

Uploaded by

brewcoder6113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Unit 5 NLP

Uploaded by

brewcoder6113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit-5

Topics covered: Introduction to Probabilistic Approaches, Statistical Approaches to NLP Tasks, Sequence Labeling,
Problems - Similarity Measures, Word Embeddings, CBOW, Skip-gram, Sentence Embeddings, Recurrent Neural
Networks (RNN), Long Short-Term Memory (LSTM)

1. Introduction to Probabilistic Approaches

A popular idea in computational linguistics is to create a probabilistic model of language. Such a model
assigns a probability to every sentence in English in such a way that more likely sentences (in some sense) get
higher probability. If you are unsure between two possible sentences, pick the higher probability one.
Comment: A ``perfect'' language model is only attainable with true intelligence. However, approximate
language models are often easy to create and good enough for many applications.

Some models:
● Unigram: words generated one at a time, drawn from a fixed distribution.
● Bigram: probability of word depends on previous word.
● Tag bigram: probability of part of speech depends on previous part of speech; probability of word
depends on part of speech.
● Maximum entropy: lots of other random features can contribute.
● Stochastic context free: words generated by a context-free grammar augmented with probabilistic
rewrite rules.

In spoken language use, we have a distribution:


P(A, W, T, M)
Speech people have usually looked at: P(W|A) – the rest of the hidden structure is ignored.
NLP people interested in the ‘more hidden’ structure – T and often M – but sometimes W is observable
E.g., there is much work looking at the parsing problem P(T|W ). Language generation is P(W|M).

Probability models Building a probability model consists of two steps:

1. Defining the model


2. Estimating the model’s parameters (= training/learning )

Probability models (almost) always make independence assumptions.


— Even though X and Y are not actually independent, our model may treat them as independent.
— This can drastically reduce the number of parameters to estimate.
— Models without independence assumptions have (way) too many parameters to estimate
reliably from the data we have
— But since independence assumptions are often incorrect, those models are often incorrect as
well: they assign probability mass to events that cannot occur

2. Statistical Approaches to NLP Tasks

Statistical NLP aims to perform statistical inference for the field of NLP. Statistical inference consists
of taking some data generated in accordance with some unknown probability distribution and making
inferences.

Motivations for Statistical NLP


a. Cognitive modeling of the human language processing has not reached a stage where we can have a
complete mapping between the language signal and the information contents.
b. Complete mapping is not always required.
c. Statistical approach provides the flexibility required for making the modeling of a language more
accurate.

Idea behind Statistical NLP


d. View language processing as a noisy channel information transmission.
e. The approach requires a model that characterizes the transmission by giving for every message the
probability of the observed output
Statistical Modeling and Classification
● Primitive acoustic features
● Quantization
● Maximum likelihood and related rules
● Class conditional density function
● Hidden Markov Model Methodology

Statistical NLP is the process of predicting the next word in the sequence given the words that precede
it. Statistical modeling helps to:
● Suggest auto-completes
● Recognize handwriting with lexical acquisition even if it’s in a poorly written text
● Detect and correct spelling errors
● Recognize speech
● Recognize multi-token named entities
● Caption images
● Summarize texts
● Detect primitive acoustic features
● Perform text categorization

Statistical models are used in NLP for two reasons:


● To make algorithms for processing language able to learn from observations of language (and other
contextual clues). This is called machine learning. There’s also an alternative called
expert systems. However, it doesn’t scale because it is not feasible for engineers to write down all of the
“rules” for “understanding” text.
● Natural language is based on referential and prototypical context to disambiguate its precise meaning. In
comparison with rule-based systems, statistical models are best for this type of situation-dependent
interface and corpus-based work.

General applications of the statistical models and fundamentals of statistical natural language processing:

● Spatial models: A co-variation of properties within geographic space.


● Time-series: Frequency-domain methods and time-domain methods.
● Survival analysis: Analysis of the expected duration of time until one or more events happens, such as
a death in biological organisms and failure in mechanical systems.
● Market segmentation: Dividing a broad market into the groups of customers with similar
characteristics
● Recommendation systems: The filtering system predicts the ‘rating’ or ‘preference’ that a user
would give to an item.
● Association Rule Learning: A method for discovering interesting relations between variables
in large databases.
● Attribution Modeling: A rule that determines how credit for sales and conversions is assigned to
touch points in conversion paths.
● Scoring: Statistics processing to predict the outcome and assign it a corresponding score.

3. Sequence Labeling

In machine learning, sequence labeling is a type of pattern recognition task that involves the
algorithmic assignment of a categorical label to each member of a sequence of observed values. A
common example of a sequence labeling task is part of speech tagging, which seeks to assign a part of
speech to each word in an input sentence or document. Sequence labeling can be treated as a set of
independent classification tasks, one per member of the sequence. However, accuracy is generally
improved by making the optimal label for a given element dependent on the choices of nearby elements,
using special algorithms to choose the globally best set of labels for the entire sequence at once.

Words and Their Roles


Words are sequential building blocks of sentences. Each word contributes syntactic and semantic
properties to a sentence. For example, a word can be an adjective, which can give positive semantics
(delicate). By doing that, this Adjective can describe a noun (tender boy). This sequential relation- ship
can go recursively and unbounded. This shows us words are related to each other.

Part-Of-Speech Tagging
Part-Of-Speech Tagging (POS Tagging) is a sequence labeling task. It is the process of marking up a
word in a text (corpus) as corresponding to a particular part of speech (syntactic tag), based on both its
definition and its context. POS Tagging is a helper task for many tasks about NLP: Word Sense
Disambiguation, Dependency Parsing, etc.
A sequence is a series of tokens where tokens are not independent of each other. Series in mathematics
and sentences in linguistics are both sequences. Because; in both of them, the next token depends on
the previous ones or vice versa.
Determining POS Tags is much more complicated than simply mapping words to their tags. Consider
word back:
● Earnings growth took a back/JJ seat.
● A small building in the back/NN.
● A clear majority of senators back/VBP at the bill.

Word sense disambiguation Given a dictionary which gives one or more word senses to each word
(for example, a bank can either be a financial institution or the sloped ground next to a river), and given
a sentence, guess the sense of each word in the sentence.
Named entity detection Given a sentence, identify all the proper names (Notre Dame, Apple, etc.)
and classify them as persons, organizations, places, etc. The typical way to set this up as a
sequence-labeling problem is called BIO tagging. Each word is labeled B (beginning) if it is the
first word in a named entity, I (inside) if it is a subsequent word in a named entity, and O (outside)
otherwise. Other encodings are possible as well.

Word segmentation Given a representation of a sentence without any word boundaries, reconstruct
the word boundaries. In some languages, like Chinese, words are written without any spaces in
between them. (Indeed, it can be difficult to settle on the definition of a “word” in such languages.)
This situation also occurs with any spoken language.

Sequence Labeling Models

Sequence labeling can be done with various methods. While traditional models are based on corpus
statistics (Hidden Markov Models, Maximum Entropy Markov Models, Conditional Random Field, etc.),
recent models are based on neural networks (Recurrent Neural Networks, Long Short-Term Memory,
BERT, etc.).
Hidden Markov Models

4. Problems - Similarity Measures

In data science, the similarity measure is a way of measuring how data samples are related or closed to
each other. On the other hand, the dissimilarity measure is to tell how much the data objects are distinct.

It is also used in classification (e.g. KNN), where the data objects are labeled based on the features’
similarity. Another example is when we talk about dissimilar outliers compared to other data samples
(e.g., anomaly detection).

The similarity measure is usually expressed as a numerical value: It gets higher when the data samples
are more alike. It is often expressed as a number between zero and one by conversion: zero means low
similarity (the data objects are dissimilar). One means high similarity (the data objects are very
similar).

Metric:
A given distance (e.g. dissimilarity) is meant to be a metric if and only if it satisfies the following four
conditions:
1- Non-negativity: d (p, q) ≥ 0, for any two distinct observations p andq.
2- Symmetry: d (p, q) = d (q, p) for all p and q.
3- Triangle Inequality: d (p, q) ≤ d (p, r) + d(r, q) for all p, q, r.
4- d (p, q) = 0 only if p = q.

Distance measures are the fundamental principle for classification, like the k-nearest neighbor’s
classifier algorithm, which measures the dissimilarity between given data samples.
Distance Functions:

The technique used to measure distances depends on a particular situation you are working on. For
instance, in some areas, the euclidean distance can be optimal and useful for computing distances. To
get it you just need to subtract the points from the vectors, raise them to squares, add them up and take
the square root of them. Did it seem complex?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured
by the cosine of the angle between two vectors and determines whether two vectors are pointing in
roughly the same direction. It is often used to measure document similarity in text analysis.
The mathematical equation of Cosine similarity between two non-zero vectors is:
How does cosine similarity work?
● Cosine Similarity is a value that is bound by a constrained range of 0 and 1.
● The similarity measurement measures the cosine of the angle between the two non-zero vectors A
and B.
● Suppose the angle between the two vectors was 90 degrees. In that case, the cosine similarity will
have a value of 0; this means that the two vectors are orthogonal or perpendicular to each other.
● As the cosine similarity measurement gets closer to 1, the angle between the two vectors, A and B, is
smaller. The images below depict this more clearly.
How is cosine similarity used?
1. Document Similarity

A scenario that involves identifying the similarity between pairs of a document is a good use case for
the utilization of cosine similarity as a quantification of the measurement of similarity between two
objects.

Quantification of the similarity between two documents can be obtained by converting the words or
phrases into a vectorized form of representation.
The vector representations of the documents can then be used within the cosine similarity formula to
obtain a quantification of similarity.

In the scenario described above, the cosine similarity of 1 implies that the two documents are exactly
alike. A cosine similarity of 0 would conclude that there are no similarities between the two
documents.
2. Pose Matching

Pose matching involves comparing the poses containing critical points of joint locations.
Pose estimation is a computer vision task, and it’s typically solved using Deep Learning approaches
such as Convolutional Pose Machine, Stacked hourglass, PoseNet, etc.
Jaccard Similarity

Jaccard Similarity is the ratio of common words to total unique words or we can say the intersection of
words to the union of words in both the documents. it scores range between 0–1. 1 represents the
higher similarity while 0 represents the no similarity. Let’s see the formula of Jaccard similarity:

5. Word Embeddings

In natural language processing (NLP), word embedding is a term used for the representation of words for
text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that
the words that are closer in the vector space are expected to be similar in meaning.
Word embeddings can be obtained using a set of language modeling and feature learning techniques
where words or phrases from the vocabulary are mapped to vectors of real numbers.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word
vector has values corresponding to these features.
Goal of Word Embeddings
● To reduce dimensionality
● To use a word to predict the words around it Inter word semantics must be captured

How are Word Embeddings used?


They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in training or inference
To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

Implementations of Word Embeddings:


Word Embeddings are a method of extracting features out of text so that we can input those features into a
machine learning model to work with text data. They try to preserve syntactical and semantic information.
The methods such as Bag of Words (BOW), CountVectorizer and TFIDF rely on the word count in a
sentence but do not save any syntactical or semantic information. In these algorithms, the size of the
vector is the number of elements in the vocabulary. We can get a sparse matrix if most of the elements are
zero. Large input vectors will mean a huge number of weights which will result in high computation
required for training. Word Embeddings give a solution to these problems.

Let’s take an example to understand how word vector is generated by taking emoticons which are most
frequently used in certain conditions and transform each emoji into a vector and the conditions will be
our features.
H ? ? ?
a ? ? ?
p ? ? ?
p ? ? ?
y
S ? ? ?
a ? ? ?
d ? ? ?
? ? ?
E ? ? ?
x ? ? ?
c ? ? ?
i ? ? ?
t
e
d
S ? ? ?
i ? ? ?
c ? ? ?
k ? ? ?

The emoji vectors for the emojis will be: [happy,sad,excited,sick]


???? =[1,0,1,0]
???? =[0,1,0,1]
???? =[0,0,1,1]
.....

Two different approaches to get Word Embeddings:

1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1.If there are 500 words in the corpus
then the vector length will be 500. After assigning vectors to each word we take a window size and
iterate through the entire corpus. While we do this there are two neural embedding methods which are
used:

Continuous Bowl of Words (CBOW)

In this model what we do is we try to fit the neighboring words in the window to the central word.

Skip Gram

In this model, we try to make the central word closer to the neighboring words. It is the complete
opposite of the CBOW model. It is shown that this method produces more meaningful embeddings.

After applying the above neural embedding methods we get trained vectors of each word after many
iterations through the corpus. These trained vectors preserve syntactical or semantic information and
are converted to lower dimensions. The vectors with similar meaning or semantic in - formation are
placed close to each other in space.

2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate
through it and get the co-occurrence of each word with other words in the corpus. We get a co-
occurrence matrix through this. The words which occur next to each other get a value of 1, if they are
one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:

Corpus:
It is a nice evening. Good
Evening!

Is it a nice evening?

it is a nice evening good


it 0
is 1+1 0
a 1/2+1 1+1/2 0
nice 1/3+1/2 1/2+1/3 1+1 0
evening 1/4+1/3 1/3+1/4 1/2+1/2 1+1 0
good 0 0 0 0 1 0

The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as
well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather
information about the context in which the word is used.
Initially, the vectors for each word are assigned randomly. Then we take two pairs of vectors and see
how close they are to each other in space. If they occur together more often or have a higher value in
the co-occurrence matrix and are far apart in space then they are brought close to each other. If they
are close to each other but are rarely or not frequently used together then they are moved further
apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates
the information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in
terms of both semantic and syntactic capturing.

Pre-trained Word Embedding Models:


People generally use pre-trained models for word embeddings. Few of them are:
● SpaCy
● fastText
● Flair etc.

Common Errors made:


● You need to use the exact same pipeline during deploying your model as were used to create
the training data for the word embedding. If you use a different tokenizer or different method of
handling white space, punctuation etc. you might end up with incompatible inputs.
● Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of
Vocabulary Word(oov). What you can do is replace those words with “UNK” which means unknown
and then handle them separately.
● Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of
length say 400 and then try to apply vectors of length 1000 at inference time, you will run into errors.
So make sure to use the same dimensions throughout.

Benefits of using Word Embeddings:


● It is much faster to train than hand build models like WordNet(which uses graph embed-
dings)
● Almost all modern NLP applications start with an embedding layer
● It Stores an approximation of meaning

Drawbacks of Word Embeddings:


● It can be memory intensive
● It is corpus dependent. Any underlying bias will have an effect on your model
● It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.

6. CBOW
In the CBOW model, the distributed representations of context (or surrounding words) are combined
to predict the word in the middle. While in the Skip-gram model, the distributed representation of the
input word is used to predict the context.

Several times faster to train than the skip-gram, slightly better accuracy for the frequent words. The
CBOW model architecture tries to predict the current target word (the center word) based on the
source context words (surrounding words). Considering a simple sentence, “the quick brown fox
jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider
a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick),
([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the
context_window words.

Single context word


Word Embedding
Word embedding is a way of representing words as vectors. The main goal of word embedding is to
convert the high dimensional feature space of words into low dimensional feature vectors by pre-
serving the contextual similarity in the corpus.
These models are widely used for all NLP problems. It first generates a vocabulary with the help of a
training corpus and then learns the word embedding representations. In simple words, these models
take a text corpus as input and produce the word vectors as output.
They can be used as feature vectors for the Machine Learning model, used to measure text similarity
using cosine similarity techniques, words clustering, and text classification techniques, which we
will be discussed in the subsequent part of this series.

Prediction-based Word Embedding


So far, we have discussed the deterministic methods to determine vector representation of the words
but these methods proved to be limited in their word representations until the new word embedding
technique named word2vec comes to the NLP community.
The popular pre-trained models to create word embedding of a text are as follows:
● Word2Vec — From Google
● Fast text — From Facebook
● Glove — From Stanford

Why these models are called Prediction-based?

1. These methods were prediction-based as they give the probabilities to the words.
2. They proved to be state of the art for tasks like word analogies and word similarities.
3. They were also able to achieve algebraic operations tasks such as King -man +woman = Queen,
which was considered a result almost magical.

Different Model Architectures for Word representation

The following model architectures are used for word representations with an objective to maximize
the accuracy and minimize the computation complexity:

● FeedForward Neural Net Language Model (NNLM)


● Recurrent Neural Net Language Model (RNNLM)
For training of the above-mentioned models, we use Stochastic gradient descent as an optimizer
and backpropagation.

FeedForward Neural Net Language Model (NNLM)


This model consists of the following layers:
Input layer, Projection layer, Hidden layer, and Output layer.
This architecture becomes complex for computation between the projection and the hidden layer, as
values in the projection layer are dense.

Problem with these models


These models outperform for the huge dataset of words but the main problem in these models is the
computation complexity. So, to overcome the computation complexity, the Word2Vec uses CBOW and
Skip-gram architecture in order to maximize the accuracy and minimize the computation complexity.

What is Word2Vec Model?


Word2Vec model is used for Word representations in Vector Space which is founded by Tomas Mikolov
and a group of the research teams from Google in 2013. It is a neural network model that attempts to
explain the word embeddings based on a text corpus.
These models work using context. This implies that to learn the embedding, it looks at nearby words; if
a group of words is always found close to the same words, they will end up having similar embeddings.
To label how words are similar or close to each other, we first fix the window size, which deter- mines
which nearby words we want to pick.
For Example, For a window size of 2, implies that for every word, we’ll pick the 2 words behind and the
2 words after it. Let’s see the following example:

Sentence: the pink horse is eating

With the help of the above table, we can see the word pairs constructed with this method. The
highlighted word denotes the word for which we want to find pairs. Here, we don’t care about how
much the distance between the words in the window is. As long as words are inside the window, we
don’t differentiate between words that are 1 word away or more.

The General Flow of the Algorithm

Step-1: Initially, we will assign a vector of random numbers to each word in the corpus.
Step-2: Then, we will iterate through each word of the document and grab the vectors of the nearest
n-words on either side of our target word, and concatenate all these vectors, and then for- ward
propagate these concatenated vectors through a linear layer + softmax function, and try to predict
what our target word was.
Step-3: In this step, we will compute the error between our estimate and the actual target word and
then back propagated the error and then modifies not only the weights of the linear layer but also the
vectors or embeddings of our neighbor’s words.
Step-4: Finally, we will extract the weights from the hidden layer and by using these weights encode
the meaning of words in the vocabulary.

Word2Vec model is not a single algorithm but is composed of the following two preprocessing modules
or techniques:

● Continuous Bag of Words (CBOW)


● Skip-Gram.
Both of the mentioned models are basically shallow neural networks that map word(s) to the target
variable which is also a word(s). These techniques learn the weights that act as word vector
representations. Both these techniques can be used to implementing word embedding using word2vec.
Before going further deep dive into the two techniques of Word2Vec, let’s first try to understand the
given below question:

Why Word2Vec technique is created?


As we know that most of the NLP systems treat words as atomic units. In existing systems with the
same purpose as that of word2vec, there is a disadvantage that there is no notion of similarity between
words. Also, those system works for small, simpler data and outperforms on because of only a few
billions of data or less.
So, In order to train the system with a larger dataset with complex models, these techniques use neural
network architecture to train complex data models and outperform huge datasets with billions of
words and with vocabulary having millions of words. It helps to measure the quality of the resulting
vector representations and works with similar words that tend to close with words that can have
multiple degrees of similarity.
Syntactic Regularities: These regularities refer to grammatical sentence correction.

Semantic Regularities: These regularities refer to the meaning of the vocabulary symbols arranged in
that structure.
The proposed technique was found that the similarity of word representations goes beyond syntactic
regularities and works surprisingly well for algebraic operations of word vectors.
For Example,
Vector (“King”) — Vector (“Man”)+Vector(“Woman”) = Word(“Queen”)

Where, “Queen” is considered the closest result vector of word representations.


The above new two proposed models i.e, CBOW and Skip-Gram in Word2Vec uses a distributed
architecture that tries to minimize the computation complexity.

7. Skip-gram
Skip-gram is one of the unsupervised learning techniques used to find the most related words for a
given word. Skip-gram is used to predict the context word for a given target word. It’s reverse of CBOW
algorithm. Here, target word is input while context words are output. As there is more than one context
word to be predicted which makes this problem difficult.
Advantages

1. It is unsupervised learning hence can work on any raw text given.


2. It requires less memory comparing with other words to vector representations.
3. It requires two weight matrix of dimension [N, |v|] each instead of [|v|, |v|]. And usually, N is
around 300 while |v| is in millions. So, we can see the advantage of using this algorithm.
Disadvantages

1. Finding the best value for N and c is difficult.


2. Softmax function is computationally expensive.
3. The time required for training this algorithm is high.

Working steps
1. The words are converted into a vector using one hot encoding. The dimension of these vectors
is [1,|v|].

2. The word w(t) is passed to the hidden layer from |v| neurons.

3. Hidden layer performs the dot product between weight vector W[|v|, N] and the input
vector w(t).
In this, we can conclude that the (t)th row of W[|v|, N] will be the output(H[1, N]).

4. Remember there is no activation function used at the hidden layer so the


H[1,k]will be passed directly to the output layer.
5. Output layer will apply dot product between H[1, N] and W’[N, |v|] and will give
us the vector U.
6. Now, to find the probability of each vector we’ll use the softmax function. As each
iteration gives output vector U which is of one hot encoding type.
7. The word with the highest probability is the result and if the predicted word for a
given context position is wrong then we’ll use back propagation to modify our
weight vectors W and W’.
This steps will be executed for each word w(t) present in vocabulary. And each word w(t) will
be passed k times. So, we can see that forward propagation will be processed |v|*k times in
each epoch.

Probability function

8. Sentence Embeddings
Sentence embedding is the collective name for a set of techniques in natural language process-
ing (NLP) where sentences are mapped to vectors of real numbers.

Application

● Sentence embedding is used by the deep learning software libraries PyTorch and Tensor- Flow.
● Popular embeddings are based on the hidden layer outputs of transformer models like BERT.

Evaluated sentence embedding models


Sentence embedding models: This page introduces the sentence embedding models under
study and the evaluation architecture. The results of the study are presented on a separate page.

Introduction to sentence embedding


A crucial component in most natural language processing (NLP) applications is finding an
expressive representation for text. Modern methods are typically based on sentence embeddings
that map a sentence onto a numerical vector. The vector attempts to capture the semantic content
of the text.
If two sentences express a similar idea using different words, their representations (embedding
vectors) should still be similar to each other.
Several methods for constructing these embeddings have been proposed in the literature.
Interestingly, learned embeddings tend generalize quite well to other material and NLP tasks
besides the ones they were trained on. This is fortunate, because it allows us to use pre-trained
models and avoid expensive training. (A single training run on a modern, large-scale NLP model
can cost up to tens of thousands of dollars).
Sentence embeddings are being applied on almost all NLP application areas. In information
retrieval they are used for comparing the meanings of text snippets, machine translation uses
sentence embeddings as an “intermediate language” when translating between two human
languages and many classification and tagging applications are based on embeddings. With better
representations it is possible to build applications that react more naturally to the sentiment and
topic of written text.

Evaluated sentence embedding models


Bag-of-words (BoW): TF-IDF. I’m using TF-IDF (term frequency-inverse document frequency)
vectors as a baseline. A TF-IDF vector is a sparse vector with one dimension per each unique
word in the vocabulary. The value for an element in a vector is an occurrence count of the
corresponding word in the sentence multiplied by a factor that is inversely proportional to the
overall frequency of that word in the whole corpus. The latter factor is meant to diminish the
effect of very common words (mutta, ja, siksi, …) which are unlikely to tell much about the actual
content of the sentence. The vectors are L2-normalized to reduce the effect of differing sentence
lengths.

Pooled word embeddings: A word embedding is a vector representation of a word.


Words, which often occur in similar context (a horse and a pony), are assigned vector that are
close by each other and words, which rarely occur in a similar context (a horse and a monocle),
dissimilar vectors. The embedding vectors are dense, relatively low dimensional (typically
50-300 dimensions) vectors.

SIF. Introduces a model called smooth inverse frequency (SIF). They propose taking a weighted
average (weighted by a term related to the inverse document frequency) of the word embeddings
in a sentence and then removing the projection of the first singular vector. This is derived from an
assumption that the sentence has been generated by a random walk of a discourse vector on a
latent word embedding space, and by including smoothing terms for frequent words.

9. Recurrent Neural Networks (RNN)


Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous
step are fed as input to the current step. In traditional neural networks, all the inputs and
outputs are independent of each other, but in cases like when it is required to predict the next
word of a sentence, the previous words are required and hence there is a need to remember the
previous words. Thus RNN came into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN is Hidden state, which remembers
some information about a sequence.

RNN have a “memory” which remembers all information about what has been calculated. It uses
the same parameters for each input as it performs the same task on all the inputs or hidden
layers to produce the output. This reduces the complexity of parameters, unlike other neural
networks.
How RNN works?
The working of a RNN can be understood with the help of below example:

Example: Suppose there is a deeper network with one input layer, three hidden layers and one
output layer. Then like other neural networks, each hidden layer will have its own set of weights
and biases, let’s say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2) for second
hidden layer and (w3, b3) for third hidden layer. This means that each of these layers are
independent of each other, i.e. they do not memorize the previous outputs.

● RNN converts the independent activations into dependent activations by providing the same
weights and biases to all the layers, thus reducing the complexity of increasing parameters and
memorizing each previous outputs by giving each output as input to the next hid - den layer.
● Hence these three layers can be joined together such that the weights and bias of all the hidden
layers is the same, into a single recurrent layer.
Formula for calculating current state:
● where:
● ht -> current state
● ht-1 -> previous state
● xt -> input state

Formula for applying Activation function (tanh):

where:

● whh -> weight at recurrent neuron


● wxh -> weight at input neuron

Formula for calculating output:

Yt -> output
Why -> weight at output layer

Training through RNN

1. A single time step of the input is provided to the network.


2. Then calculate its current state using set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the information from all the
previous states.
5. Once all the time steps are completed the final current state is used to calculate the output.
6. The output is then compared to the actual output i.e the target output and the error is
generated.
The error is then back-propagated to the network to update the weights and hence the network
(RNN) is trained.

Advantages of Recurrent Neural Network


1. An RNN remembers each and every information with time. It is useful in time series prediction
only because of the feature to remember previous inputs as well. This is called Long Short Term
Memory.
2. Recurrent neural network are even used with convolutional layers to extend the effective pixel
neighborhood.

Disadvantages of Recurrent Neural Network


1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.

Applications of Recurrent Neural Networks


Image Captioning: RNNs are used to caption an image by analyzing the activities present.

Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
particular month, can be solved using an RNN.

Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
RNN for Natural Language Processing (NLP).

Types of Recurrent Neural Networks


There are four types of Recurrent Neural Networks:
1. One to One
2. One to Many
3. Many to One
4. Many to Many

1. One to One RNN


This type of neural network is known as the Vanilla Neural Network. It's used for general
machine learning problems, which has a single input and a single output.
2. One to Many RNN
This type of neural network has a single input and multiple outputs. An example of this is the
image caption.

3. Many to One RNN


This RNN takes a sequence of inputs and generates a single output. Sentiment analysis is a
good ex- ample of this kind of network where a given sentence can be classified as expressing
positive or negative sentiments.

4. Many to Many RNN


This RNN takes a sequence of inputs and generates a sequence of outputs. Machine translation
is one of the examples.

Two Issues of Standard RNNs

1. Vanishing Gradient Problem


Recurrent Neural Networks enable you to model time-dependent and sequential data
problems, such as stock market prediction, machine translation, and text generation.

2. Exploding Gradient Problem


While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients
accumulate, resulting in very large updates to the neural network model weights during the
training process. Long training time, poor performance, and bad accuracy are the major issues
in gradient problems.
10. Long Short-Term Memory (LSTM)

Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial
intelligence and deep learning. Unlike standard feed forward neural networks, LSTM has
feedback connections. Such a recurrent neural network (RNN) can process not only single data
points (such as images), but also entire sequences of data (such as speech or video). For
example, LSTM is applicable to tasks such as unsegmented, connected handwriting
recognition, speech recognition, machine translation, robot control, video games, and
healthcare.. LSTM has become the most cited neural network of the 20th century.

LSTM with a forget gate

Applications of LSTM includes:

1. Language Modelling
2. Machine Translation
3. Image Captioning
4. Handwriting generation
5. Question Answering Chatbots

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy