S 4 Thesisfinal
S 4 Thesisfinal
CORPUS
A THESIS REPORT
Submitted by
to
the APJ Abdul Kalam Technological University
in partial fulfillment of the requirements for the award of the Degree
of
Master of Technology
in
Computer Science and Engineering
JULY 2020
DECLARATION
I, the undersigned, hereby declare that the thesis report ”IMPROVING WORD EM-
BEDDING ON MALAYALAM CORPUS”, submitted for partial fulfillment of the
requirements for the award of degree of Master of Technology of the APJ Abdul
Kalam Technological University, Kerala is a bonafide work done by me under the
supervision of Dr. Anil K Jacob, Assistant Professor, Department of Computer Sci-
ence and Engineering. This submission represents ideas in my own words and where
ideas or words of others have been included, I have adequately and accurately cited
and referenced the original sources. I also declare that I have adhered to ethics of
academic honesty and integrity and have not misrepresented or fabricated any data or
idea or fact or source in submission. I understand that any violation of the above will
be a cause for disciplinary action by the institute and/or the University and can also
evoke penal action from the sources which have thus not been properly cited or from
whom proper permission has not been obtained. This report has not been previously
formed the basis for the award of any degree, diploma or similar title of any other
University.
Place: Kuttippuram
FATHIMA MURSHIDA K
Date: 27-07-2020
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
MES COLLEGE OF ENGINEERING, KUTTIPPURAM
CERTIFICATE
This is to certify that the thesis report entitled ”IMPROVING WORD EM-
BEDDING ON MALAYALAM CORPUS” submitted by FATHIMA MUR-
SHIDA K, requirements for the award of the Degree of Master of Technology in
Computer Science and Engineering is a bonafide record of the project work carried
out under my guidance and supervision. This report in any form has not been sub-
mitted to any other University or Institute for any purpose.
FATHIMA MURSHIDA K
i
ABSTRACT
NLP is natural language processing or neuro linguistic programming.Natural lan-
guages like malayalam are highly inflectional and agglutinative in nature.This is
problematic when dealing with nlp based malayalam applications.So that inorder
to improve performance of malayalam nlp based applications, word embedding im-
provement on malayalam corpus is used.The improvement is based on converting the
words contained in the malayalam corpus into a standardised form.The standardised
form here means removing all inflectional parts in the words in the existing malay-
alam corpus that means taking only root words.All that needed is a stemmer.In this
project malayalam morphological analyser is used for taking root words of all words
in the existing malayalam corpus.The main advanatge of removing inflectional parts
from all words is that it can reduce the sparsity in the existing malayalam corpus.Also
there will be a high hike in frequency of words in the resulting corpus,then the space
and time complexity of wordembedding representation of the existing corpus will
decreases.According to zipfs law by increasing frequency of words performance of
neural word embedding will increases. .Here using fasttext, word embeddings are
performed and capture dense word vector representation of the malayalam corpus
with dimensionality reduction from the sparse word co-occurence matrix.The im-
provement is mainly used for wordnet,analogy,ontology based malayalam applica-
tions.
ii
CONTENTS
ACKNOWLEDGEMENT i
ABSTRACT ii
LIST OF FIGURES v
ABBREVIATIONS vi
Chapter 1. INTRODUCTION 1
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Natural Language Processing . . . . . . . . . . . . . . . . . 2
1.2.1 Working of NLP . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Components of NLP . . . . . . . . . . . . . . . . . 3
1.3 Hardness of Natural language . . . . . . . . . . . . . . . . 4
1.3.1 Need to perform Morphological Analysis . . . . . . 4
1.4 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2. LITERATURE SURVEY 6
2.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 GloVe . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 FastText . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Paragram . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 3. METHODOLOGY 10
3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 PROPOSED SYSTEM . . . . . . . . . . . . . . . . . . . . 11
3.4 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . 12
3.5 Module Description . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1 CORPUS CREATION . . . . . . . . . . . . . . . . 12
3.5.2 DATA PREPROCESSING . . . . . . . . . . . . . . 12
3.5.3 ZIPF S LAW . . . . . . . . . . . . . . . . . . . . . 14
3.5.4 Training Using Neural Networks . . . . . . . . . . . 18
3.5.5 MODEL EVALUATION . . . . . . . . . . . . . . . 21
3.5.6 MODEL VISUALIZATION . . . . . . . . . . . . . 23
3.6 DATAFLOW DIAGRAM . . . . . . . . . . . . . . . . . . . 23
Chapter 4. Conclusion and Future Work 24
REFERENCES
LIST OF FIGURES
v
ABBREVIATIONS
vi
CHAPTER 1
INTRODUCTION
1.1 Objective
guages like Malayalam are highly inflectional and agglutinative in nature. This is
verting the words contained in the Malayalam corpus into a standardised form. The
standardised form is obtained by removing all inflectional parts of the words in the
existing Malayalam corpus thereby containing only root words. A stemmer is used
In this project, a Malayalam morphological analyser has been used for find-
ing root words of all words in the existing Malayalam corpus. The main advantage
of removing inflectional parts from all words is that we can reduce the sparsity in
the existing Malayalam corpus. Also, there will be a hike in frequency of words in
the resulting corpus. As a result, the space and time complexity of word embedding
representation of the existing corpus will decrease. According to Zipfs law, by in-
Zipfs Law is a discrete probability distribution that tells the probability of encounter-
ing a word in a given corpus. Here using fasttext, word embeddings are performed
and capture dense word vector representation of the Malayalam corpus with dimen-
1
sionality reduction from the sparse word co-occurence matrix. The improvement is
words, learning NLP is like learning the language of your own mind. It is the sub-
languages. Computers can’t yet truly understand natural languages in the way that
humans do , but they can already do a lot. We might be able to save a lot of time
in written) carries huge amounts of information. The topic we choose, our tone, our
selection of words, everything adds some type of information that can be interpreted
and value extracted from it. In theory, we can understand and even predict human
behaviour using that information. NLP helps developers to organize and structure
Every day, we say thousands of words that other people interpret to do countless
things. We consider it as a simple communication, but words run much deeper than
that. There is always some context that we derive from what we say and how we say
it. NLP never focuses on voice modulation. It does draw on contextual patterns. In
2
NLP, we learn things through experience. An important aspect of these techniques
simple arithmetic. However, the main question is how computer knows about the
same. We need to provide enough data for machines to learn through experience.
We can feed details like Her Majesty the Queen; the Queen’s speech during the State
visit; the crown of Queen Elizabeth; the Queens’s mother; the queen is generous etc.
With the above examples, the machine understands the entity Queen. The machine
then creates word vectors. A word vector is built using surrounding words or it learns
from multiple datasets or using Machine learning (e.g., Deep Learning algorithms).
and nonword tokens such as punctuation are separated from the words.
• Semantic Analysis: The structures created by the syntactic analyzer are as-
signed meanings.
the sentences that precede it and may influence the meanings of the sentences
3
1.3 Hardness of Natural language
language spoken in the Indian state of Kerala. It is one of the 22 scheduled languages
of India and was designated a classical Language in India in 2013. The earliest
script used to write Malayalam was the Vatteluttu script, and later the Kolezhuttu,
which was derived from it. The oldest literary works in Malayalam, distinct from the
Tamil tradition, are the Paattus, folk songs, dated between the 9th and 11th centuries.
Grantha script letters were adopted to write Sanskrit loanwords, which resulted in the
problem. The word order is generally subject–object–verb, although other orders are
often employed for reasons such as emphasis. Nouns are inflected for case and num-
ber, whilst verbs are conjugated for tense, mood and causativity (and also in archaic
language for person, gender, number and polarity). Being the linguistic successor of
Morphological analyzer and morphological generator are two essential and basic
tools for building any language processing application. Morphological analysis is the
logical analyzer is a computer program which takes a word as input and produces its
4
word along with its grammatical information depending upon its word category. For
nouns it will provide gender, number, and case information and for verbs, it will be
tense, aspects, and modularity. In this project, inflections are removed from each
word in the corpus so that the frequency of words in the resulting corpus will in-
crease. So, the resulting corpus will contain only the root words. By increasing the
is more in inflected words. Also, with limited resources, we will get better output
provided we do not use morphosyntactic information. In this project only the con-
ceptual similarity, that is conceptually similar words, can occur so that Malayalam
NLP based wordnet, ontology, analogy applications can improve their performance.
stand, generate and, more generally speaking, work with human languages. But
there’s a challenge that jumps out: we, humans, communicate with words and sen-
tences; meanwhile, computers only understand numbers. For this reason, we have to
map those words (sometimes even the sentences) to vectors: just a bunch of numbers
ways to convert text into numbers are: Sparse Vector Representations and Dense
Vector Representations.
5
CHAPTER 2
LITERATURE SURVEY
A word embedding is a learned representation for text where words that have the
same meaning have a similar representation. This chapter will introduce an overview
2.1.1 Word2Vec
ding from a text corpus. It was developed by Tomas Mikolov, et al.[12] at Google
more efficient and since then has become the de facto standard for developing pre-
trained word embedding. The model is, in contrast to other deep learning models,
a shallow model of only one layer without non-linearities. The paper by Mikolov et
al.[12] introduced two architectures for unsupervised learning word embeddings from
a large corpus of text. The first architecture is called CBOW, it tries to predict the
center word from the summation of the context vectors within a specific window. The
second, and more successful architecture, is called skip-gram. The used pretrained
word2vec embeddings are trained using the Skip-gram algorithm. This algorithm is
also the inspiration for the algorithms behind the GloVe and fastText embeddings.
This embeddings were the first to be trained on a significantly large corpus of 100
billion tokens from English news articles and have a dimensionality of 300. These
articles originated from different media outlets and were bundled together in a news
6
search engine from Google, called: Google News[12].
2.1.2 GloVe
Global Vector for Word Representation (GloVe) by [14] Pennington, Socher, and
Manningwas inspired by the skip-gram algorithm and tries to approach the problem
from a different direction. Pennington, Socher,[14] and Manning show that the ratio
The idea is similar to TF-IDF (term frequency–inverse document frequency) but for
the weighing of the importance of a context word during the training of word embed-
dings. Classical vector space model representations of words were developed using
good job of using global text statistics but are not as good as the learned methods
large sparse matrix X, wherein each element represents the times word i co-occurs
with j within a window similar to skip-gram. The word embeddings are defined in
ties trained on Wikipedia articles and tweets from Twitter. Besides these embeddings,
from Common Crawl 3[14]. This dataset contains 840 billion tokens, which is sig-
nificantly more than the 100 billion tokens the Word2vec embeddings were trained
on. The published Common Crawl trained embeddings have a dimensionality of 300
7
2.1.3 FastText
FastText is a library for efficient learning of word representations and sentence clas-
from data compression, as features into additional models, for candidate selection,
but as bags of character n-grams.One of the main advantages of this approach is that
word meaning can now be transferred between words, and thus embeddings of new
words can be extrapolated from embeddings of the n-grams already learned. The
length of n-grams you use can be controlled by the -minn and -maxn flags for mini-
mum and maximum number of characters to use respectively. These control the range
of values to get n-grams. The model is considered to be a Bag of Words model be-
cause aside of the sliding window of n-gram selection, there is no internal structure
of a word that is taken into account for featurization, i.e; as long as the characters
fall under the window, the order of the character n-grams does not matter. We can
also turn n-gram embeddings completely off as well by setting them both to 0. This
can be useful when the ‘words’ in our model aren’t words for a particular language,
and character level n-grams would not make sense.During the model update, FastText
learns weights for each of the n-grams as well as the entire word token. The authors
published pretrained word vectors for 294 different languages , all trained from the
Wikipedia dumps of the different languages. All the pretrained word embeddings
8
have a dimensionality of 300.
2.1.4 Paragram
Wieting et al [11] introduced a method to tune existing word embeddings using para-
phrasing data. The focus of their paper is not on creating entirely new word embed-
dings from a large corpus. Instead, the authors are taking existing pretrained GloVe
embeddings and tuning them so that words in similar sentences are able to compose
in the same manner. Their training data consists of a set of P phrase pairs.The objec-
tive function they use focuses to increase cosine similarity, i.e. the similarity of the
portant to mention is that Wieting et al.[] expresses similarity in terms of angle and
not in terms of actual distance. Additionally, Wieting et al.[11] only explored one
they used version XXL which contains 86 million paraphrase pairs. An example of
a short paraphrase is: “thrown into jail” which is semantically similar to “taken into
Phrase XXL 6 [11], which are in fact tuned GloVe embeddings. These embeddings
9
CHAPTER 3
METHODOLOGY
has a great impact on ontology, analogy representation based, NLP based applica-
tions. Improved word vectors in Malayalam language can increase the accuracy of
improve word embeddings that do not need any labeled data. The improvement is
flections from each word in Malayalam corpus. Any large volume of text can be used
to get the word embeddings by feeding it to the model, without any kind of labeling.
In this project we are using FastText wordembedding. Word embeddings are derived
3.1 Requirements
– RAM:16GB or higher
– Hard Disk:2 TB
– TensorFlow
10
– Python
– Matplotlib
3.2 Steps
• Collect the pretrained word vectors from fastText for malayalam language.
• Visualise the pretrained malayalam word vectors using T-sne and PCA.
• Visualise the new word embedding of malayalam datset using T-sne and PCA.
• Calculate Eucledian distance and cosine similarity of similar words in the new
Dataset.
The increasing accuracy of pre-trained word embeddings has a great impact on sen-
timent analysis research all NLP based applications.The improvement is done by re-
moving all inflectional parts from each words in the corpus.By using malayalam mor-
11
phological analyser inflections are removed and converterd the existing corpus into
words in the corpus will increase then the complexity of word embedding in malay-
alam language will decrease.In this project fasttext embedding used for training and
it will results in dense vector representation of each words with dimensionality redu-
3.4 IMPLEMENTATION
• Corpus Creation
• Data Preprocessing
• Model Evaluation
• Dataset Visualisation
Datas are collected from Malayalam news articles and created a malayalam Corpus
12
• Incomplete: lacking attribute values, lacking certain attributes of interest, or
guage means a language that changes the form or ending of some words when the
type of synthetic language with morphology that primarily uses agglutination. Words
may contain different morphemes to determine their meanings, but all of these mor-
phemes (including stems and affixes) remain, in every aspect, unchanged after their
unions. This results in generally more easily deducible word meanings if compared
spelling of one or more morphemes within a word. This usually results in a short-
words. In natural language processing, data sparsity (also known by terms such as
data sparseness, data paucity, etc) is the term used to describe the phenomenon of
not observing enough data in a corpus to model language accurately. True observa-
tions about the distribution and pattern of language cannot be made because there
is not enough data to see the true distribution. So by removing inflections, sparsity
will decrease and also complexity of the model decreases and frequency of words
will increase. Then complexity of Malayalam language word embedding model will
13
of natural languages. It is the study of the rules of word construction by analysing
the syntactic properties and morphological information. In order to perform this task,
morphemes have to be separated from the original word. This process is termed as
and sandhi. Due to sandhi, many morphological changes occur at the conjoining po-
tough task, especially in languages like Malayalam. In this project only using roots
of the words and by using that creating a new corpus.Various sandhi rules are defined
in Malayalam for joining two words to form a new one. On applying these rules,
the original appearance of the words taking part in this process is altered. Rules
are applied by observing the ‘sounds’ of the end syllable of the first word and the
rules is done based on whether a word ends with a vowel (swaram) or a consonant
• Elision
• Augumentation
• Substitution
• Reduplication (Gemination)
Zipf’s Law was proposed by linguist George Kingsley Zipf[15] describing the analo-
gous pattern that appears in language.Given a large corpus of natural language occur-
14
Figure 3.1: Example of elison rule
rences, the frequency of any word is inversely proportional to its rank in frequency
15
Figure 3.4: Example of Reduplication rule
are given a text file than frequency of any word in that text file is how many times
does that word appear in that file. This law was proposed related to patterns seen in
mance of neural word embedding also increases. Zipfs Law is a discrete probability
distribution that tells the probability of encountering a word in a given corpus. Zipfs
law is a law about the frequency distribution of words in a language.k (3.1) is the rank
of the word we are interested in finding out the probability of appearing in the corpus
16
Figure 3.6: input representation of data to malayalam morphological analyser
1/(ks )
f(k;s,N)= N 1/ns (3.1)
∑n=1
17
Figure 3.8: Preprocessed data format
method that is an extension of the word2vec model. Instead of learning vectors for
words directly, fastText represents each word as an n-gram of characters. This helps
capture the meaning of shorter words and allows the embeddings to understand suf-
fixes and prefixes. Once the word has been represented using character n-grams, a
18
Figure 3.10: Zipfs law result of preprocessed malayalam corpus
bag of words model with a sliding window over a word because no internal structure
of the word is taken into account. As long as the characters are within this window,
the order of the n-grams doesn’t matter. fastText works well with rare words. So
even if a word wasn’t seen during training, it can be broken down into n-grams to
get its embeddings.Word2vec and GloVe both fail to provide any vector represen-
tation for words that are not in the model dictionary. This is a huge advantage of
this method.Fasttext can be used both for classification and word-embedding cre-
ation.The word embedding can be created by using both skipgram and CBOW. Both
are architectures to learn the underlying word representations for each word by using
neural networks. Given a set of sentences (also called corpus) the model loops on the
words of each sentence and either tries to use the current word w in order to predict
its neighbors (i.e., its context). This approach is called “SkipGram”. If it uses each of
these contexts to predict the current word w, the method is called “Continuous Bag Of
Words” (CBOW). To limit the number of words in each context, a parameter called
19
“window size” is used. Continuous Bag Of Words (CBOW) uses context to predict
a target word or using a word to predict a target context, which is called skip-gram.
The latter method is used in this project because it produces more accurate results
on large datasets. When the feature vector assigned to a word cannot be used to ac-
curately predict that word’s context, the components of the vector are adjusted.The
vectors of words judged similar by their context are nudged closer together by ad-
justing the numbers in the vector. The Skip-gram model takes in a corpus of text and
creates a hot-vector for each word. A hot vector is a vector representation of a word
where the vector is the size of the vocabulary (total unique words). All dimensions
are set to 0 except the dimension representing the word that is used as an input at
that point in time. After training the model, we get two output files, model.bin and
model.vec (Figure 3.15). The latter it is a text file containing the word vectors one
per line. The former is a binary file containing the parameters of the model along
20
Figure 3.12: The skipgram model
Word embeddings should capture the relationship between words in natural lan-
guage.In the Word Similarity and Relatedness Task, word embeddings are evaluated
by comparing word similarity scores computed from a pair of words with human la-
bels for the similarity or relatedness of the pair.Cosine similarity(3.1) measures the
similarity between two vectors. It is measured by the cosine of the angle between
21
Figure 3.15: model.vec output file
two vectors and determines whether two vectors are pointing in roughly the same
22
3.5.6 MODEL VISUALIZATION
PCA and T-sne are two techniques for visualising dataset in 2D or 3D space. PCA
23
CHAPTER 4
Conclusion and Future Work
It can be visualised that similar Malayalam words or synonyms occupied close to each
other by using tensor flow embedding projector. The Eucledian distance and cosine
similarity of similar words will be lesser than the FastText pre trained word vectors of
Malayalam language. It can be predicted that by using Zipf’s law Malayalam word
embedding can be improved. Also sparsity problem in the existing Malayalam corpus
and time complexity is reduced in the new Malayalam corpus. The improved word
vectors can be used to all NLP Malayalam applications to improve their efficiency,
model can be used as a base model and for learning morphosyntatic information on
that model.
24
REFERENCES
[1] Abdulaziz M. Alayba, Vasile Palade, Matthew Englandand Rahat Iqbal ” Im-
(ASAR),2018 .
[2] Shaosheng Cao Wei Lu, ”Improving Word Embeddings with Convolutional Fea-
proving Word Embeddings Using Kernel PCA”, Proceedings of the 4th Work-
the 2019 Conference of the North American Chapter of the Association for
[6] Miguel Ballesteros, Chris Dyer, and Noah A. Smith, ”Improved transition-
EMNLP.2015.
25
[7] Marco Baroni and Alessandro Lenci, ”Distributional memory: A general frame-
2010.
[8] Jan A. Botha and Phil Blunsom, ” Compositional morphology for word repre-
[9] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan, ”Joint
[10] Cicero Nogueira dos Santos and Bianca Zadrozny, ”Learning character-level
[11] John Wieting Mohit Bansal Kevin Gimpel Karen Livescu, ”TOWARDS UNI-
2016.
[12] Tomas Mikolov and Ilya Sutskever, ”Distributed Representations of Words and
[15] Bernardo Huberman, ”Zipf’s Law and the Internet”, Glottometrics 3, 2002,
19-26 .
26