0% found this document useful (0 votes)
9 views35 pages

S 4 Thesisfinal

Uploaded by

fathima murshida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views35 pages

S 4 Thesisfinal

Uploaded by

fathima murshida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

IMPROVING WORD EMBEDDING ON MALAYALAM

CORPUS

A THESIS REPORT
Submitted by

FATHIMA MURSHIDA K (MES18CSCE04)

to
the APJ Abdul Kalam Technological University
in partial fulfillment of the requirements for the award of the Degree
of
Master of Technology
in
Computer Science and Engineering

(ISO 9001:2008 Certified)

Department of Computer Science and Engineering


MES College of Engineering Kuttippuram
Thrikkanapuram P.O., Malappuram Dt., Kerala, India 679582

JULY 2020
DECLARATION

I, the undersigned, hereby declare that the thesis report ”IMPROVING WORD EM-
BEDDING ON MALAYALAM CORPUS”, submitted for partial fulfillment of the
requirements for the award of degree of Master of Technology of the APJ Abdul
Kalam Technological University, Kerala is a bonafide work done by me under the
supervision of Dr. Anil K Jacob, Assistant Professor, Department of Computer Sci-
ence and Engineering. This submission represents ideas in my own words and where
ideas or words of others have been included, I have adequately and accurately cited
and referenced the original sources. I also declare that I have adhered to ethics of
academic honesty and integrity and have not misrepresented or fabricated any data or
idea or fact or source in submission. I understand that any violation of the above will
be a cause for disciplinary action by the institute and/or the University and can also
evoke penal action from the sources which have thus not been properly cited or from
whom proper permission has not been obtained. This report has not been previously
formed the basis for the award of any degree, diploma or similar title of any other
University.

Place: Kuttippuram
FATHIMA MURSHIDA K
Date: 27-07-2020
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
MES COLLEGE OF ENGINEERING, KUTTIPPURAM

(ISO 9001:2008 Certified)

CERTIFICATE

This is to certify that the thesis report entitled ”IMPROVING WORD EM-
BEDDING ON MALAYALAM CORPUS” submitted by FATHIMA MUR-
SHIDA K, requirements for the award of the Degree of Master of Technology in
Computer Science and Engineering is a bonafide record of the project work carried
out under my guidance and supervision. This report in any form has not been sub-
mitted to any other University or Institute for any purpose.

Internal Examiner: External Examiner:

Dr. Anil K Jacob


Assistant Professor
Dept. of Computer Science
and Engineering
MES College of Engineering

PG Coordinator: Head of the Department:


ACKNOWLEDGEMENT
At the outset, I would like to thank the Almighty for all his blessings that led me
here.
I am grateful to Dr. A. S. Varadarajan, Principal, MES College of Engineering,
Kuttippuram, for providing the right ambiance to complete this project. I would
also like to extend my sincere gratitude to Dr. Sasidharan Sreedharan, Head of
the Department, Computer Science and Engineering, MES College of Engineering,
Kuttippuram.
I am deeply indebted to the project coordinator, Dr. Anil K Jacob, Assistant Pro-
fessor, Department of Computer Science and Engineering for his continued support.
It is with great pleasure that I express my deep sense of gratitude to my project guide,
Dr. Anil K Jacob, Assistant Professor, Department of Computer Science and Engi-
neering, for his guidance, supervision, encouragement and valuable advice in each
and every phase.
I would also like to express my sincere thanks and gratitude to all staff members
of the department, my friends and family members for their cooperation, positive
criticism, consistent support and consideration during the preparation of this work.

FATHIMA MURSHIDA K

i
ABSTRACT
NLP is natural language processing or neuro linguistic programming.Natural lan-
guages like malayalam are highly inflectional and agglutinative in nature.This is
problematic when dealing with nlp based malayalam applications.So that inorder
to improve performance of malayalam nlp based applications, word embedding im-
provement on malayalam corpus is used.The improvement is based on converting the
words contained in the malayalam corpus into a standardised form.The standardised
form here means removing all inflectional parts in the words in the existing malay-
alam corpus that means taking only root words.All that needed is a stemmer.In this
project malayalam morphological analyser is used for taking root words of all words
in the existing malayalam corpus.The main advanatge of removing inflectional parts
from all words is that it can reduce the sparsity in the existing malayalam corpus.Also
there will be a high hike in frequency of words in the resulting corpus,then the space
and time complexity of wordembedding representation of the existing corpus will
decreases.According to zipfs law by increasing frequency of words performance of
neural word embedding will increases. .Here using fasttext, word embeddings are
performed and capture dense word vector representation of the malayalam corpus
with dimensionality reduction from the sparse word co-occurence matrix.The im-
provement is mainly used for wordnet,analogy,ontology based malayalam applica-
tions.

ii
CONTENTS

Contents Page No.

ACKNOWLEDGEMENT i
ABSTRACT ii
LIST OF FIGURES v
ABBREVIATIONS vi
Chapter 1. INTRODUCTION 1
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Natural Language Processing . . . . . . . . . . . . . . . . . 2
1.2.1 Working of NLP . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Components of NLP . . . . . . . . . . . . . . . . . 3
1.3 Hardness of Natural language . . . . . . . . . . . . . . . . 4
1.3.1 Need to perform Morphological Analysis . . . . . . 4
1.4 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2. LITERATURE SURVEY 6
2.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 GloVe . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 FastText . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Paragram . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 3. METHODOLOGY 10
3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 PROPOSED SYSTEM . . . . . . . . . . . . . . . . . . . . 11
3.4 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . 12
3.5 Module Description . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1 CORPUS CREATION . . . . . . . . . . . . . . . . 12
3.5.2 DATA PREPROCESSING . . . . . . . . . . . . . . 12
3.5.3 ZIPF S LAW . . . . . . . . . . . . . . . . . . . . . 14
3.5.4 Training Using Neural Networks . . . . . . . . . . . 18
3.5.5 MODEL EVALUATION . . . . . . . . . . . . . . . 21
3.5.6 MODEL VISUALIZATION . . . . . . . . . . . . . 23
3.6 DATAFLOW DIAGRAM . . . . . . . . . . . . . . . . . . . 23
Chapter 4. Conclusion and Future Work 24
REFERENCES
LIST OF FIGURES

No. Title Page No.

3.1 Example of elison rule . . . . . . . . . . . . . . . . . . . . . . . . 15


3.2 Example of augmentation rule . . . . . . . . . . . . . . . . . . . . 15
3.3 Example of substitution rule . . . . . . . . . . . . . . . . . . . . . 15
3.4 Example of Reduplication rule . . . . . . . . . . . . . . . . . . . . 16
3.5 Initial model of dataset . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 input representation of data to malayalam morphological analyser . 17
3.7 malayalam morphological analyser output . . . . . . . . . . . . . . 17
3.8 Preprocessed data format . . . . . . . . . . . . . . . . . . . . . . . 18
3.9 Zipfs law result of initial malayalam corpus . . . . . . . . . . . . . 18
3.10 Zipfs law result of preprocessed malayalam corpus . . . . . . . . . 19
3.11 The Common Bag Of Words model . . . . . . . . . . . . . . . . . 20
3.12 The skipgram model . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.13 Training window of the model . . . . . . . . . . . . . . . . . . . . 21
3.14 Output files of the model . . . . . . . . . . . . . . . . . . . . . . . 21
3.15 model.vec output file . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.16 model evaluation result . . . . . . . . . . . . . . . . . . . . . . . . 22
3.17 model evaluation result . . . . . . . . . . . . . . . . . . . . . . . . 22
3.18 model visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.19 Dataflow diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

v
ABBREVIATIONS

NLP Natural Language Processing


CBOW Continuous Bag-of-Words
TF-IDF Frequency–Inverse Document Frequency
LSA Latent Semantic Analysis
GloVe Global Vectors for Word Representation
PCA Principal Component Analysis
T-SNE T-DistributedStochastic Neighbor Embedding
2D Two Dimension
3D Three Dimension

vi
CHAPTER 1
INTRODUCTION

1.1 Objective

NLP is natural language processing or neuro linguistic programming. Natural lan-

guages like Malayalam are highly inflectional and agglutinative in nature. This is

problematic when dealing with NLP based Malayalam applications. Therefore, in

order to improve performance of Malayalam NLP based applications, word embed-

ding improvement on Malayalam corpus is used. The improvement is based on con-

verting the words contained in the Malayalam corpus into a standardised form. The

standardised form is obtained by removing all inflectional parts of the words in the

existing Malayalam corpus thereby containing only root words. A stemmer is used

to make this happen.

In this project, a Malayalam morphological analyser has been used for find-

ing root words of all words in the existing Malayalam corpus. The main advantage

of removing inflectional parts from all words is that we can reduce the sparsity in

the existing Malayalam corpus. Also, there will be a hike in frequency of words in

the resulting corpus. As a result, the space and time complexity of word embedding

representation of the existing corpus will decrease. According to Zipfs law, by in-

creasing frequency of words, performance of neural word embedding will increase.

Zipfs Law is a discrete probability distribution that tells the probability of encounter-

ing a word in a given corpus. Here using fasttext, word embeddings are performed

and capture dense word vector representation of the Malayalam corpus with dimen-

1
sionality reduction from the sparse word co-occurence matrix. The improvement is

mainly used for wordnet, analogy, ontology based Malayalam applications.

1.2 Natural Language Processing

NLP stands for Neuro-Linguistic Programming. Neuro refers to neurology, Linguis-

tic refers to language, programming refers to neural language functions. In other

words, learning NLP is like learning the language of your own mind. It is the sub-

field of AI that is focused on enabling computers to understand and process human

languages. Computers can’t yet truly understand natural languages in the way that

humans do , but they can already do a lot. We might be able to save a lot of time

by applying NLP techniques to projects. Everything we express (either verbally or

in written) carries huge amounts of information. The topic we choose, our tone, our

selection of words, everything adds some type of information that can be interpreted

and value extracted from it. In theory, we can understand and even predict human

behaviour using that information. NLP helps developers to organize and structure

knowledge to perform tasks like translation, summarization, named entity recogni-

tion, relationship extraction, speech recognition, topic segmentation, etc.

1.2.1 Working of NLP

Every day, we say thousands of words that other people interpret to do countless

things. We consider it as a simple communication, but words run much deeper than

that. There is always some context that we derive from what we say and how we say

it. NLP never focuses on voice modulation. It does draw on contextual patterns. In

2
NLP, we learn things through experience. An important aspect of these techniques

is the ability to solve word analogies of the form “A is to B what C is to X” using

simple arithmetic. However, the main question is how computer knows about the

same. We need to provide enough data for machines to learn through experience.

We can feed details like Her Majesty the Queen; the Queen’s speech during the State

visit; the crown of Queen Elizabeth; the Queens’s mother; the queen is generous etc.

With the above examples, the machine understands the entity Queen. The machine

then creates word vectors. A word vector is built using surrounding words or it learns

from multiple datasets or using Machine learning (e.g., Deep Learning algorithms).

1.2.2 Components of NLP

• Morphological Analysis: Individual words are analyzed into their components

and nonword tokens such as punctuation are separated from the words.

• Syntactic Analysis: Linear sequences of words are transformed into structures

that show how the words relate to each other.

• Semantic Analysis: The structures created by the syntactic analyzer are as-

signed meanings.

• Discourse integration: The meaning of an individual sentence may depend on

the sentences that precede it and may influence the meanings of the sentences

that follow it.

• Pragmatic Analysis: The structure representing what was said is reinterpreted

to determine what was actually meant.

3
1.3 Hardness of Natural language

In this project, Malayalam language is being used. Malayalam is a Dravidian Indian

language spoken in the Indian state of Kerala. It is one of the 22 scheduled languages

of India and was designated a classical Language in India in 2013. The earliest

script used to write Malayalam was the Vatteluttu script, and later the Kolezhuttu,

which was derived from it. The oldest literary works in Malayalam, distinct from the

Tamil tradition, are the Paattus, folk songs, dated between the 9th and 11th centuries.

Grantha script letters were adopted to write Sanskrit loanwords, which resulted in the

modern Malayalam script.

Malayalam is a highly inflectional and agglutinative language. Algorithmic inter-

pretation of Malayalam’s words and their formation rules continue to be an untackled

problem. The word order is generally subject–object–verb, although other orders are

often employed for reasons such as emphasis. Nouns are inflected for case and num-

ber, whilst verbs are conjugated for tense, mood and causativity (and also in archaic

language for person, gender, number and polarity). Being the linguistic successor of

the macaronic Manipravalam, Malayalam grammar is based on Sanskrit too. Because

of its high complexity nature it is challenging to work on Malayalam language.

1.3.1 Need to perform Morphological Analysis

Morphological analyzer and morphological generator are two essential and basic

tools for building any language processing application. Morphological analysis is the

process of providing grammatical information of a word given its suffix. Morpho-

logical analyzer is a computer program which takes a word as input and produces its

grammatical structure as output. A morphological analyzer will return its root/stem

4
word along with its grammatical information depending upon its word category. For

nouns it will provide gender, number, and case information and for verbs, it will be

tense, aspects, and modularity. In this project, inflections are removed from each

word in the corpus so that the frequency of words in the resulting corpus will in-

crease. So, the resulting corpus will contain only the root words. By increasing the

frequency of words, complexity of word embedding will decrease because sparsity

is more in inflected words. Also, with limited resources, we will get better output

provided we do not use morphosyntactic information. In this project only the con-

ceptual similarity, that is conceptually similar words, can occur so that Malayalam

NLP based wordnet, ontology, analogy applications can improve their performance.

1.4 Word embedding

In Natural Language Processing we want to make computer programs that under-

stand, generate and, more generally speaking, work with human languages. But

there’s a challenge that jumps out: we, humans, communicate with words and sen-

tences; meanwhile, computers only understand numbers. For this reason, we have to

map those words (sometimes even the sentences) to vectors: just a bunch of numbers

,That’s called text vectorization. It is also termed as feature extraction. Different

ways to convert text into numbers are: Sparse Vector Representations and Dense

Vector Representations.

5
CHAPTER 2
LITERATURE SURVEY

2.1 State of the art

A word embedding is a learned representation for text where words that have the

same meaning have a similar representation. This chapter will introduce an overview

of four popular algorithms to create word embeddings.

2.1.1 Word2Vec

Word2Vec is a statistical method for efficiently learning a standalone word embed-

ding from a text corpus. It was developed by Tomas Mikolov, et al.[12] at Google

in 2013 as a response to make the neural-network-based training of the embedding

more efficient and since then has become the de facto standard for developing pre-

trained word embedding. The model is, in contrast to other deep learning models,

a shallow model of only one layer without non-linearities. The paper by Mikolov et

al.[12] introduced two architectures for unsupervised learning word embeddings from

a large corpus of text. The first architecture is called CBOW, it tries to predict the

center word from the summation of the context vectors within a specific window. The

second, and more successful architecture, is called skip-gram. The used pretrained

word2vec embeddings are trained using the Skip-gram algorithm. This algorithm is

also the inspiration for the algorithms behind the GloVe and fastText embeddings.

This embeddings were the first to be trained on a significantly large corpus of 100

billion tokens from English news articles and have a dimensionality of 300. These

articles originated from different media outlets and were bundled together in a news

6
search engine from Google, called: Google News[12].

2.1.2 GloVe

Global Vector for Word Representation (GloVe) by [14] Pennington, Socher, and

Manningwas inspired by the skip-gram algorithm and tries to approach the problem

from a different direction. Pennington, Socher,[14] and Manning show that the ratio

of co-occurrence probabilities of two specific words contains semantic information.

The idea is similar to TF-IDF (term frequency–inverse document frequency) but for

the weighing of the importance of a context word during the training of word embed-

dings. Classical vector space model representations of words were developed using

matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a

good job of using global text statistics but are not as good as the learned methods

like Word2Vec at capturing meaning and demonstrating it on tasks like calculating

analogies. Their algorithm[14] works by gathering all co-occurrence statistics in a

large sparse matrix X, wherein each element represents the times word i co-occurs

with j within a window similar to skip-gram. The word embeddings are defined in

terms of this co-occurrence matrix.There are embeddings with varying dimensionali-

ties trained on Wikipedia articles and tweets from Twitter. Besides these embeddings,

Pennington, Socher, and Manning also published embeddings trained on a dataset

from Common Crawl 3[14]. This dataset contains 840 billion tokens, which is sig-

nificantly more than the 100 billion tokens the Word2vec embeddings were trained

on. The published Common Crawl trained embeddings have a dimensionality of 300

and a original vocabulary of 2.2 million.

7
2.1.3 FastText

FastText is a library for efficient learning of word representations and sentence clas-

sification. It is written in C++ and supports multiprocessing during training. FastText

allows us to train supervised and unsupervised representations of words and sen-

tences. These representations (embeddings) can be used for numerous applications

from data compression, as features into additional models, for candidate selection,

or as initializers for transfer learning. Bojanowski et al.[13] introduced the fastText

embeddings by extending the skip-gram algorithm to not consider words as atomic

but as bags of character n-grams.One of the main advantages of this approach is that

word meaning can now be transferred between words, and thus embeddings of new

words can be extrapolated from embeddings of the n-grams already learned. The

length of n-grams you use can be controlled by the -minn and -maxn flags for mini-

mum and maximum number of characters to use respectively. These control the range

of values to get n-grams. The model is considered to be a Bag of Words model be-

cause aside of the sliding window of n-gram selection, there is no internal structure

of a word that is taken into account for featurization, i.e; as long as the characters

fall under the window, the order of the character n-grams does not matter. We can

also turn n-gram embeddings completely off as well by setting them both to 0. This

can be useful when the ‘words’ in our model aren’t words for a particular language,

and character level n-grams would not make sense.During the model update, FastText

learns weights for each of the n-grams as well as the entire word token. The authors

published pretrained word vectors for 294 different languages , all trained from the

Wikipedia dumps of the different languages. All the pretrained word embeddings

8
have a dimensionality of 300.

2.1.4 Paragram

Wieting et al [11] introduced a method to tune existing word embeddings using para-

phrasing data. The focus of their paper is not on creating entirely new word embed-

dings from a large corpus. Instead, the authors are taking existing pretrained GloVe

embeddings and tuning them so that words in similar sentences are able to compose

in the same manner. Their training data consists of a set of P phrase pairs.The objec-

tive function they use focuses to increase cosine similarity, i.e. the similarity of the

angles between the composed semantic representations of two paraphrases. It is im-

portant to mention is that Wieting et al.[] expresses similarity in terms of angle and

not in terms of actual distance. Additionally, Wieting et al.[11] only explored one

algebraic composition function, namely: averaging of the word vectors. Specifically,

they used version XXL which contains 86 million paraphrase pairs. An example of

a short paraphrase is: “thrown into jail” which is semantically similar to “taken into

custody”. Wieting et al. published their pretrained embeddings called Paragram-

Phrase XXL 6 [11], which are in fact tuned GloVe embeddings. These embeddings

also have a dimensionality of 300 and have a limited vocabulary of 50,000.

9
CHAPTER 3
METHODOLOGY

The increasing accuracy of word embedding representation of Malayalam language

has a great impact on ontology, analogy representation based, NLP based applica-

tions. Improved word vectors in Malayalam language can increase the accuracy of

pre-trained Malayalam word embeddings in NLP Applications. The main aim is to

improve word embeddings that do not need any labeled data. The improvement is

made by converting Malayalam language into a standardised format by removing in-

flections from each word in Malayalam corpus. Any large volume of text can be used

to get the word embeddings by feeding it to the model, without any kind of labeling.

In this project we are using FastText wordembedding. Word embeddings are derived

by training a model on large text corpus.

3.1 Requirements

• Hardware requirements are:

– Processor:Quad core Intel Core i7 Skylake or higher

– RAM:16GB or higher

– GPU :NVidia TitanX Pascal (12 GB VRAM)

– Hard Disk:2 TB

• Software requirements are :

– OS:Ubuntu or Linux or Microsoft Windows OS 7 or above

– TensorFlow

10
– Python

– Matplotlib

3.2 Steps

These are the steps involved in this project.

• Collect the pretrained word vectors from fastText for malayalam language.

• Visualise the pretrained malayalam word vectors using T-sne and PCA.

• Calculate the Eucledian distance and Cosine similarity of similar words.

• Collect data from another sources and create a new Dataset.

• Data preprocessing of new Malayalam Dataset.

• Apply subword algorithms to improve malayalam NLP model performance.

• Training of new Malayalam Dataset using Neural network.

• Visualise the new word embedding of malayalam datset using T-sne and PCA.

• Calculate Eucledian distance and cosine similarity of similar words in the new

Dataset.

• Evaluate the new model and calculate the improvement.

3.3 PROPOSED SYSTEM

The increasing accuracy of pre-trained word embeddings has a great impact on sen-

timent analysis research all NLP based applications.The improvement is done by re-

moving all inflectional parts from each words in the corpus.By using malayalam mor-

11
phological analyser inflections are removed and converterd the existing corpus into

standardised format.After data preprocessing the sparsity is reduced and frequency of

words in the corpus will increase then the complexity of word embedding in malay-

alam language will decrease.In this project fasttext embedding used for training and

it will results in dense vector representation of each words with dimensionality redu-

cation from the sparse word co-occurence matrix.

3.4 IMPLEMENTATION

These are the modules included in this project

• Corpus Creation

• Data Preprocessing

• Training Using Neural Networks

• Model Evaluation

• Dataset Visualisation

3.5 Module Description

3.5.1 CORPUS CREATION

Datas are collected from Malayalam news articles and created a malayalam Corpus

containing thousands of sentences.

3.5.2 DATA PREPROCESSING

Data in the real world is generally

12
• Incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

• Noisy: containing errors.

• Inconsistent: containing discrepancies in codes or names

Malayalam language is highly inflectional and agglutinative language. Inflected lan-

guage means a language that changes the form or ending of some words when the

way in which they are used in sentences changes. An agglutinative language is a

type of synthetic language with morphology that primarily uses agglutination. Words

may contain different morphemes to determine their meanings, but all of these mor-

phemes (including stems and affixes) remain, in every aspect, unchanged after their

unions. This results in generally more easily deducible word meanings if compared

to fusional languages, which allow modifications in either or both the phonetics or

spelling of one or more morphemes within a word. This usually results in a short-

ening of the word, or it provides easier pronunciation. Sparsity is more in inflected

words. In natural language processing, data sparsity (also known by terms such as

data sparseness, data paucity, etc) is the term used to describe the phenomenon of

not observing enough data in a corpus to model language accurately. True observa-

tions about the distribution and pattern of language cannot be made because there

is not enough data to see the true distribution. So by removing inflections, sparsity

will decrease and also complexity of the model decreases and frequency of words

will increase. Then complexity of Malayalam language word embedding model will

decrease. Data preprocessing is done by using Malayalam morphological analyser.

Morphological analysis is one of the fundamental tasks in computational processing

13
of natural languages. It is the study of the rules of word construction by analysing

the syntactic properties and morphological information. In order to perform this task,

morphemes have to be separated from the original word. This process is termed as

sandhi splitting. Sandhi splitting is important in the morphological analysis of agglu-

tinative languages like Malayalam, because of the richness in morphology, inflections

and sandhi. Due to sandhi, many morphological changes occur at the conjoining po-

sition of morphemes. Therefore, determining the morpheme boundaries becomes a

tough task, especially in languages like Malayalam. In this project only using roots

of the words and by using that creating a new corpus.Various sandhi rules are defined

in Malayalam for joining two words to form a new one. On applying these rules,

the original appearance of the words taking part in this process is altered. Rules

are applied by observing the ‘sounds’ of the end syllable of the first word and the

start syllable of the second word. In Malayalam grammar, a classification of sandhi

rules is done based on whether a word ends with a vowel (swaram) or a consonant

(vyanjanam). The rules followed are:

• Elision

• Augumentation

• Substitution

• Reduplication (Gemination)

3.5.3 ZIPF S LAW

Zipf’s Law was proposed by linguist George Kingsley Zipf[15] describing the analo-

gous pattern that appears in language.Given a large corpus of natural language occur-

14
Figure 3.1: Example of elison rule

Figure 3.2: Example of augmentation rule

Figure 3.3: Example of substitution rule

rences, the frequency of any word is inversely proportional to its rank in frequency

table.Frequency is number of times a word appears in a given sample.For eg. if we

15
Figure 3.4: Example of Reduplication rule

Figure 3.5: Initial model of dataset

are given a text file than frequency of any word in that text file is how many times

does that word appear in that file. This law was proposed related to patterns seen in

natural language corpus’s many times. By removing inflections frequency of words

will increases. According to zipfs law[15], by increasing frequency of words perfor-

mance of neural word embedding also increases. Zipfs Law is a discrete probability

distribution that tells the probability of encountering a word in a given corpus. Zipfs

law is a law about the frequency distribution of words in a language.k (3.1) is the rank

of the word we are interested in finding out the probability of appearing in the corpus

,N ( 3.1) is the corpus vocabulary size ,s is a parameter of the probability distribution.

16
Figure 3.6: input representation of data to malayalam morphological analyser

Figure 3.7: malayalam morphological analyser output

1/(ks )
 
f(k;s,N)= N 1/ns (3.1)
∑n=1

17
Figure 3.8: Preprocessed data format

Figure 3.9: Zipfs law result of initial malayalam corpus

3.5.4 Training Using Neural Networks

In this project am using fastText wordembedding.fastText is another word embedding

method that is an extension of the word2vec model. Instead of learning vectors for

words directly, fastText represents each word as an n-gram of characters. This helps

capture the meaning of shorter words and allows the embeddings to understand suf-

fixes and prefixes. Once the word has been represented using character n-grams, a

skip-gram model is trained to learn the embeddings. This model is considered to be a

18
Figure 3.10: Zipfs law result of preprocessed malayalam corpus

bag of words model with a sliding window over a word because no internal structure

of the word is taken into account. As long as the characters are within this window,

the order of the n-grams doesn’t matter. fastText works well with rare words. So

even if a word wasn’t seen during training, it can be broken down into n-grams to

get its embeddings.Word2vec and GloVe both fail to provide any vector represen-

tation for words that are not in the model dictionary. This is a huge advantage of

this method.Fasttext can be used both for classification and word-embedding cre-

ation.The word embedding can be created by using both skipgram and CBOW. Both

are architectures to learn the underlying word representations for each word by using

neural networks. Given a set of sentences (also called corpus) the model loops on the

words of each sentence and either tries to use the current word w in order to predict

its neighbors (i.e., its context). This approach is called “SkipGram”. If it uses each of

these contexts to predict the current word w, the method is called “Continuous Bag Of

Words” (CBOW). To limit the number of words in each context, a parameter called

19
“window size” is used. Continuous Bag Of Words (CBOW) uses context to predict

a target word or using a word to predict a target context, which is called skip-gram.

The latter method is used in this project because it produces more accurate results

on large datasets. When the feature vector assigned to a word cannot be used to ac-

curately predict that word’s context, the components of the vector are adjusted.The

vectors of words judged similar by their context are nudged closer together by ad-

justing the numbers in the vector. The Skip-gram model takes in a corpus of text and

creates a hot-vector for each word. A hot vector is a vector representation of a word

where the vector is the size of the vocabulary (total unique words). All dimensions

are set to 0 except the dimension representing the word that is used as an input at

that point in time. After training the model, we get two output files, model.bin and

model.vec (Figure 3.15). The latter it is a text file containing the word vectors one

per line. The former is a binary file containing the parameters of the model along

with the dictionary and all hyper parameters.

Figure 3.11: The Common Bag Of Words model

20
Figure 3.12: The skipgram model

Figure 3.13: Training window of the model

Figure 3.14: Output files of the model

3.5.5 MODEL EVALUATION

Word embeddings should capture the relationship between words in natural lan-

guage.In the Word Similarity and Relatedness Task, word embeddings are evaluated

by comparing word similarity scores computed from a pair of words with human la-

bels for the similarity or relatedness of the pair.Cosine similarity(3.1) measures the

similarity between two vectors. It is measured by the cosine of the angle between

21
Figure 3.15: model.vec output file

two vectors and determines whether two vectors are pointing in roughly the same

direction. It is often used to measure document similarity in text analysis.

Similarity(A, B) = A.B/(|A| ∗ |B|) (3.1)

Figure 3.16: model evaluation result

Figure 3.17: model evaluation result

22
3.5.6 MODEL VISUALIZATION

PCA and T-sne are two techniques for visualising dataset in 2D or 3D space. PCA

performs a linear mapping of the data to a lower-dimensional space . T-Distributed

Stochastic Neighbor Embedding (t-SNE) is a non-linear technique for dimensionality

reduction. Well suited for the visualization of high-dimensional datasets

Figure 3.18: model visualization

3.6 DATAFLOW DIAGRAM

Figure 3.19: Dataflow diagram

23
CHAPTER 4
Conclusion and Future Work

It can be visualised that similar Malayalam words or synonyms occupied close to each

other by using tensor flow embedding projector. The Eucledian distance and cosine

similarity of similar words will be lesser than the FastText pre trained word vectors of

Malayalam language. It can be predicted that by using Zipf’s law Malayalam word

embedding can be improved. Also sparsity problem in the existing Malayalam corpus

is decreased by using less resources that is conceptual similarity of words. Space

and time complexity is reduced in the new Malayalam corpus. The improved word

vectors can be used to all NLP Malayalam applications to improve their efficiency,

accuracy, speed. In future, for machine translation improvement on Malayalam, this

model can be used as a base model and for learning morphosyntatic information on

that model.

24
REFERENCES

[1] Abdulaziz M. Alayba, Vasile Palade, Matthew Englandand Rahat Iqbal ” Im-

proving Sentiment Analysis inArabic Using Word Representation”, IEEE 2nd

International Workshop on Arabic and Derived Script Analysis and Recognition

(ASAR),2018 .

[2] Shaosheng Cao Wei Lu, ”Improving Word Embeddings with Convolutional Fea-

ture Learning and Subword Information”, Proceedings of the 2017 Conference

on Empirical Methods in Natural Language Processing. .

[3] YuvalPinter,RobertGuthrie,Jacob Eisenstein, ” Mimicking Word Embeddings

using Subword RNNs”, Proceedings of the 2017 Conference on Empirical

Methods in Natural Language Processing. .

[4] Vishwani Gupta,Sven Giesselbach,Stefan Ruping,Christian Bauckhage, ”Im-

proving Word Embeddings Using Kernel PCA”, Proceedings of the 4th Work-

shop on Representation Learning for NLP.

[5] Procheta Sen,Debasis Ganguly , ”Word-Node2Vec: Improving Word Embed-

ding with Document-Level Non-LocalWord Co-occurrences”, Proceedings of

the 2019 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies..

[6] Miguel Ballesteros, Chris Dyer, and Noah A. Smith, ”Improved transition-

based parsing by modeling characters instead of words with LSTMs”, In Proc.

EMNLP.2015.

25
[7] Marco Baroni and Alessandro Lenci, ”Distributional memory: A general frame-

work for corpus-based semantics”, Computational Linguistics, 36(4):673–721,

2010.

[8] Jan A. Botha and Phil Blunsom, ” Compositional morphology for word repre-

sentations and language modelling”, In Proc. ICML, 2014.

[9] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan, ”Joint

learning of character and word embeddings”, In Proc. IJCAI,2015.

[10] Cicero Nogueira dos Santos and Bianca Zadrozny, ”Learning character-level

representations for part-ofspeech tagging”, In Proc. ICML,2014.

[11] John Wieting Mohit Bansal Kevin Gimpel Karen Livescu, ”TOWARDS UNI-

VERSAL PARAPHRASTIC SENTENCE EMBEDDINGS”, In Proc. ICLR

2016.

[12] Tomas Mikolov and Ilya Sutskever, ”Distributed Representations of Words and

Phrases and their Compositionality ”, In Proc. ICLR 2013.

[13] ajoulin,egrave,bojanowski,tmikolov, ”Bag of Tricks for Efficient Text Classifi-

cation ”, In Proc. ICLR 2016.

[14] Jeffrey Pennington, Richard Socher, Christopher D. Manning, ”GloVe: Global

Vectors for Word Representation ”, Proceedings of the 2014 Conference on

Empirical Methods in Natural Language Processing (EMNLP).

[15] Bernardo Huberman, ”Zipf’s Law and the Internet”, Glottometrics 3, 2002,

19-26 .

26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy