Bhawini NLP File
Bhawini NLP File
Bhawini NLP File
PRACTICAL FILE
FOR
1
CONTENTS
3 Introduction to NLTK 40
2
12 Perform an experiment to calculate emission and transition 77
matrix which will be helpful for tagging Parts of Speech
using Hidden Markov Model.
15 93
3
EXPERIMENT NO 1
INTRODUCTION TO NATURAL LANGUAGE PROCESSING
1. What is NLP?
NLP is an interdisciplinary field concerned with the interactions between computers and natural
human languages (e.g., English) — speech or text. NLP-powered software helps us in our daily lives
in various ways, for example:
1
The Computer Science side is concerned with applying linguistic knowledge, by transforming it
into computer programs with the help of sub-fields such as Artificial Intelligence (Machine
Learning & Deep Learning).
There are two main phases to natural language processing: data preprocessing and algorithm
development.
Data preprocessing involves preparing and "cleaning" text data for machines to be able to
analyze it. preprocessing puts data in workable form and highlights features in the text that
an algorithm can work with. There are several ways this can be done, including:
● Tokenization. This is when text is broken down into smaller units to work with.
● Stop word removal. This is when common words are removed from text so unique
words that offer the most information about the text remain.
● Lemmatization and stemming. This is when words are reduced to their root forms to
process.
● Part-of-speech tagging. This is when words are marked based on the part-of speech they
are -- such as nouns, verbs and adjectives.
Once the data has been preprocessed, an algorithm is developed to process it. There are many
different natural language processing algorithms, but two main types are commonly used:
2
● Rules-based system. This system uses carefully designed linguistic rules. This approach
was used early on in the development of natural language processing, and is still used.
● Machine learning-based system. Machine learning algorithms use statistical methods.
They learn to perform tasks based on training data they are fed, and adjust their methods
as more data is processed. Using a combination of machine learning, deep learning and
neural networks, natural language processing algorithms hone their own rules through
repeated processing and learnin
3. Phases of NLP:-
3
1. Lexical Analysis and Morphological
The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. It divides the whole text into
paragraphs, sentences, and words.
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
among the words.
In the real world, Agra goes to the Poonam, does not make any sense, so this sentence
is rejected by the Syntactic analyzer.
3. Semantic Analysis
Semantic analysis is concerned with the meaning representation. It mainly focuses on the
literal meaning of words, phrases, and sentences.
4. Discourse Integration
Discourse Integration depends upon the sentences that proceeds it and also invokes the
meaning of the sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by
applying a set of rules that characterize cooperative dialogues.
Businesses use massive quantities of unstructured, text-heavy data and need a way to
efficiently process it. A lot of the information created online and stored in databases is
natural human language, and until recently, businesses could not effectively analyze this data.
4
This is where natural language processing is useful.
The advantage of natural language processing can be seen when considering the following
two statements: "Cloud computing insurance should be part of every service-level
agreement," and, "A good SLA ensures an easier night's sleep -- even in the cloud." If a user
relies on natural language processing for search, the program will recognize that cloud
computing is an entity, that cloud is an abbreviated form of cloud computing and that SLA is
an industry acronym for service-level agreement.
These are some of the key areas in which a business can use natural language
processing (NLP).
These are the types of vague elements that frequently appear in human language and that
machine learning algorithms have historically been bad at interpreting. Now, with
improvements in deep learning and machine learning methods, algorithms can effectively
interpret them. These improvements expand the breadth and depth of data that can be
analyzed.
Syntax and semantic analysis are two main techniques used with natural language processing.
Semantics involves the use of and meaning behind words. Natural language processing applies
algorithms to understand the meaning and structure of sentences.
5
into its component parts and describe their syntactic roles.”
That actually nailed it but it could be a little more comprehensive. Parsing refers to the
formal analysis of a sentence by a computer into its constituents, which results in a parse tree
showing their syntactic relation to one another in visual form, which can be used for further
processing and understanding.
Figure 1.5 parse tree for the sentence "The thief robbed the apartment." Included is a
description of the three different information types conveyed by the sentence.
The letters directly above the single words show the parts of speech for each word (noun,
verb and determiner). One level higher is some hierarchical grouping of words into phrases.
For example, "the thief" is a noun phrase, "robbed the apartment" is a verb phrase and when
put together the two phrases form a sentence, which is marked one level higher.
But what is actually meant by a noun or verb phrase? Noun phrases are one or more words
that contain a noun and maybe some descriptors, verbs or adverbs. The idea is to group
nouns with words that are in relation to them.
A parse tree also provides us with information about the grammatical relationships of the
words due to the structure of their representation. For example, we can see in the structure
that "the thief" is the subject of "robbed."
With structure I mean that we have the verb ("robbed"), which is marked with a "V" above it
6
and a "VP" above that, which is linked with a "S" to the subject ("the thief"), which has a
"NP" above it. This is like a template for a subject-verb relationship and there are many
others for other types of relationships.
➔ Word segmentation. This is the act of taking a string of text and deriving word forms from
it. Example: A person scans a handwritten document into a computer. The algorithm would
be able to analyze the page and recognize that the words are divided by white spaces.
➔ Sentence breaking. This places sentence boundaries in large texts. Example: A natural
language processing algorithm is fed the text, "The dog barked. I woke up." The algorithm
can recognize the period that splits up the sentences using sentence breaking.
➔ Morphological segmentation. This divides words into smaller parts called morphemes.
Example: The word untestably would be broken into [[un[[test]able]]ly], where the algorithm
recognizes "un," "test," "able" and "ly" as morphemes. This is especially useful in machine
translation and speech recognition
➔ Named Entity Recognition (NER) - This technique is one of the most popular and
advantageous techniques in Semantic analysis, Semantics is something conveyed by the text.
Under this technique, the algorithm takes a phrase or paragraph as input and identifies all the
nouns or names present in that input.
➔ Tokenization. First of all, understanding the meaning of Tokenization, it is basically splitting
of the whole text into the list of tokens, lists can be anything such as words, sentences,
characters, numbers, punctuation, etc. Tokenization has two main advantages, one is to
reduce search with a significant degree, and the second is to be effective in the use of storage
space.
➔ Stemming and Lemmatization. The increasing size of data and information on the web is
all-time high from the past couple of years. This huge data and information demand
necessary tools and techniques to extract inferences with much ease.
“Stemming is the process of reducing inflected (or sometimes derived) words to their word
stem, base or root form - generally a written form of the word.” For example, what stemming
does, basically it cuts off all the suffixes. So after applying a step of stemming on the word
“playing”, it becomes “play”, or like, “asked” becomes “ask”.
7
Figure 1.6 Stemming and Lemmatization
Lemmatization usually refers to do things with the proper use of vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the lemma. In simple words,
Lemmatization deals with the lemma of a word that involves reducing the word form after
understanding the part of speech (POS) or context of the word in any document.
➔ Bag of Words. Bag of words technique is used to pre-process text and to extract all the
features from a text document to use in Machine Learning modeling. It is also a
representation of any text that elaborates/explains the occurrence of the words within a
corpus (document). It is also called “Bag” due to its mechanism, i.e. it is only concerned with
whether known words occur in the document, not the location of the words.
Let’s take an example to understand bag-of-words in more detail. Like below, we are taking 2
text documents:
Above you see two corpora as documents, we treat both documents as a different entity and
make a list of all the words present in both documents except punctuations as here,
8
“animals”
Then we create these documents into vectors (or we can say, creating a text into numbers is
called vectorization in ML) for further modelling.
Presentation of “Neha was angry on Sunil and he was angry on Ramesh” into vector form as
[1,1,1,1,1,1,1,0,0] , and the same as in, “Neha love animals” having vector form as
[1,0,0,0,0,0,0,0,1,1]. So, the bag-of-words technique is mainly used for featuring generation
from text data.
➔ Natural Language Generation . Natural language generation (NLG) is a technique that uses
raw structured data to convert it into plain English (or any other) language. We also call it
data storytelling. This technique is very helpful in many organizations where a large amount
of data is used, it converts structured data into natural languages for a better understanding of
patterns or detailed insights into any business.
1. Content Determination: Deciding what are the main content to be represented in text
or information provided in the text.
2. Document Clustering: Deciding the overall structure of the information to convey.
3. Aggregation: Merging of sentences to improve sentence understanding and
readability.
4. Lexical Choice: Putting appropriate words to convey the meaning of the sentence
more clearly.
5. Referring Expression Generation: Creating references to identify main objects and
regions of the text properly.
6. Realization: Creating and optimizing text that should follow all the norms of
grammar (like syntax, morphology, orthography).
➔ Sentiment Analysis It is one of the most common natural language processing techniques.
With sentiment analysis, we can understand the emotion/feeling of the written text.
Sentiment analysis is also known as Emotion AI or Opinion Mining.
9
The basic task of Sentiment analysis is to find whether expressed opinions in any document,
sentence, text, social media, film reviews are positive, negative, or neutral, it is also called
finding the Polarity of Text.
For example, Twitter is all filled up with sentiments, users are addressing their reactions or
expressing their opinions on each topic whichever or wherever possible. So, to access tweets
of users in a real-time scenario, there is a powerful python library called “twippy”.
➔ Sentence Segmentation The most fundamental task of this technique is to divide all text
into meaningful sentences or phrases. This task involves identifying sentence boundaries
between words in text documents. We all know that almost all languages have punctuation
marks that are presented at sentence boundaries, So sentence segmentation also referred to as
sentence boundary detection, sentence boundary disambiguation or sentence boundary
recognition.
There are many libraries available to do sentence segmentation, like, NLTK, Spacy, Stanford
CoreNLP, etc, that provide specific functions to do the task.
Three tools used commonly for natural language processing include Natural Language
Toolkit (NLTK), Gensim and Intel natural language processing Architect. NLTK is an open
source Python module with data sets and tutorials. Gensim is a Python library for topic
modeling and document indexing. Intel NLP Architect is another Python library for deep
learning topologies and techniques.
10
6. What is Natural Language Processing Used for?
Some of the main functions that natural language processing algorithms perform are:
● Text classification. This involves assigning tags to texts to put them in categories. This
can be useful for sentiment analysis, which helps the natural language processing
algorithm determine the sentiment, or emotion behind a text. For example, when brand A
is mentioned in X number of texts, the algorithm can determine how many of those
mentions were positive and how many were negative. It can also be useful for intent
detection, which helps predict what the speaker or writer may do based on the text they
are producing.
● Text extraction. This involves automatically summarizing text and finding important
pieces of data. One example of this is keyword extraction, which pulls the most
important words from the text, which can be useful for search engine optimization. Doing
this with natural language processing requires some programming -- it is not completely
automated. However, there are plenty of simple keyword extraction tools that automate
most of the process -- the user just has to set parameters within the program. For
example, a tool might pull out the most frequently used words in the text. Another
example is named entity recognition, which extracts the names of people, places and
other entities from text.
● Machine translation. This is the process by which a computer translates text from one
language, such as English, to another language, such as French, without human
intervention.
● Natural language generation. This involves using natural language processing
algorithms to analyze unstructured data and automatically produce content based on that
data. One example of this is in language models such as GPT3, which are able to analyze
an unstructured text and then generate believable articles based on the text.
The main benefit of NLP is that it improves the way humans and computers communicate
with each other. The most direct way to manipulate a computer is through code -- the
computer's language. By enabling computers to understand human language, interacting with
11
computers becomes much more intuitive for humans.
There are a number of challenges of natural language processing and most of them boil down to the
fact that natural language is ever-evolving and always somewhat ambiguous. They include:
12
● Evolving use of language. Natural language processing is also challenged by the fact
that language -- and the way people use it -- is continually changing. Although there are
rules to language, none are written in stone, and they are subject to change over time.
Hard computational rules that work now may become obsolete as the characteristics of
real-world language change over time.
NLP draws from a variety of disciplines, including computer science and computational linguistics
developments dating back to the mid-20th century. Its evolution included the following major
milestones:
● 1950s. Natural language processing has its roots in this decade, when Alan Turing
developed the Turing Test to determine whether or not a computer is truly intelligent. The
test involves automated interpretation and the generation of natural language as criterion
of intelligence.
● 1950s-1990s. NLP was largely rules-based, using handcrafted rules developed by
linguists to determine how computers would process language.
● 1990s. The top-down, language-first approach to natural language processing was
replaced with a more statistical approach, because advancements in computing made this
a more efficient way of developing NLP technology. Computers were becoming faster
and could be used to develop rules based on linguistic statistics without a linguist
creating all of the rules. Data-driven natural language processing became mainstream
during this decade. Natural language processing shifted from a linguist-based approach to
an engineer-based approach, drawing on a wider variety of scientific disciplines instead
of delving into linguistics.
● 2000-2020s. Natural language processing saw dramatic growth in popularity as a term.
With advances in computing power, natural language processing has also gained
numerous real-world applications. Today, approaches to NLP involve a combination of
classical linguistics and statistical methods.
Natural language processing plays a vital part in technology and the way humans interact with it. It is used
in many real-world applications in both the business and consumer spheres, including chatbots,
cybersecurity, search engines and big data analytics. Though not without its challenges, NLP is expected to
continue to be an important part of both industry and everyday life.
13
EXPERIMENT NO 2
INTRODUCTION TO GRAMMARS, PARSERS, POS TAG S
1. What is Grammar?
Grammar is defined as the rules for forming well-structured sentences.
While describing the syntactic structure of well-formed programs, Grammar plays a very essential and
important role. In simple words, Grammar denotes syntactical rules that are used for conversation in
natural languages.
For Example, in the ‘C’ programming language, the precise grammar rules state how functions are made
with the help of lists and statements.
2. Types of Grammar:-
A. Context Free Grammar - A context-free grammar, which is in short represented
as CFG, is a notation used for describing the languages and it is a superset of Regular
grammar which you can see from the following diagram:
14
CFG consists of a finite set of grammar rules having the following four components
● Set of Non-Terminals
● Set of Terminals
● Set of Productions
● Start Symbol
Set of Non-terminals
It is represented by V. The non-terminals are syntactic variables that denote the sets of
strings, which helps in defining the language that is generated with the help of
grammar.
Set of Terminals
It is also known as tokens and represented by Σ. Strings are formed with the help of
the basic symbols of terminals.
Set of Productions
It is represented by P. The set gives an idea about how the terminals and nonterminals
can be combined. Every production consists of the following components:
● Non-terminals,
● Arrow,
● Terminals (the sequence of terminals).
The left side of production is called non-terminals while the right side of production
is called terminals.
Start Symbol
B. Constituency Grammar
15
Before deep dive into the discussion of CG, let’s see some fundamental points about
constituency grammar and constituency relation.
For Example,
For Example, constituency grammar can organize any sentence into its three
16
constituents- a subject, a context, and an object.
These three constituents can take different values and as a result, they can generate
different sentences. For Example, If we have the following constituents, then
Example sentences that we can be generated with the help of the above constituents
are:
Now, let’s look at another view of constituency grammar is to define their grammar in
terms of their part of speech tags.
noun]
which corresponds to the same sentence – “The dogs are barking in the park”
< DT NN > < JJ VB > < PRP DT NN > -------------> The dogs
17
C. Dependency Grammar
Before deep dive into the discussion of DG, let’s see some fundamental points about
Dependency grammar and Dependency relation.
For Example,
1. Dependency Grammar states that words of a sentence are dependent upon other words of the sentence.
For Example, in the previous sentence which we discussed in CG, “barking dog” was mentioned and the
dog was modified with the help of barking as the dependency adjective modifier exists between the two.
18
2. It organizes the words of a sentence according to their dependencies. One of the words in a sentence
behaves as a root and all the other words except that word itself are linked directly or indirectly with the
root using their dependencies. These dependencies represent relationships among the words in a sentence
and dependency grammars are used to infer the structure and semantic dependencies between the words.
In the above tree, the root word is “community” having NN as the part of speech tag
and every other word of this tree is connected to root, directly or indirectly, with the
help of dependency relation such as a direct object, direct subject, modifiers, etc.
These relationships define the roles and functions of each word in the sentence and
how multiple words are connected together.
which implies that a dependent is connected to the governor with the help of relation,
or in other words, they are considered the subject, verb, and object respectively.
scientists
19
< Analyticsvidhya> <is> <the largest community of data
scientists>
Introduction to Parsers
1. Introduction to Parsing
Parsing is defined as "the analysis of an input to organize the data according to the rule of a
grammar."
There are a few ways to define parsing. However, the gist remains the same: parsing means to find
the underlying structure of the data we are given.
In a way, parsing can be considered the inverse of templating: identifying the structure and
extracting the data. In templating, instead, we have a structure and we fill it with data. In the case of
parsing, you have to determine the model from the raw representation, while for templating, you
20
have to combine the data with the model to create the raw representation. Raw representation is
usually text, but it can also be binary data.
Fundamentally, parsing is necessary because different entities need the data to be in different forms.
Parsing allows transforming data in a way that can be understood by a specific software. The
obvious example is programs — they are written by humans, but they must be executed by
computers. So, humans write them in a form that they can understand, then a software transforms
them in a way that can be used by a computer.
2. Role of Parser
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by the
lexical analyzer are grouped according to the syntactic rules of the language. This is done by a
parser. The parser obtains a string of tokens from the lexical analyzer and verifies that the string can
be the grammar for the source language. It detects and reports any syntax errors and produces a parse
tree from which intermediate code can be generated.
3. Structure of Parser
Having clarified the role of regular expressions, we can look at the general structure of a
parser. A complete parser is usually composed of two parts: a lexer, also known as scanner
or tokenizer, and the proper parser. The parser needs the lexer because it does not work
directly on the text but on the output produced by the lexer. Not all parsers adopt this
two-step schema; some parsers do not depend on a separate lexer and they combine the two
steps. They are called scannerless parsers.
A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens;
21
the parser then scans the tokens and produces the parsing result.
Let’s look at the following example and imagine that we are trying to parse addition.
437 + 734
The lexer scans the text and finds 4, 3, and 7, and then a space ( ). The job of the lexer is to
recognize that the characters 437 constitute one token of type NUM. Then the lexer finds a +
symbol, which corresponds to the second token of type PLUS, and lastly, it finds another token of
type NUM.
The parser will typically combine the tokens produced by the lexer and group them.
The definitions used by lexers and parsers are called rules or productions. In our example, a lexer
rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule
will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to a sum expression.
It is now typical to find suites that can generate both a lexer and parser. In the past, it was instead
more common to combine two different tools: one to produce the lexer and one to produce the
parser. For example, this was the case of the venerable lex and yacc couple: using lex, it was possible
to generate a lexer, while using yacc, it was possible to generate a parser.
Lexers are also known as scanners or tokenizers. Lexers play a role in parsing because they
transform the initial input in a form that is more manageable by the proper parser, who works at a
later stage. Typically lexers are easier to write than parsers, although there are special cases when
both are quite complicated; for instance, in the case of C
A very important part of the job of the lexer is dealing with whitespace. Most of the time, you want
the lexer to discard whitespace. That is because otherwise, the parser would have to check for the
22
presence of whitespace between every single token, which would quickly become annoying.
There are two terms that are related and sometimes they are used interchangeably: parse tree and
abstract syntax tree (AST). Technically, the parse tree could also be called a concrete syntax tree
(CST) because it should reflect more concretely the actual syntax of the input, at least compared to
the AST.
Conceptually, they are very similar. They are both trees; there is a root that has nodes representing
the whole source code. The roots have children nodes that contain subtrees representing smaller and
smaller portions of code, until single tokens (terminals) appear in the tree.
The difference is in the level of abstraction. A parse tree might contain all the tokens that appeared
in the program and possibly, a set of intermediate rules. The AST, instead, is a polished version of
the parse tree, in which only the information relevant to understanding the code is maintained. We
are going to see an example of an intermediate rule in the next section.
Some information might be absent both in the AST and the parse tree. For instance, comments and
grouping symbols (i.e. parentheses) are usually not represented. Things like comments are
superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.
In the AST the indication of the specific operator has disappeared and all that remains is the
operation to be performed. The specific operator is an example of an intermediate rule.
23
Graphical Representation of a Tree
The output of a parser is a tree, but the tree can also be represented in graphical ways. That is to
allow an easier understanding to the developer. Some parsing generator tools can output a file in the
DOT language, a language designed to describe graphs (a tree is a particular kind of graph). Then
this file is fed to a program that can create a graphical representation starting from this textual
description.
1.digraph sum {
2. sum -> 10;
3. sum -> 21;
4. }
6. Parsing Algorithms
Overview
Let’s start with a global overview of the features and strategies of all parsers.
Two Strategies
There are two strategies for parsing: top-down parsing and bottom-up parsing. Both terms are
defined in relation to the parse tree generated by the parser. Explained in a simple way:
● A top-down parser tries to identify the root of the parse tree first, then moves down the
subtrees until it finds the leaves of the tree.
● A bottom-up parser instead starts from the lowest part of the tree, the leaves, and rises up
until it determines the root of the tree.
24
Figure 2.7 Example parse tree
The same tree would be generated in a different order by a top-down and a bottom-up parser. In the
following images, the number indicates the order in which the nodes are created.
25
Tables of Parsing Algorithms
We provide a table below to offer a summary of the main information needed to understand and
implement a specific parser algorithm. The table lists:
To understand how a parsing algorithm works, you can also look at the syntax analytic toolkit. It is
an educational parser generator that describes the steps that a generated parser takes to accomplish
its objective. It implements an LL and an LR algorithm.
The second table shows a summary of the main features of the different parsing algorithms and for
what they are generally used.
26
Figure2.11 Table for features of parsing algorithms
1. Top-Down Algorithms
The top-down strategy is the most widespread of the two strategies and there are several successful
algorithms applying it.
LL Parser
LL (Left-to-right read of the input, Leftmost derivation) parsers are table-based parsers without
backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide
which rule to apply. The parsing table use as rows and columns nonterminals and terminals,
respectively.
27
1. Firstly, the parser looks at the current token and the appropriate amount of lookahead tokens.
2. Then, it tries to apply the different rules until it finds the correct match.
The concept of the LL parser does not refer to a specific algorithm, but more to a class of parsers.
They are defined in relation to grammars. That is to say, an LL parser is one that can parse a LL
grammar. In turn, LL grammars are defined in relation to the number of lookahead tokens that are
needed to parse them. This number is indicated between parentheses next to LL, so in the form
LL(k).
An LL(k) parser uses k tokens of lookahead and thus it can parse, at most, a grammar that needs k
tokens of lookahead to be parsed. Effectively, the concept of the LL(k) grammar is more widely
employed than the corresponding parser — which means that LL(k) grammars are used as a meter
when comparing different algorithms. For instance, you would read that PEG parsers can handle
LL(*) grammars.
The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually
compared to CYK, another chart parser, that is simpler but also usually worse in performance and
memory. The distinguishing feature of the Earley algorithm is that, in addition to storing partial
results, it implement a prediction step to decide which rule is going to try to match next.
The Earley parser fundamentally works by dividing a rule in segments, like in the following
example.
Then, working on this segment that can be connected at the dot (.), tries to reach a completed state;
that is to say. one with the dot at the end.
The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages,
while other famous algorithms (i.e. LL, LR) can parse only a subset of them. For instance, it has no
problem with left-recursive grammars. More generally, an Earley parser can also deal with
28
nondeterministic and ambiguous grammars.
It can do that at the risk of worse performance (O(n3)), in the worst case. However, it has a linear
time performance for normal grammars. The catch is that the set of languages parsed by more
traditional algorithms are the one we are usually interested in.
There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in
certain way the parsing can be more efficient, i.e., building an LL(1) grammar might be harder for
the developer, but the parser can apply it very efficiently. With Earley, you do less work, so the
parser does more of it.
In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in
terms of performance.
A recursive descent parser is a parser that works with a set of (mutually) recursive procedures,
usually one for each rule of the grammars. Thus, the structure of the parser mirrors the structure of
the grammar.
The term predictive parser is used in a few different ways: some people mean it as a synonym for a
top-down parser, some as a recursive descent parser that never backtracks.
Typically, recursive descent parsers have problems parsing left-recursive rules because the algorithm
would end up calling the same function again and again. A possible solution to this problem is using
tail recursion. Parsers that use this method are called tail recursive parsers.
Pratt Parser
A Pratt parser is a widely unused, but much appreciated (by the few who know it), parsing algorithm
defined by Vaughan Pratt in a paper called Top Down Operator Precedence. The paper itself starts
with a polemic on BNF grammars, which the author argues wrongly are the exclusive concerns of
parsing studies. This is one of the reasons for the lack of success. In fact, the algorithm does not rely
on a grammar but works directly on tokens, which makes it unusual to parsing experts.
The second reason is that traditional top-down parsers work great if you have a meaningful prefix
that helps distinguish between different rules. For example, if you get the token FOR, you are
looking at a for statement. Since this essentially applies to all programming languages and their
29
statements, it is easy to understand why the Pratt parser did not change the parsing world.
Parser Combinator
A parser combinator is a higher-order function that accepts parser functions as input and returns a
new parser function as output. A parser function usually means a function that accepts a string and
output a parse tree.
A parser combinator is modular and easy to build, but they are also slower (they have O(n4)
complexity in the worst case) and less sophisticated. They are typically adopted for easier parsing
tasks or for prototyping. In a sense, the user of a parser combinator builds the parser partially by
hand but relies on the hard work done by whoever created the parser combinator.
The most basic example is the Maybe monad. This is a wrapper around a normal type, like integer,
that returns the value itself when the value is valid (i.e. 567), but a special value, Nothing, when it is
not (i.e. undefined or divided by zero). Thus, you can avoid using a null value and unceremoniously
crashing the program. Instead, the Nothing value is managed normally, like it would manage any
other value
2. Bottom-Up Algorithms
The bottom-up strategy's main success is the family of many different LR parsers.
The reason for their relative unpopularity is that historically, they've been harder to build, although
LR parsers are more powerful than traditional LL(1) grammars. So, we mostly concentrate on them,
apart from a brief description of CYK parsers.
This means that we avoid talking about the more generic class of shift-reduce parser, which also
includes LR parsers.
1. Shift: Read one token from the input, which will become a new (momentarily isolated) node.
2. Reduce: Once the proper rule is matched, join the resulting tree with a precedent existing
subtree.
Basically, the Shift step reads the input until completion, while the Reduce step joins the subtrees
until the final parse tree is built.
30
CYK Parser
The Cocke-Younger-Kasami (CYK) algorithm was formulated independently by three authors. Its
notability is due to a great worst-case performance (O(n3)), although it is hampered by
comparatively bad performance in most common scenarios.
However, the real disadvantage of the algorithm is that it requires grammars to be expressed in
Chomsky normal form.
The CYK algorithm is used mostly for specific problems; for instance, the membership problem: to
determine if a string is compatible with a certain grammar. It can also be used in natural language
processing to find the most probable parsing between many options.
LR Parser
LR (Left-to-right read of the input; Rightmost derivation) parsers are bottom-up parsers that can
handle deterministic context-free languages in linear time with lookahead and without backtracking.
The invention of LR parsers is credited to the renowned Donald Knuth.
Traditionally, they have been compared to and have competed with LL parsers. There's a similar
analysis related to the number of lookahead tokens necessary to parse a language. An LR(k) parser
can parse grammars that need k tokens of lookahead to be parsed. However, LR grammars are less
restrictive, and thus more powerful, than the corresponding LL grammars. For example, there is no
need to exclude left-recursive rules.
Technically, LR grammars are a superset of LL grammars. One consequence of this is that you need
only LR(1) grammars, so usually, the (k) is omitted.
They are also table-based, just like LL-parsers, but they need two complicated tables. In very simple
terms:
1. One table tells the parser what to do depending on the current token, the state it's in, and the
tokens that could possibly follow the current one (lookahead sets).
1. Part-of-Speech Tagging
Part-of-Speech(POS) Tagging is the process of assigning different labels known as POS tags to the words in
a sentence that tells us about the part-of-speech of the word.
It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a
form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,
31
adjective, verb, and so on.
It is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in
correspondence with a particular part of speech, depending on the definition of the word and its context.
1. Universal POS Tags: These tags are used in the Universal Dependencies (UD) (latest version 2), a
project that is developing cross-linguistically consistent treebank annotation for many languages.
These tags are based on the type of words. E.g., NOUN(Common Noun), ADJ(Adjective), ADV(Adverb)
2. Detailed POS Tags: These tags are the result of the division of universal POS tags into various tags,
like NNS for common plural nouns and NN for the singular common noun compared to NOUN for
common nouns in English. These tags are language-specific.
format for a tagged corpus is of the form word/tag. Each word is with a tag denoting its POS. For example,
nn refers to a noun, vb is a verb.
32
Example 2 of Part-of-speech (POS) tagged corpus
In Figure 1, we can see each word has its own lexical term written underneath, however, having to
constantly write out these full terms when we perform text analysis can very quickly become cumbersome
— especially as the size of the corpus grows. Then, we use a short representation referred to as “tags” to
represent the categories.
As earlier mentioned, the process of assigning a specific tag to a word in our corpus is referred to as
part-of-speech tagging (POS tagging for short) since the POS tags are used to describe the lexical terms that
we have within our text.
Figure 2.15: Grid displaying different types of lexical terms, their tags, and random examples
Part-of-speech tags describe the characteristic structure of lexical terms within a sentence or text, therefore,
33
we can use them for making assumptions about semantics. Other applications of POS tagging include:
Markov Chains
Taking the example text we used in Figure 1, “Why not tell someone?”, imaging the sentence is truncated to
“Why not tell … ” and we want to determine whether the following word in the sentence is a noun, verb,
adverb, or some other part-of-speech.
Now, if you are familiar with English, you’d instantly identify the verb and assume that it is more likely the
word is followed by a noun rather than another verb. Therefore, the idea as shown in this example is that the
POS tag that is assigned to the next word is dependent on the POS tag of the previous word.
By associating numbers with each arrow direction, of which imply the likelihood of the next word given the
current word, we can say there is a higher likelihood the next word in our sentence would be a noun since it
has a higher likelihood than the next word being a verb if we are currently on a verb. The image in Figure is
a great example of how a Markov Model works on a very small scale.
Given this example, we may now describe markov models as “a stochastic model used to model randomly
changing systems. It is assumed that future states depend only on the current state, not on the events that
occurred before it (that is, it assumes the Markov property)”. Therefore to get the probability of the next
event, it needs only the states of the current event.
34
We can depict a markov chain as directed graph:
The lines with arrows are an indication of the direction hence the name “directed graph”, and the circles
may be regarded as the states of the model — a state is simply the condition of the present moment.
We could use this Markov model to perform POS. Considering we view a sentence as a sequence of words,
we can represent the sequence as a graph where we use the POS tags as the events that occur which would
be illustrated by the stats of our model graph.
For example, q1 in Figure would become NN indicating a noun, q2 would be VB which is short for verb,
and q3 would be O signifying all other tags that are not NN or VB. Like in Figure 3, the directed lines
would be given a transition probability that define the probability of going from one state to the next.
A more compact way to store the transition and state probabilities is using a table, better known as a
“transition matrix”.
35
Figure 2.18: Transition Matrix (Image by Author)
Notice this model only tells us the transition probability of one state to the next when we know the previous
word. Hence, this model does not show us what to do when there is no previous word. To handle this case,
we add what is known as the “initial state”.
Figure 2.19: Adding an Initial State to deal with beginning of word matrix
You may now be wondering, how did we populate the transition matrix? Great Question. I will use 3
sentences for our corpus. The first is “<s> in a station of the metro”, “<s> the apparition of these faces in the
crowd”, “<s> petals on a wet, black bough.” (Note these are the same sentences used in the course). Next,
we will break down how to populate the matrix into steps:
At the end of step one, our table would look something like this…
36
Figure 2.20: applying step one with our corpus.
Applying the above formula to the table in Figure 2.20, our new table would look as follows…
You may notice that there are many 0’s in our transition matrix which would result in our model being
incapable of generalizing to other text that may contain verbs. To overcome this problem, we add
smoothing.
Adding smoothing requires we slightly we adjust the formula by adding a small value, epsilon, to each of
the counts in the numerator, and add N * epsilon to the denominator, such that the row sum still adds up to
1.
37
Figure 2.22: New probabilities with smoothing added. N is the length of the corpus and epsilon is some
very small number.
Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed
to be a Markov process with unobservable (“hidden”) states . In our case, the unobservable states are the
POS tags of a word.
If we rewind back to our Markov Model , we see that the model has states for part of speech such as VB for
verb and NN for a noun. We may now think of these as hidden states since they are not directly observable
from the corpus. Though a human may be capable of deciphering what POS applies to a specific word, a
machine only sees the text, hence making it observable, and is unaware of whether that word POS tag is
noun, verb, or something else which in-turn means they are unobservable.
The emission probabilities describe the transitions from the hidden states in the model — remember the
hidden states are the POS tags — to the observable states — remember the observable states are the words.
In Figure 2.23 we see that for the hidden VB state we have observable states. The emission probability from
the hidden states VB to the observable eat is 0.5 hence there is a 50% chance that the model would output
38
this word when the current hidden state is VB.
We can also represent the emission probabilities as a table…
Figure 2.24: Emission matrix expressed as a table — The numbers are not accurate representations,
they are just random
Similar to the transition probability matrix, the row values must sum to 1. Also, the reason all of our POS
tags emission probabilities are more than 0 since words can have a different POS tag depending on the
context.
To populate the emission matrix, we’d follow a procedure very similar to the way we’d populate the
transition matrix. We’d first count how often a word is tagged with a specific tag.
Figure 2.25: Calculating the counts of a word and how often it is tagged with a specific tag.
Since the process is so similar to calculating the transition matrix, I will instead provide you with the
formula with smoothing applied to see how it would be calculated.
Formula 4: Formula for calculating transition probabilities where N is the number of tags and epsilon is
a very small number
39
EXPERIMENT NO 3
INTRODUCTION TO NLTK
Natural language processing is about building applications and assistance/services that can
understand human languages. It is a field that interacts amidst computers and humans. It is mainly
used for text analysis that provides computers with a way to recognize the human language.
Moreover, NLP is the technology that provides the potential to all the chatbots, voice assistants,
predictive text and text applications, has unfolded in recent years. There is a wide variety of
open-source NLP tools available.
With the help of NLP tools and techniques, most of the NLP task can be performed, a few examples
of NLP tasks involve speech recognition, summarization, topic segmentation, understanding what
the content is about or sentiment analysis etc.
Understanding NLTK
NLTK, a preeminent platform, that is used for developing Python programs for operating with
human language data. It is a suite of open source program modules, tutorials and problem sets for
presenting prepared computational linguistics courseware. NLTK incorporates symbolic and
statistical Natural Language Processing and is assimilated to interpreted corpora for teachers and
students especially.
40
have intense knowledge in programming.
4. NLTK is an ultimate combination of three factors; first, it was intentionally designed as
courseware and provides pedagogical objectives as primary status, second, its target audience
comprises both linguists and computer specialists, and it is not only convenient but
challenging also at various levels of early computational skill and thirdly, it deeply depends
on an object-oriented composing language that supports swift prototyping and intelligent
programming.
Requirements of NLTK
1. Easy to implement: One of the main objectives behind using this toolkit is to enable users to
focus on developing NLP components and system. The more time students must spend
learning to use the toolkit, the less useful it is.
2. Consistency: The toolkit must apply compatible data structures and interfaces.
3. Extensibility: The toolkit easily adapts novel components, whether such components imitate
or prolong the existing functionality and performance of the toolkit. The toolkit should be
arranged in a precise manner that appending new extensions would match into the toolkit’s
existing infrastructure.
4. Documentation: There is a need to cite the toolkit, its data structure and its implementation
delicately. The whole nomenclature must be picked out very sparingly and to be applied
consistently.
5. Monotony: The toolkit should make up the ramification of producing NLP systems, and do
not drop them. So, every class, determined by the tool, must be accessible for users that they
could complete by the time of the rudimentary course in computational linguistics.
6. Modularity: To maintain interaction amid various components of the toolkit, it should be
retained in a minimum, mild, and sharp interfaces. However, it should be plausible to finish
different projects by tiny parts of the toolkit, without agonising about how to cooperate with
the rest of the toolkit.
Uses of NLTK
1. Assignments: NLTK can be used to create assignments for students of various difficulties and
scopes. After becoming familiar with the toolkit, users can make trivial changes or
extensions in an existing module in NLTK. When developing a new module, NLTK gives
41
few useful initiating points: pre-defined interfaces and data structures, and existing modules
that apply the same interface.
2. Class demonstrations: NLTK offers graphical tools that can be utilized in the class
demonstrations, to assist in explaining elementary NLP concepts and algorithms. Such
interactive tools are accepted to represent associated data structures and to bestow the
step-by-step execution of algorithms.
3. Advanced Projects: NLTK presents users with an amenable framework for advanced
projects. Standard projects include the development of totally new functionality for a priorly
unsupported NLP task or the development of an entire system from existing and new
modules.
NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is
free, opensource, easy to use, large community, and well documented. NLTK consists of the most
common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic
segmentation, and named entity recognition. NLTK helps the computer to analyze, preprocess, and
understand the written text.
Now we first install and import nltk in our system. Open the terminal and type the following
command-
Further we import the nltk using the following command and start with the operations.
#Loading NLTK
import nltk
42
1. Tokenization
Tokenization is the first step in text analytics. The process of breaking down a text paragraph
into smaller chunks such as words or sentence is called Tokenization. Token is a single entity
that is building blocks for sentence or paragraph.
2. Sentence Tokenization
Output -
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is
awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]
3. Word Tokenization
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',',
'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat',
'cardboard']
43
4. Frequency Distribution
fdist.most_common(2)
5. Stopwords
Stopwords considered as noise in the text. Text may contain stop words such as is, am, are,
this, a, an, the, etc.
In NLTK for removing stopwords, you need to create a list of stopwords and filter out your
list of tokens from these words.
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)
output:-
{'their', 'then', 'not', 'ma', 'here', 'other', 'won', 'up', 'weren', 'being', 'we', 'those', 'an', 'them',
'which', 'him', 'so', 'yourselves', 'what', 'own', 'has', 'should', 'above', 'in', 'myself', 'against',
'that', 'before', 't', 'just', 'into', 'about', 'most', 'd', 'where', 'our', 'or', 'such', 'ours', 'of', 'doesn',
'further', 'needn', 'now', 'some', 'too', 'hasn', 'more', 'the', 'yours', 'her', 'below', 'same', 'how',
'very', 'is', 'did', 'you', 'his', 'when', 'few', 'does', 'down', 'yourself', 'i', 'do', 'both', 'shan', 'have',
44
'itself', 'shouldn', 'through', 'themselves', 'o', 'didn', 've', 'm', 'off', 'out', 'but', 'and', 'doing', 'any',
'nor', 'over', 'had', 'because', 'himself', 'theirs', 'me', 'by', 'she', 'whom', 'hers', 're', 'hadn', 'who',
'he', 'my', 'if', 'will', 'are', 'why', 'from', 'am', 'with', 'been', 'its', 'ourselves', 'ain', 'couldn', 'a',
'aren', 'under', 'll', 'on', 'y', 'can', 'they', 'than', 'after', 'wouldn', 'each', 'once', 'mightn', 'for',
'this', 'these', 's', 'only', 'haven', 'having', 'all', 'don', 'it', 'there', 'until', 'again', 'to', 'while', 'be',
'no', 'during', 'herself', 'as', 'mustn', 'between', 'was', 'at', 'your', 'were', 'isn', 'wasn'}
Removing Stopwords
filtered_sent=[]
for w in tokenized_sent:
if w not in stop_words:
filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_sent)
print("Filterd Sentence:",filtered_sent)
output:-
Tokenized Sentence: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?']
Filterd Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']
6. Lexicon Normalization
Lexicon normalization considers another type of noise in the text. For example, connection,
connected, connecting word reduce to a common word "connect". It reduces derivationally
related forms of a word to a common root word.
7. Stemming
Stemming is a process of linguistic normalization, which reduces words to their word root
word or chops off the derivational affixes. For example, connection, connected, connecting
word reduce to a common word "connect".
# Stemming
from nltk.stem import PorterStemmer
45
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stemmed_words=[]
for w in filtered_sent:
stemmed_words.append(ps.stem(w))
print("Filtered Sentence:",filtered_sent)
print("Stemmed Sentence:",stemmed_words)
Output:-
Filtered Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']
Stemmed Sentence: ['hello', 'mr.', 'smith', ',', 'today', '?']
8. Lemmatization
Lemmatization reduces words to their base word, which is linguistically correct lemmas. It
transforms root word with the use of vocabulary and morphological analysis. Lemmatization
is usually more sophisticated than stemming. Stemmer works on an individual word without
knowledge of the context. For example, The word "better" has "good" as its lemma. This
thing will miss by stemming because it requires a dictionary look-up.
#Lexicon Normalization
#performing stemming and Lemmatization
word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))
46
output-
Lemmatized Word: fly
Stemmed Word: fli
9. POS Tagging
tokens=nltk.word_tokenize(sent)
print(tokens)
output:-
['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.'
nltk.pos_tag(tokens)
Output-
[('Albert', 'NNP'),
('Einstein', 'NNP'),
('was', 'VBD'),
('born', 'VBN'),
('in', 'IN'),
('Ulm', 'NNP'),
(',', ','),
('Germany', 'NNP'),
('in', 'IN'),
('1879', 'CD'),
('.', '.')]
47
EXPERIMENT NO 4
WRITE A PYTHON PROGRAM TO REMOVE “STOPWORDS” FROM A GIVEN
TEXT AND GENERATE WORD TOKENS AND FILTERED TEXT
In NLTK for removing stopwords, you need to create a list of stopwords and filter out your
list of tokens from these words.
Code:-
Import nltk
from nltk.corpus import stopwords
import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
Output:-
48
Removing Stopwords
Code:-
49
Output
50
EXPERIMENT NO 5
WRITE A PYTHON PROGRAM TO GENERATE “TOKENS” AND ASSIGN “POS
TAGS” FOR A GIVEN TEXT USING NLTK PACKAGE
Code -
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
text = "Tokenization is one of the least glamorous parts of NLP. How do we split our text"\
"so that we can do interesting things on it. "\
"Despite its lack of glamour, it’s super important."\
"Tokenization defines what our NLP models can express. "\
"Even though tokenization is super important, it’s not always top of mind."\
"In the rest of this article, I’d like to give you a high-level overview of tokenization, where it came
from,"\
"what forms it takes, and when and how tokenization is important "\
tokenized = sent_tokenize(text)
for i in tokenized:
wordsList = nltk.word_tokenize(i)
pos_tag= nltk.pos_tag(wordsList)
print("Pos-tags",pos_tag)
51
Output:-
52
EXPERIMENT NO 6
WRITE A PYTHON PROGRAM TO GENERATE “WORLDCLOUD” WITH
MAXIMUM WORDS USED = 100, IN DIFFERENT SHAPES AND SAVE AS
A .PNG FILE FOR A GIVEN TEXT FILE.
Wordcloud 1
Code:-
custom_mask = np.array(Image.open('like.png'))
wc = WordCloud(background_color = 'black',
stopwords = stopwords,
mask = custom_mask,
contour_width = 3,
contour_color = 'black')
wc.generate(text)
image_colors = ImageColorGenerator(custom_mask)
wc.recolor(color_func = image_colors)
#Plotting
wc.to_file('like_cloud.png')
53
The Image :-
Output :-
54
Wordcloud 2
Code:-
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
custom_mask = np.array(Image.open('girl.png'))
wc = WordCloud(background_color = 'white',
stopwords = stopwords,
mask = custom_mask,
contour_width = 3,
contour_color = 'black')
wc.generate(text)
image_colors = ImageColorGenerator(custom_mask)
wc.recolor(color_func = image_colors)
wc.to_file('girl_cloud.png')
The Image : -
55
Output :-
56
EXPERIMENT NO 7
PERFORM AN EXPERIMENT TO LEARN ABOUT MORPHOLOGICAL
FEATURES OF A WORD BY ANALYZING IT.
A word can be simple or complex. For example, the word 'cat' is simple because one cannot further
decompose the word into smaller part. On the other hand, the word 'cats' is complex, because the word is
made up of two parts: root 'cat' and plural suffix '-s'
Theory
Analysis of a word into root and affix(es) is called Morphological analysis of a word. It is
mandatory to identify the root of a word for any natural language processing task. A root word can
have various forms. For example, the word 'play' in English has the following forms: 'play', 'plays',
'played' and 'playing'. Hindi shows a greater number of forms for the word 'खेल' (khela) which is
equivalent to 'play'. The forms of 'खेल'(khela) are the following:
For Telugu root ఆడడం (Adadam), the forms are the following::
Thus we understand that the morphological richness of one language might vary from one language
to another. Indian languages are generally morphologically rich languages and therefore
morphological analysis of words becomes a very significant task for Indian languages.
57
Types of Morphology
1. Inflectional morphology
Deals with word forms of a root, where there is no change in lexical category. For example, 'played'
is an inflection of the root word 'play'. Here, both 'played' and 'play' are verbs.
2. Derivational morphology
Deals with word forms of a root, where there is a change in the lexical category. For example, the
word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness' is a derived noun form of
the adjective 'happy'.
Morphological Features:
All words will have their lexical category attested during morphological analysis.
A noun and pronoun can take suffixes of the following features: gender, number, person, case
A verb can take suffixes of the following features: tense, aspect, modality, gender, number, person.
58
'rt' stands for root. 'cat' stands for lexical category. Thev value of lexicat category can be noun, verb,
adjective, pronoun, adverb, preposition. 'gen' stands for gender. The value of gender can be
masculine or feminine.
◆ 'num' stands for number. The value of number can be singular (sg) or plural (pl).
◆ 'per' stands for person. The value of person can be 1, 2 or 3
◆ The value of tense can be present, past or future. This feature is applicable for verbs.
◆ The value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature is
not applicable for verbs.
◆ 'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique case
when a postposition occurs after noun. If no postposition can occur after noun, then the case
is a direct case. This is applicable for hindi but not english as it doesn't have any
postpositions. Some of the postpsitions in hindi are: का(kaa), की(kii), के(ke), को(ko), में (meM)
Objective :- The objective of the experiment is to learn about morphological features of a word by
analysing it.
59
STEP3: Select the features.
60
OUTPUT: Right features are marked by tick and wrong features are marked by cross.
61
EXPERIMENT NO 8
PERFORM AN EXPERIMENT TO GENERATE WORD FORMS FROM
ROOT AND SUFFIX INFORMATION
A word can be simple or complex. For example, the word 'cat' is simple because one cannot further
decompose the word into smaller part. On the other hand, the word 'cats' is complex, because the word is
made up of two parts: root 'cat' and plural suffix '-s'
Theory:- Given the root and suffix information, a word can be generated. For example,
- Analysis may involve non-determinism, since more than one analysis is possible.
- Generation is a deterministic process. In case a language allows spelling variation, then till that
extent, generation would also involve non-determinism
62
Objective : The objective of the experiment is to generate word forms from root and suffix information
Procedure :-
OUTPUT: Drop downs for selecting root and other features will appear.
63
STEP3: After selecting all the features, select the word corresponding above features selected.
STEP4: Click the check button to see whether right word is selected or not
64
EXPERIMENT NO 9
PERFORM AN EXPERIMENT TO UNDERSTAND THE MORPHOLOGY OF
A WORD BY THE USE OF ADD-DELETE TABLE
Introduction : Morphology
Morphology is the study of the way words are built up from smaller meaning bearing units i.e.,
morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:
● बच्चों(bachchoM) consists of two morphemes, बच्चा(bachchaa) has the information of the root word
noun "बच्चा"(bachchaa) and ओं(oM) has the information of plural and oblique case.
● played has two morphemes play and -ed having information verb "play" and "past tense", so given
word is past tense form of verb "play".
Words can be analysed morphologically if we know all variants of a given root word. We can use an
'Add-Delete' table for this analysis.
Theory :-
Morph Analyser
Definition
Morphemes are considered as smallest meaningful units of language. These morphemes can either
be a root word(play) or affix(-ed). Combination of these morphemes is called morphological
process. So, word "played" is made out of 2 morphemes "play" and "-ed". Thus finding all parts of a
65
word(morphemes) and thus describing properties of a word is called "Morphological Analysis". For
example, "played" has information verb "play" and "past tense", so given word is past tense form of
verb "play".
Analysis of a word :
बच्चों (bachchoM) = बच्चा(bachchaa)(root) + ओं(oM)(suffix)
A linguistic paradigm is the complete set of variants of a given lexeme. These variants can be
classified according to shared inflectional categories (eg: number, case etc) and arranged into tables.
66
आ(aa) ए(e) plu dr बच्चे(bachche)
Paradigm Class
Words in the same paradigm class behave similarly, for Example लड़क is in the same paradigm class
as बच्च, so लड़का would behave similarly as बच्चा as they share the same paradigm class.
Procedure :-
67
Wrong output:-
Right output:-
68
EXPERIMENT NO 10
PERFORM AN EXPERIMENT TO LEARN TO CALCULATE BIGRAMS
FROM A GIVEN CORPUS AND CALCULATE PROBABILITY OF A
SENTENCE.
Introduction :- N - Grams
Probability of a sentence can be calculated by the probability of sequence of words occuring
in it. We can use Markov assumption, that the probability of a word in a sentence depends on the
probability of the word occuring just before it. Such a model is called first order Markov model or
the bigram model.
Here, Wn refers to the word token corresponding to the nth word in a sequence.
Theory
A combination of words forms a sentence. However, such a formation is meaningful only when the
words are arranged in some order.
Such a sentence is not grammatically acceptable. However some perfectly grammatical sentences
can be nonsensical too!
One easy way to handle such unacceptable sentences is by assigning probabilities to the strings of
words i.e, how likely the sentence is.
Probability of a sentence
If we consider each word occurring in its correct location as an independent event,the probability of
the sentences is : P(w(1), w(2)..., w(n-1), w(n))
69
Using chain rule:
Bigrams
We can avoid this very long calculation by approximating that the probability of a given word
depends only on the probability of its previous words. This assumption is called Markov assumption
and such a model is called Markov model- bigrams. Bigrams can be generalized to the n-gram which
looks at (n-1) words in the past. A bigram is a first-order Markov model.
Therefore ,
A bigram table for a given corpus can be generated and used as a lookup table for calculating
probability of sentences.
Eg: Corpus – (eos) You book a flight (eos) I read a book (eos) You read (eos)
Bigram Table:
a 0 0 0.5 0 0.5 0 0
flight 1 0 0 0 0 0 0
I 0 0 0 0 0 0 1
70
read 0.5 0 0 0.5 0 0 0
=.020625
Objective :- The objective of this experiment is to learn to calculate bigrams from a given corpus and
calculate probability of a sentence.
Procedure:-
STEP1: Select a corpus and click on
Submit
71
STEP3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step 2.
STEP4: If correct (green), click on take a quiz and fill the correct answer
72
EXPERIMENT NO 11
PERFORM AN EXPERIMENT TO LEARN HOW TO APPLY ADD-ONE
SMOOTHING ON SPARSE BIGRAM TABLE.
There are some techniques that can be used for assigning a non-zero probabilty to these 'zero
probability bigrams'. This task of reevaluating some of the zero-probability and low-probabilty
N-grams, and assigning them non-zero values, is called smoothing.
Theory :-
The standard N-gram models are trained from some corpus. The finiteness of the training corpus
leads to the absence of some perfectly acceptable N-grams. This results in sparse bigram matrices.
This method tend to underestimate the probability of strings that do not occur in their training
corpus.
There are some techniques that can be used for assigning a non-zero probabilty to these 'zero
probability bigrams'. This task of reevaluating some of the zero-probability and low-probabilty
N-grams, and assigning them non-zero values, is called smoothing. Some of the techniques are:
Add-One Smoothing, Witten-Bell Discounting, Good-Turing Discounting.
73
Add-One Smoothing
In Add-One smooting, we add one to all the bigram counts before normalizing them into
probabilities. This is called add-one smoothing.
Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram probability can be computed by
dividing the count of the word by the total number of word tokens N
P(wx) = c(wx)/sumi{c(wi)}
= c(wx)/N
Application on bigrams
Normal bigram probabilities are computed by normalizing each row of counts by the unigram count:
P(w n|wn-1) = C(wn-1wn)/C(wn-1)
For add-one smoothed bigram counts we need to augment the unigram count by the number of total
word types in the vocabulary V:
p *(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )
Objective:- The objective of this experiment is to learn how to apply add-one smoothing on sparse bigram
table.
Procedure :-
STEP1: Select a corpus
74
STEP2: Apply add one smoothing and calculate bigram probabilities using the given bigram
counts,N and V. Fill the table and hit
Submit
STEP3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step 2
75
EXPERIMENT NO 12
PERFORM AN EXPERIMENT TO CALCULATE EMISSION AND TRANSITION
MATRIX WHICH WILL BE HELPFUL FOR TAGGING PARTS OF SPEECH
USING HIDDEN MARKOV MODEL.
Introduction:-
POS TAGGING - Hidden Markov Model
POS tagging or part-of-speech tagging is the procedure of assigning a grammatical category like
noun, verb, adjective etc. to a word. In this process both the lexical information and the context play
an important role as the same lexical form can behave differently in a different context.
For example the word "Park" can have two different lexical categories based on the context.
Assigning part of speech to words by hand is a common exercise one can find in an elementary
grammar class. But here we wish to build an automated tool which can assign the appropriate
part-of-speech tag to the words of a given sentence. One can think of creating hand crafted rules by
observing patterns in the language, but this would limit the system's performance to the quality and
number of patterns identified by the rule crafter. Thus, this approach is not practically adopted for
building POS Tagger. Instead, a large corpus annotated with correct POS tags for each word is given
to the computer and algorithms then learn the patterns automatically from the data and store them in
form of a trained model. Later this model can be used to POS tag new sentences
76
In this experiment we will explore how such a model can be learned from the data.
Theory : -
A Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled
is assumed to be a Markov process with unobserved (hidden) states.In a regular Markov model
(Markov Model (Ref: http://en.wikipedia.org/wiki/Markov_model)), the state is directly visible to
the observer, and therefore the state transition probabilities are the only parameters. In a hidden
Markov model, the state is not directly visible, but output, dependent on the state, is visible.
1)Transition Probabilities: The one-step transition probability is the probability of transitioning from
one state to another in a single step.
2)Emission Probabilties: : The output probabilities for an observation from state. Emission
probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where okis an Observation. Informally, B is the
probability that the output is ok given that the current state is qi
For POS tagging, it is assumed that POS are generated as random process, and each process
randomly generates a word. Hence, transition matrix denotes the transition probability from one POS
to another and emission matrix denotes the probability that a given word can have a particular POS.
Word acts as the observations. Some of the basic assumptions are:
77
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj -> tk)
2. Output probabilities:
Probability of getting word wk for tag tj: P(wk | tj) is independent of other tags or words!
EOS/eos
They/pronoun
cut/verb
the/determiner
paper/noun
EOS/eos He/pronoun
asked/verb
for/preposition
his/pronoun
cut/noun.
EOS/eos
Put/verb
the/determiner
paper/noun
in/preposition
the/determiner
cut/noun
EOS/eos
Count the no. of times a specific word occus with a specific POS tag in the corpus.
78
Here, say for "cut"
count(cut,verb)=1
count(cut,noun)=2
count(cut,determiner)=0
Probability to be filled in the matrix cell at the intersection of cut and verb
P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33
Similarly,
P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0
Repeat the same for all the word-tag combination and fill the
Count the no. of times a specific tag comes after other POS tags in the corpus.
count(verb,determiner)=2
count(preposition,determiner)=1
count(determiner,determiner)=0
count(eos,determiner)=0
count(noun,determiner)=0
Probability to be filled in the cell at he intersection of determiner(in the column) and verb(in the
79
row)
P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66
Similarly,
Probability to be filled in the cell at the intersection of determiner(in the column) and noun(in the
row)
P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0
Objective - The objective of the experiment is to calculate emission and transition matrix which will be
helpful for tagging Parts of Speech using Hidden Markov Model.
Procedure :-
STEP2: For the given corpus fill the emission and transition matrix. Answers are rounded to 2 decimal
digits.
80
STEP3: Press Check to check your answer.
81
82
EXPERIMENT NO 13
PERFORM AN EXPERIMENT TO KNOW THE IMPORTANCE OF CONTEXT
AND SIZE OF TRAINING CORPUS IN LEARNING PARTS OF SPEECH
Introduction-
Building POS Tagger
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or
word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a
particular part of speech, based on both its definition, as well as its contextâi.e. relationship with adjacent
and related words in a phrase, sentence, or paragraph. A simplified form of this is identification of words as
nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of
computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of
speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups:
rule-based and stochastic.
Theory:-
In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs) to disambiguate
parts of speech. HMMs involve counting cases, and making a table of the probabilities of certain sequences.
For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an
adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more
likely to be a noun than a verb or a modal. The same method can of course be used to benefit from
knowledge about the following words.
More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger
sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a
83
preposition, article, or noun, but much less likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate
every combination and to assign a relative probability to each one, by multiplying together the probabilities
of each choice in turn.
It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language
parsing, that merely assigning the most common tag to each known word and the tag "proper noun" to all
unknowns, will approach 90% accuracy because many words are unambiguous.
HMMs underlie the functioning of stochastic taggers and are used in various algorithms. Accuracies for one
such algorithm (TnT) on various training data is shown here.
Conditional random fields (CRFs) are a class of statistical modelling method often applied in machine
learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a
single sample without regard to "neighboring" samples, a CRF can take context into account. Since it can
consider context, therefore CRF can be used in Natural Language Processing. Hence, Parts of Speech
tagging is also possible. It predicts the POS using the lexicons as the context.
If only one neighbour is considered as a context, then it is called bigram. Similarly, two neighbours as the
context is called trigram. In this experiment, size of training corpus and context were varied to know their
importance.
Objective - The objective of the experiment is to know the importance of context and size of training
corpus in learning Parts of Speech
Procedure :-
84
STEP1: Select the language.
OUTPUT: Drop down to select size of corpus, algorithm and features will appear.
STEP4:
85
OUTPUT: Corresponding accuracy will be shown.
86
EXPERIMENT NO 14
PERFORM AN EXPERIMENT TO UNDERSTAND THE CONCEPT OF
CHUNKING AND GET FAMILIAR WITH THE BASIC CHUNK TAGSET.
Introduction : - Chunking
Chunking of text invloves dividing a text into syntactically correlated words. For example, the
sentence 'He ate an apple.' can be divided as follows:
Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit. This can be formally expressed by using IOB prefixes.
87
Theory : -
Eg: दरवाज़ा खल
ु गया
[NP दरवाज़ा] [VP खल
ु गया]
Chunk Types
NP Noun Chunks
Eg:
88
'this' 'book' 'in'
Verb Chunks
For English:
The types of verb chunks and their tags are described below.
3. VGNN Gerunds
Eg:
वह लड़की है (सन्
ु दर/JJ)JJP
89
Note: Adjectives appearing before a noun will be grouped
together within the noun chunk.
Eg:
He walks (slowly/ADV)/ADVP
PP Prepositional Chunk
Eg:
(with/IN)PP a pen
IOB prefixes
TokensPOS Chunk-Tags
He PRP B-NP
ate VBD B-VP
an DT B-NP
apple NN I-NP
to TO B-VP
satiate VB I-VP
his PRP$ B-NP
hungerNN I-NP
90
Objective : - The objective of this experiment is to understand the concept of chunking and get familiar
with the basic chunk tagset.
Procedure : -
STEP3: Select the corresponding chunk-tag for each word in the sentence and click the
Submit button.
91
OUTPUT1: The submitted answer will be checked.
92
EXPERIMENT NO 15
THE OBJECTIVE OF THIS EXPERIMENT IS TO FIND POS TAGS OF
WORDS IN A SENTENCE USING VITERBI DECODING.
Theory - Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about probabilities of a POS tag for
a given word and transmission matrix gives the probability of transition from one POS tag to another POS
tag. It observes sequence of words and returns the state sequences of POS tags along with its probability.
Here "s" denotes words and "t" denotes tags. "a" is transmission matrix and "b" is emission matrix
93
Using above algorithm, we have to fill the viterbi table column by column.
Objective :- The objective of this experiment is to find POS tags of words in a sentence using Viterbi
decoding.
Procedure : -
STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e. form the viterbi
matrix by filling colum for each observation). Answers submitted are rounded off to 3 digits after decimal
and are than checked.
94
Wrong answers are indicated by red backgound in a cell.
95
STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered.
96
STEP5: At last check the POS tag for each word obtained from backtracking
97