Natural Language Processing
Natural Language Processing
UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1.Words and Their Components
2.Issues and Challenges
3.Morphological Models
PART 2: Finding the Structure of Documents:(chapter 2 txtbk 1)
1.Introduction
2.Methods
3.Complexity of the Approaches
4.Performances of the Approaches
UNIT - V:
PART 1: Discourse Processing: (chapter 6 txtbk 2)
1.Cohesion
2. Reference Resolution
3.Discourse Cohesion and Structure
PART 2:Language Modelling:(chapter 5 txtbk 1)
1. Introduction
2.N-Gram Models
3.Language Model Evaluation
4.Parameter Estimation
5.Language Model Adaptation
6.Types of Language Models
7.Language-Specific Modeling Problems
8.Multilingual and Cross Lingual Language Modeling
TEXTBOOK1:
Multilingual Natural Language Processing Applications: From Theory to Practice –
Daniel M. Bikel and Imed Zitouni, Pearson Publication
UNIT - I:
PART 1: Finding the Structure of Words: (chapter 1 txtbk 1)
1.Words and Their Components
2.Issues and Challenges
3.Morphological Models
There are several ways to find the structure of words in NLP, including:
By finding the structure of words in text, NLP systems can perform a wide range of
tasks, such as machine translation, text classification, sentiment analysis, and
information extraction.
1. Free Morphemes: These are standalone words that can convey meaning on
their own, such as "book," "dog," or "happy."
2. Bound Morphemes: These are units of meaning that cannot stand alone but
must be attached to a free morpheme to convey meaning. There are two
types of bound morphemes:
● Prefixes: These are morphemes that are attached to the beginning of a free
morpheme, such as "un-" in "unhappy" or "pre-" in "preview."
● Suffixes: These are morphemes that are attached to the end of a free
morpheme, such as "-ness" in "happiness" or "-ed" in "jumped."
For example, the word "unhappily" has three morphemes: "un-" (a prefix meaning
"not"), "happy" (a free morpheme meaning "feeling or showing pleasure or
contentment"), and "-ly" (a suffix that changes the word into an adverb). By analyzing
the morphemes in a word, NLP systems can better understand its meaning and how
it relates to other words in a sentence.
In addition to morphemes, words can also be analyzed by their part of speech, such
as noun, verb, adjective, or adverb. By identifying the part of speech of each word in a
sentence, NLP systems can better understand the relationships between words and
the structure of the sentence.
1.1Tokens:
In natural language processing (NLP), a token refers to a sequence of characters that
represents a meaningful unit of text. This could be a word, punctuation mark,
number, or other entity that serves as a basic unit of analysis in NLP.
For example, in the sentence "The quick brown fox jumps over the lazy dog," the
tokens are "The," "quick," "brown," "fox," "jumps," "over," "the," "lazy," and "dog." Each of
these tokens represents a separate unit of meaning that can be analyzed and
processed by an NLP system.
Tokens are often used as the input for various NLP tasks, such as text classification,
sentiment analysis, and named entity recognition. In these tasks, the NLP system
analyzes the tokens to identify patterns and relationships between them, and uses
this information to make predictions or draw insights about the text.
In order to analyze and process text effectively, NLP systems must be able to identify
and distinguish between different types of tokens, and understand their relationships
to one another. This can involve tasks such as tokenization, where the text is divided
into individual tokens, and part-of-speech tagging, where each token is assigned a
grammatical category (such as noun, verb, or adjective). By accurately identifying
and processing tokens, NLP systems can better understand the meaning and
structure of a text.
1.2 Lexemes:
In natural language processing (NLP), a lexeme is a unit of vocabulary that
represents a single concept, regardless of its inflected forms or grammatical
variations. It can be thought of as the abstract representation of a word, with all its
possible inflections and variations.
For example, the word "run" has many inflected forms, such as "ran," "running," and
"runs." These inflections are not considered separate lexemes because they all
represent the same concept of running or moving quickly on foot.
In contrast, words that have different meanings, even if they are spelled the same
way, are considered separate lexemes. For example, the word "bank" can refer to a
financial institution or the edge of a river. These different meanings are considered
separate lexemes because they represent different concepts.
● "Walk" and "walked" are inflected forms of the same lexeme, representing the
concept of walking.
● "Cat" and "cats" are inflected forms of the same lexeme, representing the
concept of a feline animal.
● "Bank" and "banking" are derived forms of the same lexeme, representing the
concept of finance and financial institutions.
Lexical analysis is also used to identify and analyze the morphological and
syntactical features of a word, such as its part of speech, inflection, and derivation.
This information is important for tasks such as stemming, lemmatization, and
part-of-speech tagging, which involve reducing words to their base or root forms and
identifying their grammatical functions.
1.3 Morphemes:
In natural language processing (NLP), morphemes are the smallest units of meaning
in a language. A morpheme is a sequence of phonemes (the smallest units of sound
in a language) that carries meaning. Morphemes can be divided into two types: free
morphemes and bound morphemes.
Free morphemes are words that can stand alone and convey meaning. Examples of
free morphemes include "book," "cat," "happy," and "run."
Bound morphemes are units of meaning that cannot stand alone but must be
attached to a free morpheme to convey meaning. Bound morphemes can be further
divided into two types: prefixes and suffixes.
Here are some examples of words broken down into their morphemes:
● "unhappily" = "un-" (prefix meaning "not") + "happy" + "-ly" (suffix meaning "in a
manner of")
● "rearrangement" = "re-" (prefix meaning "again") + "arrange" + "-ment" (suffix
indicating the act of doing something)
● "cats" = "cat" (free morpheme) + "-s" (suffix indicating plural form)
By analysing the morphemes in a word, NLP systems can better understand its
meaning and how it relates to other words in a sentence. This can be helpful for
tasks such as part-of-speech tagging, sentiment analysis, and language translation.
1.4 Typology:
In natural language processing (NLP), typology refers to the classification of
languages based on their structural and functional features. This can include
features such as word order, morphology, tense and aspect systems, and syntactic
structures.
There are many different approaches to typology in NLP, but a common one is the
distinction between analytic and synthetic languages. Analytic languages have a
relatively simple grammatical structure and tend to rely on word order and
prepositions to convey meaning. In contrast, synthetic languages have a more
complex grammatical structure and use inflections and conjugations to indicate
tense, number, and other grammatical features.
By understanding the typology of a language, NLP systems can better model its
grammatical and structural features, and improve their performance in tasks such as
language modelling, parsing, and machine translation.
2.1 Irregularity:
Irregularity is a challenge in natural language processing (NLP) because it refers to
words that do not follow regular patterns of formation or inflection. Many languages
have irregular words that are exceptions to the standard rules, making it difficult for
NLP systems to accurately identify and categorize these words.
For example, in English, irregular verbs such as "go," "do," and "have" do not follow the
regular pattern of adding "-ed" to the base form to form the past tense. Instead, they
have their unique past tense forms ("went," "did," "had") that must be memorized.
Similarly, in English, there are many irregular plural nouns, such as "child" and "foot,"
that do not follow the standard rule of adding "-s" to form the plural. Instead, these
words have their unique plural forms ("children," "feet") that must be memorized.
2.2 Ambiguity:
Ambiguity is a challenge in natural language processing (NLP) because it refers to
situations where a word or phrase can have multiple possible meanings, making it
difficult for NLP systems to accurately identify the intended meaning. Ambiguity can
arise in various forms, such as homonyms, polysemous words, and syntactic
ambiguity.
Homonyms are words that have the same spelling and pronunciation but different
meanings. For example, the word "bank" can refer to a financial institution or the side
of a river. This can create ambiguity in NLP tasks, such as named entity recognition,
where the system needs to identify the correct entity based on the context.
Polysemous words are words that have multiple related meanings. For example, the
word "book" can refer to a physical object or the act of reserving something. In this
case, the intended meaning of the word can be difficult to identify without
considering the context in which the word is used.
Syntactic ambiguity occurs when a sentence can be parsed in multiple ways. For
example, the sentence "I saw her duck" can be interpreted as "I saw the bird she
owns" or "I saw her lower her head to avoid something." In this case, the meaning of
the sentence can only be determined by considering the context in which it is used.
Ambiguity can also occur due to cultural or linguistic differences. For example, the
phrase "kick the bucket" means "to die" in English, but its meaning may not be
apparent to non-native speakers or speakers of other languages.
2.3 Productivity:
Productivity is a challenge in natural language processing (NLP) because it refers to
the ability of a language to generate new words or forms based on existing patterns
or rules. This can create a vast number of possible word forms that may not be
present in dictionaries or training data, which makes it difficult for NLP systems to
accurately identify and categorize words.
For example, in English, new words can be created by combining existing words,
such as "smartphone," "cyberbully," or "workaholic." These words are formed by
combining two or more words to create a new word with a specific meaning.
Another example is the use of prefixes and suffixes to create new words. For
instance, in English, the prefix "un-" can be added to words to create their opposite
meaning, such as "happy" and "unhappy." The suffix "-er" can be added to a verb to
create a noun indicating the person who performs the action, such as "run" and
"runner."
Productivity can also occur in inflectional morphology, where different forms of a
word are created by adding inflectional affixes. For example, in English, the verb
"walk" can be inflected to "walked" to indicate the past tense. Similarly, the adjective
"big" can be inflected to "bigger" to indicate a comparative degree.
These examples demonstrate how productivity can create a vast number of possible
word forms, making it challenging for NLP systems to accurately identify and
categorize words. To address this challenge, NLP researchers have developed
various techniques, including morphological analysis algorithms that use statistical
models to predict the likely structure of a word based on its context. Additionally,
machine learning algorithms can be trained on large datasets to learn to recognize
and categorize new word forms.
3.Morphological Models:
In natural language processing (NLP), morphological models refer to computational
models that are designed to analyze the morphological structure of words in a
language. Morphology is the study of the internal structure and the forms of words,
including their inflectional and derivational patterns. Morphological models are used
in a wide range of NLP applications, including part-of-speech tagging, named entity
recognition, machine translation, and text-to-speech synthesis.
There are several types of morphological models used in NLP, including rule-based
models, statistical models, and neural models.
Neural models, such as recurrent neural networks (RNNs) and transformers, use
deep learning techniques to learn the morphological structure of words. These
models have achieved state-of-the-art results in many NLP tasks and are particularly
effective in languages with complex morphological systems, such as Arabic and
Turkish.
In addition to these models, there are also morphological analyzers, which are tools
that can automatically segment words into their constituent morphemes and provide
additional information about the inflectional and derivational properties of each
morpheme. Morphological analyzers are widely used in machine translation and
information retrieval applications, where they can improve the accuracy of these
systems by providing more precise linguistic information about the words in a text.
One of the main advantages of finite-state morphology is that it is efficient and fast,
since it can handle large vocabularies and morphological paradigms using compact
and optimized finite-state transducers. It is also transparent and interpretable, since
the rules and transformations used by the transducers can be easily inspected and
understood by linguists and language experts.
Finite-state morphology has been used in various NLP applications, such as machine
translation, speech recognition, and information retrieval, and it has been shown to
be effective for many languages and domains. However, it may be less effective for
languages with irregular or non-productive morphological systems, or for languages
with complex syntactic or semantic structures that require more sophisticated
linguistic analysis.
(NLP) that is based on the principles of functional and cognitive linguistics. It is a usage-based
approach that emphasizes the functional and communicative aspects of language, and seeks to
model the ways in which words are used and interpreted in context.
In functional morphology, words are modeled as units of meaning, or lexemes, which are
associated with a set of functions and communicative contexts. Each lexeme is composed of a
set of abstract features that describe its semantic, pragmatic, and discursive properties, such as
The functional morphology model seeks to capture the relationship between the form and
meaning of words, by analyzing the ways in which the morphological and syntactic structures of
words reflect their communicative and discourse functions. It emphasizes the role of context and
discourse in the interpretation of words, and seeks to explain the ways in which words are used
and modified in response to the communicative needs of the speaker and the listener.
Functional morphology is particularly effective for modeling the ways in which words are
inflected, derived, or modified in response to the communicative and discourse context, such as
in the case of argument structure alternations or pragmatic marking. It can handle the
complexity and variability of natural language, by focusing on the functional and communicative
properties of words, and by using a set of flexible and adaptive rules and constraints.
corpus-driven, since it is based on the analysis of natural language data and usage patterns. It is
also compatible with other models of language and cognition, such as construction grammar
and cognitive linguistics, and can be integrated with other NLP techniques, such as discourse
Functional morphology has been used in various NLP applications, such as text classification,
sentiment analysis, and language generation, and it has been shown to be effective for many
languages and domains. However, it may require large amounts of annotated data and
computational resources, in order to model the complex and variable patterns of natural
Morphology induction has been used in various NLP applications, such as machine
translation, information retrieval, and language modeling, and it has been shown to
be effective for many languages and domains. However, it may produce less
accurate and interpretable results than other morphological models, since it relies on
statistical patterns and does not capture the full range of morphological and
syntactic structures in the language.
There are several approaches to finding the structure of documents in NLP, including:
Some of the specific techniques and tools used in finding the structure of
documents in NLP include:
Some of the specific techniques and tools used in sentence boundary detection
include:
1. Regular expressions: These are patterns that can be used to match specific
character sequences in a text, such as periods followed by whitespace
characters, and can be used to identify the end of a sentence.
2. Hidden Markov Models: These are statistical models that can be used to
identify the most likely sequence of sentence boundaries in a text, based on
the probabilities of different sentence boundary markers.
3. Deep learning models: These are neural network models that can learn
complex patterns and features of sentence boundaries from a large corpus of
text, and can be used to achieve state-of-the-art performance in sentence
boundary detection.
Some of the specific techniques and tools used in topic boundary detection include:
2.Methods:
There are several methods and techniques used in NLP to find the structure of
documents, which include:
1. Sentence boundary detection: This involves identifying the boundaries
between sentences in a document, which is important for tasks like parsing,
machine translation, and text-to-speech synthesis.
2. Part-of-speech tagging: This involves assigning a part of speech (noun, verb,
adjective, etc.) to each word in a sentence, which is useful for tasks like
parsing, information extraction, and sentiment analysis.
3. Named entity recognition: This involves identifying and classifying named
entities (such as people, organizations, and locations) in a document, which is
important for tasks like information extraction and text categorization.
4. Coreference resolution: This involves identifying all the expressions in a text
that refer to the same entity, which is important for tasks like information
extraction and machine translation.
5. Topic boundary detection: This involves identifying the points in a document
where the topic or theme of the text shifts, which is useful for organizing and
summarizing large amounts of text.
6. Parsing: This involves analyzing the grammatical structure of sentences in a
document, which is important for tasks like machine translation,
text-to-speech synthesis, and information extraction.
7. Sentiment analysis: This involves identifying the sentiment (positive, negative,
or neutral) expressed in a document, which is useful for tasks like brand
monitoring, customer feedback analysis, and market research.
There are several tools and techniques used in NLP to perform these tasks, including
machine learning algorithms, rule-based systems, and statistical models. These
tools can be used in combination to build more complex NLP systems that can
accurately analyze and understand the structure and content of large amounts of
text.
Generative sequence classification methods are a type of NLP method used to find
the structure of documents. These methods involve using probabilistic models to
classify sequences of words into predefined categories or labels.
Both HMMs and CRFs can be used for tasks like part-of-speech tagging, named
entity recognition, and chunking, which involve classifying sequences of words into
predefined categories or labels. These methods have been shown to be effective in a
variety of NLP applications and are widely used in industry and academia.
Overall, discriminative local classification methods are useful for tasks where it is
necessary to classify each individual word or token in a document based on its
features and context. These methods are often used in conjunction with other NLP
techniques, such as sentence boundary detection and parsing, to build more
complex NLP systems for document analysis and understanding.
Overall, discriminative sequence classification methods are useful for tasks where it
is necessary to predict the label or category for a sequence of words in a document,
based on the features of the sequence and the context in which it appears. These
methods have been shown to be effective in a variety of NLP applications and are
widely used in industry and academia.
One example of a hybrid approach is the use of Conditional Random Fields (CRFs)
and Support Vector Machines (SVMs) for named entity recognition. CRFs are used to
model the dependencies between neighboring labels in the sequence, while SVMs
are used to model the relationship between the input features and the labels.
Overall, hybrid approaches are useful for tasks where a single method may not be
sufficient to achieve high accuracy. By combining multiple methods, hybrid
approaches can take advantage of the strengths of each method and achieve better
performance than any one method alone.
One example of an extension for global modeling for sentence segmentation is the
use of Hidden Markov Models (HMMs). HMMs are statistical models that can be
used to identify patterns in a sequence of observations. In the case of sentence
segmentation, the observations are the words in the document, and the model tries
to identify patterns that correspond to the beginning and end of sentences. HMMs
can take into account context beyond just the current sentence, which can improve
accuracy in cases where sentence boundaries are not clearly marked.
Additionally, there are also neural network-based approaches, such as the use of
convolutional neural networks (CNNs) or recurrent neural networks (RNNs) for
sentence boundary detection. These models can learn to recognize patterns in the
text by analyzing larger contexts, and can be trained on large corpora of text to
improve their accuracy.
Overall, extensions for global modeling for sentence segmentation can be more
effective than local models when dealing with more complex or ambiguous text, and
can lead to more accurate results in certain situations.
Overall, the complexity of these approaches depends on the level of accuracy and
precision desired, the size and complexity of the documents being analyzed, and the
amount of labeled data available for training. In general, more complex approaches
tend to be more accurate but also require more resources and expertise to
implement.
In general, the performance of these approaches will depend on factors such as the
quality and quantity of the training data, the complexity and variability of the
document structure, and the specific metrics used to evaluate performance (e.g.
accuracy, precision, recall, F1-score). It's also worth noting that different approaches
may be better suited for different sub-tasks within document structure analysis, such
as identifying headings, lists, tables, or section breaks.
Syntax Analysis:
Syntax analysis in natural language processing (NLP) refers to the process of
identifying the structure of a sentence and its component parts, such as phrases and
clauses, based on the rules of the language's syntax.
In natural language processing (NLP), syntax analysis, also known as parsing, refers
to the process of analyzing the grammatical structure of a sentence in order to
determine its constituent parts, their relationships to each other, and their functions
within the sentence. This involves breaking down the sentence into its individual
components, such as nouns, verbs, adjectives, and phrases, and then analyzing how
these components are related to each other.
There are two main approaches to syntax analysis in NLP: rule-based parsing and
statistical parsing. Rule-based parsing involves the use of a set of pre-defined rules
that dictate how the different parts of speech and phrases in a sentence should be
structured and related to each other. Statistical parsing, on the other hand, uses
machine learning algorithms to learn patterns and relationships in large corpora of
text in order to generate parse trees for new sentences.
Step 1: Tokenization
The first step is to break the sentence down into its individual words, or tokens:
Next, each token is assigned a part of speech tag, which indicates its grammatical
function in the sentence:
"The" (determiner), "cat" (noun), "sat" (verb), "on" (preposition), "the" (determiner),
"mat" (noun), "." (punctuation)
Finally, the relationships between the words in the sentence are analyzed using a
dependency parser to create a parse tree. In this example, the parse tree might look
something like this:
sat
/ \
cat on
/ \ |
This parse tree shows that "cat" is the subject of the verb "sat," and "mat" is the
grammatical structure of a sentence, NLP models can more accurately interpret its
Here's an example of a parse tree for the sentence "The cat sat on the mat":
sat(V)
/ \
cat(N) on(PREP)
/ \ / \
The(D) mat(N) the(D)
This parse tree shows that the sentence is composed of a verb phrase ("sat") and a
prepositional phrase ("on the mat"), with the verb phrase consisting of a verb ("sat")
and a noun phrase ("the cat"). The noun phrase, in turn, consists of a determiner
("the") and a noun ("cat"), and the prepositional phrase consists of a preposition
("on") and a noun phrase ("the mat").
Treebanks can be used to train statistical parsers, which can then automatically
analyze new sentences and generate their own parse trees. These parsers work by
identifying patterns in the treebank data and using these patterns to make
predictions about the structure of new sentences. For example, a statistical parser
might learn that a noun phrase is usually followed by a verb phrase and use this
pattern to generate a parse tree for a new sentence.
Constituency-Based Representations:
(S
(NP (DT The) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat)))))
This representation shows that the sentence is composed of a noun phrase ("The
cat") and a verb phrase ("sat on the mat"), with the verb phrase consisting of a verb
("sat") and a prepositional phrase ("on the mat"), and the prepositional phrase
consisting of a preposition ("on") and a noun phrase ("the mat").
Dependency-Based Representations:
sat-V
|
cat-N
|
on-PREP
|
mat-N
This representation shows that the verb "sat" depends on the subject "cat," and the
preposition "on" depends on the object "mat."
Here's an example of a dependency graph for the sentence "The cat sat on the mat":
┌─► sat
┌───┐ │ │
│ The │ │ ├─► on
└───┘ │ │ │
└────► cat ──► mat
In this graph, the word "cat" depends on the word "sat" with a subject relationship,
and the word "mat" depends on the word "on" with a prepositional relationship.
Dependency graphs are useful for a variety of NLP tasks, including named entity
recognition, relation extraction, and sentiment analysis. They can also be used for
parsing and syntactic analysis, as they provide a compact and expressive way to
represent the structure of a sentence.
One advantage of dependency graphs is that they are simpler and more efficient than
phrase structure trees, which can be computationally expensive to build and
manipulate. Dependency graphs also provide a more flexible representation of
syntactic structure, as they can easily capture non-projective dependencies and other
complex relationships between words.
Here's another example of a dependency graph for the sentence "I saw the man with
the telescope":
┌─── saw
│
┌──────┐ │ │ │
│ I the man with the telescope
│ │ │ │ │ │ │
│ nsubj │ │ prep │ │
│ │ │ │ │ │
│ det │ pobj │ │
│ │ │ │ │ │
│ │ det pobj │ │
│ │ │ │ │ │
│ │ │ prep │ │
│ │ │ │ │ │
│ │ │ det │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ │ det │
│ │ │ │ │ │
│ │ │ │ │ pobj
This graph shows that the verb "saw" depends on the subject "I," and that the noun
phrase "the man" depends on the verb "saw" with an object relationship. The
prepositional phrase "with the telescope" modifies the noun phrase "the man," with
the word "telescope" being the object of the preposition "with."
In summary, dependency graphs provide a flexible and efficient way to represent the
syntactic structure of a sentence in NLP. They can be used for a variety of tasks and
are a key component of many state-of-the-art NLP models.
Syntax analysis, also known as parsing, is the process of analyzing the grammatical
structure of a sentence to identify its constituent parts and the relationships
between them. In natural language processing (NLP), phrase structure trees are
often used to represent the syntactic structure of a sentence.
A phrase structure tree, also known as a parse tree or a syntax tree, is a graphical
representation of the syntactic structure of a sentence. It consists of a hierarchical
structure of nodes, where each node represents a phrase or a constituent of the
sentence.
Here's an example of a phrase structure tree for the sentence "The cat sat on the
mat":
_____|_____
| |
NP VP
| |
___|___ |______
| | | |
Det N V NP
| | | |
In this tree, the top-level node represents the entire sentence (S), which is divided
into two subparts: the noun phrase (NP) "The cat" and the verb phrase (VP) "sat on
the mat". The NP is further divided into a determiner (Det) "The" and a noun (N) "cat".
The VP is composed of a verb (V) "sat" and a prepositional phrase (PP) "on the mat",
which itself consists of a preposition (P) "on" and another noun phrase (NP) "the
mat".
Here's another example of a phrase structure tree for the sentence "John saw the
man with the telescope":
___|___
| |
NP VP
| |
_______|_____ |___
| | |
N V PP
| | |
| |
P NP
| |
with Det N
| |
the telescope
In this tree, the top-level node represents the entire sentence (S), which is divided
into a noun phrase (NP) "John" and a verb phrase (VP) "saw the man with the
telescope". The NP is simply a single noun (N) "John". The VP is composed of a verb
(V) "saw" and a prepositional phrase (PP) "with the telescope", which itself consists
of a preposition (P) "with" and another noun phrase (NP) "the man with the
telescope". The latter is composed of a determiner (Det) "the" and a noun (N) "man",
which is modified by another prepositional phrase "with the telescope", consisting of
a preposition (P) "with" and a noun phrase (NP) "the telescope".
Phrase structure trees can be used in NLP for a variety of tasks, such as machine
translation, text-to-speech synthesis, and natural language understanding. By
identifying the syntactic structure of a sentence, computers can more accurately
understand its meaning and generate appropriate responses.
4.Parsing Algorithms:
There are several algorithms used in natural language processing (NLP) for syntax
analysis or parsing, each with its own strengths and weaknesses. Here are three
common parsing algorithms and their examples:
E -> E + T | E - T | T
T -> T * F | T / F | F
F -> ( E ) | num
Suppose we want to parse the expression "3 + 4 * (5 - 2)" using recursive descent
parsing. The algorithm would start with the top-level symbol E and apply the first
production rule E -> E + T. It would then recursively apply the production rules for E, T,
and F until it reaches the terminals "3", "+", "4", "*", "(", "5", "-", "2", and ")". The resulting
parse tree would look like this:
E
/\
E T
/ /|\
T F*F
| | |
num num E
/|\
T F
| |
num num
S -> NP VP
NP -> Det N | NP PP
VP -> V NP | VP PP
PP -> P NP
Det -> the | a
N -> man | ball | woman
V -> saw | liked
P -> with | in
Suppose we want to parse the sentence "the man saw a woman with a ball" using
shift-reduce parsing. The algorithm would start with an empty stack and shift the
tokens "the", "man", "saw", "a", "woman", "with", "a", and "ball" onto the stack. It would
then reduce the symbols "Det N" to NP, "NP PP" to NP, "V NP" to VP, and "NP PP" to
PP. The resulting parse tree would look like this:
S
|
____|____
| |
NP VP
| |
| _V_
| | |
Det NP PP
| | |
the __|__ |
| NP
| |
| Det N
| | |
| a man
|
__V__
| |
saw NP
|
Det N
| |
a woman
|
PP
|
| |
P NP
| |
with Det N
| |
a ball
3. Earley parsing: This is a chart parsing algorithm that uses dynamic programming to store
partial parses in a chart, which can be combined to form complete parses.
Here is an example of how shift-reduce parsing can be used to parse the sentence
"the cat chased the mouse" using a simple grammar:
S -> NP VP
NP -> Det N
VP -> V NP
S
/\
NP VP
/ \ |
/ chased
/ |
Det NP
| /\
the Det N
| |
the mouse
Note that this example uses a simple grammar and a straightforward parsing
process, but more complex grammars and sentences may require additional steps or
different strategies to achieve a successful parse.
Here is an example of how chart parsing can be used to parse the sentence "the cat
chased the mouse" using a simple grammar:
S -> NP VP
NP -> Det N
VP -> V NP
V -> chased
Chart parsing can be more efficient than other parsing algorithms, such as recursive
descent or shift-reduce parsing, because it stores all possible partial parses in the
chart and avoids redundant parsing of the same span multiple times. Hypergraphs
can also be used in chart parsing to represent more complex structures and enable
more efficient parsing algorithms.
Minimum spanning tree (MST) algorithms are often used for dependency parsing, as
they provide an efficient way to find the most likely parse for a sentence given a set
of syntactic dependencies.
Here's an example of how a MST algorithm can be used for dependency parsing:
Consider the sentence "The cat chased the mouse". We can represent this sentence
as a graph with nodes for each word and edges representing the syntactic
dependencies between them:
We can use a MST algorithm to find the most likely parse for this graph. One popular
algorithm for this is the Chu-Liu/Edmonds algorithm:
1. We first remove all self-loops and multiple edges in the graph. This is because
a valid dependency tree must be acyclic and have only one edge between any
two nodes.
2. We then choose a node to be the root of the tree. In this example, we can
choose "chased" to be the root since it is the main verb of the sentence.
3. We then compute the scores for each edge in the graph based on a scoring
function that takes into account the probability of each edge being a valid
dependency. The score function can be based on various linguistic features,
such as part-of-speech tags or word embeddings.
4. We use the MST algorithm to find the tree that maximizes the total score of its
edges. The MST algorithm starts with a set of edges that connect the root
node to each of its immediate dependents, and iteratively adds edges that
connect other nodes to the tree. At each iteration, we select the edge with the
highest score that does not create a cycle in the tree.
5. Once the MST algorithm has constructed the tree, we can assign a label to
each edge in the tree based on the type of dependency it represents (e.g.,
subject, object, etc.).
The resulting dependency tree for the example sentence is shown below:
In this tree, each node represents a word in the sentence, and each edge represents a syntactic
Dependency parsing can be useful for many NLP tasks, such as information extraction, machine
One advantage of dependency parsing is that it captures more fine-grained syntactic information
than phrase-structure parsing, as it represents the relationships between individual words rather
than just the hierarchical structure of phrases. However, dependency parsing can be more
Overall, there are many models for ambiguity resolution in parsing, each with its own
strengths and weaknesses. The choice of model depends on the specific application
and the available resources, such as training data and computational power.
PCFGs can be used to compute the probability of a parse tree for a given sentence,
which can then be used to select the most likely parse. The probability of a parse
tree is computed by multiplying the probabilities of its constituent production rules,
from the root symbol down to the leaves. The probability of a sentence is computed
by summing the probabilities of all parse trees that generate the sentence.
Here is an example of a PCFG for the sentence "the cat saw the dog":
S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]
In this PCFG, each production rule is annotated with a probability. For example, the
rule NP -> Det N [0.6] has a probability of 0.6, indicating that a noun phrase can be
generated by first generating a determiner, followed by a noun, with a probability of
0.6.
To parse the sentence "the cat saw the dog" using this PCFG, we can use the CKY
algorithm to generate all possible parse trees and compute their probabilities. The
algorithm starts by filling in the table of all possible subtrees for each span of the
sentence, and then combines these subtrees using the production rules of the PCFG.
The final cell in the table represents the probability of the best parse tree for the
entire sentence.
Using the probabilities from the PCFG, the CKY algorithm generates the following
parse tree for the sentence "the cat saw the dog":
S
/ \
NP VP
/ \ / \
Det N V NP
| | | / \
the cat saw the dog
Thus, the probability of the best parse tree for the sentence "the cat saw the dog" is
0.11664. This probability can be used to select the most likely parse among all
possible parse trees for the sentence.
5.2 Generative Models for Parsing:
Generative models for parsing are a family of models that generate a sentence's
parse tree by generating each node in the tree according to a set of probabilistic
rules. One such model is the probabilistic earley parser.
The earley parser uses a chart data structure to store all possible parse trees for a
sentence. The parser starts with an empty chart, and then adds new parse trees to
the chart as it progresses through the sentence. The parser consists of three main
stages: prediction, scanning, and completion.
In the prediction stage, the parser generates new items in the chart by applying
grammar rules that can generate non-terminal symbols. For example, if the grammar
has a rule S -> NP VP, the parser would predict the presence of an S symbol in the
current span of the sentence by adding a new item to the chart that indicates that an
S symbol can be generated by an NP symbol followed by a VP symbol.
In the scanning stage, the parser checks whether a word in the sentence can be
assigned to a non-terminal symbol in the chart. For example, if the parser has
predicted an NP symbol in the current span of the sentence, and the word "dog"
appears in that span, the parser would add a new item to the chart that indicates that
the NP symbol can be generated by the word "dog".
In the completion stage, the parser combines items in the chart that have the same
end position and can be combined according to the grammar rules. For example, if
the parser has added an item to the chart that indicates that an NP symbol can be
generated by the word "dog", and another item that indicates that a VP symbol can
be generated by the word "saw" and an NP symbol, the parser would add a new item
to the chart that indicates that an S symbol can be generated by an NP symbol
followed by a VP symbol.
Here is an example of a probabilistic earley parser applied to the sentence "the cat
saw the dog":
Grammar:
S -> NP VP [1.0]
NP -> Det N [0.6]
NP -> N [0.4]
VP -> V NP [0.8]
VP -> V [0.2]
Det -> "the" [0.9]
Det -> "a" [0.1]
N -> "cat" [0.5]
N -> "dog" [0.5]
V -> "saw" [1.0]
Initial chart:
0: [S -> * NP VP [1.0], 0, 0]
0: [NP -> * Det N [0.6], 0, 0]
0: [NP -> * N [0.4], 0, 0]
0: [VP -> * V NP [0.8], 0, 0]
0: [VP -> * V [0.2], 0, 0]
0: [Det -> * "the" [0.9], 0, 0]
0: [Det -> * "a" [0.1], 0, 0]
0: [N -> * "cat" [0.5], 0, 0]
0: [N -> * "dog" [0.5], 0, 0]
0: [V -> * "saw" [1.0], 0, 0]
Predicting S:
0: [S -> * NP VP [1.0], 0, 0]
1: [NP -> * Det N [0.6], 0, 0]
1: [NP -> * N [0.4], 0, 0]
1: [VP -> * V NP [0.8], 0
The maximum entropy markov model (MEMM) is a discriminative model that models
the conditional probability of a parse tree given a sentence. The model is trained on a
corpus of labeled sentences and their corresponding parse trees. During training, the
model learns a set of feature functions that map the current state of the parser (i.e.,
the current span of the sentence and the partial parse tree constructed so far) to a
set of binary features that are indicative of a particular parse tree. The model then
learns the weight of each feature function using maximum likelihood estimation.
During testing, the MEMM uses the learned feature functions and weights to score
each possible parse tree for the input sentence. The model then selects the parse
tree with the highest score as the final parse tree for the sentence.
Here is an example of a MEMM applied to the sentence "the cat saw the dog":
Features:
F1: current word is "the"
F2: current word is "cat"
F3: current word is "saw"
F4: current word is "dog"
F5: current span is "the cat"
F6: current span is "cat saw"
F7: current span is "saw the"
F8: current span is "the dog"
F9: partial parse tree is "S -> NP VP"
Weights:
F1: 1.2
F2: 0.5
F3: 0.9
F4: 1.1
F5: 0.8
F6: 0.6
F7: 0.7
F8: 0.9
F9: 1.5
S -> NP VP
- NP -> N
- - N -> "cat"
- VP -> V NP
- - V -> "saw"
- - NP -> Det N
- - - Det -> "the"
- - - N -> "dog"
Score: 4.9
S -> NP VP
- NP -> Det N
- - Det -> "the"
- - N -> "cat"
- VP -> V
- - V -> "saw"
- NP -> Det N
- - Det -> "the"
- - N -> "dog"
Score: 3.5
In this example, the MEMM generates a score for each possible parse tree and
selects the parse tree with the highest score as the final parse tree for the sentence.
The selected parse tree corresponds to the correct parse for the sentence.
6.Multilingual Issues:
However, the definition of what constitutes a token can vary depending on the
language being analyzed. This is because different languages have different rules for
how words are constructed, how they are written, and how they are used in context.
For example, in English, words are typically separated by spaces, making it relatively
easy to tokenize a sentence into individual words. However, in some languages, such
as Chinese or Japanese, there are no spaces between words, and the text must be
segmented into individual units of meaning based on other cues, such as syntax or
context.
Furthermore, even within a single language, there can be variation in how words are
spelled or written. For example, in English, words can be spelled with or without
hyphens or apostrophes, and there can be differences in spelling between American
English and British English.
1. ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."].
Case:
2. Case refers to the use of upper and lower case letters in text. In NLP, it is often
important to standardize the case of words to avoid treating the same word
as different simply because it appears in different case. For example, the
words "apple" and "Apple" should be treated as the same word.
Encoding:
3. Encoding refers to the process of representing text data in a way that can be
processed by machine learning algorithms. One common encoding method
used in NLP is Unicode, which is a character encoding standard that can
represent a wide range of characters from different languages.
Text: "The quick brown fox jumps over the lazy dog."
Tokenization: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
Case: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
Encoding: [0x74, 0x68, 0x65, 0x20, 0x71, 0x75, 0x69, 0x63, 0x6b, 0x20, 0x62, 0x72,
0x6f, 0x77, 0x6e, 0x20, 0x66, 0x6f, 0x78, 0x20, 0x6a, 0x75, 0x6d, 0x70, 0x73, 0x20,
0x6f, 0x76, 0x65, 0x72, 0x20, 0x74, 0x68, 0x65, 0x20, 0x6c, 0x61, 0x7a, 0x79, 0x20,
0x64, 0x6f, 0x67, 0x2e]
Note that the encoding is represented in hexadecimal to show the underlying bytes
that represent the text.
In Chinese, for example, a sentence like "我喜欢中文" (which means "I like Chinese")
could be segmented in different ways, such as "我 / 喜欢 / 中文" or "我喜欢 / 中文".
Similarly, in Japanese, a sentence like "私は日本語が好きです" (which also means "I
like Japanese") could be segmented in different ways, such as "私は / 日本語が / 好
きです" or "私は日本語 / が好きです".
● Chinese: In addition to the lack of spacing between words, Chinese also has a
large number of homophones, which are words that sound the same but have
different meanings. For example, the words "你" (you) and "年" (year) sound
the same in Mandarin, but they are written with different characters.
● Japanese: Japanese also has a large number of homophones, but it also has
different writing systems, including kanji (Chinese characters), hiragana, and
katakana. Kanji can often have multiple readings, which makes word
segmentation more complex.
● Thai: Thai has no spaces between words, and it also has no capitalization or
punctuation. In addition, Thai has a unique script with many consonants that
can be combined with different vowel signs to form words.
● Vietnamese: Vietnamese uses the Latin alphabet, but it also has many
diacritics (accent marks) that can change the meaning of a word. In addition,
Vietnamese words can be formed by combining smaller words, which makes
word segmentation more complex.
6.3 Morphology:
Morphology is the study of the structure of words and how they are formed from
smaller units called morphemes. Morphological analysis is important in many
natural language processing tasks, such as machine translation and speech
recognition, because it helps to identify the underlying structure of words and to
disambiguate their meanings.
Here are some examples of the challenges of morphology in different languages:
● Turkish: Turkish has a rich morphology, with a complex system of affixes that
can be added to words to convey different meanings. For example, the word
"kitap" (book) can be modified with different suffixes to indicate things like
possession, plurality, or tense.
● Arabic: Arabic also has a rich morphology, with a complex system of prefixes,
suffixes, and infixes that can be added to words to convey different meanings.
For example, the root "k-t-b" (meaning "write") can be modified with different
affixes to form words like "kitab" (book) and "kataba" (he wrote).
● Finnish: Finnish has a complex morphology, with a large number of cases,
suffixes, and vowel harmony rules that can affect the form of a word. For
example, the word "käsi" (hand) can be modified with different suffixes to
indicate things like possession, location, or movement.
● Swahili: Swahili has a complex morphology, with a large number of prefixes
and suffixes that can be added to words to convey different meanings. For
example, the word "kutaka" (to want) can be modified with different prefixes
and suffixes to indicate things like tense, negation, or subject agreement.