Unit-1 Aim 502

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

1.1 Describe the Origins and challenges of NLP


Natural Language Processing (NLP) is a field of Artificial Intelligence (AI)
that deals with the interaction between computers and human languages. NLP
is used to analyze, understand, and generate natural language text and
speech. The goal of NLP is to enable computers to understand and interpret
human language in a way that is similar to how humans process language.
1. Natural Language Processing (NLP) is a field of computer science and
artificial intelligence that focuses on the interaction between computers and
humans using natural language. It involves analyzing, understanding, and
generating human language data, such as text and speech.
2. NLP has a wide range of applications, including sentiment analysis,
machine translation, text summarization, chatbots, and more. Some
common tasks in NLP include:
3. Text Classification: Classifying text into different categories based on their
content, such as spam filtering, sentiment analysis, and topic modeling.
4. Named Entity Recognition (NER): Identifying and categorizing named
entities in text, such as people, organizations, and locations.
5. Part-of-Speech (POS) Tagging: Assigning a part of speech to each word in
a sentence, such as noun, verb, adjective, and adverb.
6. Sentiment Analysis: Analyzing the sentiment of a piece of text, such as
positive, negative, or neutral.
7. Machine Translation: Translating text from one language to another.

Origins of NLP:
1.Early Beginnings: The origins of NLP can be traced back to the early days
of computer science and linguistics. In the 1950s and 1960s, researchers
began to explore ways to enable computers to process and understand human
language. One of the earliest efforts was the development of machine
translation systems.
2. Rule-Based Approaches: During the 1960s and 1970s, researchers
focused on rule-based approaches to NLP. These approaches involved
manually creating linguistic rules and grammatical structures to teach
computers how to understand and generate text. While these methods were
limited in their capabilities, they laid the foundation for later advancements.
3. Statistical and Machine Learning Era: In the 1980s and 1990s, statistical
and machine learning approaches gained prominence in NLP. Researchers

1|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

started using large corpora of text to train models that could automatically learn
patterns in language. This led to the development of techniques like Hidden
Markov Models, n-grams, and later, neural networks.
Natural Language Processing (NLP) Challenges
NLP is a powerful tool with huge benefits, but there are still a number of
Natural Language Processing limitations and problems.
1. Ambiguity: Human language is inherently ambiguous, with words and
phrases often having multiple meanings depending on context. Resolving
this ambiguity is a significant challenge in NLP, as it requires understanding
context and intent.
2. Syntax and Semantics: Capturing the syntax (structure) and semantics
(meaning) of language is complex. NLP models need to comprehend not
only the grammatical structure of sentences but also the underlying
meaning, which may involve cultural nuances and idiomatic expressions.
3. Named Entity Recognition (NER): Identifying entities such as names of
people, places, and organizations within text is crucial for many NLP tasks.
However, NER is challenging due to variations in naming conventions,
misspellings, and the presence of novel entities.
4. Sentiment Analysis and Tone Detection : Determining the sentiment or
emotional tone of a text is challenging, as it requires understanding the
nuances of human emotions, sarcasm, and irony.
5. Contextual Understanding: NLP models need to understand and maintain
context over longer passages of text. Discerning coreference (when a word
or phrase refers to another word or phrase) and maintaining a consistent
understanding of context is essential for accurate language comprehension.
6. Low-Resource Languages: Many NLP advancements have focused on
languages with abundant data, leaving behind low-resource languages that
lack sufficient training data. Developing effective NLP models for such
languages remains a challenge.
7. Ethical and Bias Concerns: NLP models trained on large datasets may
inadvertently learn and perpetuate biases present in the data. Ensuring
fairness, mitigating bias, and addressing ethical concerns in NLP are
ongoing challenges.
8. Real-World Variability: NLP models often struggle with handling informal
language, dialects, and colloquialisms that are prevalent in real-world
communication.
9. Multilingual and Cross-Lingual Understanding : Developing NLP models
that can seamlessly process and understand multiple languages is a
complex challenge due to linguistic differences and varying grammatical
structures.
OR
 Language Differences: The majority of people in the US speak English,
but if you want to reach a global and/or diverse audience, you’ll have to
support various languages. Not only do various languages have

2|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

substantially diverse collections of vocabulary, but they also have different


forms of phrasing, inflections, and cultural norms. You can overcome this
problem by using “universal” models that can move at least part of what
you’ve learned to other languages. You will, however, have to spend time
updating your NLP system for each additional language. Employing
a certified language translation service is always the best bet when working
with different languages.
 Training Data: NLP is all about studying language in order to better
comprehend it. To become proficient in a language, a person must be
immersed in it continually for years; even the greatest AI must spend a
substantial amount of time reading, listening to, and using the language.
The training data given to an NLP system determines its capabilities. If you
feed the system inaccurate or skewed data, it will learn the incorrect things
or learn inefficiently.
 Innate Biases: In certain situations, the biases of its programmers, and also
biases in the data sets used to develop them, might be carried by NLP
systems. An NLP might exploit and/or perpetuate certain social prejudices,
or give a superior experience to some types of users over others, based on
the app. It’s difficult to create a solution that operates in all situations and
with all the people.
 Words with Multiple Meanings: There is no such thing as flawless
language, and most languages include words that can have several
meanings depending on the situation. A user who inquires, “How are you?”
has a very different aim than a user who inquires, “How do I add the new
debit card?” Good NLP tools should be able to discriminate between these
utterances with the help of settings.
 Phrases with Multiple Intentions: Because certain words and queries
have many meanings, your NLP system won’t be able to oversimplify the
issue by understanding simply one of them. A user may say to your chatbot,
“I have to cancel my prior order and change my card on file,” for instance.
Your AI must be able to discern between these intentions.
 Uncertainty and False Positives: When an NLP detects a term that should
be intelligible and/or addressable but can’t be adequately replied to, it’s
called a false positive. The idea is to create an NLP system that can identify
its own limits and clear up uncertainty using questions or hints.
 Keeping a Conversation Moving: Many current NLP applications are
based on human-machine communication. As a result, your NLP AI must be
able to keep the dialogue going by asking more questions to gather more
data and constantly pointing to a solution.

1.2 Classification of Language Modeling


A simple definition of a Language Model is an AI model that has been trained to
predict the next word or words in a text based on the preceding words, it’s part

3|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

of the technology that predicts the next word you want to type on
your mobile phone allowing you to complete the message faster.
The task of predicting the next word/s is referred to as self-supervised learning,
it does not need labels it just needs lots of text. The process applies its own
labels to the text.

1.2.1 Explain Grammar-based LM


A grammar-based language model (LM) is a type of language model that uses
a grammar to generate text. The grammar is a set of rules that define how
words can be combined to form sentences.
The LM uses the grammar to generate a probability distribution over all
possible sentences. The sentence with the highest probability is then
generated.
Grammar-based LMs are often used in natural language processing (NLP)
applications, such as machine translation and text summarization.
They are also used in speech recognition and generation systems. There are a
number of different types of grammar-based LMs.
The most common type is the context-free grammar (CFG). A CFG is a
grammar that does not take into account the context of the words in a
sentence. This means that a CFG can generate sentences that are
grammatically correct, but that do not make sense.
Another type of grammar-based LM is the context-sensitive grammar (CSG). A
CSG is a grammar that takes into account the context of the words in a
sentence. This means that a CSG can generate sentences that are both
grammatically correct and meaningful.
Grammar-based LMs are often used in combination with other types of LMs,
such as statistical LM. This allows the LM to take advantage of the strengths of
both types of LMs.

1.2.2 Explain Statistical LM


A statistical language model (LM) is a type of language model that uses
statistical techniques to predict the next word or sequence of words in a
sentence.
The LM uses a corpus of text to learn the probability distribution of words and
sentences. Statistical LMs are often used in natural language processing (NLP)
applications, such as machine translation and text summarization.
They are also used in speech recognition and generation systems. There are
types of statistical LMs.

1. N-gram: N-grams are a relatively simple approach to language models. They


create a probability distribution for a sequence of n The n can be any number,
and defines the size of the "gram", or sequence of words being assigned a
probability. For example, if n = 5, a gram might look like this: "can you please

4|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

call me." The model then assigns probabilities using sequences of n size.
Basically, n can be thought of as the amount of context the model is told to
consider. Some types of n-grams are unigrams, bigrams, trigrams and so on.
2. Unigram: The unigram is the simplest type of language model. It doesn't look
at any conditioning context in its calculations. It evaluates each word or term
independently. Unigram models commonly handle language processing tasks
such as information retrieval. The unigram is the foundation of a more specific
model variant called the query likelihood model, which uses information
retrieval to examine a pool of documents and match the most relevant one to a
specific query.
3. Bidirectional: Unlike n-gram models, which analyze text in one direction
(backwards), bidirectional models analyze text in both directions, backwards
and forwards. These models can predict any word in a sentence or body of
text by using every other word in the text. Examining text bi-directionally
increases result accuracy. This type is often utilized in machine learning and
speech generation applications. For example, Google uses a bidirectional
model to process search queries.
4. Exponential: Also known as maximum entropy models, this type is more
complex than n-grams. Simply put, the model evaluates text using an equation
that combines feature functions and n-grams. Basically, this type specifies
features and parameters of the desired results, and unlike n-grams, leaves
analysis parameters more ambiguous -- it doesn't specify individual gram
sizes, for example. The model is based on the principle of entropy, which
states that the probability distribution with the most entropy is the best choice.
In other words, the model with the most chaos, and least room for
assumptions, is the most accurate. Exponential models are designed
maximize cross entropy, which minimizes the amount statistical assumptions
that can be made. This enables users to better trust the results they get from
these models.
5. Continuous space: This type of model represents words as a non-linear
combination of weights in a neural network. The process of assigning a weight
to a word is also known as word embedding. This type becomes especially
useful as data sets get increasingly large, because larger datasets often
include more unique words. The presence of a lot of unique or rarely used
words can cause problems for linear model like an n-gram. This is because the
amount of possible word sequences increases, and the patterns that inform
results become weaker. By weighting words in a non-linear, distributed way,
this model can "learn" to approximate words and therefore not be misled by
any unknown values. Its "understanding" of a given word is not as tightly
tethered to the immediate surrounding words as it is in n-gram models.

1.3 Describe the role of Regular Expressions


Regular expressions (regex) are a sequence of characters that define a search
pattern. They are commonly used to search for specific text patterns in strings.
Regular expressions are used in a wide range of applications, including text
editing, text processing, and text analysis.

5|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

The basic syntax of a regex is a pattern that matches a set of characters. The
pattern can be simple, such as a single character or a set of characters, or
complex, such as a sequence of characters or a set of characters that can be
repeated.
Regex are commonly used in programming languages to search for specific
patterns in strings, in text editors and text processing tools to find and replace
specific text patterns, in data analysis to extract specific information from text
files, in web scraping to extract specific information from web pages, in NLP to
find specific patterns in text.
1.4 Define Finite-State Automata

2 A finite state machine


or finite state
automation is a model
of behaviour composed
of state,
3 transitions and actions.
A finite state automaton
is a device that can be
in one of a finite number
of
4 states. If the
automation is in a final
state, when it stops
6|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

working, it is said to
accept its input where
5 the input is a sequence
of symbols. FSA is used
to accept or reject a
string in a given
language and
6 uses regular
expressions. When the
automaton is switched
on it will be in the initial
stage and starts
7 working. In the final
state, it will accept or
reject the given string.
In between the initial
state and
7|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

8 finite state there are


transitions, a process of
switching over from one
state to another state.
FSA
9 can be used to
represent morphological
lexicon and recognition.
10 A finite state
machine or finite state
automation is a model
of behaviour composed
of state,
11 transitions and
actions. A finite state
automaton is a device

8|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

that can be in one of a


finite number of
12 states. If the
automation is in a final
state, when it stops
working, it is said to
accept its input where
13 the input is a
sequence of symbols.
FSA is used to accept or
reject a string in a given
language and
14 uses regular
expressions. When the
automaton is switched
on it will be in the initial
stage and starts
9|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

15 working. In the final


state, it will accept or
reject the given string.
In between the initial
state and
16 finite state there are
transitions, a process of
switching over from one
state to another state.
FSA
17 can be used to
represent morphological
lexicon and recognition.
Finite Automata (FA) is the simplest machine to recognize patterns.
FSA are defined by a set of states and a set of transitions between those
states. The states represent the various states that the automaton can be in,
and the transitions represent the possible transitions that the automaton can
make. It takes the string of symbol as input and changes its state accordingly.
When the desired symbol is found, then the transition occurs.
At the time of transition, the automata can either move to the next state or stay
in the same state.
Finite automata have two states, Accept state or Reject state. When the input
string is processed successfully, and the automata reached its final state, then
it will accept.

10 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

A finite automaton is a collection of 5-tuple (Q, ∑, δ, q0, F)


where: Q: finite set of states
∑: finite set of the input symbol
q0: initial state
F: final state
δ: Transition function
Finite automata can be represented by input tape and finite control.
Input tape: It is a linear tape having some number of cells. Each input symbol
is placed in each cell.
Finite control: The finite control decides the next state on receiving particular
input from input tape. The tape reader reads the cells one by one from left to
right, and at a time only one input symbol is read.

FSA are commonly used in natural language processing (NLP) applications,


such as text analysis and speech recognition, speech synthesis and natural
language generation.
17.1 State the importance of English Morphology
English morphology plays a crucial role in understanding and effectively
using the English language.
Morphology is the study of the structure and formation of words, including
their roots, prefixes, suffixes, and inflections.
Here are some key reasons why English morphology is important:
1. Natural Language Processing (NLP):
In the field of NLP, morphological analysis is a fundamental step in tasks
such as part-of-speech tagging, text generation, and sentiment analysis.
2. Vocabulary Expansion:
Understanding English morphology allows individuals to decipher the
meanings of unfamiliar words based on their roots, prefixes, and suffixes.
This ability facilitates vocabulary acquisition and comprehension, enabling
effective communication and reading comprehension.
3. Word Formation:
Morphology explains how words are formed by combining different
elements.
This knowledge helps in creating new words, understanding the meanings
of compound words, and deciphering the etymology of terms.
4. Grammar and Syntax:
Morphological knowledge is closely linked to grammar and sentence
structure.
Recognizing the different forms of words (inflections) helps in constructing
grammatically correct sentences and understanding the roles of words in
sentences.
5. Inflectional Endings:
11 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

English morphology involves studying inflections, such as verb tense,


pluralization, and possessive forms.
Mastery of these inflections is essential for producing accurate and coherent
sentences.
6. Derivational Processes:
Morphology explains how words change in meaning and part of speech
through derivational processes.
For example, transforming a noun to an adjective by adding a suffix (e.g.,
"friend" to "friendly") or changing a verb to a noun (e.g., "sing" to "singer").
7. Language Learning and Teaching:
Morphology plays a significant role in language education.
Teachers use morphological analysis to enhance vocabulary instruction and
help students decode new words and understand their meanings in context.
8. Language Processing and Comprehension:
Morphological awareness aids in the quick recognition and comprehension
of words.
It helps readers break down complex words into familiar components,
making it easier to infer meanings and understand texts.
9. Text Analysis and Stylistics: Morphology contributes to the analysis of
literary and linguistic texts.

17.2 Explain Transducers for lexicon and rules


A lexicon is a repository for words. The simplest one would consist of an
explicit list of every word of the language. Inconvenient or impossible!
Computational lexicons are usually structured with a list of each of the stems
and Affixes of the language together with a representation of morphotactics
telling us how they can fit together. The most common way of modeling
morphotactics is the finite-state automaton.
A transducer is a mathematical object that can be used to represent a variety
of different types of systems. In the context of natural language processing
(NLP), transducers are often used to represent lexicons and rules.
A lexicon is a list of words and their associated meanings. A transducer can be
used to represent a lexicon by having each word in the lexicon be associated
with a state. When the transducer is in a particular state, it can output the
corresponding word.
Rules are a set of instructions that can be used to generate new words or
phrases. A transducer can be used to represent rules by having each rule be
associated with a state. When the transducer is in a particular state, it can
output the corresponding rule.
Transducers are a powerful tool for representing lexicons and rules in NLP.
They can be used to generate new words and phrases, and they can be used
to identify errors in text.
17.3 State the importance of Tokenization
Tokenization is the process of breaking down a text into smaller units, called
tokens. These tokens can be individual words, punctuation marks, or other
elements. Tokenization is an important step in natural language processing
(NLP) because it allows computers to understand the structure of text.
12 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

There are a number of different ways to tokenize text. The most common
approach is to use a regular expression to identify the different types of tokens.
For example, a regular expression could be used to identify words, punctuation
marks, and numbers. Once the text has been tokenized, it can be processed
by NLP algorithms. These algorithms can be used to perform a variety of tasks,
such as sentiment analysis, machine translation, and information retrieval.
Tokenization is an important step in NLP because it allows computers to
understand the structure of text. This understanding is essential for performing
a variety of NLP tasks.
17.4 Explain Detecting and Correcting Spelling Errors
Spelling Correction is a very important task in Natural Language Processing. It
is used in various tasks like search engines, sentiment analysis, text
summarization, etc. As the name suggests, we try to detect and correct
spelling errors in spelling correction. In real-world NLP tasks, we often deal
with data having typos, and their spelling correction comes to the rescue to
improve model performance. For example, if we want to search apple and type
“aple,” we will wish that the search engine suggests “apple” instead of giving no
results.

Different approaches to spelling correction


In this approach, we consider all possible words that can be formed using edits
- insert, delete, replace, split and transpose using the brute-force method.
For example, for word abc, all the possible words after edit are - ab ac bc bac
cba acb a_bc ab_c aabc abbc acbc adbc aebc etc.
These words are then added to a list. We again repeat the same thing for the
words still having errors. We then estimate all the words using the unigram
language model. We pre-calculate the frequency of every word from a large
corpus. The word having the highest frequency is chosen.
 Adding more context
We can use n-gram models with n>1, instead of the unigram language
model, providing better accuracy. But, at the same time, the model becomes
heavy due to a higher number of calculations.
Increasing speed: SymSpell Approach (Symmetric Delete Spelling
Correction)
Here we pre-calculate all delete typos, unlike extracting all possible edits
every time we encounter any error. This will cost additional memory
consumption.
 Improving memory consumption
We require large datasets ( at least some GBs) to obtain good accuracy. If
we train the n-gram language models on small datasets like a few MBs, it
leads to high memory consumption. The symspell index and the language
models occupy half-half of the size. This increased usage is because we
store frequencies instead of simple text. We can compress our n-gram
model by using paper. A perfect hash is a hash that can never have
collisions. We can use a perfect hash to store the counts of n-grams. As we
don’t have any collisions, we store the count frequencies instead of the

13 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

original n-grams. We need to make sure that the hash of unknown words
does not match with those of known words, so for that, we use a bloom filter
with known words. A bloom filter utilizes space efficiently to check whether
an element belongs to a set or not. We can reduce the memory usage by
using a nonlinear quantization to pack the 32-bit long count frequencies into
16-bit values.
Improving accuracy
We can use machine learning algorithms to improve accuracy further. We
can start with a machine learning classifier that decides whether a given the
word has an error or not. Then we can use a regression model to rank the
words. This technique mimics the role of smoothing in language models as
it uses all the grams as input, and then a classifier decides the rank,
meaning how powerful each gram is. For word ranking, we will train a
catboost(gradient boosted decision trees) ranking model with multiple
features like word frequency, word length, edit distance between the source
word and modified word(i.e., number of changes required to convert the
source to modified), n-gram language model prediction, the frequencies of
neighboring words, etc.
 Further thoughts
We can gather a large corpus with errors and their corresponding corrected
text and then train a model to improve accuracy.
We can also have dynamic learning options. We can learn on the way while
making corrections. We can also have two pass corrections by learning
some statistics in the first pass and then making the actual correction in the
second pass.

Spelling errors are common in written text. They can be caused by a variety of
factors, such as typos, homophones, and unfamiliar words. Spelling errors can
make it difficult for readers to understand the meaning of a text.
There are a number of different ways to detect and correct spelling errors.
One common approach is to use a spell checker. Spell checkers are software
programs that can identify words that are misspelled or incorrectly capitalized.
Spell checkers can be used to check the spelling of individual words or entire
documents. Another approach to detecting and correcting spelling errors is to
use a grammar checker. Grammar checkers are software programs that can
identify grammatical errors, such as incorrect punctuation, sentence fragments,
and run-on sentences. Grammar checkers can also be used to check the
spelling of individual words.
17.5 Describe Minimum Edit Distance
Many NLP tasks are concerned with measuring how similar two strings are
The minimum edit distance between intention and execution can be visualized
using their alignment.
Minimum Edit distance is the minimum number of editing operations i.e
Insertion, Deletion, Substitution needed to transform one string to another.
Example Lets consider 2 words, Intention and Execution

14 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING

INTE$NTION
$EXECUTION
dssis
A cost or weight can be assigned to each of these operations.
Levenshtein distance between 2 strings incorporates the simplest weighing
factor i.e 1 for each operation of Insertion, Deletion and Substitution.
In the above example, the minimum distance between these 2 strings is 5.
There is another form of Levenshtein distance in which the cost of Substitution
operation is 2 since Substitution is nothing but a deletion followed by an
insertion.
In the above example, The Minimum distance between 2 strings is 8
considering substitution to be an operation of weight 2.
It is a technique used to find how similar the two strings are.
Applications of Minimum Edit Distance
 Spelling Correction
 Plagiarism Detection
 DNA Analysis i.e. Computational Biology.

15 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy