Unit-1 Aim 502
Unit-1 Aim 502
Unit-1 Aim 502
Origins of NLP:
1.Early Beginnings: The origins of NLP can be traced back to the early days
of computer science and linguistics. In the 1950s and 1960s, researchers
began to explore ways to enable computers to process and understand human
language. One of the earliest efforts was the development of machine
translation systems.
2. Rule-Based Approaches: During the 1960s and 1970s, researchers
focused on rule-based approaches to NLP. These approaches involved
manually creating linguistic rules and grammatical structures to teach
computers how to understand and generate text. While these methods were
limited in their capabilities, they laid the foundation for later advancements.
3. Statistical and Machine Learning Era: In the 1980s and 1990s, statistical
and machine learning approaches gained prominence in NLP. Researchers
1|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
started using large corpora of text to train models that could automatically learn
patterns in language. This led to the development of techniques like Hidden
Markov Models, n-grams, and later, neural networks.
Natural Language Processing (NLP) Challenges
NLP is a powerful tool with huge benefits, but there are still a number of
Natural Language Processing limitations and problems.
1. Ambiguity: Human language is inherently ambiguous, with words and
phrases often having multiple meanings depending on context. Resolving
this ambiguity is a significant challenge in NLP, as it requires understanding
context and intent.
2. Syntax and Semantics: Capturing the syntax (structure) and semantics
(meaning) of language is complex. NLP models need to comprehend not
only the grammatical structure of sentences but also the underlying
meaning, which may involve cultural nuances and idiomatic expressions.
3. Named Entity Recognition (NER): Identifying entities such as names of
people, places, and organizations within text is crucial for many NLP tasks.
However, NER is challenging due to variations in naming conventions,
misspellings, and the presence of novel entities.
4. Sentiment Analysis and Tone Detection : Determining the sentiment or
emotional tone of a text is challenging, as it requires understanding the
nuances of human emotions, sarcasm, and irony.
5. Contextual Understanding: NLP models need to understand and maintain
context over longer passages of text. Discerning coreference (when a word
or phrase refers to another word or phrase) and maintaining a consistent
understanding of context is essential for accurate language comprehension.
6. Low-Resource Languages: Many NLP advancements have focused on
languages with abundant data, leaving behind low-resource languages that
lack sufficient training data. Developing effective NLP models for such
languages remains a challenge.
7. Ethical and Bias Concerns: NLP models trained on large datasets may
inadvertently learn and perpetuate biases present in the data. Ensuring
fairness, mitigating bias, and addressing ethical concerns in NLP are
ongoing challenges.
8. Real-World Variability: NLP models often struggle with handling informal
language, dialects, and colloquialisms that are prevalent in real-world
communication.
9. Multilingual and Cross-Lingual Understanding : Developing NLP models
that can seamlessly process and understand multiple languages is a
complex challenge due to linguistic differences and varying grammatical
structures.
OR
Language Differences: The majority of people in the US speak English,
but if you want to reach a global and/or diverse audience, you’ll have to
support various languages. Not only do various languages have
2|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
3|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
of the technology that predicts the next word you want to type on
your mobile phone allowing you to complete the message faster.
The task of predicting the next word/s is referred to as self-supervised learning,
it does not need labels it just needs lots of text. The process applies its own
labels to the text.
4|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
call me." The model then assigns probabilities using sequences of n size.
Basically, n can be thought of as the amount of context the model is told to
consider. Some types of n-grams are unigrams, bigrams, trigrams and so on.
2. Unigram: The unigram is the simplest type of language model. It doesn't look
at any conditioning context in its calculations. It evaluates each word or term
independently. Unigram models commonly handle language processing tasks
such as information retrieval. The unigram is the foundation of a more specific
model variant called the query likelihood model, which uses information
retrieval to examine a pool of documents and match the most relevant one to a
specific query.
3. Bidirectional: Unlike n-gram models, which analyze text in one direction
(backwards), bidirectional models analyze text in both directions, backwards
and forwards. These models can predict any word in a sentence or body of
text by using every other word in the text. Examining text bi-directionally
increases result accuracy. This type is often utilized in machine learning and
speech generation applications. For example, Google uses a bidirectional
model to process search queries.
4. Exponential: Also known as maximum entropy models, this type is more
complex than n-grams. Simply put, the model evaluates text using an equation
that combines feature functions and n-grams. Basically, this type specifies
features and parameters of the desired results, and unlike n-grams, leaves
analysis parameters more ambiguous -- it doesn't specify individual gram
sizes, for example. The model is based on the principle of entropy, which
states that the probability distribution with the most entropy is the best choice.
In other words, the model with the most chaos, and least room for
assumptions, is the most accurate. Exponential models are designed
maximize cross entropy, which minimizes the amount statistical assumptions
that can be made. This enables users to better trust the results they get from
these models.
5. Continuous space: This type of model represents words as a non-linear
combination of weights in a neural network. The process of assigning a weight
to a word is also known as word embedding. This type becomes especially
useful as data sets get increasingly large, because larger datasets often
include more unique words. The presence of a lot of unique or rarely used
words can cause problems for linear model like an n-gram. This is because the
amount of possible word sequences increases, and the patterns that inform
results become weaker. By weighting words in a non-linear, distributed way,
this model can "learn" to approximate words and therefore not be misled by
any unknown values. Its "understanding" of a given word is not as tightly
tethered to the immediate surrounding words as it is in n-gram models.
5|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
The basic syntax of a regex is a pattern that matches a set of characters. The
pattern can be simple, such as a single character or a set of characters, or
complex, such as a sequence of characters or a set of characters that can be
repeated.
Regex are commonly used in programming languages to search for specific
patterns in strings, in text editors and text processing tools to find and replace
specific text patterns, in data analysis to extract specific information from text
files, in web scraping to extract specific information from web pages, in NLP to
find specific patterns in text.
1.4 Define Finite-State Automata
working, it is said to
accept its input where
5 the input is a sequence
of symbols. FSA is used
to accept or reject a
string in a given
language and
6 uses regular
expressions. When the
automaton is switched
on it will be in the initial
stage and starts
7 working. In the final
state, it will accept or
reject the given string.
In between the initial
state and
7|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
8|Page
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
10 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
There are a number of different ways to tokenize text. The most common
approach is to use a regular expression to identify the different types of tokens.
For example, a regular expression could be used to identify words, punctuation
marks, and numbers. Once the text has been tokenized, it can be processed
by NLP algorithms. These algorithms can be used to perform a variety of tasks,
such as sentiment analysis, machine translation, and information retrieval.
Tokenization is an important step in NLP because it allows computers to
understand the structure of text. This understanding is essential for performing
a variety of NLP tasks.
17.4 Explain Detecting and Correcting Spelling Errors
Spelling Correction is a very important task in Natural Language Processing. It
is used in various tasks like search engines, sentiment analysis, text
summarization, etc. As the name suggests, we try to detect and correct
spelling errors in spelling correction. In real-world NLP tasks, we often deal
with data having typos, and their spelling correction comes to the rescue to
improve model performance. For example, if we want to search apple and type
“aple,” we will wish that the search engine suggests “apple” instead of giving no
results.
13 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
original n-grams. We need to make sure that the hash of unknown words
does not match with those of known words, so for that, we use a bloom filter
with known words. A bloom filter utilizes space efficiently to check whether
an element belongs to a set or not. We can reduce the memory usage by
using a nonlinear quantization to pack the 32-bit long count frequencies into
16-bit values.
Improving accuracy
We can use machine learning algorithms to improve accuracy further. We
can start with a machine learning classifier that decides whether a given the
word has an error or not. Then we can use a regression model to rank the
words. This technique mimics the role of smoothing in language models as
it uses all the grams as input, and then a classifier decides the rank,
meaning how powerful each gram is. For word ranking, we will train a
catboost(gradient boosted decision trees) ranking model with multiple
features like word frequency, word length, edit distance between the source
word and modified word(i.e., number of changes required to convert the
source to modified), n-gram language model prediction, the frequencies of
neighboring words, etc.
Further thoughts
We can gather a large corpus with errors and their corresponding corrected
text and then train a model to improve accuracy.
We can also have dynamic learning options. We can learn on the way while
making corrections. We can also have two pass corrections by learning
some statistics in the first pass and then making the actual correction in the
second pass.
Spelling errors are common in written text. They can be caused by a variety of
factors, such as typos, homophones, and unfamiliar words. Spelling errors can
make it difficult for readers to understand the meaning of a text.
There are a number of different ways to detect and correct spelling errors.
One common approach is to use a spell checker. Spell checkers are software
programs that can identify words that are misspelled or incorrectly capitalized.
Spell checkers can be used to check the spelling of individual words or entire
documents. Another approach to detecting and correcting spelling errors is to
use a grammar checker. Grammar checkers are software programs that can
identify grammatical errors, such as incorrect punctuation, sentence fragments,
and run-on sentences. Grammar checkers can also be used to check the
spelling of individual words.
17.5 Describe Minimum Edit Distance
Many NLP tasks are concerned with measuring how similar two strings are
The minimum edit distance between intention and execution can be visualized
using their alignment.
Minimum Edit distance is the minimum number of editing operations i.e
Insertion, Deletion, Substitution needed to transform one string to another.
Example Lets consider 2 words, Intention and Execution
14 | P a g e
AIM-502: UNIT-1 INTRODUCTION TO NATURAL LANGUAGE PROGRAMMING
INTE$NTION
$EXECUTION
dssis
A cost or weight can be assigned to each of these operations.
Levenshtein distance between 2 strings incorporates the simplest weighing
factor i.e 1 for each operation of Insertion, Deletion and Substitution.
In the above example, the minimum distance between these 2 strings is 5.
There is another form of Levenshtein distance in which the cost of Substitution
operation is 2 since Substitution is nothing but a deletion followed by an
insertion.
In the above example, The Minimum distance between 2 strings is 8
considering substitution to be an operation of weight 2.
It is a technique used to find how similar the two strings are.
Applications of Minimum Edit Distance
Spelling Correction
Plagiarism Detection
DNA Analysis i.e. Computational Biology.
15 | P a g e