NLP Module 1
NLP Module 1
➢ Introduction
➢ Human languages
➢ Models
➢ Regular Expressions
➢ Patterns
➢ Finite State Automata
➢ Inflectional Morphology
➢ Derivational Morphology
➢ Finite-State Morphological Parsing
➢ Porter Stemmer
Natural Language Processing
• What is NLP?
• Natural language processing (NLP) is a machine learning technology that gives
computers the ability to interpret, manipulate, and comprehend human language.
• Examples include machine translation, summarization, ticket classification, and spell
check
• Applications include
Introduction-Human Languages
The primary objectives of NLP are
• Enable human-machine communication.
• Improve human-human communication.
• Perform useful processing of text or speech.
• Some example tasks are
• Definition: Programs that • Purpose: Automatically translate • Definition: Goes beyond simple
converse with humans via natural documents from one language to web search to answer complete
language another. questions.
• Famous Example: HAL 9000 from • Challenges: Machine translation • Examples of Questions:
"2001: A Space Odyssey". is not fully solved; involves • "What does 'divergent' mean?"
• Components: Language input complex algorithms and • "What year was Abraham Lincoln
(automatic speech recognition, mathematical tools. born?"
natural language understanding), • Capabilities: Answers range from
language output (natural simple definitions to complex
language generation, speech inferences and synthesis of
synthesis). information.
Introduction-Human Languages
Knowledge in Speech and Language Processing
Distinctive Feature: Language processing applications use knowledge of language, setting them
apart from other data processing systems.
Example: Unix wc program - counts bytes and lines without language knowledge, but counting
words requires understanding what constitutes a word
Simple vs. Sophisticated: Basic systems like wc contrast with complex systems such as
conversational agents and machine translation, which require extensive language knowledge.
Resolving Ambiguity
Part-of-Speech Tagging: Other Types of
Deciding whether "duck" is a verb Syntactic Disambiguation Ambiguity
or a noun. Example: Speech Act
Word Sense Disambiguation: Determining if "her" and "duck" Interpretation:
Deciding whether "make" means are part of the same entity (1.5, 1.8) Determining
or different entities (1.6). whether a sentence
"create" or "cook".
Method: Probabilistic Parsing is a statement or a
Text-to-Speech Synthesis Example:
question.
Pronouncing "lead" as in "lead
pipe" vs. "lead me on".
Models
Models and Algorithms
• Various types of linguistic knowledge can be captured using a small number of formal models
and algorithms.
• Main Models are State machines, rule systems, logic, probabilistic models, vector-space models.
• Formal models and algorithms are essential in speech and language processing.
• These tools help capture and resolve linguistic ambiguities.
Models
Models and Algorithms
Methodological Tools
Machine Learning
Vector-Space Models Search Algorithms & Example
Tools
Applications
• Definition: Based on • Description: State space • Classifiers: Decision • Training and Test Sets:
linear algebra, used for search, dynamic trees, support vector Use of distinct sets for
information retrieval programming, heuristic machines, Gaussian training and evaluation.
and word meanings.. search (best-first, A* Mixture Models, • Statistical Techniques:
• Applications: Underlie search). logistic regression. Cross-validation for
many treatments of • Applications: Speech • Sequence Models: model evaluation.
word meanings and recognition, parsing, Hidden Markov models, • Evaluation Metrics:
information retrieval machine translation. maximum entropy Careful evaluation of
systems. Markov models, system performance.
conditional random
fields. • Example Applications
• Applications: Spelling • Spelling Correction
correction, part-of-
• Speech Recognition
speech tagging, named
entity recognition. • Machine Translation
Regular Expressions
Introduction
• Regular Expression is a language for specifying text search strings.
• Widely used in UNIX, Microsoft Word, and web search engines.
• An important theoretical tool throughout computer science and linguistics.
• First developed by Kleene in 1956
• RE is a formula in a special language that is used for specifying simple classes of strings.
• A string is a sequence of symbols; for the Strings purpose of most text based search techniques,
a string is any sequence of alphanumeric characters (letters, numbers, spaces, tabs, and
punctuation)
• These tools help capture and resolve linguistic ambiguities.
• Facilitates text search and manipulation in numerous applications
• Regular Expressions as Algebraic Notation for characterizing a set of strings
• Used to specify search strings and define a language formally
• Res requires a pattern to search for and a corpus of texts to search through
• Standardized syntax across various platforms.
Regular Expressions
Basic Regular Expression Patterns
• Simple Regular Expression is sequence of simple characters.
• Simple RE is /word/.
• REs are case sensitive
• lowercase /s/ is distinct from uppercase /S/ (/s/ matches a lower case s but not an uppercase S).
• We can solve this problem with the use of the square braces [ and ]
• The string of characters inside the braces specify a disjunction of characters to match.
• The regular expression /[1234567890]/ specified any single digit.
• /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/ any capital letter
Regular Expressions
Basic Regular Expression Patterns
• In these cases the brackets can be used with the dash (-) to specify any one character in a range.
• To specify what a single character cannot be ^.
• If caret ^ is the first symbol then result will be negated other it will treated as normal caret
symbol only.
BFS DFS
Finite-State Automata
Comparing Depth-First and Breadth-First Search:
Depth-First Search (DFS):
Pros:
Efficient use of memory.
Suitable for deep, narrow spaces.
Cons:Risk of infinite loops. May explore unfruitful paths deeply.
Breadth-First Search (BFS):
Pros:Guaranteed to find the shortest path.
Explores all paths uniformly.
Cons:High memory usage.
May not be practical for large state-spaces.
Choosing a Strategy:
Small Problems: Either DFS or BFS may be adequate.
Large Problems: Consider advanced techniques like dynamic programming or A*
Finite-State Automata
Relating Deterministic and Non-Deterministic Automata:
• For any NFSA, there exists an equivalent DFSA.
• Conversion Algorithm which Converts an NFSA to a DFSA, potentially increasing the number
of states exponentially.
• NFSAs can follow multiple paths simultaneously.
• If states 𝑞𝑎 and 𝑞𝑏 are reached by the same input, they form a new state 𝑞𝑎𝑏 .
• The new DFSA can have up to 2𝑁 states, where 𝑁 is the number of states in the original NFSA.
• Conversion Process from NFSA to DFSA
• Initial State: Start with the initial state of the NFSA.
• State Grouping: For each input, group reachable states into a new state.
• State Transition: Define transitions for these new grouped states.
• Repeat: Continue for every possible input and state group until no new states are formed.
Morphology
• Regular Expressions are used for finding both "woodchuck" and "woodchucks" using a single
search string.
• The main Challenges occur with irregular Plurals like "fox" to "foxes", "peccary" to "peccaries",
"goose" to "geese", and "fish" remaining unchanged.
• It takes two kinds of knowledge to correctly search for singulars and plurals of these forms.
• Orthographic rules tell us that English words ending in -y are pluralized by changing the -y to -
i- and adding an -es.
• Morphological rules tell us that fish has a null plural, and that the plural of goose is formed by
changing the vowel.
• Morphological Parsing: Recognizing and breaking down words into morphemes.
• Example: Parsing "foxes" into "fox" and "-es“, going into VERB-go + GERUND-ing.
• Applications:
• Web Search: Handling morphologically complex languages like Russian.
• Part-of-Speech Tagging: Crucial for accurate tagging in languages with rich morphology.
• Spell-Checking: Essential for creating large dictionaries.
• Machine Translation: Translating inflected forms accurately..
Morphology
• To solve the morphological parsing problem, why couldn’t we just store all the plural forms of
English nouns and -ing forms of English verbs in a dictionary and do parsing by lookup?
• Which is no enough, the key algorithm for morphological parsing are
• Finite-State Transducers (FSTs) are used for morphological parsing throughout speech and
language processing.
• Stemming: Stripping off word endings.
• Example: "foxes" to "fox".
• Lemmatization: Mapping words to their root forms.
• Example: "sang", "sung", "sings" to "sing".
• Tokenization: Separating words from running text.
• Challenges: Handling multi-word expressions like "New York" and contractions like "I'm“
• Minimum Edit Distance: Measuring orthographic similarity between words.
• Applications: Important for spell-checking and comparing word forms
Morphology
• To solve the morphological parsing problem, why couldn’t we just store all the plural forms of
English nouns and -ing forms of English verbs in a dictionary and do parsing by lookup?
• Which is not enough, the key algorithm for morphological parsing are
• Finite-State Transducers (FSTs) are used for morphological parsing throughout speech and
language processing.
• Stemming: Stripping off word endings.
• Example: "foxes" to "fox".
• Lemmatization: Mapping words to their root forms.
• Example: "sang", "sung", "sings" to "sing".
• Tokenization: Separating words from running text.
• Challenges: Handling multi-word expressions like "New York" and contractions like "I'm“
• Minimum Edit Distance: Measuring orthographic similarity between words.
• Applications: Important for spell-checking and comparing word forms
Morphology
• Understanding Morphemes
• Morphology is the study of the way words are built up from smaller meaning-bearing
Morpheme units, morphemes.
• A morpheme is often defined as the minimal meaning-bearing unit in a language
• Examples:
• Fox: Single morpheme.
• Cats: Two morphemes (cat + -s).
• Types of Morphemes
• Stems: Main morphemes supplying the primary meaning.
• Affixes: Add additional meanings; further divided into prefixes, suffixes, infixes, and
circumfixes.
• Examples:
• Prefix: un- in "unbuckle".
• Suffix: -s in "eats".
• Circumfix: ge-...-t in German "gesagt".
• Infix: Editor-in-chief, Editors-in-chief
Morphology
• Understanding Morphemes
• Multiple Affixes: Words with Multiple Affixes
• Examples:
• Rewrites: Prefix (re-) + Stem (write) + Suffix (-s).
• Unbelievably: Stem (believe) + Prefix (un-) + Suffixes (-able, -ly).
• Methods of Combining Morphemes
• Inflection: Combining stem with grammatical morphemes, resulting in the same word class
• Examples:
• Plural: cat + -s = cats.
• Past tense: walk + -ed = walked.
• Function: Often fills syntactic functions like agreement.
• Derivation: Combination of a word stem with a grammatical morpheme resulting in a different
word class
• Examples:
• Computerize + -ation = Computerization.
• Characteristics: Often has a meaning that is harder to predict exactly.
Morphology
• Methods of Combining Morphemes
• Compounding: Combination of multiple word stems.
• Examples:
• Doghouse: Dog + House.
• Bookshelf: Book + shelf
• Cliticization: Combination of a word stem with a clitic.
• Examples:
• I’ve: I + ’ve.
• L’opera: Le + Opera (French).
Inflectional Morphology
• Inflection: Modifying words to express different grammatical categories
• English has a relatively simple inflectional system.
• English nouns have only two kinds of inflection
• Plural Inflection:
• Regular plural suffix: -s or –es
• Examples:
• Regular Nouns: cat → cats, thrush → thrushes
• Irregular Nouns: mouse → mice, ox → oxen
• Possessive Inflection:
• Apostrophe + -s for singular and irregular plural nouns.
• Examples:
• Singular: llama → llama’sIrregular
• Plural: children → children’s
• Lone apostrophe for regular plural nouns.
• Example: llamas → llamas’
Inflectional Morphology
• English verbal inflection is more complicated than nominal inflection.
• First, English has three kinds of verbs.
• main verbs, (eat, sleep, impeach),
• modal verbs (can, will, should),
• primary verbs (be, have, do)
• we will mostly be concerned with the main and primary verbs, because it is these that have
inflectional endings.
• Of these verbs a large class Regular verb are regular, that is to say all verbs of this class have the
same endings marking the same functions. These regular verbs (e.g. walk, or inspect) have four
morphological forms, as follow
Inflectional Morphology
• The irregular verbs are those that have some more or less idiosyncratic forms of inflection.
• Irregular verbs in English often have five different forms, but can have as many as eight (e.g., the
verb be) or as few as three (e.g. cut or hit).
• The table below shows some sample irregular forms
• Adjectives can also be derived from nouns and verbs. Here are examples of a few suffixes
deriving adjectives from nouns or verbs
Derivational Morphology
• Derivation in English is more complex than inflection for a number of reasons.
• Less productive compared to inflection.
• Not all suffixes can be added to every base word.
• Subtle and complex meaning differences among suffixes.
• Examples:
• -ation: can be added to many verbs but not all (e.g., *eatation, *spellation).
• Sincerity vs. sincereness: subtle meaning differences.
• Comparison with Inflection
• Inflection:
• Regular and predictable changes.
• Limited number of affixes.
• Derivation:
• Irregular and less predictable.
• Complex meaning changes.
Finite-State Morphological Parsing
• Our goal will be to take input forms like those in the first and third columns of Fig, produce
output forms like those in the second and fourth column.
• The second column contains the stem of each word as well as assorted morphological features.
These features specify additional information Feature about the stem. For example the feature
+N means that the word is a noun; +Sg means it is singular, +Pl that it is plural.
Finite-State Morphological Parsing
• Note that some of the input forms (like caught, goose, canto, or vino) will be ambiguous
between different morphological parses.
• For now, we will consider the goal of morphological parsing merely to list all possible parses.
• In order to build a morphological parser, we’ll need at least the following
• lexicon: the list of stems and affixes, together with basic information about them (whether a
stem is a Noun stem or a Verb stem, etc.)
• morphotactics: the model of morpheme ordering that explains which classes of morphemes can
follow other classes of morphemes inside a word. For example, the fact that the English plural
morpheme follows the noun rather than preceding it is a morphotactic fact.
• orthographic rules: these spelling rules are used to model the changes that occur in a word,
usually when two morphemes combine (e.g., the y to ie spelling rule discussed above that
changes city + -s to cities rather than citys).
Finite-State Morphological Parsing
• Rule: AL → ``
• Rule: ANCE → ``
• Rule: ENCE → ``
• And more...
• "run" does not match any rule here.
• Step 5a: Remove -e if the measure (m) > 1, or (m=1 and not *o):
• Rule: (m>1) E → ``
• Rule: (m=1 and not *o) E → ``
• "run" does not end in "e", so no change.
• Step 5b: Remove terminal l if m > 1 and *d (double letter):