0% found this document useful (0 votes)
41 views55 pages

NLP Module 1

Uploaded by

akarshana102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views55 pages

NLP Module 1

Uploaded by

akarshana102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Natural Language Processing

Course Code: DS311


Module-1
Module-1 Contents

➢ Introduction
➢ Human languages
➢ Models
➢ Regular Expressions
➢ Patterns
➢ Finite State Automata
➢ Inflectional Morphology
➢ Derivational Morphology
➢ Finite-State Morphological Parsing
➢ Porter Stemmer
Natural Language Processing
• What is NLP?
• Natural language processing (NLP) is a machine learning technology that gives
computers the ability to interpret, manipulate, and comprehend human language.
• Examples include machine translation, summarization, ticket classification, and spell
check
• Applications include
Introduction-Human Languages
The primary objectives of NLP are
• Enable human-machine communication.
• Improve human-human communication.
• Perform useful processing of text or speech.
• Some example tasks are

Conversational Agents Machine Translation Web-Based Question Answering

• Definition: Programs that • Purpose: Automatically translate • Definition: Goes beyond simple
converse with humans via natural documents from one language to web search to answer complete
language another. questions.
• Famous Example: HAL 9000 from • Challenges: Machine translation • Examples of Questions:
"2001: A Space Odyssey". is not fully solved; involves • "What does 'divergent' mean?"
• Components: Language input complex algorithms and • "What year was Abraham Lincoln
(automatic speech recognition, mathematical tools. born?"
natural language understanding), • Capabilities: Answers range from
language output (natural simple definitions to complex
language generation, speech inferences and synthesis of
synthesis). information.
Introduction-Human Languages
Knowledge in Speech and Language Processing
Distinctive Feature: Language processing applications use knowledge of language, setting them
apart from other data processing systems.
Example: Unix wc program - counts bytes and lines without language knowledge, but counting
words requires understanding what constitutes a word
Simple vs. Sophisticated: Basic systems like wc contrast with complex systems such as
conversational agents and machine translation, which require extensive language knowledge.

Phonetics and Phonology


Definition: Knowledge about linguistic sounds and their acoustic realization.
Application: Speech recognition and speech synthesis.
Example: HAL must recognize and produce words accurately, understanding sounds and
pronunciation.
Recognizing the sound pattern of "door".
Introduction-Human Languages
Morphology
Definition: Knowledge of the meaningful components of words..
Example:
Recognizing contractions: "I’m" for "I am" and "can’t" for "cannot".
Recognizing plurals: "doors" as the plural form of "door".
Syntax
Definition: Knowledge of the structural relationships between words.
Example: Correctly ordering words in a sentence:
Incorrect: "I’m I do, sorry that afraid Dave I’m can’t."
Correct: "I’m sorry, Dave. I’m afraid I can’t do that.“
Lexical and Compositional Semantics
Lexical Semantics: Meaning of individual words.
Example: Understanding "export" and "silk".
Compositional Semantics: Meaning derived from combining words in sentences.
Example: Understanding the phrase "by the end of the 18th century" as a temporal endpoint.
Introduction-Human Languages
Morphology
Definition: Knowledge of the meaningful components of words..
Example:
Recognizing contractions: "I’m" for "I am" and "can’t" for "cannot".
Recognizing plurals: "doors" as the plural form of "door".
Syntax
Definition: Knowledge of the structural relationships between words.
Example: Correctly ordering words in a sentence:
Incorrect: "I’m I do, sorry that afraid Dave I’m can’t."
Correct: "I’m sorry, Dave. I’m afraid I can’t do that.“
Lexical and Compositional Semantics
Lexical Semantics: Meaning of individual words.
Example: Understanding “bank" and “lie".
Compositional Semantics: Meaning derived from combining words in sentences.
Example: Understanding the phrase "by the end of the 18th century" as a temporal endpoint.
Introduction-Human Languages
Pragmatics and Dialogue Knowledge
Pragmatics: Relationship of meaning to speaker’s goals and intentions.
Example: HAL's responses to Dave:
Polite refusal: "I’m sorry, Dave. I’m afraid I can’t do that.
Direct refusal: "No, I won’t open the door.“
Discourse Knowledge
Definition: Knowledge about linguistic units larger than a single utterance.
Application: Coreference resolution.
Example:
Question: "How many states were in the United States that year?“
Requires context from previous question about the year Lincoln was born.
Introduction-Human Languages
Ambiguity
Ambiguity occurs when there are multiple alternative linguistic structures that can be built for an
input. Ambiguity is a common challenge in speech and language processing. Various models and
algorithms are used to resolve different types of ambiguities.

Example of Ambiguity Causes of Ambiguity Deeper


•Sentence: "I made her duck."
Ambiguities in
• Morphological/Syntactic Spoken Sentences
Ambiguity:
•Multiple Meanings(1.5) • Phonetic Ambiguity:
• Duck: Verb or Noun
• I cooked waterfowl for her. • "I" vs. "Eye"
• Her: Dative Pronoun or
• (1.6) I cooked waterfowl belonging to • "Maid" vs. "Made"
Possessive Pronoun
her.
• Semantic Ambiguity:
• (1.7) I created the (plaster?) duck she
owns. • Make: Create or Cook
• (1.8) I caused her to quickly lower her • Syntactic Ambiguity:
head or body. • Make can be transitive,
• (1.9) I waved my magic wand and ditransitive, or take a direct object
turned her into undifferentiated and a verb.
waterfowl.
Introduction-Human Languages
Ambiguity

Resolving Ambiguity
Part-of-Speech Tagging: Other Types of
Deciding whether "duck" is a verb Syntactic Disambiguation Ambiguity
or a noun. Example: Speech Act
Word Sense Disambiguation: Determining if "her" and "duck" Interpretation:
Deciding whether "make" means are part of the same entity (1.5, 1.8) Determining
or different entities (1.6). whether a sentence
"create" or "cook".
Method: Probabilistic Parsing is a statement or a
Text-to-Speech Synthesis Example:
question.
Pronouncing "lead" as in "lead
pipe" vs. "lead me on".
Models
Models and Algorithms
• Various types of linguistic knowledge can be captured using a small number of formal models
and algorithms.
• Main Models are State machines, rule systems, logic, probabilistic models, vector-space models.
• Formal models and algorithms are essential in speech and language processing.
• These tools help capture and resolve linguistic ambiguities.
Models
Models and Algorithms

State Machines Formal Rule Systems Logic-Based Models Probabilistic Models

• Definition: Formal • Description: Declarative • Definition: Models • Importance: Crucial for


models consisting of counterparts to state based on first order logic all types of linguistic
states, transitions, and machines. (predicate calculus), knowledge.
input representations. • Examples: Regular lambda-calculus, • Examples: Weighted
• Types: Deterministic grammars, regular feature-structures, and automata, Markov
and non-deterministic relations, context-free semantic primitives. models, hidden Markov
finite-state automata, grammars, feature- • Applications: models (HMMs).
finite-state transducers. augmented grammars. Traditionally used for • Applications: Part-of-
• Applications: Used for • Applications: Also used semantics and speech tagging, speech
knowledge of for phonology, pragmatics. recognition, dialogue
phonology, morphology, morphology, and syntax. • Recent Focus: Shift understanding, text-to-
and syntax. towards non-logical speech, machine
lexical semantics for translation.
robustness.
Models
Models and Algorithms

Methodological Tools
Machine Learning
Vector-Space Models Search Algorithms & Example
Tools
Applications

• Definition: Based on • Description: State space • Classifiers: Decision • Training and Test Sets:
linear algebra, used for search, dynamic trees, support vector Use of distinct sets for
information retrieval programming, heuristic machines, Gaussian training and evaluation.
and word meanings.. search (best-first, A* Mixture Models, • Statistical Techniques:
• Applications: Underlie search). logistic regression. Cross-validation for
many treatments of • Applications: Speech • Sequence Models: model evaluation.
word meanings and recognition, parsing, Hidden Markov models, • Evaluation Metrics:
information retrieval machine translation. maximum entropy Careful evaluation of
systems. Markov models, system performance.
conditional random
fields. • Example Applications
• Applications: Spelling • Spelling Correction
correction, part-of-
• Speech Recognition
speech tagging, named
entity recognition. • Machine Translation
Regular Expressions
Introduction
• Regular Expression is a language for specifying text search strings.
• Widely used in UNIX, Microsoft Word, and web search engines.
• An important theoretical tool throughout computer science and linguistics.
• First developed by Kleene in 1956
• RE is a formula in a special language that is used for specifying simple classes of strings.
• A string is a sequence of symbols; for the Strings purpose of most text based search techniques,
a string is any sequence of alphanumeric characters (letters, numbers, spaces, tabs, and
punctuation)
• These tools help capture and resolve linguistic ambiguities.
• Facilitates text search and manipulation in numerous applications
• Regular Expressions as Algebraic Notation for characterizing a set of strings
• Used to specify search strings and define a language formally
• Res requires a pattern to search for and a corpus of texts to search through
• Standardized syntax across various platforms.
Regular Expressions
Basic Regular Expression Patterns
• Simple Regular Expression is sequence of simple characters.
• Simple RE is /word/.
• REs are case sensitive
• lowercase /s/ is distinct from uppercase /S/ (/s/ matches a lower case s but not an uppercase S).
• We can solve this problem with the use of the square braces [ and ]
• The string of characters inside the braces specify a disjunction of characters to match.
• The regular expression /[1234567890]/ specified any single digit.
• /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/ any capital letter
Regular Expressions
Basic Regular Expression Patterns
• In these cases the brackets can be used with the dash (-) to specify any one character in a range.
• To specify what a single character cannot be ^.
• If caret ^ is the first symbol then result will be negated other it will treated as normal caret
symbol only.

• /?/, which means “the preceding character or nothing”.


• For this we use the question-mark /?/, which means “the preceding character or nothing”.
• We can think of the question-mark as meaning “zero or one instances of the previous
character”.
Regular Expressions
Basic Regular Expression Patterns
• The Kleene Star (*) Means "zero or more occurrences" of the previous character or regular
expression.
• Example: /a*/ matches "", "a", "aa", "aaa", etc.
• Usage: In the language of sheep, /baa*!/ matches "baa!", "baaa!", "baaaa!", etc.
• The Kleene Plus (+) Means "one or more occurrences" of the previous character
• Example: /a+/ matches "a", "aa", "aaa", etc.
• Usage: In the language of sheep, /baa+!/ matches "baa!", "baaa!", "baaaa!", etc.
• Wildcard Character (.) Matches any single character except a newline
• Example: beg.n matches "begin", "begun", "beg'n".
• Combining Wildcards with Kleene Star ( .*) means "any sequence of characters".
• Example: /aardvark.*aardvark/ matches lines where "aardvark" appears twice.
• Anchors in Regular Expressions Caret (^): Matches the start of a line.
• Example: ^The matches "The" at the start of a line.
Regular Expressions
Basic Regular Expression Patterns
• Dollar Sign ($): Matches the end of a line.
• Example: dog\.$ matches "dog." at the end of a line.
• Word Boundary (\b): Matches the start or end of a word
• Example: \bthe\b matches "the" but not "other".
• Non-Word Boundary (\B): Matches any position that is not a word boundary
• Example: \B99\B matches "299" but not "99 bottles".

• Example 1: Matching Prices


• Regex: /[0-9]+/ matches sequences of digits.
• Example 2: Matching Specific Patterns
• Regex: /^The dog\.$/ matches the exact phrase "The dog." at the start and end of a line.
Regular Expressions
Disjunction, Grouping, and Precedence operators
• Disjunction Operator (|) Matches either of the specified patterns.
• Example: : /cat|dog/ matches either "cat" or "dog".
• Use Case: Searching for texts about pets, particularly interested in cats and dogs
• Grouping with Parentheses () ensures that operators apply to specific parts of the pattern.
• Example: /gupp(y|ies)/ matches both "guppy" and "guppies".
• Operator Precedence Hierarchy: Highest to Lowest Precedence:
• Parentheses ()
• Counters * + ? {}
• Sequences and Anchors ^ $
• Disjunction |
• Example: /the*/ matches "theeee" but not "thethe“ so /(the)*/
• Example: Matching repeated instances.
• Incorrect Pattern: /Column [0-9]+ */ matches only a single column followed by spaces.
• Correct Pattern: /(Column [0-9]+ *)*/ matches repeated instances of "Column" followed by numbers
and spaces.
Regular Expressions
Simple Example
• Suppose we wanted to write a RE to find cases of the English article the.
• Initial Pattern: /the/
• Problem: Misses "The" and matches substrings within other words.
• Improved Pattern: /\b[tT]he\b/ matches "the" with word boundaries.
• Advanced Pattern: /(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/ matches "the" at the beginning or
end of lines.
A complex Example
• Suppose the user might want to buy any PC 500 MHz or 32 Gb or Compaq or Mac or $999.99.
• Price Pattern: \b$[0-9]+(\.[0-9][0-9])?\b
• Processor Speed: \b[0-9]+ *(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b
• Disk Space: \b[0-9]+(\.[0-9]+)? *(Gb|[Gg]igabytes?)\b
• Operating Systems:
• /\b(Windows *(Vista|XP)?)\b/
• /\b(Mac|Macintosh|Apple|OS X)\b/
Regular Expressions
Advanced Operators
Substitution: Replace matched patterns.
Example: s/colour/color/
Memory: Refer to matched groups.
Example: s/([0-9]+)/<\1>/ adds brackets around digits.
Advanced Example: ‘s/the (.*)er they were, the \1er they will be/ ‘ensures the same word is used.
Finite-State Automata
Regular Expressions, FSAs, and Regular Languages
Regular Expressions (RE): Describe patterns for text searching.
Example: /cat|dog/ matches "cat" or "dog".
Finite-State Automata (FSA): Theoretical models that can implement REs.
Any RE can be implemented as an FSA (excluding those with memory features).
Regular Languages: Formal languages describable by REs or FSAs.
Interconnections:
• Any RE can be converted into an FSA and vice versa.
• Regular languages can be described by REs, FSAs, and regular grammars
Finite-State Automata
Using an FSA to Recognize Sheeptalk
Let’s begin with the “sheep language”
Defined by strings: baa!, baaa!, baaaa!, etc.
Regular Expression: /baa+!/
• A finite-state automaton for talking sheep.
• If the machine never gets to the final state, either because it runs out of input, or it gets some
input that doesn’t match an arc, or if it just happens to get stuck in some non-final state, we say
the machine rejects or fails Rejecting to accept an input.
Finite-State Automata
Using an FSA to Recognize Sheeptalk
• More formally, a finite automaton is defined by the following five parameters:
• Q= q0q1q2 . . .qN−1 a finite set of N states
• S a finite input alphabet of symbols
• q0 the start state
• F the set of final states, F is subset of Q
• δ(q, i) the transition function or transition matrix between states. Given a state q Є Q and an
input symbol i Є ∑, δ (q, i) returns a new state q1 Є Q. δ is thus a relation from Q× ∑ to Q;
• For the sheeptalk automaton Q = {q0,q1,q2,q3,q4}, S = {a,b, !}, F = {q4}, and δ(q, i) is defined by
the transition table above.
Finite-State Automata
The algorithm D-RECOGNIZE for “deterministic recognizer”.
Finite-State Automata
Formal Languages
• A set of strings made of symbols from a finite alphabet.
• Alphabet for sheep talk: S = {a, b, !}
• Model m: An FSA defining a language.
• Language L(m): Set of strings recognized or generated by m.
• Example for sheep talk:𝐿(𝑚)={baa!,baaa!,baaaa!,baaaaa!,…}
• L(m)={baa!,baaa!,baaaa!,baaaaa!,…}
• Utility of Automata:
• Expresses infinite sets in a closed form.
• Can define formal languages for computational models.
• Formal vs. Natural Languages:
• Formal Language: Used in models like phonology, morphology, syntax.
• Natural Language: Real languages spoken by people.
• Generative Grammar: Defines a formal language by generating all possible strings.
Finite-State Automata
Example:
Construct a formal language would model the
subset of English consisting of phrases like
ten cents, three dollars, one dollar thirty-five cents
and so on
An FSA for the words for English numbers 1–99.
FSA for the simple dollars and cents
Finite-State Automata
Non-Deterministic FSAs
• FSAs where multiple transitions are possible for a given state and input.
• Example NFSA:
• Self-loop on state 2 instead of state 3. At state 2, seeing an ‘a’ can either : Stay in state 2 Move to state
3
• NFSA Characteristics:
• Multiple paths for the same input. Decision points introduce non-determinism.
• Comparison with DFSA:
• DFSA: Deterministic behavior based on current state and input.
• NFSA: Non-deterministic behavior, choices at certain states.
• Epsilon (∈)-Transitions: Transitions that do not consume any input symbol.
• Example NFSA with ∈-Transition:
• Moves from state 3 to state 2 without consuming input.
• Can either: Follow ∈-transition to state 2, Follow ! arc to state 4
• Implications: Adds another layer of non-determinism.
• Flexibility in state transitions without input consumption
Finite-State Automata
Using an NFSA to Accept Strings
• The main challenge with NFSAs is multiple Choices, Potential to follow the wrong arc, leading
to rejection when the string should be accepted.
• Solutions to Non-Determinism:
• Backup: Mark position and state, backtrack if necessary.
• Look-ahead: Check future input to decide the path.
• Parallelism: Explore all paths simultaneously.
• Backup Approach:
• Make tentative choices, backtrack if needed.
• Remember alternatives at choice points.
• Search-State: Combination of current state and tape position..
• Transition Table Adjustments:
• Epsilon (∈)-Column: Represents transitions without consuming input.
• Lists of Nodes: For states with multiple transitions from the same input.
Finite-State Automata
ND-RECOGNIZE Algorithm for NFSAs
• function ND-RECOGNIZE(tape,machine) returns accept or reject
• agenda {(Initial state of machine, beginning of tape)}
• current-search-state NEXT(agenda)
• loop
• if ACCEPT-STATE?(current-search-state) returns true then return accept
• else
• agenda agenda [ GENERATE-NEW-STATES(current-search-state)
• if agenda is empty then
• return reject
• else
• current-search-state NEXT(agenda)
• end
Finite-State Automata
ND-RECOGNIZE Algorithm for NFSAs
• function GENERATE-NEW-STATES(current-state) returns a set of search-states
• current-node the node the current search-state is in
• index the point on the tape the current search-state is looking at
• return a list of search states from transition table as follows:
• (transition-table[current-node,], index)
• [(
• transition-table[current-node, tape[index]], index + 1)
• function ACCEPT-STATE?(search-state) returns true or false
• current-node the node search-state is in
• index the point on the tape search-state is looking at
• if index is at the end of the tape and current-node is an accept state of machine
• then
• return true
• else
• return false
Finite-State Automata
Example
Finite-State Automata
Recognition as Search: Depth-First and Breadth-First Search Strategies
• State Ordering in ND-RECOGNIZE:
• Undefined Order: Unexplored states are added to the agenda as they are created.
• NEXT Function: Returns an unexplored state from the agenda.
• Depth-First Search (DFS):Implementation:
• Agenda as a stack (LIFO).
• Characteristics: Follows new leads deeply.
• Backtracks upon hitting dead ends.
• Pitfall: Can enter infinite loops.
• Example Trace:
• Input: baaa!
• DFS explores one path deeply before backtracking.
• Breadth-First Search (BFS):
• Implementation: Agenda as a queue (FIFO).
• Characteristics: Explores all possible choices at each level.
• Expands one ply of the search tree at a time.
• Pitfall: Requires large memory for large state-spaces.
• Example Trace:
• Input: baaa!
• BFS explores all paths at each level before moving deeper.
Finite-State Automata
Example

BFS DFS
Finite-State Automata
Comparing Depth-First and Breadth-First Search:
Depth-First Search (DFS):
Pros:
Efficient use of memory.
Suitable for deep, narrow spaces.
Cons:Risk of infinite loops. May explore unfruitful paths deeply.
Breadth-First Search (BFS):
Pros:Guaranteed to find the shortest path.
Explores all paths uniformly.
Cons:High memory usage.
May not be practical for large state-spaces.
Choosing a Strategy:
Small Problems: Either DFS or BFS may be adequate.
Large Problems: Consider advanced techniques like dynamic programming or A*
Finite-State Automata
Relating Deterministic and Non-Deterministic Automata:
• For any NFSA, there exists an equivalent DFSA.
• Conversion Algorithm which Converts an NFSA to a DFSA, potentially increasing the number
of states exponentially.
• NFSAs can follow multiple paths simultaneously.
• If states 𝑞𝑎 and 𝑞𝑏 are reached by the same input, they form a new state 𝑞𝑎𝑏 .
• The new DFSA can have up to 2𝑁 states, where 𝑁 is the number of states in the original NFSA.
• Conversion Process from NFSA to DFSA
• Initial State: Start with the initial state of the NFSA.
• State Grouping: For each input, group reachable states into a new state.
• State Transition: Define transitions for these new grouped states.
• Repeat: Continue for every possible input and state group until no new states are formed.
Morphology
• Regular Expressions are used for finding both "woodchuck" and "woodchucks" using a single
search string.
• The main Challenges occur with irregular Plurals like "fox" to "foxes", "peccary" to "peccaries",
"goose" to "geese", and "fish" remaining unchanged.
• It takes two kinds of knowledge to correctly search for singulars and plurals of these forms.
• Orthographic rules tell us that English words ending in -y are pluralized by changing the -y to -
i- and adding an -es.
• Morphological rules tell us that fish has a null plural, and that the plural of goose is formed by
changing the vowel.
• Morphological Parsing: Recognizing and breaking down words into morphemes.
• Example: Parsing "foxes" into "fox" and "-es“, going into VERB-go + GERUND-ing.
• Applications:
• Web Search: Handling morphologically complex languages like Russian.
• Part-of-Speech Tagging: Crucial for accurate tagging in languages with rich morphology.
• Spell-Checking: Essential for creating large dictionaries.
• Machine Translation: Translating inflected forms accurately..
Morphology
• To solve the morphological parsing problem, why couldn’t we just store all the plural forms of
English nouns and -ing forms of English verbs in a dictionary and do parsing by lookup?
• Which is no enough, the key algorithm for morphological parsing are
• Finite-State Transducers (FSTs) are used for morphological parsing throughout speech and
language processing.
• Stemming: Stripping off word endings.
• Example: "foxes" to "fox".
• Lemmatization: Mapping words to their root forms.
• Example: "sang", "sung", "sings" to "sing".
• Tokenization: Separating words from running text.
• Challenges: Handling multi-word expressions like "New York" and contractions like "I'm“
• Minimum Edit Distance: Measuring orthographic similarity between words.
• Applications: Important for spell-checking and comparing word forms
Morphology
• To solve the morphological parsing problem, why couldn’t we just store all the plural forms of
English nouns and -ing forms of English verbs in a dictionary and do parsing by lookup?
• Which is not enough, the key algorithm for morphological parsing are
• Finite-State Transducers (FSTs) are used for morphological parsing throughout speech and
language processing.
• Stemming: Stripping off word endings.
• Example: "foxes" to "fox".
• Lemmatization: Mapping words to their root forms.
• Example: "sang", "sung", "sings" to "sing".
• Tokenization: Separating words from running text.
• Challenges: Handling multi-word expressions like "New York" and contractions like "I'm“
• Minimum Edit Distance: Measuring orthographic similarity between words.
• Applications: Important for spell-checking and comparing word forms
Morphology
• Understanding Morphemes
• Morphology is the study of the way words are built up from smaller meaning-bearing
Morpheme units, morphemes.
• A morpheme is often defined as the minimal meaning-bearing unit in a language
• Examples:
• Fox: Single morpheme.
• Cats: Two morphemes (cat + -s).
• Types of Morphemes
• Stems: Main morphemes supplying the primary meaning.
• Affixes: Add additional meanings; further divided into prefixes, suffixes, infixes, and
circumfixes.
• Examples:
• Prefix: un- in "unbuckle".
• Suffix: -s in "eats".
• Circumfix: ge-...-t in German "gesagt".
• Infix: Editor-in-chief, Editors-in-chief
Morphology
• Understanding Morphemes
• Multiple Affixes: Words with Multiple Affixes
• Examples:
• Rewrites: Prefix (re-) + Stem (write) + Suffix (-s).
• Unbelievably: Stem (believe) + Prefix (un-) + Suffixes (-able, -ly).
• Methods of Combining Morphemes
• Inflection: Combining stem with grammatical morphemes, resulting in the same word class
• Examples:
• Plural: cat + -s = cats.
• Past tense: walk + -ed = walked.
• Function: Often fills syntactic functions like agreement.
• Derivation: Combination of a word stem with a grammatical morpheme resulting in a different
word class
• Examples:
• Computerize + -ation = Computerization.
• Characteristics: Often has a meaning that is harder to predict exactly.
Morphology
• Methods of Combining Morphemes
• Compounding: Combination of multiple word stems.
• Examples:
• Doghouse: Dog + House.
• Bookshelf: Book + shelf
• Cliticization: Combination of a word stem with a clitic.
• Examples:
• I’ve: I + ’ve.
• L’opera: Le + Opera (French).
Inflectional Morphology
• Inflection: Modifying words to express different grammatical categories
• English has a relatively simple inflectional system.
• English nouns have only two kinds of inflection
• Plural Inflection:
• Regular plural suffix: -s or –es
• Examples:
• Regular Nouns: cat → cats, thrush → thrushes
• Irregular Nouns: mouse → mice, ox → oxen
• Possessive Inflection:
• Apostrophe + -s for singular and irregular plural nouns.
• Examples:
• Singular: llama → llama’sIrregular
• Plural: children → children’s
• Lone apostrophe for regular plural nouns.
• Example: llamas → llamas’
Inflectional Morphology
• English verbal inflection is more complicated than nominal inflection.
• First, English has three kinds of verbs.
• main verbs, (eat, sleep, impeach),
• modal verbs (can, will, should),
• primary verbs (be, have, do)
• we will mostly be concerned with the main and primary verbs, because it is these that have
inflectional endings.
• Of these verbs a large class Regular verb are regular, that is to say all verbs of this class have the
same endings marking the same functions. These regular verbs (e.g. walk, or inspect) have four
morphological forms, as follow
Inflectional Morphology
• The irregular verbs are those that have some more or less idiosyncratic forms of inflection.
• Irregular verbs in English often have five different forms, but can have as many as eight (e.g., the
verb be) or as few as three (e.g. cut or hit).
• The table below shows some sample irregular forms

• Selling Changes in Inflections


• Doubling Consonants:
• beg → begging, begged
• picnic → picnicking, picnicked
• Silent -e Deletion:
• merge → merging, merged
• -s Spelling Rules:
• toss → tosses
• waltz → waltzes
• try → tries
Derivational Morphology
• English derivation is complex compared to its inflectional system.
• Derivation involves combining a word stem with a grammatical morpheme.
• This process often results in a word of a different class with unpredictable meaning.
• Inflection changes grammatical function; derivation creates new words.
• Nominalization: Formation of new nouns from verbs or adjectives.

• Adjectives can also be derived from nouns and verbs. Here are examples of a few suffixes
deriving adjectives from nouns or verbs
Derivational Morphology
• Derivation in English is more complex than inflection for a number of reasons.
• Less productive compared to inflection.
• Not all suffixes can be added to every base word.
• Subtle and complex meaning differences among suffixes.
• Examples:
• -ation: can be added to many verbs but not all (e.g., *eatation, *spellation).
• Sincerity vs. sincereness: subtle meaning differences.
• Comparison with Inflection
• Inflection:
• Regular and predictable changes.
• Limited number of affixes.
• Derivation:
• Irregular and less predictable.
• Complex meaning changes.
Finite-State Morphological Parsing
• Our goal will be to take input forms like those in the first and third columns of Fig, produce
output forms like those in the second and fourth column.
• The second column contains the stem of each word as well as assorted morphological features.
These features specify additional information Feature about the stem. For example the feature
+N means that the word is a noun; +Sg means it is singular, +Pl that it is plural.
Finite-State Morphological Parsing
• Note that some of the input forms (like caught, goose, canto, or vino) will be ambiguous
between different morphological parses.
• For now, we will consider the goal of morphological parsing merely to list all possible parses.
• In order to build a morphological parser, we’ll need at least the following
• lexicon: the list of stems and affixes, together with basic information about them (whether a
stem is a Noun stem or a Verb stem, etc.)
• morphotactics: the model of morpheme ordering that explains which classes of morphemes can
follow other classes of morphemes inside a word. For example, the fact that the English plural
morpheme follows the noun rather than preceding it is a morphotactic fact.
• orthographic rules: these spelling rules are used to model the changes that occur in a word,
usually when two morphemes combine (e.g., the y to ie spelling rule discussed above that
changes city + -s to cities rather than citys).
Finite-State Morphological Parsing

Finite State Transducer


Finite State Automata
Porter Stemmer
• The Porter Stemmer is a popular algorithm used in natural language processing for stemming,
• which is the process of reducing words to their base or root form.
• Developed by Martin Porter in 1980, it is designed to remove common morphological and
inflectional endings from words in English.
• The Porter Stemmer is widely used in information retrieval and text mining applications.
• Key Concepts
• Stemming: The process of reducing words to their root form.
• For example, "running," "runner," and "ran" are all reduced to "run.“
• Suffix Removal: The algorithm removes suffixes based on a set of rules to produce the stem of
the word.
• Rules and Steps: The Porter Stemmer operates in a series of steps, each applying specific rules
for suffix removal.
Porter Stemmer
• Advantages and Limitations
• Advantages:
• Efficiency: The Porter Stemmer is fast and simple to implement.
• Commonly Used: It is widely used in many NLP applications and frameworks.
• Limitations:
• Aggressive Stemming: Sometimes the Porter Stemmer may be too aggressive, removing parts of
the word that are actually meaningful.
• Language Specific: The Porter Stemmer is specifically designed for English and may not work
well with other languages.
• The Porter Stemmer is a fundamental tool in text processing and information retrieval.
Porter Stemmer
• Example: Stemming the word "running"
• Steps in the Porter Stemmer
• Step 1a: Remove plural suffixes:
• Rule: SSES → SS
• Rule: IES → I
• Rule: SS → SS
• Rule: S → ``
• "running" does not match any rule here, so it remains "running."
• Step 1b: Remove suffix -ed or -ing if the stem has a vowel:

• Rule: (m>0) EED → EE


• Rule: (*v*) ED → ``
• Rule: (*v*) ING → ``
• "running" matches (*v*) ING → run
Porter Stemmer
• Step 1c: Turn terminal y to i if there is another vowel in the stem:

• Rule: (*v*) Y → I from nltk.stem import PorterStemmer


• "run" does not end in "y", so no change. # Initialize the stemmer porter = PorterStemmer()
# Example words words = ["running", "runner", "ran", "runs",
• Step 2: Map double suffixes to single ones:
"easily", "fairly"]
# Stemming each word
• Rule: ATIONAL → ATE stems = [porter.stem(word) for word in words]
print(stems)
• Rule: TIONAL → TION
• And many more...
• "run" does not match any rule here.
• Step 3: Deal with -ic-, -full, -ness etc.:
• Rule: ICATE → IC
• Rule: ATIVE → ``
• Rule: ALIZE → AL
• And more...
• "run" does not match any rule here.
Porter Stemmer
• Step 4: Remove -ant, -ence, -ment etc.:

• Rule: AL → ``
• Rule: ANCE → ``
• Rule: ENCE → ``
• And more...
• "run" does not match any rule here.
• Step 5a: Remove -e if the measure (m) > 1, or (m=1 and not *o):

• Rule: (m>1) E → ``
• Rule: (m=1 and not *o) E → ``
• "run" does not end in "e", so no change.
• Step 5b: Remove terminal l if m > 1 and *d (double letter):

• Rule: (m>1 and *d and *L) L → ``


• "run" does not end in "ll", so no change.
• Final Result
• The word "running" is reduced to the stem "run" using the Porter Stemmer.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy