Chapter Three

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Words and Transducers

Survey of English Morphology


• Morphology is the study of the way words are built up
from smaller meaning-bearing units - morphemes.
• Morpheme is often defined as the minimal meaning-
bearing unit in a language.
– Main Entry: mor·pheme
Pronunciation: 'mor-"fEm
Function: noun
Etymology: French morphème, from Greek morphE form
: a distinctive collocation of phonemes (as the free form pin
or the bound form -s of pins) having no smaller meaningful
parts
Survey of English Morphology
• Example:
– fox consists of a single morpheme: fox.
– cats consists of two morphemes: cat and –s.
– Two broad classes of morphemes:
1. Stems - main morpheme of a word, and
2. Affixes – add additional meaning to the word.
i. Prefixes – preceding the stem: unbuckle
ii. Suffixes – following the stem: eats
iii. Infixes – inserted in the stem: humingi (Philippine language
Tagalog)
iv. Circumfixes – precede and follow the stem. gesagt (German
past participle of sagen)
Survey of English Morphology
• A word can have more than one affix:
– rewrites:
• Prefix - re
• Stem - write
• Suffix - s
– unbelievably:
• Prefix - un
• Stem - believe
• Suffix - able, ly
• English language does not tend to stack more than four or five affixes
• Turkish can have words with nine or ten affixes – languages like Turkish are
called agglutinative languages.
– Main Entry: ag·glu·ti·na·tive
– Pronunciation: \ə-ˈglü-tən-ˌā-tiv, -ə-tiv\
– Function: adjective
– Date: 1634
– 1 : adhesive
2 : characterized by linguistic agglutination
Survey of English Morphology
• There are many ways to combine morphemes to create a word.
• Four methods are common and play important role in speech and language
processing:

1. Inflection
• Combination of a word stem with a grammatical morpheme, usually
resulting in a word of the same class as the original stem, and usually
filling some syntactic function like agreement.
• Example:
– -s: plural of nouns
– -ed: past tense of verbs.
Survey of English Morphology
2. Derivation
• Combination of word stem with a grammatical morpheme, usually
resulting in a word of a different class, often with a meaning hard to
predict.
• Example:
– Computerize – verb
– Computerization – noun.
3. Compounding
• Combination of multiple word stems together.
• Example:
– Doghouse: dog + house.
4. Cliticization
• Combination of a word stem with a clitic. A clitic is a morpheme that
acts syntactically like a word, but is reduced in form and attached
(phonologically and sometimes orthographically) to another word.
• Example:
– I’ve = I + ‘ve = I + have
Inflectional Morphology
• English language has a relatively simple
inflelectional system; Only
– Nouns
– Verbs
– Adjectives (sometimes)
• Number of possible inflectional affixes is
quite small.
Inflectional Morphology: Nouns
• Nouns (English):
– Plural
– Possessive
– Many (but not all) nouns can either appear in
• bare stem or singular form, or
• Take a plural suffix

Regular Nouns Irregular Nouns


Singular Cat Thrush Mouse Ox

Plural Cats Thrushes Mice Oxen


Inflectional Morphology
• Regular plural spelled:
– -s
– -es after words ending in
• –s (ibis/ibises)
• -z (waltz/waltzes)
• -sh (thrush/thrushes)
• -ch (finch/finches)
• -x (box/boxes); sometimes
• Nouns ending in –y preceded by a consonant change the –y to –i
(butterfly/butterflies).
• The possessive suffix is realized by apostrophe + -s for
– Regular singular nouns (llama’s), and
– Plural nouns not ending in –s (children’s), and often
• Lone apostrophe after
– Regular plural nouns (llamas’), and some
– Names ending in –s or –z (Euripides’ comedies’).
Inflectional Morphology: Verbs
• English language inflection of verbs is more
complicated than nominal inflection.
1. English has three kinds of verbs
i. Main verbs (eat, sleep, impeach)
ii. Modal verbs (can, will, should)
iii. Primary verbs (be, have, do)
• Concerned with main and primary verbs because these
have inflectional endings.
• Of these verbs a large class are regular (all verbs in this
class have the same endings marking the same functions)
Regular Verbs
• Regular Verbs have four morphological forms.
• For regular verbs we know the other forms by adding one of
three predictable endings and making some regular spelling
changes.

Morphological Form Regularly Inflected Verbs


Class
Stem Walk Merge Try Map

-s form Walks Merges Tries Maps

-ing participle Walking Merging Trying Mapping

Past form or –ed Walked Merged Tried Mapped


participle
Regular Verbs

• Since regular verbs


1. Cover majority of the verbs and forms, and
2. Regular class is productive,
they are significant in the morphology of
English language.

– Productive class is one that automatically


includes any new words that enter the
language.
Irregular Verbs
• Irregular Verbs are those that have some more or less idiosyncratic forms of
inflection.
• English irregular verbs
– often have five different forms, but can have
– as many as eight (e.g., the verb be), or
– as few as three (e.g., cut or hit)
• They constitute a smaller class of verbs estimated to be about 250

Morphological Form Class Irregularly Inflected Verbs


Stem Eat Catch Cut
-s form Eats Catches Cuts
-ing participle Eating Catching Cutting
Past form Ate Caught Cut

–ed participle Eaten Caught Cut


Usage of Morphological Forms for
Irregular Verbs
• The –s form:
– Used in “habitual present” form to distinguish the third-person singular
ending: “She jogs every Tuesday” from the other choices of person and
number “I/you/we/they jog every Tuesday”.
• The stem form:
– Used in in the infinitive form, and also after certain other verbs “I’d rather
walk home, I want to walk home”
• The –ing participle is used in the progressive construction to mark a
present or ongoing activity “It is raining”, or when the verb is treated as a
noun (this particular kind of nominal use of a verb is called gerund use:
“Fishing is fine if you live near water”)
• The –ed participle is used in the perfect construction “He’s eaten lunch
already”, or passive construction “The verdict was overturned yesterday”
Spelling Changes
• A number of regular spelling changes occur at morpheme boundaries.
• Example:
– A single consonant letter is doubled before adding the –ing and –ed suffixes:
beg/begging/begged
– If the final letter is “c”, the doubling is spelled “ck”:
picnic/picnicking/picnicked
– If the base ends in a silent –e, it is deleted before adding –ing and –ed:
merge/merging/merged
– Just as for nouns, the –s ending is spelled
• –es after verb stems ending in –s (toss/tosses)
• -z (waltz/waltzes)
• -sh (wash/washes)
• -ch (catch/catches)
• -x (tax/taxes) sometimes.
– Also like nouns, verbs ending in –y preceded by a consonant change the –y to
–i (try/tries).
Derivational Morphology
• Derivation is combination of a word stem
with a grammatical morpheme
– Usually resulting in a word of a different class,
– Often with a meaning hard to predict exactly
• English inflection is relatively simple
compared to other languages.
• Derivation in English language is quite
complex.
Derivational Morphology
• A common kind of derivation in English is the formation of
– new nouns,
– From verbs, or
– Adjectives, called nominalization.
• Example:
– Suffix –ation produces nouns from verbs ending often in the suffix –ize
(computerize → computerization)

Suffix Base Verb/Adjective Derived Noun


-ation Computerize (V) Computerization
-ee Appoint (V) Appointee
-er Kill (V) Killer
-ness Fuzzy (A) Fuzziness
Derivational Morphology

• Adjectives can also be derived from nouns and verbs

Suffix Base Noun/Verb Derived Adjective


-al Computation (N) Computational
-able Embrace (V) Embraceable
-less Clue (N) Clueless
Complexity of Derivation in English
Language
• There a number of reasons for complexity in
Derivation in English:
1. Generally less productive:
 Nominalizing suffix like –ation, which can be added to
almost any verb ending in –ize, cannot be added to
absolutely every verb.
 Example: we can’t say *eatation or *spellation (* marks
stem of words that do not have the named suffix in
English)
2. There are subtle and complex meaning differences
among nominalizing suffixes
 Example: sincerity vs sincereness
Cliticization
• Clitic is a unit whose status lies in between that of an
affix and a word.
• Phonological behavior:
– Short
– Unaccented
• Syntactic behaviour:
– Words, acting as:
• Pronouns,
• Articles,
• Conjunctions
– Verbs
Cliticization
 Proclitics – clitics proceeding a word
 Enclitics – clitics following a word

Full Form Clitic Full Form Clitic


am ‘m have ‘ve
are ‘re has ‘s
is ‘s had ‘d
will ‘ll would ‘d

• Ambiguity
– She’s → she is or she has
Non-Concatenative Morphology
• Morphology discussed so far is called
concatenative morphology.
• Other (than English) languages have extensive
non-concatenative morphology:
– Morphemes are combined in more complex ways -
Tagalog
– Arabic, Hebrew and other Semitic languages exhibit
templatic morphology or root-and-pattern morphology.
Agreement

• In English language plural is marked on


both nouns and verbs.
• Consequently the subject noun and the
main verb have to agree in number:
– Both must be either singular or plural.
Finite-State Morphological Parsing

• Goal of Morphological Parsing is to take the column 1


in the following table and produce output forms like
those in the 2 column
– Example link-grammar: http://www.link.cs.cmu.edu/link/

Input Morphologically Parsed Input Morphologically Parsed Output


Output

cats cat+N+PL goose goose+V

cat cat+N+SG gooses goose+V+1P+SG

cities city+N+PL merging merge+V+PresPart

geese goose+N+PL caught catch+V+PastPart

goose goose+N+SG caught catch+V+Past


Finite-State Morphological
• The second/fourth column of the previous slide table contains stem
of each word as well as assorted morphological features. These
features provide additional information about the stem.
• Example:
– +N – word is a noun
– +SG – word is singular
• Some of the input forms may be ambiguous:
– Caught
– Goose
• For now we will consider the goal of morphological parsing as
merely listing of all possible parses.
Requirements of Morphological Parser
1. Lexicon: the list of stems and affixes, together with basic
information about them.
 Example: whether a stem is a Noun or a Verb, etc.

2. Morphotactics: the model of morpheme ordering that explains


which classes of morphemes can follow other classes of morphemes
inside a word.
 Example: English plural morpheme follows the noun and does
not precede it.

3. Orthographic rules: these are spelling rules that are used to model
the changes that occur in a word, usually when two morphemes
combine.
 Example: They *y → *ie spelling rule as in city + -s → cities.
Requirements of Morphological Parser

• In the next section we will present:


– Representation of a simple lexicon for the sub-
problem of morphological recognition
– Build FSAs to model morphotactic knowledge
– Finite-State Transducer (FST) is introduced as
a way of modeling morphological features in
the lexicon
Lexicon
• A lexicon is a repository for words.
– The simplest possible lexicon would consist of an explicit list of every word of
the language
– Every word means including
• Abbreviations: AAA
• Proper Names: Jane, Beijing, Lydra, Saranda, Zana, Gent, etc.
– Example:
• a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, …
• For the various reasons in general it will be inconvenient or even
impossible to list every word in a language.
• Computational lexicons are usually structured with
– a list of each of the stems and affixes of the language.
– Representation of the morphotactics
• One of the most common way to model morphotactics is with the
finite-state-automaton.
Example of FSA for English nominal
inflection
reg-noun plural-s

q0 q1 q2
Start State
irreg-pl-noun

irreg-sg-noun

• This FSA assumes that the lexicon includes


– regular nouns (reg-nouns), that take
– Regular –s plural: cat, dog, aardvark
• Ignoring for now that the plural of words like fox have inserted e: foxes.
– irregular noun forms that don’t take –s; both:
• singular (irreg-sg-noun): goose, mouse, sheep
• plural (irreg-pl-noun): geese, mice, sheep
Examples of Nominal Inflections

reg-noun irreg-pl-noun irreg-sg-noun plural

fox geese goose -s


cat sheep sheep
aardvark mice mouse
Example for English verbal inflection
Irreg-past-verb-form

past (-ed)
reg-verb-stem
q0 q1 q3
past paticiple (-ed)

reg-verb-stem
irreg-verb-stem
pres part (-ing)
q2
3sg(-s)

• This lexicon has three stem  Four affix classes:


classes:  -ed: past
– reg-verb-stem  -ed: participle
– irreg-verb-stem  -ing: participle
– irreg-past-verb-form  -s: third case singular
Examples of Verb Inflections

reg-verb- irreg- irreg- past past- pres- 3sg


stem verb- past-verb part part
stem
walk cut caught -ed -ed -ing -s
fry speak ate
talk sing eaten
impeach sang
English Derivational Morphology
• As it has been discussed earlier in this chapter,
English derivational morphology is significantly more
complex than English inflectional morphology.
• FSA for that model derivational morphology are thus
tend to be quite complex.
• Some models of English derivation are based on the
more complex context-free grammars.
Morphotactics of English Adjectives
• Example of Simple Case of Derivation from Antworth (1990):

 big, bigger, biggest  cool, cooler, coolest, coolly


 happy, happier, happiest,  red, redder, reddest
happily  real, unreal, really
 unhappy unhappier,
unhappiest, unhappily
 clear, clearer, clearest, clearly,
unclear, unclearly
un- adj-root -er -est -ly

q0 q1 q2 q3

e
Problem Issues
• While previous slide FSA will
– recognize all the adjectives in the table presented earlier,
– it will also recognize ungrammatical forms like:
• unbig, unfast, oranger, or smally
• adj-root would include adjectives that:
1. can occur with un- and –ly: clear, happy and real
2. can not occur: big, small, etc.

 This simple example gives un idea of the complexity


to be expected from English derivation.
Derivational Morphology Example 2
• FSA models a number of derivational facts:
– generalization that any verb ending in –ize can be followed by the nominalizing suffix –ation
– -al or –able → -ity or -ness
– Exercise 3.1 to discover some of the individual exceptions to many of these constructs.

• Example: nouni -ize/V -ation/N


– fossil → fossilize → fossilization
– equal → equalize → equalization q0 q1 q2 q3 q4
– formal → formalize → formalization
– realize → realizable → realization adj-al -able/A -er/N
– natural → naturalness
-ity/N
– casual → casualness adj-al q5 q6
-ness/N
verbj
adj-ous -ly/Adv
q7 q9
-ive/A -ness/N
q8 -ly/Adv
verbk
-ative/A
q10
 FSA models for another fragment of -ful/A
English derivational morphology
nounl
q11
Solving the Problem of Morphological
Recognition
• Using FSAs above one could solve the problem of
Morphological Recognition;
– Given an input string of letters does it constitute legitimate
English word or not.
• Taking morphotactic FSAs and plugging in each “sub-
lexicon” into the FSA.
– Expanding each arc (e.g., reg-noun-stem arc) with all
morphemes that make up the stem of reg-noun-stem.
– The resulting FSA can then be defined at the level of the
individual letter.
Solving the Problem of Morphological
Recognition
• Noun-recognition FSA produced by expanding the Nominal Inflection FSA of
previous slide with sample regular and irregular nouns for each class.
• We can use the FSA in the figure below to recognize strings like aardvarks by simply
starting at the initial state, and comparing the input letter by letter with each word on
each outgoing arc, and so on, just as we saw in Ch. 2.

reg-noun plural-s o
x
q0 q1 q2 f s
Start State
irreg-pl-noun a t
c
e
irreg-sg-noun

g
o o s e
e
e

e s

Expanded FSA for a few English nouns with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see, beginning on page 62 (ref book), how to correctly
deal with the inserted e in foxes.
Finite-State Transducers (FST)
• FSA can represent morphotactic structure of a lexicon and thus can be
used for word recogntion.
• In this section we will introduce the finite-state transducers and we will
show how they can be applied to morphological parsing.
• A transducers maps between one representation and another;
– A finite-state transducer or FST is a type of finite automaton which maps
between two sets of symbols.
– FST can be visualized as a two-tape automaton which recognizes or generates
pairs of strings.
– Intuitively, we can do this by labeling each arc in the the FSM (finite-state
machine) with two symbol strings, one from each tape.
– In the figure in the next slide an FST is depicted where each arc is labeled by
an input and output string, separated by a colon.
A Finite-State Transducer (FST)

aa:b b:a b:0

q0 b:b q1

a:ba

• FST has a more general function than an FSA;


– Where an FSA defines a formal language by defining a set of strings,
an FST defines a relation between sets of strings.
– FST can be thought of as a machine that reads one string and
generates another.
A Finite-State Transducer (FST)
• FST as recognizer:
– A transducer that takes a pair of strings as input, and outputs:
• accept if the string-pair is in the string-pair language, and
• reject if it is not.
• FST as a generator:
– A machine that outputs pairs of strings of the language and outputs:
• yes or no, and
• a pair of output string.
• FST as translator:
– A machine that reads a string and outputs another string.
• FST as a set relater:
– A machine that computes relations between sets.
A Finite-State Transducer (FST)

• All four categories of FST in previous slide


have applications in speech and language
processing.
– Morphological parsing (and for many other
NLP applications):
• Apply FST translator metaphor:
– Input: a string of letters
– Output: a string of morphemes.
Formal Definition of FST
Q A finite set of N states q0, q1, …, qN-1,

S A finite set corresponding to the input alphabet

∆ A finite set corresponding to the output alphabet

q0 ∈ Q The start state

F⊆Q The set of final states

The transition function or transition matrix between states; Given a state


q ∈ Q and a string w ∈ S*. d(q,w) returns a set of new states Q’ ∈ Q. d is
d(q,w) thus a function from Q x S*, to 2Q (because there are 2Q possible subsets
of Q). d returns a set of states rather than a single state because a given
input may be ambiguous in which state it maps to.
The output function giving the set of possible output strings for each
s(q,w) state and input. Given a state q ∈ Q and string w ∈ S*, s(q,w) gives a set
of output strings, each string o ∈ D*. s is thus a function from Q x S* to 2D*
Properties of FST
• FSAs are isomorphic to regular languages ⇔ FSTs are isomorphic
to regular expressions.
• FSTs and regular expressions are closed under union
• Generally FSTs are not closed under difference, complementation
and intersection.
• In addition to union, FSTs have two closure properties that turn out
to be extremely useful:
– Inversion: The inversion of a transducer T(T-1) simply switches the
input and output labels. Thus if T maps from the input alphabet I to
the output alphabet O, T-1 maps from O to I.
– Composition: If T1 is a transducer from I1 to O1 and T2 a
transducer from O1 to O2, then T1∘T2 maps from I1 to O2.
Properties of FST
• Inversion is useful because it makes it easy to convert
a FST-as-parser into an FST-as-generator.
• Composition is useful because it allow us to take two
transducers that run in series and replace them with
one more complex transducer.
– Composition works as in algebra:
• Applying T1∘T2 to an input sequence S is identical to
applying T1 to S and then T2 to the resulting sequence; thus
T1∘T2(S) = T2(T1(S))
FST Composition Example

a:b b:c a:c


a:b b:c a:c
q0 q1 q0 q1 = q0 q1

• The composition of [a:b]+ with [b:c]+ to produce [a:c]+


Projection

• The Projection of an FST is the FSA that is


produced by extracting only one side of the
relation.
– We refer to the projection to the left or upper
side of the relation as the upper or first
projection and the projection to the lower or
right side of the relation as the lower or second
projection.
Sequential Transducers and Determinism
• Transducers as have been described may be
nondeterministic; given an input there may be many
possible output symbols.
– Thus using general FSTs requires the kinds of search
algorithms discussed in Chapter 1, which makes FSTs quite
slow in the general case.
– This fact implies that it would be nice to have an algorithm
to convert a nondeterministic FST to a deterministic one.
• Every non-deterministic FSA is equivalent to some
deterministic FSA
• Not all FSTs can be determinized.
Sequential Transducers
• Sequential transducers are a subtype of transducers that
are deterministic on their input.
– At any state of a sequential transducer, each given symbol of
the input alphabet S can label at most one transition out of
that state.

a:b b:0

q0 b:b q1

a:ba
Sequential Transducers
• Sequential transducers are not necessarily sequential on their
output. In example of FST in previous slide, two distinct transitions
leaving from state q0 have the same output (b).
– Inverse of a sequential transducer may thus not be sequential, thus we
always need to specify the direction of the transduction when
discussing sequentiality.
• Formally, the definition of sequential transducers modifies the d
and s functions slightly:
 d becomes a function from Q x S* to Q (rather than to 2Q), and
 s becomes a function from Q x S* to D* (rather than 2D*).
• Subsequential transducer is one generalization of sequential
transducer which generates an additional output string at the final
states, concatenating it onto the output produced so far.
Importance of Sequential and
Subsequential Transducers
• Sequential and Subsequential FSTs are:
– efficient because they are deterministic on input →
they can be processed in time proportional to the
number of symbols in the input (linear in their input
length) rather then being proportional to some much
larger number which is function of the number of
states.

– Efficient algorithms for detrminization exists for


Subsequential transducers (Mohri 1997) and for their
minimization (Mohri, 2000).
2-subsequential FSA (Mohri 1997)

q1
a:a a:a a
q0 q3
b:a b:b b
q2

• Mohri (1996, 1997) shows a number of tasks whose ambiguity can be


limited in this way, including the representation of dictionaries, the
compilation of morphological and phonological rules, and local syntactic
constraints. For each of these kinds of problems, he and others have shown
that they are p-subsequentializable, and thus can be determinized and
minimized. This class of transducers includes many, although not necessarily
all, morphological rules.
Task of Morphological Parsing
• Example: cats → cat+N+PL
• In the finite-state morphology paradigm, we represent a word as
correspondence between:
– A lexical level – which represents a concatenation of morphemes make up a
word, and
– The surface level – which represents the concatenation of letters which
make up the actual spelling of the word

Lexical c a t +N +PL

Surface c a t s
Schematic examples of the lexical and surface tapes; the actual transducers will involve
intermediate tapes as well.
Lexical Tape
• For finite-state morphology it is convenient to view an FST as having two tapes.
– The upper or lexical tape is composed from characters from one (input) alphabet
S.
– The lower or surface tape, is composed of characters from another (output)
alphabet D.

• In the two-level morphology of Koskenniemi (1983), each arc is allowed to have only a
single symbol from each alphabet.
– Two symbol alphabets can be obtained by combining alphabets S and D to make a
new alphabet S’.
– New alphabet S’ makes the relationship to FSAs clear:
 S’ is a finite alphabet of complex symbols:
– Each complex symbol is composed of an input-output pair i:o;
» i – one symbol from input alphabet S
» o – one symbol from output alphabet D
 S’ ⊆ S x D
 S and D may each also include the epsilon symbol e.
Lexical Tape
• Comparing FSA to FST modeling
morphological aspect of a language the
following can be observed:
– FSA – accepts a language stated over a finite
alphabet of single symbols; e.g. Sheep
language:
 S={b,a,!}
– FST as defined in previous slides accepts a
language stated over pairs of symbols, as in:
 S’={a:a, b:b, !:!, a:!, a:e, e:!}
Feasible Pairs
• In two-level morphology, the pairs of symbols in S’
are also called feasible pairs.
– Each feasible pair symbol a:b in the transducer alphabet S’
expresses how the symbol a from one tape is mapped to the
symbol b on the other tape.
– Example: a:e means that an a on the upper tape will
correspond to nothing on the lower tape.
• We can write regular expressions in the complex
alphabet S’ just as in the case of FSA.
Default Pairs
• Since it is most common for symbols to map to themselves (in
two-level morphology) we call pairs like a:a default pairs, and
just refer to them by the single letter a.
FST Morphological Parser
• From morphotactic FSAs covered earlier, and by adding Lexical
tape and the appropriate morphological features we can build a
FST morphological parser.

• In the next slide is presented a figure that is augmentation of the


figure in the slide FSA for English Nominal Inflection with
nominal morphological features (+Sg and +Pl) that correspond
to each morpheme.
– The symbol ^ indicates a morpheme boundary, while
– The symbol # indicates a word boundary.
A Schematic Transducer for English
Nominal Number Inflection Tnum
q1 +N q4 +Pl
reg-noun e ^s#
+Sg
#
+Sg
q0 irreg-sg-noun q2 +N q5 #
qq77
e

irreg-pl-noun +Pl
q3 +N
e
q6 #

• The symbols above of in each arc represent elements of the


morphological parse in the lexical tape.
• The symbols below each arc represent the surface tape (or the
intermediate tape, to be described later), using the morpheme-boundary
symbol ^ and word-boundary marker #.
• The arcs need to be expanded by individual words in the lexicon.
Transducer & Lexicon
• In order to use the Transducer in the previous slide as a morphological
noun parser it needs to be expanded with all the individual regular and
irregular noun stems: replacing the labels reg-noun etc.
• This expansion can be done by updating the lexicon for this transducer,
so that irregular plurals like geese will parse into the correct stem
goose +N +Pl.
• This is achieved by allowing the lexicon to also have two levels:
– Surface geese maps to lexical goose → new lexical entry:
“g:g o:e o:e s:s e:e”.
• Regular forms are simpler: two level entry for fox will now be “f:f o:o
x:x”
• Relying on the orthographic convention that f stands for f:f and so on,
we can simply refer to it as fox and the form for geese as “g o:e o:e s
e”.
Lexicon
reg-noun irreg-pl-noun irreg-sg-noun
fox g o:e o:e s e goose
cat sheep sheep
aardvark m o:i u:e s:c e mouse
1
o 2
o x
f x
f
c 3
a 4
t 5
+N 6 +Pl
c a t e ^s#
0
+Sg
g #
g
o o s e +N +Sg 77
o o s e e #
o +Pl
e #
o s e +N
e s e e

Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
FST
• The resulting transducer shown in previous slide will map:
– Plural nouns into the stem plus the morphological marker +Pl
– Singular nouns into the stem plus the morphological marker +Sg.
• Example:
– cats → cat +N +Pl
– c:c a:a t:t +N: +Pl:^s#
– Output symbols include the morpheme and word boundary markers: ^
and # respectively. Thus the lower labels do not correspond exactly to
the surface level.
– Hence the tapes (output) with these morpheme boundary markers is
referred to as intermediate as shown in the next figure.
Lexical and Intermediate Tapes

Lexical f o x +N +PL

Inter-
f o x ^ s #
mediate
A schematic view of the lexical and intermediate tapes
Transducers and Orthographic Rules
Transducers and Orthographic Rules
• The method described in previous section will
successfully recognize words like aardvarks and mice.
• However, concatenating morphemes won’t work for
cases where there is a spelling change:
– Incorrect rejection of the input like foxes, and
– Incorrect acceptance of the input like foxs.
• This is due to the fact that English language often
requires spelling changes at morpheme boundaries:
– Introduction of spelling (or orthographic) rules
Notations for Writing Orthographic Rules
• Implementing a rule in a transducer is important in general for
speech and language processing.
• The table bellow introduces a number of rules.
Name Description of Rule Example

Consonant
1-letter consonant doubled before –ing/-ed beg/begging
doubling

E deletion Silent -e dropped before –ing and –ed make/making

E insertion -e added after –s, -z, -x, -ch, -sh before -s watch/watches

Y replacement -y changes to –ie before –s, -i before -ed try/tries

K insertion Verbs ending with vowel + -c add –k panic/panicked


Lexical → Intermediate → Surface
• An example of the lexical, intermediate, and surface tapes. Between each
pair of tapes is a two-level transducer; the lexical transducer between
lexical and intermediate levels, and the E-insertion spelling rule between
the intermediate and surface levels. The E-insertion spelling rule inserts
an e on the surface tape when the intermediate tape has a morpheme
boundary ^ followed by the morpheme -s

Lexical f o x +N +PL

Inter-
f o x ^ s #
mediate

Surface f o x e s
Orthographic Rule Example
• E-insertion rule from the table might be formulated as:
– Insert an e on the surface tape just when the lexical tape has a morpheme
ending in x (or z, etc) and the next morpheme is –s”. Bellow is formalization
of this rule:

 x
 
e  e /  s ^ _ s #
z
 
 This rule notation is due to Chomsky and Halle (1968); and is of the
form:
 a → b/c_d; “rewrite a with b when it occurs between c and d”.
Orthographic Rule Example
• Symbol e means an empty transition;
replacing it means inserting something.
• Morpheme boundaries ^ are deleted by
default on the surface level (^:e)
• Since # symbol marks a word boundary, the
rule in the previous slide means:
– Insert an e after a morpheme-final x, s, or z, and
before the morpheme s.
Transducer for E-insertion rule
other
q5
z, s, x
^:e s ^:e
other z, s, x
# z, s, x ^:e e:e s

q0 q1 q2 q3 q4
#, other z, x
#, other

#
State/Input s:s x:x z:z ^:e e:e # other
q0: 1 1 1 0 - 0 0
q1: 1 1 1 2 - 0 0
q2: 5 1 1 0 3 0 0
q3 4 - - - - - -
q4 - - - - - 0 -
q5 1 1 1 2 - - 0
Combining FST Lexicon and Rules
Combining FST Lexicon and Rules

• Ready now to combine Lexicon and Rule Transducers for parsing


and generating.
• The figure below depicts the architecture of two-level cascade of
FSTs with lexicon and rules
f o x +N +Pl

LEXICON-FST

f o x ^ s #

Orthographic
FST1 Rules FSTN

f o x e s
Tlex FST Combined with Te-insert FST
1
o 2
o x
f x
f
a t +N +Pl
c 3 4 5 6 ^s#
c a t e
0
+Sg
g #
g
o o s e +N +Sg
o o s e e #

e +Pl
# Lexical f o x +N +PL
e +N
e s
e e
Tlex 0 1 2 5 6 7
Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
Inter-
f o x ^ s #
mediate

other
Te-insert 0 0 0 1 2 3 4 0
q5
z, s, x
^:e
Surface f o x e s
s ^:e
other z, s, x
# z, s, x ^:e e:e s

q0 q1 q2 q3 q4
#, other z, x
#, other

#
Finite-State Transducers
• The exact same cascade with the same state sequences is used when the
machine is
– Generating the surface tape from the lexical tape
– Parsing the lexical tape from the surface tape.
• Parsing is slightly more complicated than generation, because of the
problem of ambiguity.
– Example: foxes can also be a verb (meaning “to baffle or confuse”), and hence
lexical parse for foxes could be
• fox +V +3Sg: That trickster foxes me every time, as well as
• fox +N +Pl: I saw two foxes yesterday.
– For ambiguous cases of this sort the transducers is not capable of deciding.
– Disambiguation will require some external evidence such as surrounding
words. Disambiguation algorithms are discussed in Ch 5, Ch. 20.
– In lack of such external evidence the best that FST can do is enumerate all
possible choices – transducing fox^s# into both fox +V +3Sg and fox +N +Pl.
FSTs and Ambiguation
• There is a kind of ambiguity that FSTs need to handle:
– Local ambiguity that occurs during the process of parsing.
– Example: Parsing input verb assess.
• After seeing ass, E-insertion transducer may propose that the e that follows is
inserted by the spelling rule - as far as the FST we might have been parsing
the word - asses.
• Thus is not until we don’t see the # after asses but rather run into another s,
that we can realize that we have gone down an incorrect path.
• Because of this non-determinism, FST-parsing algorithm need to
incorporate some sort of search algorithm.
• Note that many possible spurious segmentations of the input, such as
parsing assess as ^a^s^ses^s will be ruled out since no entry in the
lexicon will match this string.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy