Chapter Three
Chapter Three
Chapter Three
1. Inflection
• Combination of a word stem with a grammatical morpheme, usually
resulting in a word of the same class as the original stem, and usually
filling some syntactic function like agreement.
• Example:
– -s: plural of nouns
– -ed: past tense of verbs.
Survey of English Morphology
2. Derivation
• Combination of word stem with a grammatical morpheme, usually
resulting in a word of a different class, often with a meaning hard to
predict.
• Example:
– Computerize – verb
– Computerization – noun.
3. Compounding
• Combination of multiple word stems together.
• Example:
– Doghouse: dog + house.
4. Cliticization
• Combination of a word stem with a clitic. A clitic is a morpheme that
acts syntactically like a word, but is reduced in form and attached
(phonologically and sometimes orthographically) to another word.
• Example:
– I’ve = I + ‘ve = I + have
Inflectional Morphology
• English language has a relatively simple
inflelectional system; Only
– Nouns
– Verbs
– Adjectives (sometimes)
• Number of possible inflectional affixes is
quite small.
Inflectional Morphology: Nouns
• Nouns (English):
– Plural
– Possessive
– Many (but not all) nouns can either appear in
• bare stem or singular form, or
• Take a plural suffix
• Ambiguity
– She’s → she is or she has
Non-Concatenative Morphology
• Morphology discussed so far is called
concatenative morphology.
• Other (than English) languages have extensive
non-concatenative morphology:
– Morphemes are combined in more complex ways -
Tagalog
– Arabic, Hebrew and other Semitic languages exhibit
templatic morphology or root-and-pattern morphology.
Agreement
3. Orthographic rules: these are spelling rules that are used to model
the changes that occur in a word, usually when two morphemes
combine.
Example: They *y → *ie spelling rule as in city + -s → cities.
Requirements of Morphological Parser
q0 q1 q2
Start State
irreg-pl-noun
irreg-sg-noun
past (-ed)
reg-verb-stem
q0 q1 q3
past paticiple (-ed)
reg-verb-stem
irreg-verb-stem
pres part (-ing)
q2
3sg(-s)
q0 q1 q2 q3
e
Problem Issues
• While previous slide FSA will
– recognize all the adjectives in the table presented earlier,
– it will also recognize ungrammatical forms like:
• unbig, unfast, oranger, or smally
• adj-root would include adjectives that:
1. can occur with un- and –ly: clear, happy and real
2. can not occur: big, small, etc.
reg-noun plural-s o
x
q0 q1 q2 f s
Start State
irreg-pl-noun a t
c
e
irreg-sg-noun
g
o o s e
e
e
e s
Expanded FSA for a few English nouns with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see, beginning on page 62 (ref book), how to correctly
deal with the inserted e in foxes.
Finite-State Transducers (FST)
• FSA can represent morphotactic structure of a lexicon and thus can be
used for word recogntion.
• In this section we will introduce the finite-state transducers and we will
show how they can be applied to morphological parsing.
• A transducers maps between one representation and another;
– A finite-state transducer or FST is a type of finite automaton which maps
between two sets of symbols.
– FST can be visualized as a two-tape automaton which recognizes or generates
pairs of strings.
– Intuitively, we can do this by labeling each arc in the the FSM (finite-state
machine) with two symbol strings, one from each tape.
– In the figure in the next slide an FST is depicted where each arc is labeled by
an input and output string, separated by a colon.
A Finite-State Transducer (FST)
q0 b:b q1
a:ba
a:b b:0
q0 b:b q1
a:ba
Sequential Transducers
• Sequential transducers are not necessarily sequential on their
output. In example of FST in previous slide, two distinct transitions
leaving from state q0 have the same output (b).
– Inverse of a sequential transducer may thus not be sequential, thus we
always need to specify the direction of the transduction when
discussing sequentiality.
• Formally, the definition of sequential transducers modifies the d
and s functions slightly:
d becomes a function from Q x S* to Q (rather than to 2Q), and
s becomes a function from Q x S* to D* (rather than 2D*).
• Subsequential transducer is one generalization of sequential
transducer which generates an additional output string at the final
states, concatenating it onto the output produced so far.
Importance of Sequential and
Subsequential Transducers
• Sequential and Subsequential FSTs are:
– efficient because they are deterministic on input →
they can be processed in time proportional to the
number of symbols in the input (linear in their input
length) rather then being proportional to some much
larger number which is function of the number of
states.
q1
a:a a:a a
q0 q3
b:a b:b b
q2
Lexical c a t +N +PL
Surface c a t s
Schematic examples of the lexical and surface tapes; the actual transducers will involve
intermediate tapes as well.
Lexical Tape
• For finite-state morphology it is convenient to view an FST as having two tapes.
– The upper or lexical tape is composed from characters from one (input) alphabet
S.
– The lower or surface tape, is composed of characters from another (output)
alphabet D.
• In the two-level morphology of Koskenniemi (1983), each arc is allowed to have only a
single symbol from each alphabet.
– Two symbol alphabets can be obtained by combining alphabets S and D to make a
new alphabet S’.
– New alphabet S’ makes the relationship to FSAs clear:
S’ is a finite alphabet of complex symbols:
– Each complex symbol is composed of an input-output pair i:o;
» i – one symbol from input alphabet S
» o – one symbol from output alphabet D
S’ ⊆ S x D
S and D may each also include the epsilon symbol e.
Lexical Tape
• Comparing FSA to FST modeling
morphological aspect of a language the
following can be observed:
– FSA – accepts a language stated over a finite
alphabet of single symbols; e.g. Sheep
language:
S={b,a,!}
– FST as defined in previous slides accepts a
language stated over pairs of symbols, as in:
S’={a:a, b:b, !:!, a:!, a:e, e:!}
Feasible Pairs
• In two-level morphology, the pairs of symbols in S’
are also called feasible pairs.
– Each feasible pair symbol a:b in the transducer alphabet S’
expresses how the symbol a from one tape is mapped to the
symbol b on the other tape.
– Example: a:e means that an a on the upper tape will
correspond to nothing on the lower tape.
• We can write regular expressions in the complex
alphabet S’ just as in the case of FSA.
Default Pairs
• Since it is most common for symbols to map to themselves (in
two-level morphology) we call pairs like a:a default pairs, and
just refer to them by the single letter a.
FST Morphological Parser
• From morphotactic FSAs covered earlier, and by adding Lexical
tape and the appropriate morphological features we can build a
FST morphological parser.
irreg-pl-noun +Pl
q3 +N
e
q6 #
Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
FST
• The resulting transducer shown in previous slide will map:
– Plural nouns into the stem plus the morphological marker +Pl
– Singular nouns into the stem plus the morphological marker +Sg.
• Example:
– cats → cat +N +Pl
– c:c a:a t:t +N: +Pl:^s#
– Output symbols include the morpheme and word boundary markers: ^
and # respectively. Thus the lower labels do not correspond exactly to
the surface level.
– Hence the tapes (output) with these morpheme boundary markers is
referred to as intermediate as shown in the next figure.
Lexical and Intermediate Tapes
Lexical f o x +N +PL
Inter-
f o x ^ s #
mediate
A schematic view of the lexical and intermediate tapes
Transducers and Orthographic Rules
Transducers and Orthographic Rules
• The method described in previous section will
successfully recognize words like aardvarks and mice.
• However, concatenating morphemes won’t work for
cases where there is a spelling change:
– Incorrect rejection of the input like foxes, and
– Incorrect acceptance of the input like foxs.
• This is due to the fact that English language often
requires spelling changes at morpheme boundaries:
– Introduction of spelling (or orthographic) rules
Notations for Writing Orthographic Rules
• Implementing a rule in a transducer is important in general for
speech and language processing.
• The table bellow introduces a number of rules.
Name Description of Rule Example
Consonant
1-letter consonant doubled before –ing/-ed beg/begging
doubling
E insertion -e added after –s, -z, -x, -ch, -sh before -s watch/watches
Lexical f o x +N +PL
Inter-
f o x ^ s #
mediate
Surface f o x e s
Orthographic Rule Example
• E-insertion rule from the table might be formulated as:
– Insert an e on the surface tape just when the lexical tape has a morpheme
ending in x (or z, etc) and the next morpheme is –s”. Bellow is formalization
of this rule:
x
e e / s ^ _ s #
z
This rule notation is due to Chomsky and Halle (1968); and is of the
form:
a → b/c_d; “rewrite a with b when it occurs between c and d”.
Orthographic Rule Example
• Symbol e means an empty transition;
replacing it means inserting something.
• Morpheme boundaries ^ are deleted by
default on the surface level (^:e)
• Since # symbol marks a word boundary, the
rule in the previous slide means:
– Insert an e after a morpheme-final x, s, or z, and
before the morpheme s.
Transducer for E-insertion rule
other
q5
z, s, x
^:e s ^:e
other z, s, x
# z, s, x ^:e e:e s
q0 q1 q2 q3 q4
#, other z, x
#, other
#
State/Input s:s x:x z:z ^:e e:e # other
q0: 1 1 1 0 - 0 0
q1: 1 1 1 2 - 0 0
q2: 5 1 1 0 3 0 0
q3 4 - - - - - -
q4 - - - - - 0 -
q5 1 1 1 2 - - 0
Combining FST Lexicon and Rules
Combining FST Lexicon and Rules
LEXICON-FST
f o x ^ s #
Orthographic
FST1 Rules FSTN
f o x e s
Tlex FST Combined with Te-insert FST
1
o 2
o x
f x
f
a t +N +Pl
c 3 4 5 6 ^s#
c a t e
0
+Sg
g #
g
o o s e +N +Sg
o o s e e #
e +Pl
# Lexical f o x +N +PL
e +N
e s
e e
Tlex 0 1 2 5 6 7
Expanded FST from Tnum for a few English nouns Tlex with their inflection. Note that this automaton will
incorrectly accept the input foxs. We will see later how to correctly deal with the inserted e in as in foxes.
Inter-
f o x ^ s #
mediate
other
Te-insert 0 0 0 1 2 3 4 0
q5
z, s, x
^:e
Surface f o x e s
s ^:e
other z, s, x
# z, s, x ^:e e:e s
q0 q1 q2 q3 q4
#, other z, x
#, other
#
Finite-State Transducers
• The exact same cascade with the same state sequences is used when the
machine is
– Generating the surface tape from the lexical tape
– Parsing the lexical tape from the surface tape.
• Parsing is slightly more complicated than generation, because of the
problem of ambiguity.
– Example: foxes can also be a verb (meaning “to baffle or confuse”), and hence
lexical parse for foxes could be
• fox +V +3Sg: That trickster foxes me every time, as well as
• fox +N +Pl: I saw two foxes yesterday.
– For ambiguous cases of this sort the transducers is not capable of deciding.
– Disambiguation will require some external evidence such as surrounding
words. Disambiguation algorithms are discussed in Ch 5, Ch. 20.
– In lack of such external evidence the best that FST can do is enumerate all
possible choices – transducing fox^s# into both fox +V +3Sg and fox +N +Pl.
FSTs and Ambiguation
• There is a kind of ambiguity that FSTs need to handle:
– Local ambiguity that occurs during the process of parsing.
– Example: Parsing input verb assess.
• After seeing ass, E-insertion transducer may propose that the e that follows is
inserted by the spelling rule - as far as the FST we might have been parsing
the word - asses.
• Thus is not until we don’t see the # after asses but rather run into another s,
that we can realize that we have gone down an incorrect path.
• Because of this non-determinism, FST-parsing algorithm need to
incorporate some sort of search algorithm.
• Note that many possible spurious segmentations of the input, such as
parsing assess as ^a^s^ses^s will be ruled out since no entry in the
lexicon will match this string.