Hmm
Hmm
2. Word classes
3. Tag sets and problem definition
3
What is a POS?
4
Morphological and Syntactic Definition of POS
5
Penn TreeBank POS Tag set
6
Example of Penn Treebank Tagging
of Brown Corpus Sentence
•The/DT grand/JJ jury/NN commented/VBD on/IN
a/DT number/NN of/IN other/JJ topics/NNS ./.
•VB DT NN .
Book that flight .
•VBZ DT NN VB NN ?
Does that flight serve dinner ?
7
The Problem
8
Why Do We Care about POS?
• Pronunciation
Hand me the lead pipe.
• Predicting what words can be expected next
• Personal pronoun (e.g., I, she) ____________
• Stemming
• -s means singular for verbs, plural for nouns
• Machine translation
• (E) content +N (F) contenu +N
• (E) content +Adj (F) content +Adj or satisfait +Adj
9
Definition
10
1. Introduction
2. WORD CLASSES
3. Tag sets and problem definition
12
• Every known human language has nouns and verbs
• Nouns: people, places, things
• Classes of nouns
Open class vs. closed class • proper vs. common
• count vs. mass
• Verbs: actions and processes
• Open class: • Adjectives: properties, qualities
• Nouns, Verbs, Adjectives, Adverbs. • Adverbs: hodgepodge!
• Why “open”? new ones can be created all the time • Unfortunately, John walked home extremely
• English has 4: Nouns, Verbs, Adjectives, Adverbs slowly yesterday
• Many languages have all 4, but not all! • Numerals: one, two, three, third, …
• In Lakhota and possibly Chinese, what English treats as adjectives act more like verbs
• Closed: a relatively fixed membership • Differ more from language to language than open class words
• conjunctions: and, or, but
•
Examples:
• pronouns: I, she, him • prepositions: on, under, over, …
• prepositions: with, on, under, over, near, by, … • particles: up, down, on, off, …
• determiners: the, a, an • determiners: a, an, the, …
• Usually function words • pronouns: she, who, I, ..
• conjunctions: and, but, or, …
(short common words which play a role in grammar) • auxiliary verbs: can, may should, …
13
1. Introduction
2. Word classes
3. TAG SETS AND PROBLEM DEFINITION
16
How do we assign POS tags to words in a sentence?
17
How hard is POS tagging? Measuring ambiguity
18
Potential Sources of Disambiguation
• Many words have only one POS tag (e.g. is, Mary, very, smallest)
• Others have a single most likely tag (e.g. a, dog)
• But tags also tend to co-occur regularly with other tags (e.g. Det, N)
• We can look at POS likelihoods P(t1|tn-1) to disambiguate sentences
and to assess sentence likelihoods
19
1. Introduction
2. Word classes
3. Tag sets and problem definition
21
Algorithms for POS Tagging - Knowledge
• Dictionary
• Morphological rules, e.g.,
• _____-tion
• _____-ly
• capitalization
• N-gram frequencies
• to _____
• DET _____ N
• But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell,
and one noun form, a small fish)
• Combining these
• V _____-ing I was gracking vs. Gracking is fun.
22
Algorithms for POS Tagging - Approaches
• Basic approaches
• Rule-Based
• Stochastic: HMM-based
• Transformation-Based Tagger (Brill) (we won’t cover this)
• Do we return one best answer or several answers and let later steps
decide?
• How does the requisite knowledge get entered?
23
• Training/Teaching an NLP Component
• Each step of NLP analysis requires a module that knows what to do. How do such
modules get created?
• By hand advantage: based on sound linguistic principles, sensible to people, explainable
• By training less work, extensible to new languages, customizable for specific domains.
24
1. Introduction
2. Word classes
3. Tag sets and problem definition
• Basic Idea:
• Start with a dictionary
• Assign all possible tags to words from the dictionary
• Write rules by hand to selectively remove tags
• if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1
is not a verb like “consider” then eliminate non-adv else eliminate adv.
• Typically more than 1000 hand-written rules
• Leaving the correct tag for each word
26
Rule-Based POS Tagging
Start with a dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back:VB, JJ, RB, NN
• the: DT
• bill: NN, VB
27
Rule-Based POS Tagging
Use the dictionary to assign every possible tag
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
28
Rule-Based POS Tagging
Write rules to eliminate tags
Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
29
Sample ENGTWOL (ENGlish TWO Level analysis) Lexicon
30
Rule-Based POS Tagging
ENGTWOL
• 1st stage: Run words through a morphological analyzer to get all parts
of speech.
• Example: Pavlov had shown that salivation …
31
Rule-Based POS Tagging
ENGTWOL
• 2nd stage: Figure out what to do about words that are unknown or
ambiguous. Two approaches:
• Rules that specify what to do.
• Rules that specify what not to do:
Given input: “that”
If
(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier It isn’t that odd vs
(+2 SENT-LIM) ;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a I consider that odd vs
; verb like “consider” which I believe that he is right.
; allows adjective complements
; in “I consider that odd”
Then eliminate non-ADV tags From ENGTWOL
Else eliminate ADV 32
1. Introduction
2. Word classes
3. Tag sets and problem definition
36
1. Introduction
2. Word classes
3. Tag sets and problem definition
38
Markov chain
39
Markov chain - Example
• Say that there are only three kinds of weather conditions, namely Rainy, Sunny and Cloudy
• Peter, is a small kid, he loves to play outside. He loves it when the weather is sunny, because
all his friends come out to play in the sunny conditions. He hates the rainy weather for
obvious reasons.
• Every day, his mother observes the weather in the morning (that is when he usually goes out
to play) and likes always, Peter comes up to her right after getting up and asks her to tell him
what the weather is going to be like. Since she is a responsible parent, she wants to answer
that question as accurately as possible. But the only thing she has is a set of observations
taken over multiple days as to how weather has been.
• How does she make a prediction of the weather for today based on what the weather has
been for the past N days?
40
Markov chain - Example
41
Markov chain - Example
• Let’s say we have a sequence: Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy, ….;
so, in a day we can be in any of the 3 states
• We can use the following state sequence notation:
𝑞1, 𝑞2, 𝑞3, 𝑞4, 𝑞5,….., where 𝑞𝑖 𝜖 {𝑆𝑢𝑛𝑛𝑦, 𝑅𝑎𝑖𝑛𝑦, 𝐶𝑙𝑜𝑢𝑑𝑦}.
• In order to compute the probability of tomorrow’s weather we can use the Markov property
42
Markov chain - Example
43
Markov chain - Example
44
Markov chain - Example
45
Markov chain - Example
46
Markov chain
• Summary: A Markov chain is a weighted
automaton in which
• weights are probabilities, i.e., all weights are
between 0 and 1 and the sum of the weights of
all outgoing edges of a state is 1, and
• the input sequence uniquely determines the
states the automaton goes through.
• A Markov chain is actually a bigram
language model
• Markov chains are useful when we want to
compute the probability for a sequence of
events that we can observe.
LNK_47
Markov Model
• A Markov Model is a stochastic model which models temporal or sequential data, i.e.,
data that are ordered
• It provides a way to model the dependencies of current information (e.g. weather) with
previous information
• It is composed of states, transition scheme between states, and emission of outputs
(discrete or continuous)
• Several goals can be accomplished by using Markov models:
• Learn statistics of sequential data.
• Do prediction or estimation.
• Recognize patterns
48
1. Introduction
2. Word classes
3. Tag sets and problem definition
• An HMM is a stochastic model where the states of the model are hidden. Each state can emit an output
which is observed
• Imagine: You were locked in a room for several days and you were asked about the weather outside.
The only piece of evidence you have is whether the person who comes into the room bringing your
daily meal is carrying an umbrella or not.
• What is hidden? Sunny, Rainy, Cloudy
• What can you observe? Umbrella or Not 50
HMM
51
HMM
• From Bayes’ Theorem, we can obtain the probability for a particular day as:
• Thus
52
HMM components and parameters
54
Ice-cream
• The two hidden states (H and C) correspond to hot and cold weather,
• The observations (drawn from the alphabet O = {1,2,3}) correspond to the number of ice
creams eaten by Jason on a given day
55
HMM tagger
• The goal of HMM decoding is to choose the tag sequence t that is most probable given the
observation sequence of n words
HMM
DECODING
PUTTING
ALL
TOGETHER
56
HMM- the three fundamental problems
57
How can we solve the problems?
58
1. Introduction
2. Word classes
3. Tag sets and problem definition
60
Example: given the ice-cream
eating HMM, what is the
probability of the sequence 3 1 3?
Note: we don’t know what the
HMM- Problem 1 hidden state sequence is.
• In HMM, each hidden state produces only a single observation sequence of hidden states
and the sequence of observations have the same length
• Given:
• A particular hidden state sequence Q = q1,q2,...,qT
• An observation sequence O = o1,o2,...,oT ,
• Then, the likelihood of the observation sequence is
• We don’t know what the hidden state sequence was The joint probability of being in a
particular weather sequence Q and generating a particular sequence O
61
Example: given the ice-cream
eating HMM, what is the
probability of the sequence 3 1 3?
HMM- Problem 1 Note: we don’t know what the
hidden state sequence is.
• Example: The computation of the forward probability for our ice-cream observation 3 1 3 from one
possible hidden state sequence hot hot cold
• We don’t know what the hidden state sequence was The sequence 3 1 3 has eight 3-event sequence
• cold cold cold
• cold cold hot
• ….
• Problem: N hidden states and an observation sequence of T observations, there are NT possible hidden
sequences Solution: the forward algorithm 62
The forward algorithm
• It is an efficient O(N2T) algorithm and a kind of dynamic programming algorithm that uses a
table to store intermediate values as it builds up the probability of the observation sequence.
• The forward algorithm computes the observation probability by summing over the
probabilities of all possible hidden state paths that could generate the observation
sequence.
• Each cell of the trellis αt(j) represents the probability of being in state j after seeing the first t
observations, given the automaton λ
• Given state qj at time t, the value is computed as:
63
Visualizing the computation of a single
element αt(i) in the trellis by summing
The forward algorithm all the previous values αt−1, weighted by
their transition probabilities a, and
multiplying by the observation
probability bi(ot).
65
The forward algorithm
3 1 3
α2(1)= 0.005 + 0.064 = 0.069 α3(1)= 0.00345 +0.001616= 0.005066
α2(1)=α1(1) P (C|C) P (1|C) α3(1)=α2(1) P (C|C) P (3|C)
= 0.02 * 0.5 * 0.5 = 0.005 = 0.069 * 0.5 * 0.1 = 0.00345
COLD α1(1)= P(C|start) P(3|C)
=0.2 * 0.1 = 0.002 α2(1)=α1(2) P (C|H) P (1|C) α3(1)=α2(2) P (C|H) P (3|C)
= 0.32 * 0.4 * 0.5 = 0.064 = 0.0404 * 0.4 * 0.1 = 0.001616
α2(2)= 0.002 + 0.0384 = 0.0404 α3(1)= 0.01104 + 0.009696
α2(2)=α1(1) P (H|C) P (1|H) α3(2)=α2(1) P (C|H) P (3|H)
HOT α1(2)= P(H|start)
= 0.02 * 0.5 * 0.2 = 0.002 = 0.069 * 0.4 * 0.4 = 0.01104
P(3|H)
= 0.8 * 0.4 = 0.32 α2(1)=α1(2) P (H|H) P (1|H) α3(1)=α2(2) P (H|H) P (3|H)
= 0.32 * 0.6 * 0.2 = 0.0384 = 0.0404 * 0.6 * 0.4 = 0.009696
66
The forward algorithm
67
The forward algorithm
• For each possible hidden state sequence (HHH, HHC, HCH, …), we could run the
forward algorithm and compute the likelihood of the observation sequence given that
hidden state sequence.
• Then we could choose the hidden state sequence with the maximum observation
likelihood.
• It should be clear from the previous section that we cannot do this because there are an
exponentially large number of state sequences.
68
1. Introduction
2. Word classes
3. Tag sets and problem definition
• For any model, such as an HMM, that contains hidden variables, the task of determining
which sequence of variables is the underlying source of some sequence of observations is
called the decoding task.
• In the ice-cream domain, given a sequence of ice-cream observations 3 1 3 and an HMM,
the task of the decoder is to find the best hidden weather sequence (H H H)
Decoding: Given as input an HMM λ = (A,B) and a sequence of observations O = o1,o2,...,oT,
find the most probable sequence of states Q = q1q2q3 ...qT .
• Viterbi algorithm
• is a kind of dynamic programming that makes uses of a dynamic programming trellis
• strongly resembles another dynamic programming variant, the minimum edit distance
algorithm
70
The Viterbi algorithm
• Given:
• A particular hidden state sequence Q = q0,q1,q2,...,qT
• An observation sequence O = o1,o2,...,oT ,
• Each cell of the trellis, vt(j), represents the probability that the HMM is in state j after seeing the
first t observations and passing through the most probable state sequence q1,...,qt−1, given the
automaton λ
• The value of each cell vt(j) is computed by recursively taking the most probable path that could
lead us to this cell
71
Viterbi algorithm for finding optimal sequence of
The Viterbi algorithm hidden states. Given an observation sequence and an
HMM λ = (A,B), the algorithm returns the state path
through the HMM that assigns maximum likelihood to
the observation sequence.
72
The Viterbi algorithm
The value of each cell vt(j) is computed by
recursively taking the most probable path
that could lead us to this cell
73
The Viterbi algorithm
74
The Viterbi algorithm vs. the forward algorithm
• The Viterbi algorithm is identical to the forward algorithm EXCEPT it takes the max over the
previous path probabilities whereas the forward algorithm takes the sum.
• The Viterbi algorithm has one component that the forward algorithm doesn’t have:
backpointers
• Why?
• The forward algorithm needs to produce an observation likelihood,
• The Viterbi algorithm must produce a probability and also the most likely state sequence.
• Computing this best state sequence by keeping track of the path of hidden states that led
to each state, and then at the end backtracing the best path to the beginning (the Viterbi
backtrace).
75
Example
76
Example – “Janet will back the bill.” 1. Use the forward algorithm
2. Use the Viterbi algorithm
77
78
1. Introduction
2. Word classes
3. Tag sets and problem definition
• The standard algorithm for HMM training is the forward-backward, or Baum-Welch algorithm (Baum, 1972), a
special case of the Expectation-Maximization or EM algorithm (Dempster et al., 1977).
• The algorithm will let us train both the transition probabilities A and the emission probabilities B of the HMM.
• EM is an iterative algorithm, computing an initial estimate for the probabilities, then using those estimates to
computing a better estimate, and so on, iteratively improving the probabilities that it learns
• The real problem: we don’t know the counts of being in any of the hidden states
• Solution: The Baum-Welch algorithm solves this by iteratively estimating the counts. We will start with an
estimate for the transition and observation probabilities and then use these estimated probabilities to derive better
and better probabilities
Computing the forward probability for an observation and
Then, dividing that probability mass among all the different paths that contributed to this forward probability.
80
HMM- Problem 3
The backward probability β is the
probability of seeing the observations
from time t+1 to the end, given that we
are in state i at time t (and given the
automaton λ):
• Denote the ξt as the probability of being in state i at time t and state j at time t+1, given the
observation sequence and of course the model
• To compute ξt , we first compute a probability which is similar to ξt, but differs in including the
probability of the observation; note the different conditioning of O
82
HMM- Problem 3
83
HMM- Problem 3
84
The forward-backward algorithm
85
Sketch of Baum-Welch (EM) Algorithm
for Training HMMs
Assume an HMM with N states.
Randomly set its parameters λ=(A,B)
(making sure they represent legal distributions)
Until converge (i.e. λ no longer changes) do:
E Step: Use the forward/backward procedure to
determine the probability of various possible
state sequences for generating the training data
M Step: Use these probability estimates to
re-estimate values for all of the parameters λ
86
Self-study
87
1. Introduction
2. Word classes
3. Tag sets and problem definition
89
Training and test sets
• We take a set of test sentences
• Hand-label them for part of speech
• The result is a “Gold Standard” test set
• Who does this?
Brown corpus: done by U Penn Grad students in linguistics
• Don’t they disagree?
• Yes! But on about 97% of tags no disagreements
• And if you let the taggers discuss the remaining 3%, they often reach agreement
• But we can’t train our frequencies on the test set sentences
• So for testing the Most-Frequent-Tag algorithm (or any other probabilistic algorithm), we need
2 things:
• A hand-labeled training set: the data that we compute frequencies from, ….
• A hand-labeled test set: The data that we use to compute our % correct.
90
Computing % correct
91
Training and Test sets
92
Evaluation and rule-based taggers
93
94