0% found this document useful (0 votes)
28 views

Hmm

The document provides an overview of Part-of-Speech (POS) tagging, including definitions, word classes, and various tagging approaches such as rule-based and stochastic methods. It discusses the importance of POS tagging in natural language processing, including its applications in syntactic parsing and machine translation. Additionally, it outlines the challenges of ambiguity in word classes and the evaluation of tagging systems.

Uploaded by

Vũ Tuấn Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Hmm

The document provides an overview of Part-of-Speech (POS) tagging, including definitions, word classes, and various tagging approaches such as rule-based and stochastic methods. It discusses the importance of POS tagging in natural language processing, including its applications in syntactic parsing and machine translation. Additionally, it outlines the challenges of ambiguity in word classes and the evaluation of tagging systems.

Uploaded by

Vũ Tuấn Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Part-of-Speech (POS) tagging

Slides adapted from: Dan Jurafsky, Julia Hirschberg, Jim Martin

Read J & M Chapter 8.

You may also want to look at:


http://www.georgetown.edu/faculty/ballc/ling361/tagging_overview.html
1. INTRODUCTION

2. Word classes
3. Tag sets and problem definition

4. POS tagging approaches


• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
2
POS examples

•N noun chair, bandwidth, pacing


•V verb study, debate, munch
• ADJ adjective purple, tall, ridiculous
• ADV adverb unfortunately, slowly,
•P preposition of, by, to
• PRO pronoun I, me, mine
• DET determiner the, a, that, those

3
What is a POS?

• Is this a semantic distinction? For example, maybe Noun is the class


of words for people, places and things. Maybe Adjective is the class
of words for properties of nouns.
• Consider: green book
book is a Noun
green is an Adjective
• Now consider: book worm
This green is very soothing.

4
Morphological and Syntactic Definition of POS

• An Adjective is a word that can fill the blank in:


It’s so __________.
• A Noun is a word that can be marked as plural.
• A Noun is a word that can fill the blank in:
the __________ is
• What is green?
It’s so green.
Both greens could work for the walls.
The green is a little much given the red rug.

5
Penn TreeBank POS Tag set

6
Example of Penn Treebank Tagging
of Brown Corpus Sentence
•The/DT grand/JJ jury/NN commented/VBD on/IN
a/DT number/NN of/IN other/JJ topics/NNS ./.

•VB DT NN .
Book that flight .

•VBZ DT NN VB NN ?
Does that flight serve dinner ?

7
The Problem

• Words often have more than


one word class: this
• This is a nice day = PRP
• This day is nice = DT
• You can go this far = RB

8
Why Do We Care about POS?

• Pronunciation
Hand me the lead pipe.
• Predicting what words can be expected next
• Personal pronoun (e.g., I, she) ____________

• Stemming
• -s means singular for verbs, plural for nouns

• As the basis for syntactic parsing and then meaning extraction


• I will lead the group into the lead smelter.

• Machine translation
• (E) content +N  (F) contenu +N
• (E) content +Adj  (F) content +Adj or satisfait +Adj

9
Definition

• “The process of assigning a part-of-speech or other lexical class


marker to each word in a corpus” (Jurafsky and Martin)

10
1. Introduction

2. WORD CLASSES
3. Tag sets and problem definition

4. POS tagging approaches


• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
11
What is a word class?

• Words that somehow ‘behave’ alike:


• Appear in similar contexts
• Perform similar functions in sentences
• Undergo similar transformations
• Basic word classes:
• 8 (ish) traditional parts of speech: noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, ….
• Called: parts-of-speech, lexical category, word classes, morphological classes,
lexical tags, POS

12
• Every known human language has nouns and verbs
• Nouns: people, places, things
• Classes of nouns
Open class vs. closed class • proper vs. common
• count vs. mass
• Verbs: actions and processes
• Open class: • Adjectives: properties, qualities
• Nouns, Verbs, Adjectives, Adverbs. • Adverbs: hodgepodge!
• Why “open”?  new ones can be created all the time • Unfortunately, John walked home extremely
• English has 4: Nouns, Verbs, Adjectives, Adverbs slowly yesterday
• Many languages have all 4, but not all! • Numerals: one, two, three, third, …
• In Lakhota and possibly Chinese, what English treats as adjectives act more like verbs
• Closed: a relatively fixed membership • Differ more from language to language than open class words
• conjunctions: and, or, but

Examples:
• pronouns: I, she, him • prepositions: on, under, over, …
• prepositions: with, on, under, over, near, by, … • particles: up, down, on, off, …
• determiners: the, a, an • determiners: a, an, the, …
• Usually function words • pronouns: she, who, I, ..
• conjunctions: and, but, or, …
(short common words which play a role in grammar) • auxiliary verbs: can, may should, …
13
1. Introduction

2. Word classes
3. TAG SETS AND PROBLEM DEFINITION

4. POS tagging approaches


• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
14
Tagsets

• Brown corpus tagset (87 tags):


• http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html

• Penn Treebank tagset (45 tags):


• http://www.cs.colorado.edu/~martin/SLP/Figures/

• C7 tagset (146 tags)


• http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and purpose
– Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g.,
conflating prepositions, conjunctions, particles
– Simple morphology = more ambiguity = fewer tags
15
POS Tagging

• Words often have more than one POS: back


• The back door = JJ
• On my back = NN
• Win the voters back = RB
• Promised to back the bill = VB
• The POS tagging problem is to determine the POS tag for a particular
instance of a word.

These examples from Dekang Lin

16
How do we assign POS tags to words in a sentence?

Time flies like an arrow.


• Time/[V,N] flies/[V,N] like/[V,Prep] an/Det arrow/N
• Time/N flies/V like/Prep an/Det arrow/N

Fruit flies like a banana


• Fruit/N flies/N like/V a/Det banana/N
• Fruit/N flies/V like/Prep a/Det banana/N

17
How hard is POS tagging? Measuring ambiguity

18
Potential Sources of Disambiguation

• Many words have only one POS tag (e.g. is, Mary, very, smallest)
• Others have a single most likely tag (e.g. a, dog)
• But tags also tend to co-occur regularly with other tags (e.g. Det, N)
• We can look at POS likelihoods P(t1|tn-1) to disambiguate sentences
and to assess sentence likelihoods

19
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
20
Algorithms for POS Tagging

• Why can’t we just look them up in a dictionary?


• Words that aren’t in the dictionary
• One idea: P(ti | wi) = the probability that a random hapax legomenon
in the corpus has tag ti.
• Nouns are more likely than verbs, which are more likely than pronouns.
• Another idea: use morphology.

21
Algorithms for POS Tagging - Knowledge

• Dictionary
• Morphological rules, e.g.,
• _____-tion
• _____-ly
• capitalization
• N-gram frequencies
• to _____
• DET _____ N
• But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell,
and one noun form, a small fish)
• Combining these
• V _____-ing I was gracking vs. Gracking is fun.

22
Algorithms for POS Tagging - Approaches

• Basic approaches
• Rule-Based
• Stochastic: HMM-based
• Transformation-Based Tagger (Brill) (we won’t cover this)
• Do we return one best answer or several answers and let later steps
decide?
• How does the requisite knowledge get entered?

23
• Training/Teaching an NLP Component
• Each step of NLP analysis requires a module that knows what to do. How do such
modules get created?
• By hand advantage: based on sound linguistic principles, sensible to people, explainable
• By training  less work, extensible to new languages, customizable for specific domains.

• Training/Teaching a POS Tagger


• The problem is tractable. We can do a very good job with just:
• a dictionary
• a tagset
• a large corpus, usually tagged by hand

24
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
25
Rule-based POS Tagging

• Basic Idea:
• Start with a dictionary
• Assign all possible tags to words from the dictionary
• Write rules by hand to selectively remove tags
• if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1
is not a verb like “consider” then eliminate non-adv else eliminate adv.
• Typically more than 1000 hand-written rules
• Leaving the correct tag for each word

26
Rule-Based POS Tagging
Start with a dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back:VB, JJ, RB, NN
• the: DT
• bill: NN, VB

• Etc… for the ~100,000 words of English

27
Rule-Based POS Tagging
Use the dictionary to assign every possible tag

NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

28
Rule-Based POS Tagging
Write rules to eliminate tags
Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

29
Sample ENGTWOL (ENGlish TWO Level analysis) Lexicon

30
Rule-Based POS Tagging
ENGTWOL
• 1st stage: Run words through a morphological analyzer to get all parts
of speech.
• Example: Pavlov had shown that salivation …

Pavlov PAVLOV N NOM SG PROPER


had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG

31
Rule-Based POS Tagging
ENGTWOL
• 2nd stage: Figure out what to do about words that are unknown or
ambiguous. Two approaches:
• Rules that specify what to do.
• Rules that specify what not to do:
Given input: “that”
If
(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier It isn’t that odd vs
(+2 SENT-LIM) ;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a I consider that odd vs
; verb like “consider” which I believe that he is right.
; allows adjective complements
; in “I consider that odd”
Then eliminate non-ADV tags From ENGTWOL
Else eliminate ADV 32
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
33
Statistical POS tagging
• Based on probability theory
• The simple “most-frequent-tag” algorithm:
• baseline algorithm
• Meaning that no one would use it if they really wanted some data tagged
• No probabilities for words not in corpus
• But it’s useful as a comparison
• Conditional Probability and Tags
• P(Verb) is probability of randomly selected word being a verb.
• P(Verb|race) is “what’s the probability of a word being a verb given that it’s the word “race”?
• Race can be a noun or a verb. It’s more likely to be a noun
• P(Verb|race) == “out of all the times we saw ‘race’, how many were verbs?”
• In Brown corpus, P(Verb|race) = 96/98 = .98
Count(race is verb)
P(V | race) 
total Count(race)
34
Most-frequent-tag
• Some ambiguous words have a more • Most-frequent-tag algorithm
frequent tag and a less frequent tag • For each word
• Consider the word “a” in these 2 sentences: • Create dictionary with each possible tag
• would/MD prohibit/VB a/DT suit/NN for a word
for/IN refund/NN • Take a tagged corpus
• of/IN section/NN 381/CD (/( a/NN )/) ./. • Count the number of times each tag
occurs for that word
• Which do you think is more frequent? • Given a new sentence
• We could count in a corpus
• For each word, pick the most frequent
• The Brown Corpus part of speech tagged at tag for that word from the corpus.
U Penn
• Counts in this corpus:  Q: Where does the dictionary come from?
A: One option is to use the same corpus
21830 DT that we use for computing the tags
6 NN
3 FW
35
Using a corpus to build a dictionary

• The/DT City/NNP Purchasing/NNP Department/NNP ,/, the/DT jury/NN said/VBD,/,


is/VBZ lacking/VBG in/IN experienced/VBN clerical/JJ personnel/NNS …
• From this sentence, dictionary is:
clerical
department
experienced
in
is
jury

36
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
• The forward algorithm
• The Viterbi algorithm
• The forward-backward algorithm
5. Evaluation
37
Markov chain

• A Markov chain is a model that tells us something about the probabilities of


sequences of random variables, states, each of which can take on values from
some set.
• These sets can be words, or tags, or symbols representing anything, like the
weather
• Markov assumption: if we want to predict the future in the sequence, all that
matters is the current state

38
Markov chain

• The states are represented as


nodes in the graph, A Markov chain for weather (a) and one for words (b), showing states
• and the transitions, with their and transitions. A start distribution π is required;
probabilities, as edges. setting π = [0.1, 0.7, 0.2] for (a) would mean a probability 0.7 of
starting in state 2 (cold), probability 0.1 of starting in state 1 (hot), etc.
• The transitions are
probabilities: the values of
arcs leaving a given state
must sum to 1
• Components of a Markov
chain

39
Markov chain - Example

• Say that there are only three kinds of weather conditions, namely Rainy, Sunny and Cloudy
• Peter, is a small kid, he loves to play outside. He loves it when the weather is sunny, because
all his friends come out to play in the sunny conditions. He hates the rainy weather for
obvious reasons.
• Every day, his mother observes the weather in the morning (that is when he usually goes out
to play) and likes always, Peter comes up to her right after getting up and asks her to tell him
what the weather is going to be like. Since she is a responsible parent, she wants to answer
that question as accurately as possible. But the only thing she has is a set of observations
taken over multiple days as to how weather has been.
• How does she make a prediction of the weather for today based on what the weather has
been for the past N days?

40
Markov chain - Example

• Our model has only 3 states:


𝑆 = {𝑆1, 𝑆2, 𝑆3} ,
and the name of each state is:
• 𝑆1 = 𝑆𝑢𝑛𝑛𝑦 ,
• 𝑆2 = 𝑅𝑎𝑖𝑛𝑦,
• 𝑆3 = 𝐶𝑙𝑜𝑢𝑑𝑦
• To establish the transition
probabilities relationship between
states we will need to collect data
• Assume the data produces the
following transition probabilities

41
Markov chain - Example
• Let’s say we have a sequence: Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy, ….;
so, in a day we can be in any of the 3 states
• We can use the following state sequence notation:
𝑞1, 𝑞2, 𝑞3, 𝑞4, 𝑞5,….., where 𝑞𝑖 𝜖 {𝑆𝑢𝑛𝑛𝑦, 𝑅𝑎𝑖𝑛𝑦, 𝐶𝑙𝑜𝑢𝑑𝑦}.
• In order to compute the probability of tomorrow’s weather we can use the Markov property

42
Markov chain - Example

• Question 1: Given that today is Sunny,


what’s the probability that tomorrow is
Sunny and the next day Rainy?

43
Markov chain - Example

• Question 1: Given that today is Sunny,


what’s the probability that tomorrow is
Sunny and the next day Rainy?

44
Markov chain - Example

• Question 2: Assume that yesterday’s


weather was Rainy, and today is Cloudy,
what is the probability that tomorrow
will be Sunny?

45
Markov chain - Example

• Question 2: Assume that yesterday’s


weather was Rainy, and today is
Cloudy, what is the probability that
tomorrow will be Sunny?

46
Markov chain
• Summary: A Markov chain is a weighted
automaton in which
• weights are probabilities, i.e., all weights are
between 0 and 1 and the sum of the weights of
all outgoing edges of a state is 1, and
• the input sequence uniquely determines the
states the automaton goes through.
• A Markov chain is actually a bigram
language model
• Markov chains are useful when we want to
compute the probability for a sequence of
events that we can observe.
LNK_47
Markov Model

• A Markov Model is a stochastic model which models temporal or sequential data, i.e.,
data that are ordered
• It provides a way to model the dependencies of current information (e.g. weather) with
previous information
• It is composed of states, transition scheme between states, and emission of outputs
(discrete or continuous)
• Several goals can be accomplished by using Markov models:
• Learn statistics of sequential data.
• Do prediction or estimation.
• Recognize patterns

48
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
• The forward algorithm
• The Viterbi algorithm
• The forward-backward algorithm
5. Evaluation
49
Markov Model Hidden Markov Model

• An HMM is a stochastic model where the states of the model are hidden. Each state can emit an output
which is observed
• Imagine: You were locked in a room for several days and you were asked about the weather outside.
The only piece of evidence you have is whether the person who comes into the room bringing your
daily meal is carrying an umbrella or not.
• What is hidden? Sunny, Rainy, Cloudy
• What can you observe? Umbrella or Not 50
HMM

• Let’s assume that 𝑡 days had passed.


• Therefore, we will have an observation sequence O = {𝑜1,…,𝑜𝑡} ,
• where 𝑜𝑖 𝜖 {𝑈𝑚𝑏𝑟𝑒𝑙𝑙𝑎, 𝑁𝑜𝑡 𝑈𝑚𝑏𝑟𝑒𝑙𝑙𝑎} .
• Each observation comes from an unknown state.
• Therefore, we will also have an unknown sequence 𝑄 = {𝑞1,…,𝑞𝑡},
• where 𝑞𝑖 𝜖 {𝑆𝑢𝑛𝑛𝑦,𝑅𝑎𝑖𝑛𝑦,𝐶𝑙𝑜𝑢𝑑𝑦} .
• We would like to know: 𝑃(𝑞1,..,𝑞𝑡|𝑜1,…,𝑜𝑡).

51
HMM
• From Bayes’ Theorem, we can obtain the probability for a particular day as:

• For a sequence of length 𝑡:

• From the Markov property:

• Independent observations assumption:

• Thus

52
HMM components and parameters

• A HMM is governed by the following parameters: λ = {𝐴,𝐵,𝜋}


• State-transition probability matrix 𝐴
• Emission/Observation/State Conditional Output probabilities 𝐵
• Initial (prior) state probabilities 𝜋
• Determine the fixed number of states (𝑁): 𝑆 = 𝑠1,…,𝑠𝑁
53
Ice-cream
• Imagine that you are a climatologist studying the history of global warming.
• You cannot find any records of the weather in Baltimore, Maryland, for the summer of 2020, but you do
find Jason Eisner’s diary, which lists how many ice creams Jason ate every day that summer.
Our goal is to use these observations to estimate the temperature every day.
• We’ll simplify this weather task by assuming there are only two kinds of days:
• cold (C)
• hot (H).
• So the Eisner task is as follows:
• Given a sequence of observations O (each an integer representing the number of ice creams eaten on a
given day)
• find the ‘hidden’ sequence Q of weather states (H or C) which caused Jason to eat the ice cream

54
Ice-cream

• The two hidden states (H and C) correspond to hot and cold weather,
• The observations (drawn from the alphabet O = {1,2,3}) correspond to the number of ice
creams eaten by Jason on a given day

55
HMM tagger
• The goal of HMM decoding is to choose the tag sequence t that is most probable given the
observation sequence of n words
HMM
DECODING

Applying use HMM assumptions


Bayes’ rule

Dropping the Bigram assumption


denominator

PUTTING
ALL
TOGETHER

56
HMM- the three fundamental problems

57
How can we solve the problems?

• Problem 1. Likelihood of the input O:


• Compute P(O|λ) for the input O and HMM λ
•  Forward algorithm
• Problem 2. Decoding (= tagging) the input O:
• Find the best (tags) Q for the input O
•  Viterbi
• Problem 3. Estimation (= learning the model):
• Find the best model parameters A and B for the training data O
•  Forward-backward algorithm

58
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
• The forward algorithm
• The Viterbi algorithm
• The forward-backward algorithm
5. Evaluation
59
HMM- Problem 1

• Computing Likelihood: Given an HMM λ = (A,B) and an observation sequence O,


determine the likelihood P(O|λ).
• Example: given the ice-cream eating HMM, what is the probability of the sequence 3 1 3?
•  Problem: we don’t know what the hidden state sequence is

60
Example: given the ice-cream
eating HMM, what is the
probability of the sequence 3 1 3?
 Note: we don’t know what the
HMM- Problem 1 hidden state sequence is.

• In HMM, each hidden state produces only a single observation  sequence of hidden states
and the sequence of observations have the same length
• Given:
• A particular hidden state sequence Q = q1,q2,...,qT
• An observation sequence O = o1,o2,...,oT ,
• Then, the likelihood of the observation sequence is
• We don’t know what the hidden state sequence was  The joint probability of being in a
particular weather sequence Q and generating a particular sequence O

• The total probability of the observations

61
Example: given the ice-cream
eating HMM, what is the
probability of the sequence 3 1 3?
HMM- Problem 1 Note: we don’t know what the
hidden state sequence is.

• Example: The computation of the forward probability for our ice-cream observation 3 1 3 from one
possible hidden state sequence hot hot cold

• We don’t know what the hidden state sequence was  The sequence 3 1 3 has eight 3-event sequence
• cold cold cold
• cold cold hot
• ….

• Problem: N hidden states and an observation sequence of T observations, there are NT possible hidden
sequences  Solution: the forward algorithm 62
The forward algorithm
• It is an efficient O(N2T) algorithm and a kind of dynamic programming algorithm that uses a
table to store intermediate values as it builds up the probability of the observation sequence.
• The forward algorithm computes the observation probability by summing over the
probabilities of all possible hidden state paths that could generate the observation
sequence.
• Each cell of the trellis αt(j) represents the probability of being in state j after seeing the first t
observations, given the automaton λ
• Given state qj at time t, the value is computed as:

63
Visualizing the computation of a single
element αt(i) in the trellis by summing
The forward algorithm all the previous values αt−1, weighted by
their transition probabilities a, and
multiplying by the observation
probability bi(ot).

The forward algorithm, where forward[s,t] represents αt(s).


64
The forward algorithm
Given state qj at time t, the
value is computed as:

α2(2) = α1(1) x P (H|C) x P (1|H) = 0.02 x 0.5 x 0.2

α2(2) = α1(2) x P (H|H) x P (1|H) = 0.32 x 0.6 x 0.2

65
The forward algorithm

3 1 3
α2(1)= 0.005 + 0.064 = 0.069 α3(1)= 0.00345 +0.001616= 0.005066
α2(1)=α1(1) P (C|C) P (1|C) α3(1)=α2(1) P (C|C) P (3|C)
= 0.02 * 0.5 * 0.5 = 0.005 = 0.069 * 0.5 * 0.1 = 0.00345
COLD α1(1)= P(C|start) P(3|C)
=0.2 * 0.1 = 0.002 α2(1)=α1(2) P (C|H) P (1|C) α3(1)=α2(2) P (C|H) P (3|C)
= 0.32 * 0.4 * 0.5 = 0.064 = 0.0404 * 0.4 * 0.1 = 0.001616
α2(2)= 0.002 + 0.0384 = 0.0404 α3(1)= 0.01104 + 0.009696
α2(2)=α1(1) P (H|C) P (1|H) α3(2)=α2(1) P (C|H) P (3|H)
HOT α1(2)= P(H|start)
= 0.02 * 0.5 * 0.2 = 0.002 = 0.069 * 0.4 * 0.4 = 0.01104
P(3|H)
= 0.8 * 0.4 = 0.32 α2(1)=α1(2) P (H|H) P (1|H) α3(1)=α2(2) P (H|H) P (3|H)
= 0.32 * 0.6 * 0.2 = 0.0384 = 0.0404 * 0.6 * 0.4 = 0.009696

66
The forward algorithm

67
The forward algorithm

• For each possible hidden state sequence (HHH, HHC, HCH, …), we could run the
forward algorithm and compute the likelihood of the observation sequence given that
hidden state sequence.
• Then we could choose the hidden state sequence with the maximum observation
likelihood.
• It should be clear from the previous section that we cannot do this because there are an
exponentially large number of state sequences.

68
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
• The forward algorithm
• The Viterbi algorithm
• The forward-backward algorithm
5. Evaluation
69
HMM- Problem 2

• For any model, such as an HMM, that contains hidden variables, the task of determining
which sequence of variables is the underlying source of some sequence of observations is
called the decoding task.
• In the ice-cream domain, given a sequence of ice-cream observations 3 1 3 and an HMM,
the task of the decoder is to find the best hidden weather sequence (H H H)
Decoding: Given as input an HMM λ = (A,B) and a sequence of observations O = o1,o2,...,oT,
find the most probable sequence of states Q = q1q2q3 ...qT .
• Viterbi algorithm
• is a kind of dynamic programming that makes uses of a dynamic programming trellis
• strongly resembles another dynamic programming variant, the minimum edit distance
algorithm

70
The Viterbi algorithm
• Given:
• A particular hidden state sequence Q = q0,q1,q2,...,qT
• An observation sequence O = o1,o2,...,oT ,
• Each cell of the trellis, vt(j), represents the probability that the HMM is in state j after seeing the
first t observations and passing through the most probable state sequence q1,...,qt−1, given the
automaton λ
• The value of each cell vt(j) is computed by recursively taking the most probable path that could
lead us to this cell

71
Viterbi algorithm for finding optimal sequence of
The Viterbi algorithm hidden states. Given an observation sequence and an
HMM λ = (A,B), the algorithm returns the state path
through the HMM that assigns maximum likelihood to
the observation sequence.

72
The Viterbi algorithm
The value of each cell vt(j) is computed by
recursively taking the most probable path
that could lead us to this cell

v2(2) = v1(1) x P(H|C) x P (1|H) = 0.02 x 0.5 x 0.2

v2(2) = v1(2) x P (H|H) x P (1|H) = 0.32 x 0.6 x 0.2

73
The Viterbi algorithm

74
The Viterbi algorithm vs. the forward algorithm

• The Viterbi algorithm is identical to the forward algorithm EXCEPT it takes the max over the
previous path probabilities whereas the forward algorithm takes the sum.
• The Viterbi algorithm has one component that the forward algorithm doesn’t have:
backpointers
• Why?
• The forward algorithm needs to produce an observation likelihood,
• The Viterbi algorithm must produce a probability and also the most likely state sequence.
• Computing this best state sequence by keeping track of the path of hidden states that led
to each state, and then at the end backtracing the best path to the beginning (the Viterbi
backtrace).

75
Example

76
Example – “Janet will back the bill.” 1. Use the forward algorithm
2. Use the Viterbi algorithm

77
78
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES


• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
• The forward algorithm
• The Viterbi algorithm
• The forward-backward algorithm
5. Evaluation
79
HMM – Problem 3
• Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters
A and B.
• The input to such a learning algorithm would be For the ice cream task, we would start with :
• an unlabeled sequence of observations O and • a sequence of observations O = {1,3,2,...,} and
• a vocabulary of potential hidden states Q. • the set of hidden states H and C.

• The standard algorithm for HMM training is the forward-backward, or Baum-Welch algorithm (Baum, 1972), a
special case of the Expectation-Maximization or EM algorithm (Dempster et al., 1977).
• The algorithm will let us train both the transition probabilities A and the emission probabilities B of the HMM.
• EM is an iterative algorithm, computing an initial estimate for the probabilities, then using those estimates to
computing a better estimate, and so on, iteratively improving the probabilities that it learns
• The real problem: we don’t know the counts of being in any of the hidden states
• Solution: The Baum-Welch algorithm solves this by iteratively estimating the counts. We will start with an
estimate for the transition and observation probabilities and then use these estimated probabilities to derive better
and better probabilities
Computing the forward probability for an observation and
Then, dividing that probability mass among all the different paths that contributed to this forward probability.
80
HMM- Problem 3
The backward probability β is the
probability of seeing the observations
from time t+1 to the end, given that we
are in state i at time t (and given the
automaton λ):

The computation of βt(i) by summing all


the successive values βt+1(j) weighted
by their transition probabilities aij and
their observation probabilities bj(ot+1).
Start and end states not shown.
81
HMM- Problem 3
• Put all together to see how the forward and backward probabilities can help compute the transition
probability aij and observation probability bi(ot) from an observation sequence, even though the
actual path taken through the model is hidden
• Estimate âij by a variant of simple maximum likelihood estimation

• Denote the ξt as the probability of being in state i at time t and state j at time t+1, given the
observation sequence and of course the model

• To compute ξt , we first compute a probability which is similar to ξt, but differs in including the
probability of the observation; note the different conditioning of O

82
HMM- Problem 3

The expected number of transitions from state


i to state j is then the sum over all t of ξ

83
HMM- Problem 3

• We also need a formula for recomputing the observation probability.


• This is the probability of a given symbol vk from the observation vocabulary V, given a state j

• Denoting γt(j) as the probability of being in state j at time t

84
The forward-backward algorithm

• In the E-step, we compute the expected state


occupancy count γ and the expected state
transition count ξ from the earlier A and B
probabilities.

• In the M-step, we use γ and ξ to recompute


new A and B probabilities.

85
Sketch of Baum-Welch (EM) Algorithm
for Training HMMs
Assume an HMM with N states.
Randomly set its parameters λ=(A,B)
(making sure they represent legal distributions)
Until converge (i.e. λ no longer changes) do:
E Step: Use the forward/backward procedure to
determine the probability of various possible
state sequences for generating the training data
M Step: Use these probability estimates to
re-estimate values for all of the parameters λ

86
Self-study

• Extending the HMM algorithm to trigrams


• Beam Search
• Maximum Entropy Markov models

87
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS tagging approaches


• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. EVALUATION
88
Evaluation

• The result is compared with a manually coded “Gold Standard”


• Typically accuracy reaches 96-97%
• This may be compared with result for a baseline tagger (one that uses no context).
• Important: 100% is impossible even for human annotators.
• Evaluation performance
• How do we know how well a tagger does?
• Say we had a test sentence, or a set of test sentences, that were already tagged by a human (a
“Gold Standard”)
• We could run a tagger on this set of test sentences
• And see how many of the tags we got right.
• This is called “Tag accuracy” or “Tag percent correct”

89
Training and test sets
• We take a set of test sentences
• Hand-label them for part of speech
• The result is a “Gold Standard” test set
• Who does this?
Brown corpus: done by U Penn Grad students in linguistics
• Don’t they disagree?
• Yes! But on about 97% of tags no disagreements
• And if you let the taggers discuss the remaining 3%, they often reach agreement
• But we can’t train our frequencies on the test set sentences
• So for testing the Most-Frequent-Tag algorithm (or any other probabilistic algorithm), we need
2 things:
• A hand-labeled training set: the data that we compute frequencies from, ….
• A hand-labeled test set: The data that we use to compute our % correct.
90
Computing % correct

• Of all the words in the test set


• For what percent of them did the tag chosen by the tagger equal the human-selected tag.
# of words tagged correctly in test set
%correct 
total # of words in test set
• Human tag set: (“Gold Standard” set)



91
Training and Test sets

• Often they come from the same labeled corpus!


• We just use 90% of the corpus for training and save out 10% for
testing!
• Even better: cross-validation
• Take 90% training, 10% test, get a % correct
• Now take a different 10% test, 90% training, get % correct
• Do this 10 times and average

92
Evaluation and rule-based taggers

• Does the same evaluation metric work for rule-based taggers?


• Yes!
• Rule-based taggers don’t need the training set
• But they still need a test set to see how well the rules are working

93
94

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy