0% found this document useful (0 votes)

28 views

Hmm

The document provides an overview of Part-of-Speech (POS) tagging, including definitions, word classes, and various tagging approaches such as rule-based and stochastic methods. It discusses the importance of POS tagging in natural language processing, including its applications in syntactic parsing and machine translation. Additionally, it outlines the challenges of ambiguity in word classes and the evaluation of tagging systems.

Uploaded by

Vũ Tuấn Anh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Hmm

Uploaded by

Vũ Tuấn Anh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

Part-of-Speech (POS) tagging

Slides adapted from: Dan Jurafsky, Julia Hirschberg, Jim Martin

Read J & M Chapter 8.

You may also want to look at:

http://www.georgetown.edu/faculty/ballc/ling361/tagging_overview.html
1. INTRODUCTION

2. Word classes
3. Tag sets and problem definition

4. POS tagging approaches

• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
2
POS examples

•N noun chair, bandwidth, pacing

•V verb study, debate, munch
• ADJ adjective purple, tall, ridiculous
• ADV adverb unfortunately, slowly,
•P preposition of, by, to
• PRO pronoun I, me, mine
• DET determiner the, a, that, those

3
What is a POS?

• Is this a semantic distinction? For example, maybe Noun is the class

of words for people, places and things. Maybe Adjective is the class
of words for properties of nouns.
• Consider: green book
book is a Noun
green is an Adjective
• Now consider: book worm
This green is very soothing.

4
Morphological and Syntactic Definition of POS

• An Adjective is a word that can fill the blank in:

It’s so __________.
• A Noun is a word that can be marked as plural.
• A Noun is a word that can fill the blank in:
the __________ is
• What is green?
It’s so green.
Both greens could work for the walls.
The green is a little much given the red rug.

5
Penn TreeBank POS Tag set

6
Example of Penn Treebank Tagging
of Brown Corpus Sentence
•The/DT grand/JJ jury/NN commented/VBD on/IN
a/DT number/NN of/IN other/JJ topics/NNS ./.

•VB DT NN .
Book that flight .

•VBZ DT NN VB NN ?
Does that flight serve dinner ?

7
The Problem

• Words often have more than

one word class: this
• This is a nice day = PRP
• This day is nice = DT
• You can go this far = RB

8
Why Do We Care about POS?

• Pronunciation
Hand me the lead pipe.
• Predicting what words can be expected next
• Personal pronoun (e.g., I, she) ____________

• Stemming
• -s means singular for verbs, plural for nouns

• As the basis for syntactic parsing and then meaning extraction

• I will lead the group into the lead smelter.

• Machine translation
• (E) content +N  (F) contenu +N
• (E) content +Adj  (F) content +Adj or satisfait +Adj

9
Definition

• “The process of assigning a part-of-speech or other lexical class

marker to each word in a corpus” (Jurafsky and Martin)

10
1. Introduction

2. WORD CLASSES
3. Tag sets and problem definition

4. POS tagging approaches

• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
11
What is a word class?

• Words that somehow ‘behave’ alike:

• Appear in similar contexts
• Perform similar functions in sentences
• Undergo similar transformations
• Basic word classes:
• 8 (ish) traditional parts of speech: noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, ….
• Called: parts-of-speech, lexical category, word classes, morphological classes,
lexical tags, POS

12
• Every known human language has nouns and verbs
• Nouns: people, places, things
• Classes of nouns
Open class vs. closed class • proper vs. common
• count vs. mass
• Verbs: actions and processes
• Open class: • Adjectives: properties, qualities
• Nouns, Verbs, Adjectives, Adverbs. • Adverbs: hodgepodge!
• Why “open”?  new ones can be created all the time • Unfortunately, John walked home extremely
• English has 4: Nouns, Verbs, Adjectives, Adverbs slowly yesterday
• Many languages have all 4, but not all! • Numerals: one, two, three, third, …
• In Lakhota and possibly Chinese, what English treats as adjectives act more like verbs
• Closed: a relatively fixed membership • Differ more from language to language than open class words
• conjunctions: and, or, but
•
Examples:
• pronouns: I, she, him • prepositions: on, under, over, …
• prepositions: with, on, under, over, near, by, … • particles: up, down, on, off, …
• determiners: the, a, an • determiners: a, an, the, …
• Usually function words • pronouns: she, who, I, ..
• conjunctions: and, but, or, …
(short common words which play a role in grammar) • auxiliary verbs: can, may should, …
13
1. Introduction

2. Word classes
3. TAG SETS AND PROBLEM DEFINITION

4. POS tagging approaches

• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
14
Tagsets

• Brown corpus tagset (87 tags):

• http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html

• Penn Treebank tagset (45 tags):

• http://www.cs.colorado.edu/~martin/SLP/Figures/

• C7 tagset (146 tags)

• http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
• Vary in number of tags: a dozen to over 200
• Size of tag sets depends on language, objectives and purpose
– Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g.,
conflating prepositions, conjunctions, particles
– Simple morphology = more ambiguity = fewer tags
15
POS Tagging

• Words often have more than one POS: back

• The back door = JJ
• On my back = NN
• Win the voters back = RB
• Promised to back the bill = VB
• The POS tagging problem is to determine the POS tag for a particular
instance of a word.

These examples from Dekang Lin

16
How do we assign POS tags to words in a sentence?

Time flies like an arrow.

• Time/[V,N] flies/[V,N] like/[V,Prep] an/Det arrow/N
• Time/N flies/V like/Prep an/Det arrow/N

Fruit flies like a banana

• Fruit/N flies/N like/V a/Det banana/N
• Fruit/N flies/V like/Prep a/Det banana/N

17
How hard is POS tagging? Measuring ambiguity

18
Potential Sources of Disambiguation

• Many words have only one POS tag (e.g. is, Mary, very, smallest)
• Others have a single most likely tag (e.g. a, dog)
• But tags also tend to co-occur regularly with other tags (e.g. Det, N)
• We can look at POS likelihoods P(t1|tn-1) to disambiguate sentences
and to assess sentence likelihoods

19
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
20
Algorithms for POS Tagging

• Why can’t we just look them up in a dictionary?

• Words that aren’t in the dictionary
• One idea: P(ti | wi) = the probability that a random hapax legomenon
in the corpus has tag ti.
• Nouns are more likely than verbs, which are more likely than pronouns.
• Another idea: use morphology.

21
Algorithms for POS Tagging - Knowledge

• Dictionary
• Morphological rules, e.g.,
• _____-tion
• _____-ly
• capitalization
• N-gram frequencies
• to _____
• DET _____ N
• But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell,
and one noun form, a small fish)
• Combining these
• V _____-ing I was gracking vs. Gracking is fun.

22
Algorithms for POS Tagging - Approaches

• Basic approaches
• Rule-Based
• Stochastic: HMM-based
• Transformation-Based Tagger (Brill) (we won’t cover this)
• Do we return one best answer or several answers and let later steps
decide?
• How does the requisite knowledge get entered?

23
• Training/Teaching an NLP Component
• Each step of NLP analysis requires a module that knows what to do. How do such
modules get created?
• By hand advantage: based on sound linguistic principles, sensible to people, explainable
• By training  less work, extensible to new languages, customizable for specific domains.

• Training/Teaching a POS Tagger

• The problem is tractable. We can do a very good job with just:
• a dictionary
• a tagset
• a large corpus, usually tagged by hand

24
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
25
Rule-based POS Tagging

• Basic Idea:
• Start with a dictionary
• Assign all possible tags to words from the dictionary
• Write rules by hand to selectively remove tags
• if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1
is not a verb like “consider” then eliminate non-adv else eliminate adv.
• Typically more than 1000 hand-written rules
• Leaving the correct tag for each word

26
Rule-Based POS Tagging
Start with a dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back:VB, JJ, RB, NN
• the: DT
• bill: NN, VB

• Etc… for the ~100,000 words of English

27
Rule-Based POS Tagging
Use the dictionary to assign every possible tag

NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

28
Rule-Based POS Tagging
Write rules to eliminate tags
Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

29
Sample ENGTWOL (ENGlish TWO Level analysis) Lexicon

30
Rule-Based POS Tagging
ENGTWOL
• 1st stage: Run words through a morphological analyzer to get all parts
of speech.
• Example: Pavlov had shown that salivation …

Pavlov PAVLOV N NOM SG PROPER

had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG

31
Rule-Based POS Tagging
ENGTWOL
• 2nd stage: Figure out what to do about words that are unknown or
ambiguous. Two approaches:
• Rules that specify what to do.
• Rules that specify what not to do:
Given input: “that”
If
(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier It isn’t that odd vs
(+2 SENT-LIM) ;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a I consider that odd vs
; verb like “consider” which I believe that he is right.
; allows adjective complements
; in “I consider that odd”
Then eliminate non-ADV tags From ENGTWOL
Else eliminate ADV 32
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. Evaluation
33
Statistical POS tagging
• Based on probability theory
• The simple “most-frequent-tag” algorithm:
• baseline algorithm
• Meaning that no one would use it if they really wanted some data tagged
• No probabilities for words not in corpus
• But it’s useful as a comparison
• Conditional Probability and Tags
• P(Verb) is probability of randomly selected word being a verb.
• P(Verb|race) is “what’s the probability of a word being a verb given that it’s the word “race”?
• Race can be a noun or a verb. It’s more likely to be a noun
• P(Verb|race) == “out of all the times we saw ‘race’, how many were verbs?”
• In Brown corpus, P(Verb|race) = 96/98 = .98
Count(race is verb)
P(V | race) 
total Count(race)
34
Most-frequent-tag
• Some ambiguous words have a more • Most-frequent-tag algorithm
frequent tag and a less frequent tag • For each word
• Consider the word “a” in these 2 sentences: • Create dictionary with each possible tag
• would/MD prohibit/VB a/DT suit/NN for a word
for/IN refund/NN • Take a tagged corpus
• of/IN section/NN 381/CD (/( a/NN )/) ./. • Count the number of times each tag
occurs for that word
• Which do you think is more frequent? • Given a new sentence
• We could count in a corpus
• For each word, pick the most frequent
• The Brown Corpus part of speech tagged at tag for that word from the corpus.
U Penn
• Counts in this corpus:  Q: Where does the dictionary come from?
A: One option is to use the same corpus
21830 DT that we use for computing the tags
6 NN
3 FW
35
Using a corpus to build a dictionary

• The/DT City/NNP Purchasing/NNP Department/NNP ,/, the/DT jury/NN said/VBD,/,

is/VBZ lacking/VBG in/IN experienced/VBN clerical/JJ personnel/NNS …
• From this sentence, dictionary is:
clerical
department
experienced
in
is
jury
…

36
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• A Markov chain is a model that tells us something about the probabilities of

sequences of random variables, states, each of which can take on values from
some set.
• These sets can be words, or tags, or symbols representing anything, like the
weather
• Markov assumption: if we want to predict the future in the sequence, all that
matters is the current state

38
Markov chain

• The states are represented as

nodes in the graph, A Markov chain for weather (a) and one for words (b), showing states
• and the transitions, with their and transitions. A start distribution π is required;
probabilities, as edges. setting π = [0.1, 0.7, 0.2] for (a) would mean a probability 0.7 of
starting in state 2 (cold), probability 0.1 of starting in state 1 (hot), etc.
• The transitions are
probabilities: the values of
arcs leaving a given state
must sum to 1
• Components of a Markov
chain

39
Markov chain - Example

• Say that there are only three kinds of weather conditions, namely Rainy, Sunny and Cloudy
• Peter, is a small kid, he loves to play outside. He loves it when the weather is sunny, because
all his friends come out to play in the sunny conditions. He hates the rainy weather for
obvious reasons.
• Every day, his mother observes the weather in the morning (that is when he usually goes out
to play) and likes always, Peter comes up to her right after getting up and asks her to tell him
what the weather is going to be like. Since she is a responsible parent, she wants to answer
that question as accurately as possible. But the only thing she has is a set of observations
taken over multiple days as to how weather has been.
• How does she make a prediction of the weather for today based on what the weather has
been for the past N days?

40
Markov chain - Example

• Our model has only 3 states:

𝑆 = {𝑆1, 𝑆2, 𝑆3} ,
and the name of each state is:
• 𝑆1 = 𝑆𝑢𝑛𝑛𝑦 ,
• 𝑆2 = 𝑅𝑎𝑖𝑛𝑦,
• 𝑆3 = 𝐶𝑙𝑜𝑢𝑑𝑦
• To establish the transition
probabilities relationship between
states we will need to collect data
• Assume the data produces the
following transition probabilities

41
Markov chain - Example
• Let’s say we have a sequence: Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy, ….;
so, in a day we can be in any of the 3 states
• We can use the following state sequence notation:
𝑞1, 𝑞2, 𝑞3, 𝑞4, 𝑞5,….., where 𝑞𝑖 𝜖 {𝑆𝑢𝑛𝑛𝑦, 𝑅𝑎𝑖𝑛𝑦, 𝐶𝑙𝑜𝑢𝑑𝑦}.
• In order to compute the probability of tomorrow’s weather we can use the Markov property

42
Markov chain - Example

• Question 1: Given that today is Sunny,

what’s the probability that tomorrow is
Sunny and the next day Rainy?

43
Markov chain - Example

• Question 1: Given that today is Sunny,

what’s the probability that tomorrow is
Sunny and the next day Rainy?

44
Markov chain - Example

• Question 2: Assume that yesterday’s

weather was Rainy, and today is Cloudy,
what is the probability that tomorrow
will be Sunny?

45
Markov chain - Example

• Question 2: Assume that yesterday’s

weather was Rainy, and today is
Cloudy, what is the probability that
tomorrow will be Sunny?

46
Markov chain
• Summary: A Markov chain is a weighted
automaton in which
• weights are probabilities, i.e., all weights are
between 0 and 1 and the sum of the weights of
all outgoing edges of a state is 1, and
• the input sequence uniquely determines the
states the automaton goes through.
• A Markov chain is actually a bigram
language model
• Markov chains are useful when we want to
compute the probability for a sequence of
events that we can observe.
LNK_47
Markov Model

• A Markov Model is a stochastic model which models temporal or sequential data, i.e.,
data that are ordered
• It provides a way to model the dependencies of current information (e.g. weather) with
previous information
• It is composed of states, transition scheme between states, and emission of outputs
(discrete or continuous)
• Several goals can be accomplished by using Markov models:
• Learn statistics of sequential data.
• Do prediction or estimation.
• Recognize patterns

48
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• An HMM is a stochastic model where the states of the model are hidden. Each state can emit an output
which is observed
• Imagine: You were locked in a room for several days and you were asked about the weather outside.
The only piece of evidence you have is whether the person who comes into the room bringing your
daily meal is carrying an umbrella or not.
• What is hidden? Sunny, Rainy, Cloudy
• What can you observe? Umbrella or Not 50
HMM

• Let’s assume that 𝑡 days had passed.

• Therefore, we will have an observation sequence O = {𝑜1,…,𝑜𝑡} ,
• where 𝑜𝑖 𝜖 {𝑈𝑚𝑏𝑟𝑒𝑙𝑙𝑎, 𝑁𝑜𝑡 𝑈𝑚𝑏𝑟𝑒𝑙𝑙𝑎} .
• Each observation comes from an unknown state.
• Therefore, we will also have an unknown sequence 𝑄 = {𝑞1,…,𝑞𝑡},
• where 𝑞𝑖 𝜖 {𝑆𝑢𝑛𝑛𝑦,𝑅𝑎𝑖𝑛𝑦,𝐶𝑙𝑜𝑢𝑑𝑦} .
• We would like to know: 𝑃(𝑞1,..,𝑞𝑡|𝑜1,…,𝑜𝑡).

51
HMM
• From Bayes’ Theorem, we can obtain the probability for a particular day as:

• For a sequence of length 𝑡:

• From the Markov property:

• Independent observations assumption:

• Thus

52
HMM components and parameters

• A HMM is governed by the following parameters: λ = {𝐴,𝐵,𝜋}

• State-transition probability matrix 𝐴
• Emission/Observation/State Conditional Output probabilities 𝐵
• Initial (prior) state probabilities 𝜋
• Determine the fixed number of states (𝑁): 𝑆 = 𝑠1,…,𝑠𝑁
53
Ice-cream
• Imagine that you are a climatologist studying the history of global warming.
• You cannot find any records of the weather in Baltimore, Maryland, for the summer of 2020, but you do
find Jason Eisner’s diary, which lists how many ice creams Jason ate every day that summer.
Our goal is to use these observations to estimate the temperature every day.
• We’ll simplify this weather task by assuming there are only two kinds of days:
• cold (C)
• hot (H).
• So the Eisner task is as follows:
• Given a sequence of observations O (each an integer representing the number of ice creams eaten on a
given day)
• find the ‘hidden’ sequence Q of weather states (H or C) which caused Jason to eat the ice cream

54
Ice-cream

• The two hidden states (H and C) correspond to hot and cold weather,
• The observations (drawn from the alphabet O = {1,2,3}) correspond to the number of ice
creams eaten by Jason on a given day

55
HMM tagger
• The goal of HMM decoding is to choose the tag sequence t that is most probable given the
observation sequence of n words
HMM
DECODING

Applying use HMM assumptions

Bayes’ rule

Dropping the Bigram assumption

denominator

PUTTING
ALL
TOGETHER

56
HMM- the three fundamental problems

57
How can we solve the problems?

• Problem 1. Likelihood of the input O:

• Compute P(O|λ) for the input O and HMM λ
•  Forward algorithm
• Problem 2. Decoding (= tagging) the input O:
• Find the best (tags) Q for the input O
•  Viterbi
• Problem 3. Estimation (= learning the model):
• Find the best model parameters A and B for the training data O
•  Forward-backward algorithm

58
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• Computing Likelihood: Given an HMM λ = (A,B) and an observation sequence O,

determine the likelihood P(O|λ).
• Example: given the ice-cream eating HMM, what is the probability of the sequence 3 1 3?
•  Problem: we don’t know what the hidden state sequence is

60
Example: given the ice-cream
eating HMM, what is the
probability of the sequence 3 1 3?
 Note: we don’t know what the
HMM- Problem 1 hidden state sequence is.

• In HMM, each hidden state produces only a single observation  sequence of hidden states
and the sequence of observations have the same length
• Given:
• A particular hidden state sequence Q = q1,q2,...,qT
• An observation sequence O = o1,o2,...,oT ,
• Then, the likelihood of the observation sequence is
• We don’t know what the hidden state sequence was  The joint probability of being in a
particular weather sequence Q and generating a particular sequence O

• The total probability of the observations

61
Example: given the ice-cream
eating HMM, what is the
probability of the sequence 3 1 3?
HMM- Problem 1 Note: we don’t know what the
hidden state sequence is.

• Example: The computation of the forward probability for our ice-cream observation 3 1 3 from one
possible hidden state sequence hot hot cold

• We don’t know what the hidden state sequence was  The sequence 3 1 3 has eight 3-event sequence
• cold cold cold
• cold cold hot
• ….

• Problem: N hidden states and an observation sequence of T observations, there are NT possible hidden
sequences  Solution: the forward algorithm 62
The forward algorithm
• It is an efficient O(N2T) algorithm and a kind of dynamic programming algorithm that uses a
table to store intermediate values as it builds up the probability of the observation sequence.
• The forward algorithm computes the observation probability by summing over the
probabilities of all possible hidden state paths that could generate the observation
sequence.
• Each cell of the trellis αt(j) represents the probability of being in state j after seeing the first t
observations, given the automaton λ
• Given state qj at time t, the value is computed as:

63
Visualizing the computation of a single
element αt(i) in the trellis by summing
The forward algorithm all the previous values αt−1, weighted by
their transition probabilities a, and
multiplying by the observation
probability bi(ot).

The forward algorithm, where forward[s,t] represents αt(s).

64
The forward algorithm
Given state qj at time t, the
value is computed as:

α2(2) = α1(1) x P (H|C) x P (1|H) = 0.02 x 0.5 x 0.2

α2(2) = α1(2) x P (H|H) x P (1|H) = 0.32 x 0.6 x 0.2

65
The forward algorithm

3 1 3
α2(1)= 0.005 + 0.064 = 0.069 α3(1)= 0.00345 +0.001616= 0.005066
α2(1)=α1(1) P (C|C) P (1|C) α3(1)=α2(1) P (C|C) P (3|C)
= 0.02 * 0.5 * 0.5 = 0.005 = 0.069 * 0.5 * 0.1 = 0.00345
COLD α1(1)= P(C|start) P(3|C)
=0.2 * 0.1 = 0.002 α2(1)=α1(2) P (C|H) P (1|C) α3(1)=α2(2) P (C|H) P (3|C)
= 0.32 * 0.4 * 0.5 = 0.064 = 0.0404 * 0.4 * 0.1 = 0.001616
α2(2)= 0.002 + 0.0384 = 0.0404 α3(1)= 0.01104 + 0.009696
α2(2)=α1(1) P (H|C) P (1|H) α3(2)=α2(1) P (C|H) P (3|H)
HOT α1(2)= P(H|start)
= 0.02 * 0.5 * 0.2 = 0.002 = 0.069 * 0.4 * 0.4 = 0.01104
P(3|H)
= 0.8 * 0.4 = 0.32 α2(1)=α1(2) P (H|H) P (1|H) α3(1)=α2(2) P (H|H) P (3|H)
= 0.32 * 0.6 * 0.2 = 0.0384 = 0.0404 * 0.6 * 0.4 = 0.009696

66
The forward algorithm

67
The forward algorithm

• For each possible hidden state sequence (HHH, HHC, HCH, …), we could run the
forward algorithm and compute the likelihood of the observation sequence given that
hidden state sequence.
• Then we could choose the hidden state sequence with the maximum observation
likelihood.
• It should be clear from the previous section that we cannot do this because there are an
exponentially large number of state sequences.

68
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• For any model, such as an HMM, that contains hidden variables, the task of determining
which sequence of variables is the underlying source of some sequence of observations is
called the decoding task.
• In the ice-cream domain, given a sequence of ice-cream observations 3 1 3 and an HMM,
the task of the decoder is to find the best hidden weather sequence (H H H)
Decoding: Given as input an HMM λ = (A,B) and a sequence of observations O = o1,o2,...,oT,
find the most probable sequence of states Q = q1q2q3 ...qT .
• Viterbi algorithm
• is a kind of dynamic programming that makes uses of a dynamic programming trellis
• strongly resembles another dynamic programming variant, the minimum edit distance
algorithm

70
The Viterbi algorithm
• Given:
• A particular hidden state sequence Q = q0,q1,q2,...,qT
• An observation sequence O = o1,o2,...,oT ,
• Each cell of the trellis, vt(j), represents the probability that the HMM is in state j after seeing the
first t observations and passing through the most probable state sequence q1,...,qt−1, given the
automaton λ
• The value of each cell vt(j) is computed by recursively taking the most probable path that could
lead us to this cell

71
Viterbi algorithm for finding optimal sequence of
The Viterbi algorithm hidden states. Given an observation sequence and an
HMM λ = (A,B), the algorithm returns the state path
through the HMM that assigns maximum likelihood to
the observation sequence.

72
The Viterbi algorithm
The value of each cell vt(j) is computed by
recursively taking the most probable path
that could lead us to this cell

v2(2) = v1(1) x P(H|C) x P (1|H) = 0.02 x 0.5 x 0.2

v2(2) = v1(2) x P (H|H) x P (1|H) = 0.32 x 0.6 x 0.2

73
The Viterbi algorithm

74
The Viterbi algorithm vs. the forward algorithm

• The Viterbi algorithm is identical to the forward algorithm EXCEPT it takes the max over the
previous path probabilities whereas the forward algorithm takes the sum.
• The Viterbi algorithm has one component that the forward algorithm doesn’t have:
backpointers
• Why?
• The forward algorithm needs to produce an observation likelihood,
• The Viterbi algorithm must produce a probability and also the most likely state sequence.
• Computing this best state sequence by keeping track of the path of hidden states that led
to each state, and then at the end backtracing the best path to the beginning (the Viterbi
backtrace).

75
Example

76
Example – “Janet will back the bill.” 1. Use the forward algorithm
2. Use the Viterbi algorithm

77
78
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS TAGGING APPROACHES

• Rule-based POS tagging
• Stochastic POS tagging – Statistical tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
• The forward algorithm
• The Viterbi algorithm
• The forward-backward algorithm
5. Evaluation
79
HMM – Problem 3
• Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters
A and B.
• The input to such a learning algorithm would be For the ice cream task, we would start with :
• an unlabeled sequence of observations O and • a sequence of observations O = {1,3,2,...,} and
• a vocabulary of potential hidden states Q. • the set of hidden states H and C.

• The standard algorithm for HMM training is the forward-backward, or Baum-Welch algorithm (Baum, 1972), a
special case of the Expectation-Maximization or EM algorithm (Dempster et al., 1977).
• The algorithm will let us train both the transition probabilities A and the emission probabilities B of the HMM.
• EM is an iterative algorithm, computing an initial estimate for the probabilities, then using those estimates to
computing a better estimate, and so on, iteratively improving the probabilities that it learns
• The real problem: we don’t know the counts of being in any of the hidden states
• Solution: The Baum-Welch algorithm solves this by iteratively estimating the counts. We will start with an
estimate for the transition and observation probabilities and then use these estimated probabilities to derive better
and better probabilities
Computing the forward probability for an observation and
Then, dividing that probability mass among all the different paths that contributed to this forward probability.
80
HMM- Problem 3
The backward probability β is the
probability of seeing the observations
from time t+1 to the end, given that we
are in state i at time t (and given the
automaton λ):

The computation of βt(i) by summing all

the successive values βt+1(j) weighted
by their transition probabilities aij and
their observation probabilities bj(ot+1).
Start and end states not shown.
81
HMM- Problem 3
• Put all together to see how the forward and backward probabilities can help compute the transition
probability aij and observation probability bi(ot) from an observation sequence, even though the
actual path taken through the model is hidden
• Estimate âij by a variant of simple maximum likelihood estimation

• Denote the ξt as the probability of being in state i at time t and state j at time t+1, given the
observation sequence and of course the model

• To compute ξt , we first compute a probability which is similar to ξt, but differs in including the
probability of the observation; note the different conditioning of O

82
HMM- Problem 3

The expected number of transitions from state

i to state j is then the sum over all t of ξ

83
HMM- Problem 3

• We also need a formula for recomputing the observation probability.

• This is the probability of a given symbol vk from the observation vocabulary V, given a state j

• Denoting γt(j) as the probability of being in state j at time t

84
The forward-backward algorithm

• In the E-step, we compute the expected state

occupancy count γ and the expected state
transition count ξ from the earlier A and B
probabilities.

• In the M-step, we use γ and ξ to recompute

new A and B probabilities.

85
Sketch of Baum-Welch (EM) Algorithm
for Training HMMs
Assume an HMM with N states.
Randomly set its parameters λ=(A,B)
(making sure they represent legal distributions)
Until converge (i.e. λ no longer changes) do:
E Step: Use the forward/backward procedure to
determine the probability of various possible
state sequences for generating the training data
M Step: Use these probability estimates to
re-estimate values for all of the parameters λ

86
Self-study

• Extending the HMM algorithm to trigrams

• Beam Search
• Maximum Entropy Markov models

87
1. Introduction

2. Word classes
3. Tag sets and problem definition

4. POS tagging approaches

• Rule-based POS tagging
• Stochastic POS tagging – Statistical POS tagging
• Markov chain
• Hidden Markov Model
• Stochastic POS tagging – HMM POS tagging
5. EVALUATION
88
Evaluation

• The result is compared with a manually coded “Gold Standard”

• Typically accuracy reaches 96-97%
• This may be compared with result for a baseline tagger (one that uses no context).
• Important: 100% is impossible even for human annotators.
• Evaluation performance
• How do we know how well a tagger does?
• Say we had a test sentence, or a set of test sentences, that were already tagged by a human (a
“Gold Standard”)
• We could run a tagger on this set of test sentences
• And see how many of the tags we got right.
• This is called “Tag accuracy” or “Tag percent correct”

89
Training and test sets
• We take a set of test sentences
• Hand-label them for part of speech
• The result is a “Gold Standard” test set
• Who does this?
Brown corpus: done by U Penn Grad students in linguistics
• Don’t they disagree?
• Yes! But on about 97% of tags no disagreements
• And if you let the taggers discuss the remaining 3%, they often reach agreement
• But we can’t train our frequencies on the test set sentences
• So for testing the Most-Frequent-Tag algorithm (or any other probabilistic algorithm), we need
2 things:
• A hand-labeled training set: the data that we compute frequencies from, ….
• A hand-labeled test set: The data that we use to compute our % correct.
90
Computing % correct

• Of all the words in the test set

• For what percent of them did the tag chosen by the tagger equal the human-selected tag.
# of words tagged correctly in test set
%correct 
total # of words in test set
• Human tag set: (“Gold Standard” set)



91
Training and Test sets

• Often they come from the same labeled corpus!

• We just use 90% of the corpus for training and save out 10% for
testing!
• Even better: cross-validation
• Take 90% training, 10% test, get a % correct
• Now take a different 10% test, 90% training, get % correct
• Do this 10 times and average

92
Evaluation and rule-based taggers

• Does the same evaluation metric work for rule-based taggers?

• Yes!
• Rule-based taggers don’t need the training set
• But they still need a test set to see how well the rules are working

93
94

Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
No ratings yet
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
40 pages
Lect6 Pos
No ratings yet
Lect6 Pos
62 pages
nlp-unit-iii-notes
No ratings yet
nlp-unit-iii-notes
30 pages
Lecture 20-23 Part of Speech Tagging
No ratings yet
Lecture 20-23 Part of Speech Tagging
36 pages
Part-Of-Speech (POS) Tagging
No ratings yet
Part-Of-Speech (POS) Tagging
53 pages
Cme4408 p6 Pos Tagging
No ratings yet
Cme4408 p6 Pos Tagging
33 pages
10 - POS Tagging
No ratings yet
10 - POS Tagging
75 pages
Lecture#11 (POS Tagging)
No ratings yet
Lecture#11 (POS Tagging)
19 pages
NLP 4
No ratings yet
NLP 4
83 pages
Lecture 5
No ratings yet
Lecture 5
56 pages
Lecture 16-17-18-19
No ratings yet
Lecture 16-17-18-19
42 pages
POS Tagging: Introduction: Heng Ji
No ratings yet
POS Tagging: Introduction: Heng Ji
35 pages
pos tagging and chunking
No ratings yet
pos tagging and chunking
29 pages
Ilak Pos Tagging
No ratings yet
Ilak Pos Tagging
48 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
47 pages
NLPChapter3
No ratings yet
NLPChapter3
14 pages
Lec3-posner intro
No ratings yet
Lec3-posner intro
30 pages
4-Lecture Four - (Part of Speech Tagging and Sequence Labeling)
No ratings yet
4-Lecture Four - (Part of Speech Tagging and Sequence Labeling)
36 pages
Chapter Two Natural Language Processing
No ratings yet
Chapter Two Natural Language Processing
141 pages
Lec-5 POStagging
No ratings yet
Lec-5 POStagging
24 pages
3 Natural Language Processing-PoS Tagging
No ratings yet
3 Natural Language Processing-PoS Tagging
14 pages
Part of Speech Tagging (Chapter 5) : Adapted From Kathy Mccoy'S Presentation Downloaded From The Web, September 2010
No ratings yet
Part of Speech Tagging (Chapter 5) : Adapted From Kathy Mccoy'S Presentation Downloaded From The Web, September 2010
63 pages
Module-5 (Markov Model and Pos Tagging)
No ratings yet
Module-5 (Markov Model and Pos Tagging)
66 pages
POStagging
No ratings yet
POStagging
72 pages
Module 2 HMMppt
No ratings yet
Module 2 HMMppt
31 pages
module-3
No ratings yet
module-3
33 pages
PARTS OF SPEECH TAGGING Article
No ratings yet
PARTS OF SPEECH TAGGING Article
4 pages
Syntactic Processing - Lecture Notes
No ratings yet
Syntactic Processing - Lecture Notes
56 pages
Unit 3
No ratings yet
Unit 3
16 pages
Tagging and its types
No ratings yet
Tagging and its types
3 pages
Parts of Speech
No ratings yet
Parts of Speech
26 pages
Part of Speech Tagging
No ratings yet
Part of Speech Tagging
13 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
A Hybrid Model For POS Tagging
No ratings yet
A Hybrid Model For POS Tagging
4 pages
Natural Language Processing: Parts of Speech Tagging - Pos
No ratings yet
Natural Language Processing: Parts of Speech Tagging - Pos
20 pages
Part-Of-Speech Tagging: A Simple But Useful Form of Linguistic Analysis
No ratings yet
Part-Of-Speech Tagging: A Simple But Useful Form of Linguistic Analysis
18 pages
Assignment 3
No ratings yet
Assignment 3
12 pages
3. Language Structure
No ratings yet
3. Language Structure
10 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
Rutuja
No ratings yet
Rutuja
10 pages
unit-3
No ratings yet
unit-3
50 pages
Speech and Language Processing: SLP Chapter 5
No ratings yet
Speech and Language Processing: SLP Chapter 5
56 pages
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
No ratings yet
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
93 pages
NLP-Lectures 4,5,6
No ratings yet
NLP-Lectures 4,5,6
85 pages
Parts of Speech Tagging
No ratings yet
Parts of Speech Tagging
17 pages
8 POSNER Intro May 6 2021
No ratings yet
8 POSNER Intro May 6 2021
26 pages
NLP Ia2
No ratings yet
NLP Ia2
18 pages
Speech Recognition Architecture
No ratings yet
Speech Recognition Architecture
13 pages
hidden markov model
No ratings yet
hidden markov model
13 pages
721
No ratings yet
721
7 pages
Automatic tagging. Project, Holovko Yana
No ratings yet
Automatic tagging. Project, Holovko Yana
9 pages
10pos Tagging PDF
No ratings yet
10pos Tagging PDF
76 pages
Multi-Tagging For Transition-Based Dependency Parsing
No ratings yet
Multi-Tagging For Transition-Based Dependency Parsing
10 pages
Rule_based_POS_Tagging_Example (1)
No ratings yet
Rule_based_POS_Tagging_Example (1)
4 pages
5 Sequence Learning
No ratings yet
5 Sequence Learning
50 pages
NLP Report - Modified
No ratings yet
NLP Report - Modified
8 pages
723
No ratings yet
723
5 pages
Business English
From Everand
Business English
Rosalia Covello
4/5 (7)
The Secret to Solving Crossword Puzzles
From Everand
The Secret to Solving Crossword Puzzles
Richard Mann
No ratings yet
Root Cause Analysis
0% (1)
Root Cause Analysis
7 pages
Knot Activity
No ratings yet
Knot Activity
2 pages
Literature Study-TNCDBR Norms
No ratings yet
Literature Study-TNCDBR Norms
5 pages
Domain and Range Homework Help
100% (1)
Domain and Range Homework Help
4 pages
Reset The Local Group Policies
100% (1)
Reset The Local Group Policies
6 pages
Science & Technology in Society: Artificial Intelligence
No ratings yet
Science & Technology in Society: Artificial Intelligence
1 page
VSD Workshop
No ratings yet
VSD Workshop
121 pages
SPA WM999-15 Assignment Brief and Front Sheet PGT - Assignment 3 - Final
No ratings yet
SPA WM999-15 Assignment Brief and Front Sheet PGT - Assignment 3 - Final
8 pages
Manufacturer in EDP
No ratings yet
Manufacturer in EDP
13 pages
Chapter 7
No ratings yet
Chapter 7
22 pages
Letters of The Damned Issue #1
No ratings yet
Letters of The Damned Issue #1
19 pages
Oracle E-Business Suite R12.1 Financial Management Solution Engineer Specialist_dump
No ratings yet
Oracle E-Business Suite R12.1 Financial Management Solution Engineer Specialist_dump
4 pages
Crossword
No ratings yet
Crossword
1 page
Temp Shelter
No ratings yet
Temp Shelter
29 pages
Career Interest Survey
No ratings yet
Career Interest Survey
3 pages
Y5 Unit 1 Worksheets 2
No ratings yet
Y5 Unit 1 Worksheets 2
42 pages
Partup Presentation by Slidesgo
No ratings yet
Partup Presentation by Slidesgo
52 pages
If You're So Smart, Why Aren't You Rich
No ratings yet
If You're So Smart, Why Aren't You Rich
297 pages
Jal Jeevan Mission: Why in News
No ratings yet
Jal Jeevan Mission: Why in News
3 pages
Datasheet 4
No ratings yet
Datasheet 4
20 pages
Unit 1
No ratings yet
Unit 1
11 pages
PSQ 1.2. Voltage Sags & Interruptions
No ratings yet
PSQ 1.2. Voltage Sags & Interruptions
56 pages
Price List
No ratings yet
Price List
1 page
Chapter II.9
No ratings yet
Chapter II.9
16 pages
Mt8102ip Series
No ratings yet
Mt8102ip Series
2 pages
financials
No ratings yet
financials
3 pages
BS-6 Spring 2020 - FR2 Assignment # 5: Other Relevant Information Is As Under
No ratings yet
BS-6 Spring 2020 - FR2 Assignment # 5: Other Relevant Information Is As Under
1 page
Bimco Standard Time Charter Party For Container Vessels Code Name: Boxtime 2004
No ratings yet
Bimco Standard Time Charter Party For Container Vessels Code Name: Boxtime 2004
16 pages
DENGUE
No ratings yet
DENGUE
43 pages
Instant ebooks textbook Electromagnetic compatibility principles and applications 2nd ed., rev. and expanded Edition David A Weston download all chapters
100% (4)
Instant ebooks textbook Electromagnetic compatibility principles and applications 2nd ed., rev. and expanded Edition David A Weston download all chapters
82 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hmm

Uploaded by

Hmm

Uploaded by

Part-of-Speech (POS) tagging

Slides adapted from: Dan Jurafsky, Julia Hirschberg, Jim Martin

Read J & M Chapter 8.

You may also want to look at:

4. POS tagging approaches

•N noun chair, bandwidth, pacing

• Is this a semantic distinction? For example, maybe Noun is the class

• An Adjective is a word that can fill the blank in:

• Words often have more than

• As the basis for syntactic parsing and then meaning extraction

• “The process of assigning a part-of-speech or other lexical class

4. POS tagging approaches

• Words that somehow ‘behave’ alike:

4. POS tagging approaches

• Brown corpus tagset (87 tags):

• Penn Treebank tagset (45 tags):

• C7 tagset (146 tags)

• Words often have more than one POS: back

These examples from Dekang Lin

Time flies like an arrow.

Fruit flies like a banana

4. POS TAGGING APPROACHES

• Why can’t we just look them up in a dictionary?

• Training/Teaching a POS Tagger

4. POS TAGGING APPROACHES

• Etc… for the ~100,000 words of English

Pavlov PAVLOV N NOM SG PROPER

4. POS TAGGING APPROACHES

• The/DT City/NNP Purchasing/NNP Department/NNP ,/, the/DT jury/NN said/VBD,/,

4. POS TAGGING APPROACHES

• A Markov chain is a model that tells us something about the probabilities of

• The states are represented as

• Our model has only 3 states:

• Question 1: Given that today is Sunny,

• Question 1: Given that today is Sunny,

• Question 2: Assume that yesterday’s

• Question 2: Assume that yesterday’s

4. POS TAGGING APPROACHES

• Let’s assume that 𝑡 days had passed.

• For a sequence of length 𝑡:

• From the Markov property:

• Independent observations assumption:

• A HMM is governed by the following parameters: λ = {𝐴,𝐵,𝜋}

Applying use HMM assumptions

Dropping the Bigram assumption

• Problem 1. Likelihood of the input O:

4. POS TAGGING APPROACHES

• Computing Likelihood: Given an HMM λ = (A,B) and an observation sequence O,

• The total probability of the observations

The forward algorithm, where forward[s,t] represents αt(s).

α2(2) = α1(1) x P (H|C) x P (1|H) = 0.02 x 0.5 x 0.2

α2(2) = α1(2) x P (H|H) x P (1|H) = 0.32 x 0.6 x 0.2

4. POS TAGGING APPROACHES

v2(2) = v1(1) x P(H|C) x P (1|H) = 0.02 x 0.5 x 0.2

v2(2) = v1(2) x P (H|H) x P (1|H) = 0.32 x 0.6 x 0.2

4. POS TAGGING APPROACHES

The computation of βt(i) by summing all

The expected number of transitions from state

• We also need a formula for recomputing the observation probability.

• Denoting γt(j) as the probability of being in state j at time t

• In the E-step, we compute the expected state

• In the M-step, we use γ and ξ to recompute

• Extending the HMM algorithm to trigrams

4. POS tagging approaches

• The result is compared with a manually coded “Gold Standard”

• Of all the words in the test set

• Often they come from the same labeled corpus!

• Does the same evaluation metric work for rule-based taggers?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.