0% found this document useful (0 votes)
4 views

13-TextGen-2024

Natural Language Generation (NLG) is a sub-field of natural language processing focused on creating systems that produce coherent text for human use. The document outlines various NLG tasks such as machine translation, dialogue systems, and summarization, as well as the autoregressive models used for text generation. It also discusses decoding methods and training algorithms essential for developing effective NLG systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

13-TextGen-2024

Natural Language Generation (NLG) is a sub-field of natural language processing focused on creating systems that produce coherent text for human use. The document outlines various NLG tasks such as machine translation, dialogue systems, and summarization, as well as the autoregressive models used for text generation. It also discusses decoding methods and training algorithms essential for developing effective NLG systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Natural Language Generation:

Task
Antoine Bosselut
What is natural language generation?
• Natural language generation
(NLG) is a sub-field of natural
language processing

• Focused on building systems that


automatically produce coherent
and useful written or spoken text
for human consumption

• NLG systems are already


changing the world we live in…
Machine Translation
Dialogue Systems
Summarization
Document Summarization E-mail Summarization Meeting Summarization

http://mogren.one/lic/ https://chrome.google.com/webstore/detail/gmail-summarization/
(Wang and Cardie, ACL 2013)
Data-to-Text Generation

(Parikh et al.., EMNLP 2020)

(Wiseman and Rush., EMNLP 2017)


(Dusek et. al., INLG 2019)
Visual Description Generation

Two children are sitting at a table in a restaurant. The children are


one little girl and one little boy. The little girl is eating a pink frosted
donut with white icing lines on top of it. The girl has blonde hair and
is wearing a green jacket with a black long sleeve shirt underneath.
The little boy is wearing a black zip up jacket and is holding his
finger to his lip but is not eating. A metal napkin dispenser is in
between them at the table. The wall next to them is white brick.
Two adults are on the other side of the short white brick wall. The
room has white circular lights on the ceiling and a large window in
the front of the restaurant. It is daylight outside.
(Karpathy & Li., CVPR 2015) (Krause et. al., CVPR 2017)
Creative Generation
Stories & Narratives Poetry

(Rashkin et al.., EMNLP 2020) (Ghazvininejad et al.., ACL 2017)


All-in-one: ChatGPT
What is natural language generation?

Any task involving text production for human


consumption requires natural language generation
What is natural language generation?

Any task involving text production for human


consumption requires natural language generation

Deep Learning is powering next-gen NLG systems!


Today’s Outline

• Introduction

• Section 1: Formalizing NLG: a simple model and training algorithm

• Section 2: Decoding from NLG models

• Section 3: Evaluating NLG Systems

• Exercise Session: Playing around with our own story generation system
Basics of natural language generation
• Most text generation are autoregressive models — they predict next
tokens based on the values of past tokens

• In autoregressive text generation models, at each time step t, our model


takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂
𝑡
𝑦
Basics of natural language generation
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂
𝑡
𝑦
Basics of natural language generation
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂ yt̂

yt−4 yt−3 yt−2 yt−1


𝑡
𝑦
Basics of natural language generation
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂
yt̂ ̂
yt+1 ̂
yt+1

yt−4 yt−3 yt−2 yt−1 yt̂ ̂


yt+1
𝑡
𝑦
A look at a single step
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂ yt̂

yt−4 yt−3 yt−2 yt−1


𝑡
𝑦
Basics: What are we trying to do?
• At each time step t, our model computes a vector of scores for each token
in our vocabulary, S ∈ ℝ :

= ({ < } , )
f( . ) is your model

• Then, we compute a probability distribution over ∈ using these


scores:

exp( )
( = { < }) =
∑ ′∈ exp( ′)
𝑤

𝑉
𝑤

𝑆
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑦
𝑆
𝑓
𝑦
𝜃
𝑤
𝑤
𝑃
𝑉
𝑆
𝑉
Basics: What are we trying to do?
• At each time step t, our model computes a vector of scores for each token
in our vocabulary, S ∈ ℝ :

= ({ < } , )
f( . ) is your model

• Then, we compute a probability distribution over ∈ using these


scores:
exp( )
( { < }) =
∑ ′∈ exp( ′)
𝑤

𝑉
𝑤

𝑆
𝑡
𝑡
𝑡
𝑃
𝑦
𝑦
𝑆
𝑓
𝑦
𝜃
𝑤
𝑤
𝑃
𝑉
𝑆
𝑉
Basics: What are we trying to do?
• At each time step t, our model computes a vector of scores for each token
in our vocabulary, S ∈ ℝ . Then, we compute a probability distribution
over ∈ using these scores:
( { < })

softmax

S
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑦
−4 −3 −2 −1
𝑤
𝑃
𝑉
𝑦
𝑦
𝑦
𝑦
𝑉
Basics: What are we trying to do?
restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor

• At inference time, our decoding algorithm defines a function to select a token from
this distribution :

yt̂ = g(P(yt | {y<t})) ( . ) is your decoding algorithm


𝑃
𝑃
𝑔
Basics: What are we trying to do?
• We train the model to minimize the negative loglikelihood of predicting the next
token in the sequence:

ℒt = − log P(y*
t | <t )
{y* } Sum ℒt for the
entire sequence

- This is a multi-class classi cation task where each ∈ is a unique class.



- The label at each step is the actual word in the training sequence
- This token is often called the “gold” or “ground truth” token
- This algorithm is often called “teacher forcing”
𝑡
𝑤
𝑦
𝑉
fi
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<

ℒ = − log P(y*
1
| y*
0 )

1


0
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<

ℒ = − (log P(y*
1
| y*
0 ) + log P( y*
2
| y* ,
0 1
y* ))
∗ ∗
1 2

∗ ∗
0 1
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<

ℒ = − (log P(y*
1
| y*
0 ) + log P( y*
2
| y* ,
0 1
y* ) + log P ( y*
3
| y* , y*
0 1 2
, y* ))
∗ ∗ ∗
1 2 3

∗ ∗ ∗
0 1 2
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
4
log ( ∗
{ }< )


ℒ =−
=1
∗ ∗ ∗ ∗
1 2 3 4
𝑡
∗ ∗ ∗ ∗
𝑡
𝑡
𝑃
𝑦
𝑦
0 1 2 3
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
T
log P(y* <t )

ℒ=− t | {y* }
∗ ∗ ∗ ∗
t=1 ∗ ∗ ∗
<END>

1 2 3 4 −3 −2 −1


0

1

2

3
… ∗
−4

−3

−2

−1
𝑡
𝑡
𝑦
𝑦
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Text Generation: Takeaways
• Text generation is the foundation of many useful NLP applications (e.g.,
translation, summarisation, dialogue systems)

• In autoregressive NLG, we generate one token a time, using the context and
previous generated tokens as inputs for generating the next token.

• Our model generates a set of scores for every token in the vocabulary, which
we can convert to a probability distribution using the softmax function

• To get a calibrated distribution, we train our model using maximum


likelihood estimation to predict the next token on a dataset of sequences
Natural Language Generation:
Decoding
Antoine Bosselut
Section Outline

• Content - Greedy Decoding Methods: Argmax, Beam Search

• Content - Challenges of Greedy Decoding

• Content - Sampling Methods: Top-k, Top-p

• Advanced - kNN Language Models; Backprop-based decoding

2
Decoding: what is it all about?

• At each time step t, our model computes a vector of scores for each token in our

= ({ })
vocabulary, S ∈ ℝ :
< f( . ) is your model

• Then, we compute a probability distribution over these scores (usually with a


softmax function):
exp( )
( = { < }) =
∑ ′∈ exp( )′
• Our decoding algorithm defines a function to select a token from this
distribution:
^ = ( ( { < }))
𝑤

𝑉
𝑤

g( . ) is your decoding algorithm
𝑆
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑦
𝑡
𝑡
𝑡
𝑆
𝑓
𝑦
𝑦
𝑔
𝑃
𝑦
𝑦
𝑤
3
𝑃
𝑆
𝑉
Decoding: what is it all about?

• Our decoding algorithm defines a function to select a token from this

yt̂ = g(P(yt | {y*}, y<t


̂ ))
distribution

<END>
^ ^ ^ ^ ^ ^
1 2 −3 −2 −1


∗ ∗ ^ ^ ^ ^ ^ ^
−2 −1 y*
0 1 2 −4 −3 −2 −1
<START>
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
4
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Greedy methods: Argmax Decoding

• g = select the token with the highest probability:

yt̂ = argmax P(yt = w | {y}<t)


w∈V

restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor

5
Greedy methods: Argmax Decoding

• g = select the token with the highest probability:

Select highest
yt̂ = argmax P(yt = w | {y}<t) scoring token
w∈V
What’s a potential problem with argmax decoding?
restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor

6
Issues with argmax decoding
Beam search
Better-than-greedy
• In argmax decoding, we cannot
decoding?
revise prior decisions
• in greedy decoding, we cannot go back and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____

• fundamental idea of beam search: explore


• Better option: use beam search (a search algorithm) to explore
several
several differentand
hypotheses hypotheses instead
select the best one of just a
single one
• keep track of k most probable partial translations
at each decoder step instead of just one!
the beam size k is usually 5-10
7 27 2/15/18
Issues with argmax decoding
Beam search
Better-than-greedy
• In argmax decoding, we cannot
decoding?
revise prior decisions
• in greedy decoding, we cannot go back and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____


• fundamental
Better option: idea
use of
beam
Potentially leads to sequences that are
• beam
search (a search:
search explore
algorithm) to explore
several
several differentand
hypotheses hypotheses instead
select the best one of just a
Ungrammatical
-
single one
- Unnatural
• keep track of k most probable partial translations
- Nonsensical at each decoder step instead of just one!
- Incorrect the beam size k is usually 5-10
8 27 2/15/18
Greedy methods: Beam Beam Search search
Better-than-greedy decoding?
• in greedy decoding, we cannot go back
• In greedy decoding, we cannot revise prior decisions
and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____

• fundamental idea of beam search: explore


• Better option: use beam search (a search algorithm) to explore
• Beam Search:several
Explore
several several
differentdifferent hypotheses
hypotheses instead
hypotheses and select the best one instead
of just of
a just one
single
• Track of the b onescoring sequences at each decoder step instead of just one
highest
j
• keep track of k most probable partial translations
̂ ̂ ̂


Score at each step:
at each log P(
decoder yt | y
step1 , . . . ,
instead yt−1 ,
of X)
just one!
t=1 size k is usually 5-10
the beam
• b is called the beam size
27 2/15/18
9
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2

log P(y1̂ | y0)

the -1.05

<START>

a -1.39

10
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
2
log P(yt̂ | y0̂ , . . . , yt−1
̂ )

t=1

poor -1.90
the
people -2.3

<START>

poor -1.54
a
person -3.2

11
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
3
log P(yt̂ | y0, y1̂ , . . . , yt−1
̂ )

t=1

are -2.42
poor
the don’t -2.13
people

<START>
person -3.12
poor
a but -3.53
person

12
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
always -3.82

not -2.67
are
poor
the don’t have -3.32
people
take -3.61
<START>
person
poor and so on…
a but j
person
log P(yt̂ | y1̂ , . . . , yt−1
̂ )

t=1
13
Greedy methods: Beam Search
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person j
log P(yt̂ | y1̂ , . . . , yt−1
̂ )

t=1
14
18
Greedy methods: Beam Search
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
take any

<START> enough
person money
poor
a but funds
person j
log P(yt̂ | y1̂ , . . . , yt−1
̂ )

t=1
15
19
Greedy methods: Beam Search

• To take best scoring path at every step:


• Maximize likelihood
• or
• Maximize loglikehood of sequence
• or
• Minimize negative log likelihood of sequence

• Use the (negative) (log)likelihood of the full sequence up to this point

16
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
always -3.82

not -2.67
are
poor
the don’t have -3.32
people
take -3.61
<START>
person
poor and so on…
a but
person

17
Beam Search

• Different hypotheses may produce <END> token at different time steps


- When a hypothesis produces <END>, stop expanding it and place it aside
• Continue beam search until:

- All b beams (hypotheses) produce <END> OR

- Hit max decoding limit T


• Select top hypotheses using the normalized likelihood score
T
1
log P(yt̂ | y1̂ , . . . , yt−1
̂ , X)
T∑t=1

- Otherwise shorter hypotheses have higher scores


18
What do you think might happen if we
increase the beam size?

They maximise the likelihood of the sequence.


What do maximum likelihood sequences look like?

19
Why does repetition happen?

20 (Holtzman et. al., ICLR 2020)


Why does repetition happen?

Negative loglikelihood
decreases over time!

21 (Holtzman et. al., ICLR 2020)


Beam search gets repetitive and repetitive

Worse for transformer LMs

(Holtzman et. al., ICLR 2020)


And it keeps going…

Longer it goes, the worse it gets.

23 (Holtzman et. al., ICLR 2020)


Greedy methods get repetitive

Context: In a shocking finding, scientist discovered a herd of unicorns


living in a remote, previously unexplored valley, in the Andes
Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the


National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM) and the
Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
24 (Holtzman et. al., ICLR 2020)
Greedy methods get repetitive

Context: In a shocking finding, scientist discovered a herd of unicorns


living in a remote, previously unexplored valley, in the Andes
Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the


Repetition is a big problem in text generation!
National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM) and the
Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
25 (Holtzman et. al., ICLR 2020)
How can we reduce repetition?

Simple option:
• Heuristic: Don’t repeat n-grams

More complex:
• Minimize embedding distance between consecutive sentences (Celikyilmaz et al., 2018)
• Doesn’t help with intra-sentence repetition
• Coverage loss (See et al., 2017)
• Prevents attention mechanism from attending to the same words
• Unlikelihood objective (Welleck et al., 2020)
• Penalize generation of already-seen tokens

26
Are greedy methods reasonable?

27 (Holtzman et. al., ICLR 2020)


Time to get random : Sampling!

• Sample a token from the distribution of tokens

yt̂ ∼ P(yt = w | {y}<t)


restroom
• It’s random so you can sample any token! grocery
store
airport
He wanted bathroom
to go to the Model beach
doctor
What’s a potential problem hospital
with sampling? pub
gym
28
Decoding: Top-k sampling

• Problem: Vanilla sampling makes every token in the vocabulary an option


• Even if most of the probability mass in the distribution is over a limited set of
options, the tail of the distribution could be very long
• Many tokens are probably irrelevant in the current context
• Why are we giving them individually a tiny chance to be selected?
• Why are we giving them as a group a high chance to be selected?

29 (Fan et al., ACL 2018; Holtzman et al., ACL 2018)


Decoding: Top-k sampling

• Problem: Vanilla sampling makes every token in the vocabulary an option


• Even if most of the probability mass in the distribution is over a limited set of
options, the tail of the distribution could be very long
• Many tokens are probably irrelevant in the current context
• Why are we giving them individually a tiny chance to be selected?
• Why are we giving them as a group a high chance to be selected?

• Solution: Top-k sampling


• Only sample from the top k tokens in the probability distribution

30 (Fan et al., ACL 2018; Holtzman et al., ACL 2018)


Decoding: Top-k sampling

• Solution: Top-k sampling


• Only sample from the top k tokens in the probability distribution
• Common values are k = 5, 10, 20 (but it’s up to you!) restroom
grocery
store
airport
He wanted bathroom
to go to the Model beach
doctor
hospital
pub
• Increase k for more diverse/risky outputs gym
• Decrease k for more generic/safe outputs
31

(Fan et al., ACL 2018; Holtzman et al., ACL 2018)


Decoding: Top-k sampling

• Solution: Top-k sampling


• Only sample from the top k tokens in the probability distribution
• Common values are k = 5, 10, 20 (but it’s up to you!) restroom
grocery
store
airport
What’s
He wanted a potential problem with top-k sampling?
bathroom
to go to the Model beach
doctor
hospital
pub
• Increase k for more diverse/risky outputs gym
• Decrease k for more generic/safe outputs
32

(Fan et al., ACL 2018; Holtzman et al., ACL 2018)


Issues with Top-k sampling

Top-k sampling can cut off too quickly!

Top-k sampling can also cut off too slowly!

33 (Holtzman et. al., ICLR 2020)


Decoding: Top-p (nucleus) sampling

• Problem: The probability distributions we sample from are dynamic


• When the distribution Pt is flatter, a limited k removes many viable options
• When the distribution Pt is peakier, a high k allows for too many options to have
a chance of being selected

• Solution: Top-p sampling


• Sample from all tokens in the top p cumulative probability mass (i.e., where
mass is concentrated)
• Varies k depending on the uniformity of Pt

34 (Holtzman et. al., ICLR 2020)


Decoding: Top-p (nucleus) sampling

• Solution: Top-p sampling


• Sample from all tokens in the top p cumulative probability mass (i.e., where
mass is concentrated)
• Varies k depending on the uniformity of Pt

Pt (yt
1
= w | {y}<t) Pt (yt
2
= w | {y}<t) Pt (yt
3
= w | {y}<t)

35 (Holtzman et. al., ICLR 2020)


Scaling randomness: Softmax temperature
• Recall: On timestep t, the model computes a prob distribution by applying the softmax
| |
function to a vector of scores ∈ ℝ
exp( )
( = )=
∑ ′∈ exp( ′)

• You can apply a temperature hyperparameter to the softmax to rebalance :

exp( / )
( = )=
∑ ′∈ exp( ′ / )

What happens if we increase


the temperature?
𝑤

𝑉
𝑤

𝑉
𝑤

𝑤

𝑆
𝜏
𝑆
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑃
𝑦
𝑤
𝑡
𝑡
36
𝑃
𝑠
𝜏𝑃
𝑤
𝑤
𝑆
𝜏
𝑉
𝑆
Scaling randomness: Softmax temperature
• Recall: On timestep t, the model computes a prob distribution by applying the softmax
| |
function to a vector of scores ∈ ℝ
exp( )
( = )=
∑ ′∈ exp( ′)

• You can apply a temperature hyperparameter to the softmax to rebalance :

exp( / )
( = )=
∑ ′∈ exp( ′ / )

• Raise the temperature > 1:


• becomes more uniform What happens if we decrease
• More diverse output (probability the temperature?
is spread around vocabulary)
𝑤

𝑉
𝑤

𝑉
𝑤

𝑤

𝑆
𝜏
𝑆
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑃
𝑦
𝑤
𝑡
𝑡
𝑡
37
𝑃
𝑠
𝜏𝑃
𝑃
𝜏
𝑤
𝑤
𝑆
𝜏
𝑉
𝑆
Scaling randomness: Softmax temperature
• Recall: On timestep t, the model computes a prob distribution by applying the softmax
| |
function to a vector of scores ∈ ℝ
exp( )
( = )=
∑ ′∈ exp( ′)

• You can apply a temperature hyperparameter to the softmax to rebalance :

exp( / )
( = )=
∑ ′∈ exp( ′ / )

• Raise the temperature > 1: • Lower the temperature < 1:


• becomes more uniform • becomes more spiky
• More diverse output (probability • Less diverse output (probability
is spread around vocabulary) is concentrated on top words)
𝑤

𝑉
𝑤

𝑉
𝑤

𝑤

𝑆
𝜏
𝑆
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑃
𝑦
𝑤
𝑡
𝑡
𝑡
𝑡
38
𝑃
𝑠
𝜏𝑃
𝑃
𝜏
𝑃
𝜏
𝑤
𝑤
𝑆
𝜏
𝑉
𝑆
What happens if temperature goes to 0?

exp( / )
( = )=
∑ ′∈
exp( ′/ )
𝑤

𝑉
𝑤

𝑆
𝜏
𝑡
𝑡
𝑃
𝑦
𝑤
𝑤
39
𝑆
𝜏
Improving decoding: re-balancing distributions

• Problem: What if I don’t trust how well my model’s distributions are calibrated?
• Don’t rely on ONLY your model’s distribution over tokens

• Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics!

40
(Khandelwal et. al., ICLR 2020)
Improving decoding: re-balancing distributions

• Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics!


• Cache a database of phrases from your training corpus (or some other corpus)
• At decoding time, search for most similar phrases in the database
• Re-balance Pt using induced distribution Pphrase over words that follow these phrases

41
(Khandelwal et. al., ICLR 2020)
Improving Decoding: Re-ranking

• Problem: What if I decode a bad sequence from my model?

• Decode a bunch of sequences


• 10 candidates is a common number, but it’s up to you
• Define a score to approximate quality of sequences and re-rank by this score
• Simplest is to use perplexity!
• Careful! Remember that repetitive methods can generally get high perplexity.
• Re-rankers can score a variety of properties:
• style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al.,
2020), logical consistency (Lu et al., 2020), and many more…
• Beware of poorly-calibrated re-rankers
• Can use multiple re-rankers in parallel
42
Decoding: Takeaways

• Decoding is still a challenging problem in natural language generation

• Human language distribution is noisy and doesn’t reflect simple properties (i.e.,
probability maximization)

• Different decoding algorithms can allow us to inject biases that encourage different
properties of coherent natural language generation

• Some of the most impactful advances in NLG of the last few years have come from
simple, but effective, modifications to decoding algorithms

• A lot more work to be done!


43
Decoding References
[1] Gulcehre et al., On Using Monolingual Corpora in Neural Machine Translation. arXiv 2015
[2] Wu et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv
2016
[3] Venugopalan et al., Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text. EMNLP 2016
[4] Li et al., A Diversity-Promoting Objective Function for Neural Conversation Models. EMNLP 2018
[5] Paulus et al., A Deep Reinforced Model for Abstractive Summarization. ICLR 2018
[6] Celikyilmaz et al., Deep Communicating Agents for Abstractive Summarization. NAACL 2018
[7] Holtzman et al., Learning to Write with Cooperative Discriminators. ACL 2018
[8] Fan et al., Hierarchical Neural Story Generation. ACL 2018
[9] Gabriel et al., Discourse Understanding and Factual Consistency in Abstractive Summarization. EACL 2021
[10] Dathathri et al., Plug and Play Language Models: A Simple Approach to Controlled Text Generation. ICLR 2020
[11] Holtzman et al., The Curious Case of Neural Text Degeneration. ICLR 2020
[12] Khandelwal et al., Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020
[13] Qin et al., Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense
Reasoning. EMNLP 2020

44
Natural Language Generation:
Evaluation
Antoine Bosselut
Greedy methods get repetitive
Context: In a shocking finding, scientist discovered a herd of unicorns
living in a remote, previously unexplored valley, in the Andes
Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the


National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM) and the
Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
(Holtzman et. al., ICLR 2020)
How should we evaluate the
quality of this sequence?

3
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences

• Why can’t we use perplexity of our generated sentences?


Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences

• Why can’t we use perplexity of our generated sentences?

• Decoding algorithms that minimise perplexity (i.e., argmax, beam search)


would be advantaged even if they don’t produce the best text
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences

• Why can’t we use perplexity of our generated sentences?

• Decoding algorithms that minimise perplexity (i.e., argmax, beam search)


would be advantaged even if they don’t produce the best text

• Perplexity of reference sequences tell us how calibrated our model is to


real sequences, but doesn’t say much about the generations it produces
How do you think text generation evaluation
differs compared to classi cation evaluation?

8
fi
A simple dialogue
Are you going to Prof.
Bosselut’s CS431 lecture?

Heck yes !

Yes !

You know it !

Yup .

Any “right” answer you know could be one of many!


Section Outline

Ref: They walked to the grocery store .

Gen: The woman went to the hardware store .

Content Overlap Metrics Model-based Metrics Human Evaluations

(Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial)
Content overlap metrics
Ref: They walked to the grocery store .

Gen: The woman went to the hardware store .

• Compute a score that indicates the similarity between generated and gold-
standard (human-written) text

• Fast and efficient and widely used

• Two broad categories:


- N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
- Semantic overlap metrics (e.g., PYRAMID, SPICE, SPIDEr, etc.)
N-gram overlap metrics
Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)
Yet automatic metrics such as BLEU
correlate with human judgement
• They’re not ideal for machine translation, but are correlated with human
judgments of quality
A simple failure case
Are you going to Prof.
Bosselut’s CS431 lecture?

Heck yes !
n-gram overlap metrics Score:
have no concept of 0.61 Yes !
semantic relatedness!
0.25 You know it !

False negative 0 Yup .

False positive 0.67 Heck no !


A more comprehensive failure analysis

(Liu et al, EMNLP 2016)


N-gram overlap metrics
Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)

• They’re not ideal for machine translation

• They get progressively much worse for tasks that are more open-ended
than machine translation
- Worse for summarization, where extractive methods that copy from documents are preferred

- Much worse for dialogue, which is more open-ended than summarization


N-gram overlap metrics
Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)

• They’re not ideal for machine translation

• They get progressively much worse for tasks that are more open-ended
than machine translation
- Worse for summarization, where extractive methods that copy from documents are preferred

- Much worse for dialogue, which is more open-ended than summarization

- Much, much worse story generation, which is also open-ended, but whose sequence length can
make it seem you’re getting decent scores!
Semantic overlap metrics

PYRAMID: SPICE: SPIDER:


• Incorporates human content Semantic propositional image caption A combination of semantic graph
selection variation in summarization evaluation is an image captioning similarity (SPICE) and n-gram similarity
evaluation. metric that initially parses the measure (CIDER), the SPICE metric
reference text to derive an abstract yields a more complete quality
• Identifies Summarization Content scene graph representation. evaluation metric.
Units (SCU)s to compare
information content in summaries. (Anderson et al., 2016). (Liu et al., 2017)

(Nenkova, et al., 2007)


Model-based metrics
• Use learned representations of words
and sentences to compute semantic
similarity between generated and
reference texts

• No more n-gram bottleneck because


text units are represented as
embeddings!

• Even though embeddings are


pretrained, distance metrics used to
measure the similarity can be xed
fi
Model-based metrics: Word distance functions
Vector Similarity: Word Mover’s Distance:
Embedding-based similarity Measures the distance
for semantic distance between between two sequences
text (e.g., sentences, paragraphs,
etc.), using word embedding
• Embedding Average (Liu et al., similarity matching.
2016
• Vector Extrema (Liu et al., 2016) (Kusner et al., 2015; Zhao et al.,
• MEANT (Lo, 2017) 2019)
• YISI (Lo, 2019)

BERTScore:
Use pre-trained contextual embeddings
from BERT and match words in candidate
and reference sentences by cosine similarity

(Zhang et al., 2020)


Model-based metrics: Beyond word matching
Sentence Movers Similarity :
Based on Word Movers Distance to evaluate text in a continuous
space using sentence embeddings from recurrent neural network
representations.

(Clark et.al., 2019)

BLEURT:
A regression model based on BERT returns a score
that indicates to what extend the candidate text is
grammatical and conveys the meaning of the
reference text.

(Sellam et.al. 2020)


Model-based metrics: LLMs
[System]


Please act as an impartial judge and evaluate the quality of the responses provided by two

Use LLMs to evaluate generation AI assistants to the user question displayed below. You should choose the assistant that
follows the user’s instructions and answers the user’s question better. Your evaluation
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,

outputs according to clearly


and level of detail of their responses. Begin your evaluation by comparing the two
responses and provide a short explanation. Avoid any position biases and ensure that the
order in which the responses were presented does not influence your decision. Do not allow
the length of the responses to influence your evaluation. Do not favor certain names of

defined rubric the assistants. Be as objective as possible. After providing your explanation, output your
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
if assistant B is better, and "[[C]]" for a tie.

-
[User Question]
G-Eval (Liu et al., 2023) {question}

[The Start of Assistant A’s Answer]

-
{answer_a}
LLM-as-a-judge (Zheng et al., 2023) [The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]


{answer_b}
Input Context [The End of Assistant B’s Answer]
Task Introduction Article: Paul Merson has restarted his row with
Andros Townsend after the Tottenham midfielder
You will be given one summary written for a news was brought on with only seven minutes remaining
article. Your task is to rate the summary on one in his team 's 0-0 draw with Burnley on ……
metric ……
Input Target
Summary: Paul merson was brought on with only [System]
seven minutes remaining in his team 's 0-0 draw
Evaluation Criteria with burnley …… Please act as an impartial judge and evaluate the quality of the response provided by an
Coherence (1-5) - the collective quality of all
AI assistant to the user question displayed below. Your evaluation should consider factors
Evaluation Form (scores ONLY):
sentences. We align this dimension with the DUC such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of
quality question of structure and coherence …… - Coherence:
the response. Begin your evaluation by providing a short explanation. Be as objective as
Auto
possible. After providing your explanation, please rate the response on a scale of 1 to 10
CoT
by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
Evaluation Steps
1. Read the news article carefully and identify the 0.6 [Question]
main topic and key points.
2. Read the summary and compare it to the news 0.4 {question}
article. Check if the summary covers the main topic G-Eval 0.2
and key points of the news article, and if it presents
them in a clear and logical order. 0 [The Start of Assistant’s Answer]
3. Assign a score for coherence on a scale of 1 to 1 2 3 4 5
10, where 1 is the lowest and 5 is the highest based {answer}
on the Evaluation Criteria. [The End of Assistant’s Answer]
Weighted Summed Score: 2.59
What might be a bene t of model-based
metrics compared to overlap metrics?

22
fi
Human evaluations
• Automatic metrics fall short of matching human
decisions

• Most important form of evaluation for text


generation systems
- >75% generation papers at ACL 2019 include human evaluations

• Gold standard in developing new automatic metrics


- New automated metrics must correlate well with human
evaluations!
Human evaluations
• Ask humans to evaluate the quality of generated text

• Overall or along some specific dimension:


- fluency

For details Celikyilmaz, Clark, Gao, 2020


- coherence / consistency

- factuality and correctness

- commonsense

- style / formality

- grammaticality

- typicality

- redundancy
Human evaluations
• Ask humans to evaluate the quality of generated text

• Overall or along some specific dimension:


- fluency
Note: Don’t compare human

For details Celikyilmaz, Clark, Gao, 2020


- coherence / consistency

- factuality and correctness evaluation scores across


- commonsense differently-conducted studies
- style / formality

- grammaticality

- typicality
Even if they claim to evaluate
- redundancy the same dimensions!
Human evaluations: case study
Human evaluations: case study
Human evaluations: case study
Human evaluation: Issues
• Human judgments are regarded as the gold standard

• Human evaluation is slow and expensive

Suppose you can run a human evaluation

Do we have anything to worry about?


Human evaluation: Issues
Human evaluation: Issues
Human evaluation: Issues
• Human judgments are regarded as the
gold standard Humans:
• are inconsistent
• Human evaluation is slow and expensive • can be illogical
(compared to automatic evaluation), • lose concentration
• misinterpret your question
even if your humans try to speed it up!
• can’t always explain why
they feel the way they do
• Conducting effective human evaluations • May try to speed through
is difficult your evaluation
Evaluation: Takeaways
• Content overlap metrics provide a good starting point for evaluating the quality of
generated text, but they’re not good enough on their own.

• Model-based metrics can be more correlated with human judgment, but behavior is not
interpretable

• Human judgments are critical.


- Only ones that can directly evaluate factuality – is the model saying correct things?
- But humans are inconsistent!

• In many cases, the best judge of output quality is YOU!

• Look at your model generations. Don’t just rely on numbers!


Concluding Thoughts
• Interacting with natural language generation systems quickly shows their limitations

• Even in tasks with more progress, there are still many improvements ahead

• Evaluation remains a huge challenge.


- We need better ways of automatically evaluating performance of NLG systems

• With the advent of large-scale language models, deep NLG research has been reset
- it’s never been easier to jump in the space!

• One of the most exciting areas of NLP to work in!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy