0% found this document useful (0 votes)

4 views

13-TextGen-2024

Natural Language Generation (NLG) is a sub-field of natural language processing focused on creating systems that produce coherent text for human use. The document outlines various NLG tasks such as machine translation, dialogue systems, and summarization, as well as the autoregressive models used for text generation. It also discusses decoding methods and training algorithms essential for developing effective NLG systems.

Uploaded by

zelsiqkiikwuzqvrzm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

13-TextGen-2024

Uploaded by

zelsiqkiikwuzqvrzm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Natural Language Generation:

Task
Antoine Bosselut
What is natural language generation?
• Natural language generation
(NLG) is a sub-field of natural
language processing

• Focused on building systems that

automatically produce coherent
and useful written or spoken text
for human consumption

• NLG systems are already

changing the world we live in…
Machine Translation
Dialogue Systems
Summarization
Document Summarization E-mail Summarization Meeting Summarization

http://mogren.one/lic/ https://chrome.google.com/webstore/detail/gmail-summarization/
(Wang and Cardie, ACL 2013)
Data-to-Text Generation

(Parikh et al.., EMNLP 2020)

(Wiseman and Rush., EMNLP 2017)

(Dusek et. al., INLG 2019)
Visual Description Generation

Two children are sitting at a table in a restaurant. The children are

one little girl and one little boy. The little girl is eating a pink frosted
donut with white icing lines on top of it. The girl has blonde hair and
is wearing a green jacket with a black long sleeve shirt underneath.
The little boy is wearing a black zip up jacket and is holding his
finger to his lip but is not eating. A metal napkin dispenser is in
between them at the table. The wall next to them is white brick.
Two adults are on the other side of the short white brick wall. The
room has white circular lights on the ceiling and a large window in
the front of the restaurant. It is daylight outside.
(Karpathy & Li., CVPR 2015) (Krause et. al., CVPR 2017)
Creative Generation
Stories & Narratives Poetry

(Rashkin et al.., EMNLP 2020) (Ghazvininejad et al.., ACL 2017)

All-in-one: ChatGPT
What is natural language generation?

Any task involving text production for human

consumption requires natural language generation
What is natural language generation?

Any task involving text production for human

consumption requires natural language generation

Deep Learning is powering next-gen NLG systems!

Today’s Outline

• Introduction

• Section 1: Formalizing NLG: a simple model and training algorithm

• Section 2: Decoding from NLG models

• Section 3: Evaluating NLG Systems

• Exercise Session: Playing around with our own story generation system
Basics of natural language generation
• Most text generation are autoregressive models — they predict next
tokens based on the values of past tokens

• In autoregressive text generation models, at each time step t, our model

takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂
𝑡
𝑦
Basics of natural language generation
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂
𝑡
𝑦
Basics of natural language generation
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂ yt̂

yt−4 yt−3 yt−2 yt−1

𝑡
𝑦
Basics of natural language generation
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂
yt̂ ̂
yt+1 ̂
yt+1
…

yt−4 yt−3 yt−2 yt−1 yt̂ ̂

yt+1
𝑡
𝑦
A look at a single step
• In autoregressive text generation models, at each time step t, our model
takes in a sequence of tokens of text as input { } and outputs a new
<
token, yt̂ yt̂

yt−4 yt−3 yt−2 yt−1

𝑡
𝑦
Basics: What are we trying to do?
• At each time step t, our model computes a vector of scores for each token
in our vocabulary, S ∈ ℝ :

= ({ < } , )
f( . ) is your model

• Then, we compute a probability distribution over ∈ using these

scores:

exp( )
( = { < }) =
∑ ′∈ exp( ′)
𝑤

𝑉
𝑤

𝑆
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑦
𝑆
𝑓
𝑦
𝜃
𝑤
𝑤
𝑃
𝑉
𝑆
𝑉
Basics: What are we trying to do?
• At each time step t, our model computes a vector of scores for each token
in our vocabulary, S ∈ ℝ :

= ({ < } , )
f( . ) is your model

• Then, we compute a probability distribution over ∈ using these

scores:
exp( )
( { < }) =
∑ ′∈ exp( ′)
𝑤

𝑉
𝑤

𝑆
𝑡
𝑡
𝑡
𝑃
𝑦
𝑦
𝑆
𝑓
𝑦
𝜃
𝑤
𝑤
𝑃
𝑉
𝑆
𝑉
Basics: What are we trying to do?
• At each time step t, our model computes a vector of scores for each token
in our vocabulary, S ∈ ℝ . Then, we compute a probability distribution
over ∈ using these scores:
( { < })

softmax

S
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑦
−4 −3 −2 −1
𝑤
𝑃
𝑉
𝑦
𝑦
𝑦
𝑦
𝑉
Basics: What are we trying to do?
restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor

• At inference time, our decoding algorithm defines a function to select a token from
this distribution :

yt̂ = g(P(yt | {y<t})) ( . ) is your decoding algorithm

𝑃
𝑃
𝑔
Basics: What are we trying to do?
• We train the model to minimize the negative loglikelihood of predicting the next
token in the sequence:

ℒt = − log P(y*
t | <t )
{y* } Sum ℒt for the
entire sequence

- This is a multi-class classi cation task where each ∈ is a unique class.

∗
- The label at each step is the actual word in the training sequence
- This token is often called the “gold” or “ground truth” token
- This algorithm is often called “teacher forcing”
𝑡
𝑤
𝑦
𝑉
fi
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<

ℒ = − log P(y*
1
| y*
0 )
∗
1

∗
0
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<

ℒ = − (log P(y*
1
| y*
0 ) + log P( y*
2
| y* ,
0 1
y* ))
∗ ∗
1 2

∗ ∗
0 1
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<

ℒ = − (log P(y*
1
| y*
0 ) + log P( y*
2
| y* ,
0 1
y* ) + log P ( y*
3
| y* , y*
0 1 2
, y* ))
∗ ∗ ∗
1 2 3

∗ ∗ ∗
0 1 2
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
4
log ( ∗
{ }< )
∗
∑
ℒ =−
=1
∗ ∗ ∗ ∗
1 2 3 4
𝑡
∗ ∗ ∗ ∗
𝑡
𝑡
𝑃
𝑦
𝑦
0 1 2 3
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
T
log P(y* <t )
∑
ℒ=− t | {y* }
∗ ∗ ∗ ∗
t=1 ∗ ∗ ∗
<END>
∗
1 2 3 4 −3 −2 −1
…

∗
0
∗
1
∗
2
∗
3
… ∗
−4
∗
−3
∗
−2
∗
−1
𝑡
𝑡
𝑦
𝑦
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Text Generation: Takeaways
• Text generation is the foundation of many useful NLP applications (e.g.,
translation, summarisation, dialogue systems)

• In autoregressive NLG, we generate one token a time, using the context and
previous generated tokens as inputs for generating the next token.

• Our model generates a set of scores for every token in the vocabulary, which
we can convert to a probability distribution using the softmax function

• To get a calibrated distribution, we train our model using maximum

likelihood estimation to predict the next token on a dataset of sequences
Natural Language Generation:
Decoding
Antoine Bosselut
Section Outline

• Content - Greedy Decoding Methods: Argmax, Beam Search

• Content - Challenges of Greedy Decoding

• Content - Sampling Methods: Top-k, Top-p

• Advanced - kNN Language Models; Backprop-based decoding

2
Decoding: what is it all about?

• At each time step t, our model computes a vector of scores for each token in our

= ({ })
vocabulary, S ∈ ℝ :
< f( . ) is your model

• Then, we compute a probability distribution over these scores (usually with a

softmax function):
exp( )
( = { < }) =
∑ ′∈ exp( )′
• Our decoding algorithm defines a function to select a token from this
distribution:
^ = ( ( { < }))
𝑤

𝑉
𝑤

g( . ) is your decoding algorithm
𝑆
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑦
𝑡
𝑡
𝑡
𝑆
𝑓
𝑦
𝑦
𝑔
𝑃
𝑦
𝑦
𝑤
3
𝑃
𝑆
𝑉
Decoding: what is it all about?

• Our decoding algorithm defines a function to select a token from this

yt̂ = g(P(yt | {y*}, y<t

̂ ))
distribution

<END>
^ ^ ^ ^ ^ ^
1 2 −3 −2 −1
…

…
∗ ∗ ^ ^ ^ ^ ^ ^
−2 −1 y*
0 1 2 −4 −3 −2 −1
<START>
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
4
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Greedy methods: Argmax Decoding

• g = select the token with the highest probability:

yt̂ = argmax P(yt = w | {y}<t)

w∈V

restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor

5
Greedy methods: Argmax Decoding

• g = select the token with the highest probability:

Select highest
yt̂ = argmax P(yt = w | {y}<t) scoring token
w∈V
What’s a potential problem with argmax decoding?
restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor

6
Issues with argmax decoding
Beam search
Better-than-greedy
• In argmax decoding, we cannot
decoding?
revise prior decisions
• in greedy decoding, we cannot go back and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____

• fundamental idea of beam search: explore

• Better option: use beam search (a search algorithm) to explore
several
several differentand
hypotheses hypotheses instead
select the best one of just a
single one
• keep track of k most probable partial translations
at each decoder step instead of just one!
the beam size k is usually 5-10
7 27 2/15/18
Issues with argmax decoding
Beam search
Better-than-greedy
• In argmax decoding, we cannot
decoding?
revise prior decisions
• in greedy decoding, we cannot go back and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____

•
• fundamental
Better option: idea
use of
beam
Potentially leads to sequences that are
• beam
search (a search:
search explore
algorithm) to explore
several
several differentand
hypotheses hypotheses instead
select the best one of just a
Ungrammatical
-
single one
- Unnatural
• keep track of k most probable partial translations
- Nonsensical at each decoder step instead of just one!
- Incorrect the beam size k is usually 5-10
8 27 2/15/18
Greedy methods: Beam Beam Search search
Better-than-greedy decoding?
• in greedy decoding, we cannot go back
• In greedy decoding, we cannot revise prior decisions
and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____

• fundamental idea of beam search: explore

• Better option: use beam search (a search algorithm) to explore
• Beam Search:several
Explore
several several
differentdifferent hypotheses
hypotheses instead
hypotheses and select the best one instead
of just of
a just one
single
• Track of the b onescoring sequences at each decoder step instead of just one
highest
j
• keep track of k most probable partial translations
̂ ̂ ̂
∑
•
Score at each step:
at each log P(
decoder yt | y
step1 , . . . ,
instead yt−1 ,
of X)
just one!
t=1 size k is usually 5-10
the beam
• b is called the beam size
27 2/15/18
9
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2

log P(y1̂ | y0)

the -1.05

<START>

a -1.39

10
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
2
log P(yt̂ | y0̂ , . . . , yt−1
̂ )
∑
t=1

poor -1.90
the
people -2.3

<START>

poor -1.54
a
person -3.2

11
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
3
log P(yt̂ | y0, y1̂ , . . . , yt−1
̂ )
∑
t=1

are -2.42
poor
the don’t -2.13
people

<START>
person -3.12
poor
a but -3.53
person

12
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
always -3.82

not -2.67
are
poor
the don’t have -3.32
people
take -3.61
<START>
person
poor and so on…
a but j
person
log P(yt̂ | y1̂ , . . . , yt−1
̂ )
∑
t=1
13
Greedy methods: Beam Search
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person j
log P(yt̂ | y1̂ , . . . , yt−1
̂ )
∑
t=1
14
18
Greedy methods: Beam Search
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
take any

<START> enough
person money
poor
a but funds
person j
log P(yt̂ | y1̂ , . . . , yt−1
̂ )
∑
t=1
15
19
Greedy methods: Beam Search

• To take best scoring path at every step:

• Maximize likelihood
• or
• Maximize loglikehood of sequence
• or
• Minimize negative log likelihood of sequence

• Use the (negative) (log)likelihood of the full sequence up to this point

16
Beam
Greedy searchBeam
methods: decoding:
Searchexample

Beam size = 2
always -3.82

not -2.67
are
poor
the don’t have -3.32
people
take -3.61
<START>
person
poor and so on…
a but
person

17
Beam Search

• Different hypotheses may produce <END> token at different time steps

- When a hypothesis produces <END>, stop expanding it and place it aside
• Continue beam search until:

- All b beams (hypotheses) produce <END> OR

- Hit max decoding limit T

• Select top hypotheses using the normalized likelihood score
T
1
log P(yt̂ | y1̂ , . . . , yt−1
̂ , X)
T∑t=1

- Otherwise shorter hypotheses have higher scores

18
What do you think might happen if we
increase the beam size?

They maximise the likelihood of the sequence.

What do maximum likelihood sequences look like?

19
Why does repetition happen?

20 (Holtzman et. al., ICLR 2020)

Why does repetition happen?

Negative loglikelihood
decreases over time!

21 (Holtzman et. al., ICLR 2020)

Beam search gets repetitive and repetitive

Worse for transformer LMs

(Holtzman et. al., ICLR 2020)

And it keeps going…

Longer it goes, the worse it gets.

23 (Holtzman et. al., ICLR 2020)

Greedy methods get repetitive

Context: In a shocking finding, scientist discovered a herd of unicorns

living in a remote, previously unexplored valley, in the Andes
Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the

National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM) and the
Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
24 (Holtzman et. al., ICLR 2020)
Greedy methods get repetitive

Context: In a shocking finding, scientist discovered a herd of unicorns

living in a remote, previously unexplored valley, in the Andes
Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the

Repetition is a big problem in text generation!
National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM) and the
Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
25 (Holtzman et. al., ICLR 2020)
How can we reduce repetition?

Simple option:
• Heuristic: Don’t repeat n-grams

More complex:
• Minimize embedding distance between consecutive sentences (Celikyilmaz et al., 2018)
• Doesn’t help with intra-sentence repetition
• Coverage loss (See et al., 2017)
• Prevents attention mechanism from attending to the same words
• Unlikelihood objective (Welleck et al., 2020)
• Penalize generation of already-seen tokens

26
Are greedy methods reasonable?

27 (Holtzman et. al., ICLR 2020)

Time to get random : Sampling!

• Sample a token from the distribution of tokens

yt̂ ∼ P(yt = w | {y}<t)

restroom
• It’s random so you can sample any token! grocery
store
airport
He wanted bathroom
to go to the Model beach
doctor
What’s a potential problem hospital
with sampling? pub
gym
28
Decoding: Top-k sampling

• Problem: Vanilla sampling makes every token in the vocabulary an option

• Even if most of the probability mass in the distribution is over a limited set of
options, the tail of the distribution could be very long
• Many tokens are probably irrelevant in the current context
• Why are we giving them individually a tiny chance to be selected?
• Why are we giving them as a group a high chance to be selected?

29 (Fan et al., ACL 2018; Holtzman et al., ACL 2018)

Decoding: Top-k sampling

• Problem: Vanilla sampling makes every token in the vocabulary an option

• Solution: Top-k sampling

• Only sample from the top k tokens in the probability distribution

30 (Fan et al., ACL 2018; Holtzman et al., ACL 2018)

Decoding: Top-k sampling

• Solution: Top-k sampling

• Only sample from the top k tokens in the probability distribution
• Common values are k = 5, 10, 20 (but it’s up to you!) restroom
grocery
store
airport
He wanted bathroom
to go to the Model beach
doctor
hospital
pub
• Increase k for more diverse/risky outputs gym
• Decrease k for more generic/safe outputs
31

(Fan et al., ACL 2018; Holtzman et al., ACL 2018)

Decoding: Top-k sampling

• Solution: Top-k sampling

• Only sample from the top k tokens in the probability distribution
• Common values are k = 5, 10, 20 (but it’s up to you!) restroom
grocery
store
airport
What’s
He wanted a potential problem with top-k sampling?
bathroom
to go to the Model beach
doctor
hospital
pub
• Increase k for more diverse/risky outputs gym
• Decrease k for more generic/safe outputs
32

(Fan et al., ACL 2018; Holtzman et al., ACL 2018)

Issues with Top-k sampling

Top-k sampling can cut off too quickly!

Top-k sampling can also cut off too slowly!

33 (Holtzman et. al., ICLR 2020)

Decoding: Top-p (nucleus) sampling

• Problem: The probability distributions we sample from are dynamic

• When the distribution Pt is flatter, a limited k removes many viable options
• When the distribution Pt is peakier, a high k allows for too many options to have
a chance of being selected

• Solution: Top-p sampling

• Sample from all tokens in the top p cumulative probability mass (i.e., where
mass is concentrated)
• Varies k depending on the uniformity of Pt

34 (Holtzman et. al., ICLR 2020)

Decoding: Top-p (nucleus) sampling

• Solution: Top-p sampling

• Sample from all tokens in the top p cumulative probability mass (i.e., where
mass is concentrated)
• Varies k depending on the uniformity of Pt

Pt (yt
1
= w | {y}<t) Pt (yt
2
= w | {y}<t) Pt (yt
3
= w | {y}<t)

35 (Holtzman et. al., ICLR 2020)

Scaling randomness: Softmax temperature
• Recall: On timestep t, the model computes a prob distribution by applying the softmax
| |
function to a vector of scores ∈ ℝ
exp( )
( = )=
∑ ′∈ exp( ′)

• You can apply a temperature hyperparameter to the softmax to rebalance :

exp( / )
( = )=
∑ ′∈ exp( ′ / )

What happens if we increase

the temperature?
𝑤

𝑉
𝑤

𝑉
𝑤

𝑤

𝑆
𝜏
𝑆
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑃
𝑦
𝑤
𝑡
𝑡
36
𝑃
𝑠
𝜏𝑃
𝑤
𝑤
𝑆
𝜏
𝑉
𝑆
Scaling randomness: Softmax temperature
• Recall: On timestep t, the model computes a prob distribution by applying the softmax
| |
function to a vector of scores ∈ ℝ
exp( )
( = )=
∑ ′∈ exp( ′)

• You can apply a temperature hyperparameter to the softmax to rebalance :

exp( / )
( = )=
∑ ′∈ exp( ′ / )

• Raise the temperature > 1:

• becomes more uniform What happens if we decrease
• More diverse output (probability the temperature?
is spread around vocabulary)
𝑤

𝑉
𝑤

𝑉
𝑤

𝑤

𝑆
𝜏
𝑆
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑃
𝑦
𝑤
𝑡
𝑡
𝑡
37
𝑃
𝑠
𝜏𝑃
𝑃
𝜏
𝑤
𝑤
𝑆
𝜏
𝑉
𝑆
Scaling randomness: Softmax temperature
• Recall: On timestep t, the model computes a prob distribution by applying the softmax
| |
function to a vector of scores ∈ ℝ
exp( )
( = )=
∑ ′∈ exp( ′)

• You can apply a temperature hyperparameter to the softmax to rebalance :

exp( / )
( = )=
∑ ′∈ exp( ′ / )

• Raise the temperature > 1: • Lower the temperature < 1:

• becomes more uniform • becomes more spiky
• More diverse output (probability • Less diverse output (probability
is spread around vocabulary) is concentrated on top words)
𝑤

𝑉
𝑤

𝑉
𝑤

𝑤

𝑆
𝜏
𝑆
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑃
𝑦
𝑤
𝑡
𝑡
𝑡
𝑡
38
𝑃
𝑠
𝜏𝑃
𝑃
𝜏
𝑃
𝜏
𝑤
𝑤
𝑆
𝜏
𝑉
𝑆
What happens if temperature goes to 0?

exp( / )
( = )=
∑ ′∈
exp( ′/ )
𝑤

𝑉
𝑤

𝑆
𝜏
𝑡
𝑡
𝑃
𝑦
𝑤
𝑤
39
𝑆
𝜏
Improving decoding: re-balancing distributions

• Problem: What if I don’t trust how well my model’s distributions are calibrated?
• Don’t rely on ONLY your model’s distribution over tokens

• Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics!

40
(Khandelwal et. al., ICLR 2020)
Improving decoding: re-balancing distributions

• Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics!

• Cache a database of phrases from your training corpus (or some other corpus)
• At decoding time, search for most similar phrases in the database
• Re-balance Pt using induced distribution Pphrase over words that follow these phrases

41
(Khandelwal et. al., ICLR 2020)
Improving Decoding: Re-ranking

• Problem: What if I decode a bad sequence from my model?

• Decode a bunch of sequences

• 10 candidates is a common number, but it’s up to you
• Define a score to approximate quality of sequences and re-rank by this score
• Simplest is to use perplexity!
• Careful! Remember that repetitive methods can generally get high perplexity.
• Re-rankers can score a variety of properties:
• style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al.,
2020), logical consistency (Lu et al., 2020), and many more…
• Beware of poorly-calibrated re-rankers
• Can use multiple re-rankers in parallel
42
Decoding: Takeaways

• Decoding is still a challenging problem in natural language generation

• Human language distribution is noisy and doesn’t reflect simple properties (i.e.,
probability maximization)

• Different decoding algorithms can allow us to inject biases that encourage different
properties of coherent natural language generation

• Some of the most impactful advances in NLG of the last few years have come from
simple, but effective, modifications to decoding algorithms

• A lot more work to be done!

43
Decoding References
[1] Gulcehre et al., On Using Monolingual Corpora in Neural Machine Translation. arXiv 2015
[2] Wu et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv
2016
[3] Venugopalan et al., Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text. EMNLP 2016
[4] Li et al., A Diversity-Promoting Objective Function for Neural Conversation Models. EMNLP 2018
[5] Paulus et al., A Deep Reinforced Model for Abstractive Summarization. ICLR 2018
[6] Celikyilmaz et al., Deep Communicating Agents for Abstractive Summarization. NAACL 2018
[7] Holtzman et al., Learning to Write with Cooperative Discriminators. ACL 2018
[8] Fan et al., Hierarchical Neural Story Generation. ACL 2018
[9] Gabriel et al., Discourse Understanding and Factual Consistency in Abstractive Summarization. EACL 2021
[10] Dathathri et al., Plug and Play Language Models: A Simple Approach to Controlled Text Generation. ICLR 2020
[11] Holtzman et al., The Curious Case of Neural Text Degeneration. ICLR 2020
[12] Khandelwal et al., Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020
[13] Qin et al., Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense
Reasoning. EMNLP 2020

44
Natural Language Generation:
Evaluation
Antoine Bosselut
Greedy methods get repetitive
Context: In a shocking finding, scientist discovered a herd of unicorns
living in a remote, previously unexplored valley, in the Andes
Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the

National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM) and the
Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
(Holtzman et. al., ICLR 2020)
How should we evaluate the
quality of this sequence?

3
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences

• Why can’t we use perplexity of our generated sentences?

Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences

• Why can’t we use perplexity of our generated sentences?

• Decoding algorithms that minimise perplexity (i.e., argmax, beam search)

would be advantaged even if they don’t produce the best text
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences

• Why can’t we use perplexity of our generated sentences?

• Decoding algorithms that minimise perplexity (i.e., argmax, beam search)

would be advantaged even if they don’t produce the best text

• Perplexity of reference sequences tell us how calibrated our model is to

real sequences, but doesn’t say much about the generations it produces
How do you think text generation evaluation
differs compared to classi cation evaluation?

8
fi
A simple dialogue
Are you going to Prof.
Bosselut’s CS431 lecture?

Heck yes !

Yes !

You know it !

Yup .

Any “right” answer you know could be one of many!

Section Outline

Ref: They walked to the grocery store .

Gen: The woman went to the hardware store .

Content Overlap Metrics Model-based Metrics Human Evaluations

(Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial)
Content overlap metrics
Ref: They walked to the grocery store .

Gen: The woman went to the hardware store .

• Compute a score that indicates the similarity between generated and gold-
standard (human-written) text

• Fast and efficient and widely used

• Two broad categories:

- N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
- Semantic overlap metrics (e.g., PYRAMID, SPICE, SPIDEr, etc.)
N-gram overlap metrics
Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)
Yet automatic metrics such as BLEU
correlate with human judgement
• They’re not ideal for machine translation, but are correlated with human
judgments of quality
A simple failure case
Are you going to Prof.
Bosselut’s CS431 lecture?

Heck yes !
n-gram overlap metrics Score:
have no concept of 0.61 Yes !
semantic relatedness!
0.25 You know it !

False negative 0 Yup .

False positive 0.67 Heck no !

A more comprehensive failure analysis

(Liu et al, EMNLP 2016)

N-gram overlap metrics
Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)

• They’re not ideal for machine translation

• They get progressively much worse for tasks that are more open-ended
than machine translation
- Worse for summarization, where extractive methods that copy from documents are preferred

- Much worse for dialogue, which is more open-ended than summarization

N-gram overlap metrics
Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)

• They’re not ideal for machine translation

• They get progressively much worse for tasks that are more open-ended
than machine translation
- Worse for summarization, where extractive methods that copy from documents are preferred

- Much worse for dialogue, which is more open-ended than summarization

- Much, much worse story generation, which is also open-ended, but whose sequence length can
make it seem you’re getting decent scores!
Semantic overlap metrics

PYRAMID: SPICE: SPIDER:

• Incorporates human content Semantic propositional image caption A combination of semantic graph
selection variation in summarization evaluation is an image captioning similarity (SPICE) and n-gram similarity
evaluation. metric that initially parses the measure (CIDER), the SPICE metric
reference text to derive an abstract yields a more complete quality
• Identifies Summarization Content scene graph representation. evaluation metric.
Units (SCU)s to compare
information content in summaries. (Anderson et al., 2016). (Liu et al., 2017)

(Nenkova, et al., 2007)

Model-based metrics
• Use learned representations of words
and sentences to compute semantic
similarity between generated and
reference texts

• No more n-gram bottleneck because

text units are represented as
embeddings!

• Even though embeddings are

pretrained, distance metrics used to
measure the similarity can be xed
fi
Model-based metrics: Word distance functions
Vector Similarity: Word Mover’s Distance:
Embedding-based similarity Measures the distance
for semantic distance between between two sequences
text (e.g., sentences, paragraphs,
etc.), using word embedding
• Embedding Average (Liu et al., similarity matching.
2016
• Vector Extrema (Liu et al., 2016) (Kusner et al., 2015; Zhao et al.,
• MEANT (Lo, 2017) 2019)
• YISI (Lo, 2019)

BERTScore:
Use pre-trained contextual embeddings
from BERT and match words in candidate
and reference sentences by cosine similarity

(Zhang et al., 2020)

Model-based metrics: Beyond word matching
Sentence Movers Similarity :
Based on Word Movers Distance to evaluate text in a continuous
space using sentence embeddings from recurrent neural network
representations.

(Clark et.al., 2019)

BLEURT:
A regression model based on BERT returns a score
that indicates to what extend the candidate text is
grammatical and conveys the meaning of the
reference text.

(Sellam et.al. 2020)

Model-based metrics: LLMs
[System]

•
Please act as an impartial judge and evaluate the quality of the responses provided by two

Use LLMs to evaluate generation AI assistants to the user question displayed below. You should choose the assistant that
follows the user’s instructions and answers the user’s question better. Your evaluation
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,

outputs according to clearly

and level of detail of their responses. Begin your evaluation by comparing the two
responses and provide a short explanation. Avoid any position biases and ensure that the
order in which the responses were presented does not influence your decision. Do not allow
the length of the responses to influence your evaluation. Do not favor certain names of

defined rubric the assistants. Be as objective as possible. After providing your explanation, output your
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
if assistant B is better, and "[[C]]" for a tie.

-
[User Question]
G-Eval (Liu et al., 2023) {question}

[The Start of Assistant A’s Answer]

-
{answer_a}
LLM-as-a-judge (Zheng et al., 2023) [The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]

{answer_b}
Input Context [The End of Assistant B’s Answer]
Task Introduction Article: Paul Merson has restarted his row with
Andros Townsend after the Tottenham midfielder
You will be given one summary written for a news was brought on with only seven minutes remaining
article. Your task is to rate the summary on one in his team 's 0-0 draw with Burnley on ……
metric ……
Input Target
Summary: Paul merson was brought on with only [System]
seven minutes remaining in his team 's 0-0 draw
Evaluation Criteria with burnley …… Please act as an impartial judge and evaluate the quality of the response provided by an
Coherence (1-5) - the collective quality of all
AI assistant to the user question displayed below. Your evaluation should consider factors
Evaluation Form (scores ONLY):
sentences. We align this dimension with the DUC such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of
quality question of structure and coherence …… - Coherence:
the response. Begin your evaluation by providing a short explanation. Be as objective as
Auto
possible. After providing your explanation, please rate the response on a scale of 1 to 10
CoT
by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
Evaluation Steps
1. Read the news article carefully and identify the 0.6 [Question]
main topic and key points.
2. Read the summary and compare it to the news 0.4 {question}
article. Check if the summary covers the main topic G-Eval 0.2
and key points of the news article, and if it presents
them in a clear and logical order. 0 [The Start of Assistant’s Answer]
3. Assign a score for coherence on a scale of 1 to 1 2 3 4 5
10, where 1 is the lowest and 5 is the highest based {answer}
on the Evaluation Criteria. [The End of Assistant’s Answer]
Weighted Summed Score: 2.59
What might be a bene t of model-based
metrics compared to overlap metrics?

22
fi
Human evaluations
• Automatic metrics fall short of matching human
decisions

• Most important form of evaluation for text

generation systems
- >75% generation papers at ACL 2019 include human evaluations

• Gold standard in developing new automatic metrics

- New automated metrics must correlate well with human
evaluations!
Human evaluations
• Ask humans to evaluate the quality of generated text

• Overall or along some specific dimension:

- fluency

For details Celikyilmaz, Clark, Gao, 2020

- coherence / consistency

- factuality and correctness

- commonsense

- style / formality

- grammaticality

- typicality

- redundancy
Human evaluations
• Ask humans to evaluate the quality of generated text

• Overall or along some specific dimension:

- fluency
Note: Don’t compare human

For details Celikyilmaz, Clark, Gao, 2020

- coherence / consistency

- factuality and correctness evaluation scores across

- commonsense differently-conducted studies
- style / formality

- grammaticality

- typicality
Even if they claim to evaluate
- redundancy the same dimensions!
Human evaluations: case study
Human evaluations: case study
Human evaluations: case study
Human evaluation: Issues
• Human judgments are regarded as the gold standard

• Human evaluation is slow and expensive

Suppose you can run a human evaluation

Do we have anything to worry about?

Human evaluation: Issues
Human evaluation: Issues
Human evaluation: Issues
• Human judgments are regarded as the
gold standard Humans:
• are inconsistent
• Human evaluation is slow and expensive • can be illogical
(compared to automatic evaluation), • lose concentration
• misinterpret your question
even if your humans try to speed it up!
• can’t always explain why
they feel the way they do
• Conducting effective human evaluations • May try to speed through
is difficult your evaluation
Evaluation: Takeaways
• Content overlap metrics provide a good starting point for evaluating the quality of
generated text, but they’re not good enough on their own.

• Model-based metrics can be more correlated with human judgment, but behavior is not
interpretable

• Human judgments are critical.

- Only ones that can directly evaluate factuality – is the model saying correct things?
- But humans are inconsistent!

• In many cases, the best judge of output quality is YOU!

• Look at your model generations. Don’t just rely on numbers!

Concluding Thoughts
• Interacting with natural language generation systems quickly shows their limitations

• Even in tasks with more progress, there are still many improvements ahead

• Evaluation remains a huge challenge.

- We need better ways of automatically evaluating performance of NLG systems

• With the advent of large-scale language models, deep NLG research has been reset
- it’s never been easier to jump in the space!

• One of the most exciting areas of NLP to work in!

495 Lecture 13 Trans Decoder
No ratings yet
495 Lecture 13 Trans Decoder
21 pages
NLP_basics
No ratings yet
NLP_basics
119 pages
14. Natural Language Generation with LLMs 1
No ratings yet
14. Natural Language Generation with LLMs 1
42 pages
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
No ratings yet
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
42 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
No ratings yet
Neural Text Generation: A Practical Guide: Ziang Xie Zxie@cs - Stanford.edu
21 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
LLM Cheatsheet
No ratings yet
LLM Cheatsheet
1 page
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
Lect 07 _MT and Seq2seq
No ratings yet
Lect 07 _MT and Seq2seq
86 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
5. speech to text Beam Search
No ratings yet
5. speech to text Beam Search
15 pages
11_seq_to_seq_model
No ratings yet
11_seq_to_seq_model
30 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
CH 6
No ratings yet
CH 6
30 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
Automatic Detection of Generated Text Is Easiest When Humans Are Fooled
No ratings yet
Automatic Detection of Generated Text Is Easiest When Humans Are Fooled
15 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
Lecture 6 to 8 N-gram
No ratings yet
Lecture 6 to 8 N-gram
19 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Autoregressive Large Language Models Are Computationally Universal
No ratings yet
Autoregressive Large Language Models Are Computationally Universal
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
AI Mid-Term
No ratings yet
AI Mid-Term
3 pages
5 2022 Bea-1 28
No ratings yet
5 2022 Bea-1 28
16 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
NLP Unit 4
No ratings yet
NLP Unit 4
22 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
Module 5
No ratings yet
Module 5
76 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Cs224n Text Generation
No ratings yet
Cs224n Text Generation
73 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Evaluation of Text Generation: A Survey
No ratings yet
Evaluation of Text Generation: A Survey
75 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
NLP BOOK
No ratings yet
NLP BOOK
599 pages
plm.17
No ratings yet
plm.17
15 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
GPT in 60 Lines of NumPy _ Jay Mody
No ratings yet
GPT in 60 Lines of NumPy _ Jay Mody
41 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
TouchCode Class 7: Coding Book
From Everand
TouchCode Class 7: Coding Book
Team Orange
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
SIMPLE FUTURE TENSE (Edit)
No ratings yet
SIMPLE FUTURE TENSE (Edit)
7 pages
(Exercises) : After Before Verbs
No ratings yet
(Exercises) : After Before Verbs
2 pages
Mark Twain The Awful German Language
No ratings yet
Mark Twain The Awful German Language
33 pages
2nd_Provisional_LPS_Nagaon
No ratings yet
2nd_Provisional_LPS_Nagaon
30 pages
Examen Nivel B1 - ESL
No ratings yet
Examen Nivel B1 - ESL
4 pages
2nd Prep Exam 2024t 2-2
No ratings yet
2nd Prep Exam 2024t 2-2
5 pages
Week 5 Latest
No ratings yet
Week 5 Latest
5 pages
1A Grammar and Vocabulary Teach Present 1A Grammar and Vocabulary
No ratings yet
1A Grammar and Vocabulary Teach Present 1A Grammar and Vocabulary
1 page
Test Prezent Simplu - Prezent Continuu
No ratings yet
Test Prezent Simplu - Prezent Continuu
2 pages
French Level 1B - 2016
No ratings yet
French Level 1B - 2016
2 pages
Speech Act: To Fulfill The Final Project of Semantics
No ratings yet
Speech Act: To Fulfill The Final Project of Semantics
11 pages
Communication Studies Crash Course
No ratings yet
Communication Studies Crash Course
7 pages
1 Speaking Test (Copy and Paste)
No ratings yet
1 Speaking Test (Copy and Paste)
2 pages
Raz Cqlh35 Drkingsmemorial
No ratings yet
Raz Cqlh35 Drkingsmemorial
2 pages
Basic Greeting in English
No ratings yet
Basic Greeting in English
15 pages
Year 8 Poetry Analysis (2)
No ratings yet
Year 8 Poetry Analysis (2)
6 pages
Borrowed words
No ratings yet
Borrowed words
2 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
5 pages
Peter Roach2009 - Chapter 4
No ratings yet
Peter Roach2009 - Chapter 4
9 pages
Phonics 3 Lesson Plan
No ratings yet
Phonics 3 Lesson Plan
10 pages
English 2
No ratings yet
English 2
3 pages
English 101
No ratings yet
English 101
15 pages
Atc Ram Ban English Guess Paper (1)
No ratings yet
Atc Ram Ban English Guess Paper (1)
18 pages
Yanyan Aida Rohman - Teacher's Strategies To Iimprove Students' Speaking Skill at SMA 3 Bahasa Putera Harapan Purwokerto (PU HUA SCHOOL) - Compressed
No ratings yet
Yanyan Aida Rohman - Teacher's Strategies To Iimprove Students' Speaking Skill at SMA 3 Bahasa Putera Harapan Purwokerto (PU HUA SCHOOL) - Compressed
100 pages
Luyện tập TN 1-1-viết phát triển 1
No ratings yet
Luyện tập TN 1-1-viết phát triển 1
11 pages
Week 2 - S1 - Pronouns
No ratings yet
Week 2 - S1 - Pronouns
12 pages
Past Simple 4
100% (1)
Past Simple 4
2 pages
Weekly Lesson Plan Myp 4,5
No ratings yet
Weekly Lesson Plan Myp 4,5
7 pages
2014 - The Episteme Journal of Linguistics and Literature Vol 1 No 1 - 2-The Analysis of Language Style On The Campaign Speech of
100% (1)
2014 - The Episteme Journal of Linguistics and Literature Vol 1 No 1 - 2-The Analysis of Language Style On The Campaign Speech of
9 pages
PR 9 Morphemic Structure Og Words
No ratings yet
PR 9 Morphemic Structure Og Words
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.