13-TextGen-2024
13-TextGen-2024
Task
Antoine Bosselut
What is natural language generation?
• Natural language generation
(NLG) is a sub-field of natural
language processing
http://mogren.one/lic/ https://chrome.google.com/webstore/detail/gmail-summarization/
(Wang and Cardie, ACL 2013)
Data-to-Text Generation
• Introduction
• Exercise Session: Playing around with our own story generation system
Basics of natural language generation
• Most text generation are autoregressive models — they predict next
tokens based on the values of past tokens
= ({ < } , )
f( . ) is your model
exp( )
( = { < }) =
∑ ′∈ exp( ′)
𝑤

𝑉
𝑤

𝑆
𝑡
𝑡
𝑡
𝑃
𝑦
𝑤
𝑦
𝑆
𝑓
𝑦
𝜃
𝑤
𝑤
𝑃
𝑉
𝑆
𝑉
Basics: What are we trying to do?
• At each time step t, our model computes a vector of scores for each token
in our vocabulary, S ∈ ℝ :
= ({ < } , )
f( . ) is your model
softmax
S
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑃
𝑦
𝑦
−4 −3 −2 −1
𝑤
𝑃
𝑉
𝑦
𝑦
𝑦
𝑦
𝑉
Basics: What are we trying to do?
restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor
• At inference time, our decoding algorithm defines a function to select a token from
this distribution :
ℒt = − log P(y*
t | <t )
{y* } Sum ℒt for the
entire sequence
ℒ = − log P(y*
1
| y*
0 )
∗
1
∗
0
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
ℒ = − (log P(y*
1
| y*
0 ) + log P( y*
2
| y* ,
0 1
y* ))
∗ ∗
1 2
∗ ∗
0 1
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
ℒ = − (log P(y*
1
| y*
0 ) + log P( y*
2
| y* ,
0 1
y* ) + log P ( y*
3
| y* , y*
0 1 2
, y* ))
∗ ∗ ∗
1 2 3
∗ ∗ ∗
0 1 2
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
4
log ( ∗
{ }< )
∗
∑
ℒ =−
=1
∗ ∗ ∗ ∗
1 2 3 4
𝑡
∗ ∗ ∗ ∗
𝑡
𝑡
𝑃
𝑦
𝑦
0 1 2 3
𝑡
𝑡
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Maximum Likelihood Training (i.e., teacher forcing)
∗ ∗
• Trained to generate the next word given a set of preceding words { }<
T
log P(y* <t )
∑
ℒ=− t | {y* }
∗ ∗ ∗ ∗
t=1 ∗ ∗ ∗
<END>
∗
1 2 3 4 −3 −2 −1
…
∗
0
∗
1
∗
2
∗
3
… ∗
−4
∗
−3
∗
−2
∗
−1
𝑡
𝑡
𝑦
𝑦
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Text Generation: Takeaways
• Text generation is the foundation of many useful NLP applications (e.g.,
translation, summarisation, dialogue systems)
• In autoregressive NLG, we generate one token a time, using the context and
previous generated tokens as inputs for generating the next token.
• Our model generates a set of scores for every token in the vocabulary, which
we can convert to a probability distribution using the softmax function
2
Decoding: what is it all about?
• At each time step t, our model computes a vector of scores for each token in our
= ({ })
vocabulary, S ∈ ℝ :
< f( . ) is your model
<END>
^ ^ ^ ^ ^ ^
1 2 −3 −2 −1
…
…
∗ ∗ ^ ^ ^ ^ ^ ^
−2 −1 y*
0 1 2 −4 −3 −2 −1
<START>
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
𝑇
4
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
𝑦
Greedy methods: Argmax Decoding
restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor
5
Greedy methods: Argmax Decoding
Select highest
yt̂ = argmax P(yt = w | {y}<t) scoring token
w∈V
What’s a potential problem with argmax decoding?
restroom
grocery
store
airport
pub
He wanted to go to the Model gym
bathroom
game
beach
hospital
doctor
6
Issues with argmax decoding
Beam search
Better-than-greedy
• In argmax decoding, we cannot
decoding?
revise prior decisions
• in greedy decoding, we cannot go back and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____
•
• fundamental
Better option: idea
use of
beam
Potentially leads to sequences that are
• beam
search (a search:
search explore
algorithm) to explore
several
several differentand
hypotheses hypotheses instead
select the best one of just a
Ungrammatical
-
single one
- Unnatural
• keep track of k most probable partial translations
- Nonsensical at each decoder step instead of just one!
- Incorrect the beam size k is usually 5-10
8 27 2/15/18
Greedy methods: Beam Beam Search search
Better-than-greedy decoding?
• in greedy decoding, we cannot go back
• In greedy decoding, we cannot revise prior decisions
and
revisedecoding
• Greedy previoushasdecisions!
no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____
Beam size = 2
the -1.05
<START>
a -1.39
10
Beam
Greedy searchBeam
methods: decoding:
Searchexample
Beam size = 2
2
log P(yt̂ | y0̂ , . . . , yt−1
̂ )
∑
t=1
poor -1.90
the
people -2.3
<START>
poor -1.54
a
person -3.2
11
Beam
Greedy searchBeam
methods: decoding:
Searchexample
Beam size = 2
3
log P(yt̂ | y0, y1̂ , . . . , yt−1
̂ )
∑
t=1
are -2.42
poor
the don’t -2.13
people
<START>
person -3.12
poor
a but -3.53
person
12
Beam
Greedy searchBeam
methods: decoding:
Searchexample
Beam size = 2
always -3.82
not -2.67
are
poor
the don’t have -3.32
people
take -3.61
<START>
person
poor and so on…
a but j
person
log P(yt̂ | y1̂ , . . . , yt−1
̂ )
∑
t=1
13
Greedy methods: Beam Search
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person j
log P(yt̂ | y1̂ , . . . , yt−1
̂ )
∑
t=1
14
18
Greedy methods: Beam Search
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
take any
<START> enough
person money
poor
a but funds
person j
log P(yt̂ | y1̂ , . . . , yt−1
̂ )
∑
t=1
15
19
Greedy methods: Beam Search
16
Beam
Greedy searchBeam
methods: decoding:
Searchexample
Beam size = 2
always -3.82
not -2.67
are
poor
the don’t have -3.32
people
take -3.61
<START>
person
poor and so on…
a but
person
17
Beam Search
19
Why does repetition happen?
Negative loglikelihood
decreases over time!
Simple option:
• Heuristic: Don’t repeat n-grams
More complex:
• Minimize embedding distance between consecutive sentences (Celikyilmaz et al., 2018)
• Doesn’t help with intra-sentence repetition
• Coverage loss (See et al., 2017)
• Prevents attention mechanism from attending to the same words
• Unlikelihood objective (Welleck et al., 2020)
• Penalize generation of already-seen tokens
26
Are greedy methods reasonable?
Pt (yt
1
= w | {y}<t) Pt (yt
2
= w | {y}<t) Pt (yt
3
= w | {y}<t)
exp( / )
( = )=
∑ ′∈ exp( ′ / )
exp( / )
( = )=
∑ ′∈ exp( ′ / )
exp( / )
( = )=
∑ ′∈ exp( ′ / )
exp( / )
( = )=
∑ ′∈
exp( ′/ )
𝑤

𝑉
𝑤

𝑆
𝜏
𝑡
𝑡
𝑃
𝑦
𝑤
𝑤
39
𝑆
𝜏
Improving decoding: re-balancing distributions
• Problem: What if I don’t trust how well my model’s distributions are calibrated?
• Don’t rely on ONLY your model’s distribution over tokens
40
(Khandelwal et. al., ICLR 2020)
Improving decoding: re-balancing distributions
41
(Khandelwal et. al., ICLR 2020)
Improving Decoding: Re-ranking
• Human language distribution is noisy and doesn’t reflect simple properties (i.e.,
probability maximization)
• Different decoding algorithms can allow us to inject biases that encourage different
properties of coherent natural language generation
• Some of the most impactful advances in NLG of the last few years have come from
simple, but effective, modifications to decoding algorithms
44
Natural Language Generation:
Evaluation
Antoine Bosselut
Greedy methods get repetitive
Context: In a shocking finding, scientist discovered a herd of unicorns
living in a remote, previously unexplored valley, in the Andes
Mountains. Even more surprising to the researchers was the fact
that the unicorns spoke perfect English.
3
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences
Perplexity: A first try
• Evaluate quality of the model based on the perplexity of the model on
reference sentences
8
fi
A simple dialogue
Are you going to Prof.
Bosselut’s CS431 lecture?
Heck yes !
Yes !
You know it !
Yup .
(Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial)
Content overlap metrics
Ref: They walked to the grocery store .
• Compute a score that indicates the similarity between generated and gold-
standard (human-written) text
Heck yes !
n-gram overlap metrics Score:
have no concept of 0.61 Yes !
semantic relatedness!
0.25 You know it !
• They get progressively much worse for tasks that are more open-ended
than machine translation
- Worse for summarization, where extractive methods that copy from documents are preferred
• They get progressively much worse for tasks that are more open-ended
than machine translation
- Worse for summarization, where extractive methods that copy from documents are preferred
- Much, much worse story generation, which is also open-ended, but whose sequence length can
make it seem you’re getting decent scores!
Semantic overlap metrics
BERTScore:
Use pre-trained contextual embeddings
from BERT and match words in candidate
and reference sentences by cosine similarity
BLEURT:
A regression model based on BERT returns a score
that indicates to what extend the candidate text is
grammatical and conveys the meaning of the
reference text.
•
Please act as an impartial judge and evaluate the quality of the responses provided by two
Use LLMs to evaluate generation AI assistants to the user question displayed below. You should choose the assistant that
follows the user’s instructions and answers the user’s question better. Your evaluation
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,
defined rubric the assistants. Be as objective as possible. After providing your explanation, output your
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
if assistant B is better, and "[[C]]" for a tie.
-
[User Question]
G-Eval (Liu et al., 2023) {question}
-
{answer_a}
LLM-as-a-judge (Zheng et al., 2023) [The End of Assistant A’s Answer]
22
fi
Human evaluations
• Automatic metrics fall short of matching human
decisions
- commonsense
- style / formality
- grammaticality
- typicality
- redundancy
Human evaluations
• Ask humans to evaluate the quality of generated text
- grammaticality
- typicality
Even if they claim to evaluate
- redundancy the same dimensions!
Human evaluations: case study
Human evaluations: case study
Human evaluations: case study
Human evaluation: Issues
• Human judgments are regarded as the gold standard
• Model-based metrics can be more correlated with human judgment, but behavior is not
interpretable
• Even in tasks with more progress, there are still many improvements ahead
• With the advent of large-scale language models, deep NLG research has been reset
- it’s never been easier to jump in the space!