0% found this document useful (0 votes)
78 views

LLMs and Retrieval-Augmented Generation (RAG)

Llm and RAG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

LLMs and Retrieval-Augmented Generation (RAG)

Llm and RAG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Retrieval-augmented LMs:

Past, Present and Future


Large Language Models: Methods and Applications
Akari Asai (akari@cs.washington.edu)

Feel free to post questions on Sli.do!


Sli.do code #2068655
How do normal parametric LLMs work?
Encapsulating everything in parameters by pre-training parameters on large-scale text corpora

Pittsburgh is a city in the county


seat of Allegheny County,
Pennsylvania, United States
Pre-training
data

𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )

LLM
How do normal parametric LLMs work?
Encapsulating everything in parameters by pre-training parameters on large-scale text corpora

Pittsburgh is a city in and the


county seat of Allegheny County,
Pennsylvania, United States Allegheny
Pre-training Pennsylvania
data
King

𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )

LLM

Pittsburgh is located in
𝑥1 𝑥2 𝑥3 𝑥4
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy


Limitations of parametric LLMs #1: Hallucinations
LLMs cannot encapslate everything in their parameters yet.


Catastrophic incidents due to LLM hallucinations
Such LLM hallucinations have been causing many critical incidents in the real world
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time

Input 𝒙

LLM

Pre-training
data Output 𝒚
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time

Input 𝒙

Retriever LLM

Datastore
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time

Query 𝒒 Input 𝒙

Retriever LLM

Datastore
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time

Query 𝒒 Input 𝒙

Retriever LLM

sim(𝑞, 𝑑)
Documents
𝒁
Datastore
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time

Query 𝒒 Input 𝒙

Retriever LLM

Documents
𝒁
Datastore Output 𝒚
Benefit of retrieval-augmented LMs #1: reduce hallucinations
Retrieval-augmented LMs can reduce hallucinations, especially in long-tail knowledge

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
Benefit of retrieval-augmented LMs #1: reduce hallucinations
Retrieval-augmented LMs can reduce hallucinations, especially in long-tail knowledge

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
Quiz:
What are the other benefits of using
retrieval-augmented LMs?
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training

“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training

“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training

Rishi Sunak is a British politician and


the current Prime Minister of the
United Kingdom (May 2024
Wikipedia)
Pre-training
data
2024 May data

𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )

LLM

“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training

Rishi Sunak is a British politician and


the current Prime Minister of the
United Kingdom (May 2024 Rishi
Wikipedia)
Pre-training Boris
data
2024 May data Liz

𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )

LLM

The current Prime Minister UK is

“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training

2023
The incumbent prime minister is Keir
Starmer, who assumed the office on
5 July 2024. Rishi
Pre-training Boris
data 2024 data
Keir

𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )

LLM

The current Prime Minister UK is

“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
We can easily swap datastores for retrieval-augmented LMs for new data distributions

2023

2024 data

“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #3: Providing attributions
Retrieval-augmented LMs can provide a small number of documents as attributions

“Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models”. Bohnet et al. ArXiv 2020.
Benefit of retrieval-augmented LMs #4: Flexible data opt-in / out
We can incorporate or remove high-risk data dynamically at inference, not training time

“SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore”. Min et al. In ICLR 2024
Benefit of retrieval-augmented LMs #5: parameter efficiency
Retrieval-augmented LMs can be much more parameter efficient and compute-optimal

“Scaling Retrieval-Based Language Models with a Trillion-Token Datastore.” Shao, He, Asai et al., ArXiv 2024.
Benefit of retrieval-augmented LMs #5: parameter efficiency
Retrieval-augmented LMs can be much more parameter efficient and compute-optimal

“Scaling Retrieval-Based Language Models with a Trillion-Token Datastore.” Shao, He, Asai et al., ArXiv 2024.
Retrieval-augmented LMs have been widely used!
Retrieval-augmented LMs have been widely used both in academia and industry
Retrieval-augmented LMs have been widely used!
Retrieval-augmented LMs have been widely used both in academia and industry

“60% of LLM applications use some form


of retrieval-augmented generation (RAG)”
Today’s outline
1. Introduction: What are retrieval-augmented LMs? Why do we want
to use them?
2. Past: Architecture and training of retrieval-augmented LMs for
downstream tasks
3. Present: Retrieval-augmented generation with LLMs
4. Future: Limitations & future directions

Feel free to post questions on Sli.do!


Sli.do code #2068655
Past: Architecture and training
of retrieval-augmented LMs for
downstream tasks
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering

2017: DrQA

Trained QA model

BM25

Retrieve and read Wikipedia articles for open-domain QA

“Reading Wikipedia to Answer Open-Domain Questions.” Chen et al., ACL 2017.


Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering

2017: DrQA

2019: ORQA
2020: RALM, RAG

End-to-end pre-training → fine-tuning of retriever & LM

“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Lewis et al., NeurIPS 2020.
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering

2017: DrQA

2019: ORQA
2020: RALM, RAG
2020: kNN LM

New architectures for retrieval-augmented LMs

“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering

2017: DrQA

2019: ORQA
2020: RALM, RAG
2020: kNN LM

2021: RETRO
New architectures for retrieval-augmented LMs

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
Brief history of retrieval-augmented LMs development
Versatile and powerful LLMs demonstrate effectiveness even without fine-tuning

2017: DrQA

2019: ORQA
2020: RALM, RAG
2020: kNN LM
2020: GPT3
2021: RETRO
2022: ChatGPT

LLMs surpressed specialized QA models w/ retrieval

https://paperswithcode.com/sota/question-answering-on-triviaqa
Brief history of retrieval-augmented LMs development
Success of In-Context Retrieval-Augmented LMs (commonly referred to as RAG today)

2017: DrQA

2019: ORQA
2020: RALM, RAG
2020: kNN LM

2021: RETRO
Use off-the-shelf LLMs & retrieval systems

2023: In-Context Retrieval-


Augmented LMs
“In-Context Retrieval-Augmented Language Models.” Ram et al., TACL 2023.
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering

2017: DrQA

2019: ORQA
Past: Developments in
2020: RALM, RAG Architecture and Training for
2020: kNN LM Specific Tasks

2021: RETRO

2023: Retrieval-
augmented LLMs
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering

2017: DrQA

2019: ORQA
Past: Architecture / training
2020: RALM, RAG developments for certain down or
2020: kNN LM up-stream tasks

2021: RETRO

Current: Designing versatile and


2023: Retrieval- reliable LLM-based RAG systems
augmented LLMs for diverse use cases
Diverse architectures of retrieval-augmented LMs
Classifying retrieval-augmented LMs based on “where” we incorporate retrieved context

● Input augmentaiton
○ Augment the input of LMs with retrieved context
○ E.g., RAG, REALM, DrQA, In-context RALM Retriever 𝑥

LLM

𝑦
REALM: Augmenting input space of LMs
REALM is an retrieval-augmented masked LMs that predicts next tokens / spans in context

“REALM: Retrieval-Augmented Language Model Pre-Training.” Guu et al., ICML 2020.


REALM: Augmenting input space of LMs
REALM finds relevant context by conducting kNN search in embedding spaces

“REALM: Retrieval-Augmented Language Model Pre-Training.” Guu et al., ICML 2020.


REALM: Augmenting input space of LMs
REALM compute weighted averages of final answer distributions, using retrieval similarities

“REALM: Retrieval-Augmented Language Model Pre-Training.” Guu et al., ICML 2020.


REALM: Augmenting input space of LMs
REALM compute weighted averages of final answer distributions, using retrieval similarities

“REALM: Retrieval-Augmented Language Model Pre-Training.” Guu et al., ICML 2020.


RAG: Augmenting input space of LMs
RAG combines a trained retriever & autoregressive BART, starting from pre-trained weights

“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Lewis et al., NeurIPS 2020.
RAG & REALM: Results
RAG and REALM show their effectiveness on open-domain QA and other tasks

● RAG outperforms REALM and other baselines on Open-domain QA such as NaturalQuestion

Open-domain QA (EM)
50
NQ WQ
40

30
Higher is
better 20

10

0
T5 REALM DPR RAG

No retrieval With retrieval


“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Lewis et al., NeurIPS 2020.
RAG & REALM: Results
RAG and REALM show their effectiveness on open-domain QA and other tasks

● RAG outperforms REALM and other baselines on Open-domain QA such as NaturalQuestions


● RAG also show their effectiveness on generation tasks

Open-domain QA (EM) Question generation (Tri-match)


50 MS MARCO Jeopardy
NQ WQ 100
40
80
30
60
Higher is
better 20 40

10 20

0 0
T5 REALM DPR RAG Gold BART RAG-Token

No retrieval With retrieval No retrieval With retrieval


“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Lewis et al., NeurIPS 2020.
Recent follow-up: In-context Retrieval-augmented LMs
Similar principles as in DrQA, REALM, RAG, but completely removes retrieval

● Combining retrieval and off-the-shelf LMs e.g., GPT-4 at inference time without training
● Often referred to as “RAG” nowadays
● We’ll cover this in depth in the next section!

“In-Context Retrieval-Augmented Language Models.” Ram et al., TACL 2023.


Pros and cons of input augmentation
Input augmentation is powerful but has several limitations

● Pros
○ Easy to switch to new, more powerful LMs with fine-tuning / without training
○ LLMs can effectively levarage input context
Pros and cons of input augmentation
Input augmentation is powerful but has several limitations.

● Pros
○ Easy to switch to new, more powerful LMs with fine-tuning / without training
○ LLMs can effectively levarage input context

● Cons
○ Expensive to scale up to hundreads or thousands of documents
■ LLMs also often do not fully levarage long context
○ No strict attributions to specific evidences

“Lost in the Middle: How Language Models Use Long Contexts.” Liu et al., TACL 2023.
Diverse architectures of retrieval-augmented LMs
Classifying retrieval-augmented LMs based on “where” we incorporate retrieved context

● Input augmentaiton
○ Augment the input of LMs with retrieved context
○ E.g., RAG, REALM, DrQA, In-context RALM Retriever

● Intermediate incorporation
○ Incorporate retrieved context in intermediate LLM
spaces of transformers
○ E.g., RETRO, Instruct RETRO
RETRO: Incorporating context in intermediate layers
RETRO enables more efficient incorporations of many documents

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
RETRO enables more efficient incorporations of many documents

Standard transformer block


“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
RETRO enables more efficient incorporations of many documents

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
RETRO uses frozen BERT as a retriever, and retrieve nearest neighbors from 1.7T datastore

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Given the input sequence, it first retrieves a set of relevant documents (embedding of text)

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Use cross-attention to generate retrieved context-aware representations

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Use cross-attention to generate retrieved context-aware representations

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Concatnate all of the CA output (the size of input H and output CCA(H,E) remains the same

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
RETRO uses frozen BERT as a retriever, and retrieve nearest neighbors from 1.7T datastore

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Results
RETRO shows impressive performance improvements on upstream (language modeling) tasks

● RETRO significantly outperforms non-retrieved baselines

Lower is
better

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Results
RETRO shows impressive performance improvements on upstream (language modeling) tasks

● RETRO significantly outperforms non-retrieved baselines


● RETRO performance continues to improve as the datastore scales from a few billion to 1.7
trillion data points

Lower is
better

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Results
RETRO shows impressive performance improvements on upstream (language modeling) tasks

● RETRO significantly outperforms non-retrieved baselines


● RETRO performance continues to improve as the datastore scales from a few billion to 1.7
trillion data points
● Increasing # of docs up to 40 helps

Lower is
better

“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
Recent follow-up: Instruct RETRO
Develop RETRO-block on top of Llama (autoregressive LMs), pre-training & multi-task training

“InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining.” Wang et al., ICML 2024.
Pros and cons of intermediate incorporation
Alternative way to incorporate retrieved context in a more scalable way, but requires training

● Pros
○ More efficiently incorporates many passages than input augmentation
○ Possibly more effective than retrieval augmentaion (i.e., Instruct RETRO results)

● Cons
○ Require modification of underlying LMs
○ Expensive pre-training is necessary
○ Doesn’t provide strict attribution
Diverse architectures of retrieval-augmented LMs
Classifying retrieval-augmented LMs based on “where” we incorporate retrieved context

● Input augmentaiton
○ Augment the input of LMs with retrieved context
○ E.g., RAG, REALM, DrQA, In-context RALM

● Intermediate incorporation
○ Incorporate retrieved context in intermediate
spaces of transformers
○ E.g., RETRO, Instruct RETRO

● Output interpolation
○ Interpolate output token probabilities with retrieved Retriever LLM
non-parametric distributions
○ E.g., kNN LM
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs

● Given a context 𝑥, a model predicts parametric distributions for next token

Parametric distribution
(LM output distribution)

“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs

● Given a context 𝑥, a model predicts parametric distributions for next token


● kNN LM computes nonparametric distributions, by finding similar training context 𝐶𝑖

Nonparametric distribution

“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs

● Given a context 𝑥, a model predicts parametric distributions for next token


● kNN LM computes nonparametric distributions, by finding similar training context 𝐶𝑖

“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs

● Interpolates two token distributions, adjusting the balance using a hyperparamter 𝜆

Nonparametric
distribution

Parametric
distribution

“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: Results
kNN LM outperforms much larger parametric LMs by large margin

● kNN LM constantly outperforms parametric 100M LMs & 30x larger 3B LMs with larger datastore

“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: Results
kNN LM outperforms much larger parametric LMs by large margin

● kNN LM constantly outperforms parametric LMs and 30x larger 3B LMs with larger datastore
● kNN LM also enables efficient & controlled domain adaptations

“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
Recent follow-up: TRIME
Training kNN LM to better learn interpolations

● kNN LM uses pre-trained LMs without any training


● TRIME introduces an efficient training method, outperforming kNN LM

Wikitext 103 (Perplexity)


20

15

10

0
Transformer kNN LM TRIME
Dev Perp Test Perp

“Training Language Models with Memory Augmentation.” Zhong et al., EMNLP 2022.
Pros and cons of output interpolation
kNN LM & variatns have unique advantages but have several empirical challenges

● Pros
○ Provides token-level attributions
○ Enables explicit control between parametric and non-parametric memories
Pros and cons of output interpolation
kNN LM & variatns have unique advantages but have several empirical challenges

● Pros
○ Provides token-level attributions
○ Enables explicit control between parametric and non-parametric memories
● Cons
○ Difficult to scale to large retrieval corpora (i.e., the number of embeddings equals the
number of tokens)
○ Empirically shows limited effectiveness outside of upsteam language modeling tasks
Summary
Diverse types of retrieval-augmented LMs have been studied; have pros & cons

● Input augmentation: widely used and effective but faces challenges when incorporating more
passages
● Intermediate incorporation: can efficiently handle more passages but requires pre-training and
fine-tuning
● Output interpolation: provides direct control over LM output, but has limited success in
downstream tasks and faces challenges of scaling the datastore
Representative Retrieval unit Retrieval frequency
methods
Input DrQA, RAG, REALM, Passage Once at the beginning
augmentation ICRALM
Intermediate RETRO, Passage Every k tokens
incorporation InstructRETRO
Output kNNLM Token Every token
interpolation TRIME
Present: Retrieval-augmented
Generation with LLMs
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks

LLM

Answer the following question, based on the reference.


Reference
Q: Who is the current PM of UK?
A:

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks

LLM

Answer the following question, based on the reference.


Reference Rishi Sunak
Q: Who is the current PM of UK?
A:

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks

Retriever LLM

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks

Retriever LLM

Answer the following question, based on the reference.


Reference
The current prime minister is Keir Starmer, who succeeded
Rishi Sunak on 5 July 2024, following the 2024 general
election
Q: Who is the current PM of UK?
A:

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks

Retriever LLM

Answer the following question, based on the reference. Keir Starmer


Reference
The current prime minister is Keir Starmer, who succeeded
Rishi Sunak on 5 July 2024, following the 2024 general
election
Q: Who is the current PM of UK?
A:

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context Retrieval-augmented LMs: Result
Simply augmenting input-space of LMs give signficant gain across different tasks

● In upsream language modeling task, simply adding retrieved context gives large gains, especially
smaller models
● Similar significant gains in downstream tasks such as Question Answering

“In-Context Retrieval-Augmented Language Models.” Ram et al., TACL 2023.


In-context Retrieval-augmented LMs: Result
Effects of retrieval systems for downstream task performance

● On language modeling, BM 25 results in best performance

“In-Context Retrieval-Augmented Language Models.” Ram et al., TACL 2023.


In-context Retrieval-augmented LMs: Result
Effects of retrieval systems for downstream task performance

● On language modeling, BM 25 results in best performance


● On downstream QA tasks, trained retrieval models eg Contriever results in best performance

“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
Limitations of such naïve “RAG”
Is combining off-the-shelf models sufficient?

● In-context retrieval-augmented LMs sometimes generate content that is not


fully supported by their citations

“Evaluating Verifiability in Generative Search Engines”. Liu et al. Findings of EMNLP 2023.
Limitations of such naïve “RAG”
Is combining off-the-shelf models sufficient?

● In-context retrieval-augmented LMs sometimes generate content that is not


fully supported by their citations
● They can easily be distracted by unhelpful context

“Making Retrieval-Augmented Language Models Robust to Irrelevant Context”. ICLR 2024.


Limitations of such naïve “RAG”
Is combining off-the-shelf models sufficient?

● In-context retrieval-augmented LMs generate what is not fully supported by


their citations
● They can easily get distracted by unhelpful context
● Diverse tasks require diffent retrieval needs e.g., content, frequency

Can be easily answered based on top


Who is the current PM of UK?
documents retrieved at the beginning
Limitations of such naïve “RAG”
Is combining off-the-shelf models sufficient?

● In-context retrieval-augmented LMs generate what is not fully supported by


their citations
● They can easily get distracted by unhelpful context
● Diverse tasks require diffent retrieval needs e.g., content, frequency

Can be easily answered based on top


Who is the current PM of UK?
documents retrieved at the beginning

Create a table listing all previous UK Prime


This may require iterative retrieval, based on
Ministers, including their terms in office, political
the current generation
party, alma mater, and notable achievements.
Limitations of such naïve “RAG”
Is combining off-the-shelf models sufficient?

● In-context retrieval-augmented LMs generate what is not fully supported by


their citations
● They can easily be distracted by unhelpful context
● Diverse tasks require diffent retrieval needs e.g., content, frequency

Can be easily answered based on top


Who is the current PM of UK?
documents retrieved at the beginning

Create a table listing all previous UK Prime


This may require iterative retrieval, based on
Ministers, including their terms in office, political
the current generation
party, alma mater, and notable achievements.

The equation x 2 + 2x = i has two complex Questions with similar solutions may have limited
solutions. Determine the product of their real semantic similarities in embedding space
parts (from MATH)
Designing and training more reliable LLM RAG
Approaches to optimize (1) LM, (2) Retrievers, or (3) prompts for LLM RAG

1. Optimizing LLMs for RAG: training / controlling LLMs with retrieved context

Retriever LLM
SAIL: Training LMs with retrieval-augmented data
SAIL augments existing instruction-tuning data to teach the LM how to use retrieved context

SAIL sytnthetically generates


explanations by using a NLI
model

“Search Augmented Instruction Learning”. Luo et al. Findings of EMNLP 2024.


Self-RAG: Teaching LLMs to learn to levarage retrieved context
Self-RAG teaches LMs to adaptively retrieve and evaluates context & own generation

● Train an arbitrary LM (e.g., Llama 3) to generate special tokens for (1) triggering retrieval only
when necessary and (2) evaluating the relevance of retrieved context and its own generations.

“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Advanced RAG inference algorithm
Advanced RAG inference algorithm to better incorporate retrieved context

● Levarage model-generated tokens to improve search process at inference time

“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Optimizing LLMs for RAG: Results
New training and advanced inference algorithm for RAG significantly boost performance

● Training with 8B and 13B models significantly boosts performance compared to off-the-shelf RAG
pipelines

Acc,
rouge 100

80

60

40

20

0
PopQA FActScore ASQA PubHealth
Llama 2 13B Self-RAG 13B ChatGPT SAIL

“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Optimizing LLMs for RAG: Results
New training and advanced inference algorithm for RAG significantly boost performance

● Training with 8B and 13B models significantly boosts performance compared to off-the-shelf RAG
pipelines
● Adaptive use of retrieval also improves the efficiency of RAG systems

Acc,
rouge 100

80

60

40

20

0
PopQA FActScore ASQA PubHealth
Llama 2 13B Self-RAG 13B ChatGPT SAIL

“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Designing and training more reliable LLM RAG
Approaches to optimize (1) LM, (2) Retrievers, or (3) prompts for LLM RAG

1. Optimizing LLMs for RAG: training / controlling LLMs with retrieved context

2. Optimizing Retriever for RAG: training retrievers for LLM RAG

Retriever LLM
Optimizing retrievers for RAG
Training retrieval modules using LM feedback

● For RAG pipelines using blackbox LLMs e.g., GPT o1, we cannot directly train the LLMs for RAG
● Can we train retrievers instead?

“REPLUG: Retrieval-Augmented Black-Box Language Models”. Shi et al. NAACL 2024.


REPLUG: Training a retriever using blackbox LLM feedback
Training retrieval model using LM feedback

● Train retrievers for black-box LLMs by minimizing KL divergence between LM & retriever

“REPLUG: Retrieval-Augmented Black-Box Language Models”. Shi et al. NAACL 2024.


REPLUG: Training a retriever using blackbox LLM feedback
Training retrieval model using LM feedback

● Train retrievers for black-box LLMs by minimizing KL divergence between LM & retriever

“REPLUG: Retrieval-Augmented Black-Box Language Models”. Shi et al. NAACL 2024.


REPLUG: Training a retriever using blackbox LLM feedback
Training retrieval model using LM feedback

● Train retrievers for black-box LLMs by minimizing KL divergence between LM & retriever

“REPLUG: Retrieval-Augmented Black-Box Language Models”. Shi et al. NAACL 2024.


RA-DIT: Combining REPLUG + retrieval-augmented LM training
Trains both retriever and LM on multiple tasks using REPLUG + retrieval-augmented training

“RA-DIT: Retrieval-Augmented Dual Instruction Tuning”. Lin et al. ICLR 2024.


RA-DIT: Combining REPLUG + retrieval-augmented LM training
Trains both retriever and LM on multiple tasks using REPLUG + retrieval-augmented training

“RA-DIT: Retrieval-Augmented Dual Instruction Tuning”. Lin et al. ICLR 2024.


REPLUG, RA-DIT: Results
Training retriever & LM gives large improvements across diverse tasks

● RA-DIT observes performance gain from combinations of off-the-shelf (REPLUG w/o LSR)

80
Acc, EM
75

70

65

60

55

50
MMLU TQA Avg
REPLUG RA-DIT (LM only) RA-DIT (R only) RA-DIT

“RA-DIT: Retrieval-Augmented Dual Instruction Tuning”. Lin et al. ICLR 2024.


REPLUG, RA-DIT: Results
Training retriever & LM gives large improvements across diverse tasks

● RA-DIT observes performance gain from combinations of off-the-shelf (REPLUG w/o LSR)
● Both LM and retriever training contributes to performance gain

80
Acc, EM
75

70

65

60

55

50
MMLU TQA Avg
REPLUG RA-DIT (LM only) RA-DIT (R only) RA-DIT

“RA-DIT: Retrieval-Augmented Dual Instruction Tuning”. Lin et al. ICLR 2024.


REPLUG, RA-DIT: Results
Training retriever & LM gives large improvements across diverse tasks

● RA-DIT observes performance gain from combinations of off-the-shelf (REPLUG w/o LSR)
● Both LM and retriever training contributes to performance gain

80
Acc, EM
75

70

65

60

55

50
MMLU TQA Avg
REPLUG RA-DIT (LM only) RA-DIT (R only) RA-DIT

“RA-DIT: Retrieval-Augmented Dual Instruction Tuning”. Lin et al. ICLR 2024.


Optimizing retrievers for RAG
Alternative appraoches: introducing additional modules for reranking or filtering

● From initial retrived docs 𝑍, select more relevant context before feeding it to LMs
● Examples include: cross-encoder, context compression (Xi et al., 2024)

“RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation”. Xu et al. ICLR 2024.
Designing and training more reliable LLM RAG
Approaches to optimize (1) LM, (2) Retrievers, or (3) prompts for LLM RAG

1. Optimizing LLMs for RAG: training / controlling LLMs with retrieved context

2. Optimizing Retriever for RAG: training retrievers for LLM RAG

3. Optimizing Prompts for RAG: advanced prompt techniques

Retriever LLM
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications

● Training-free RAG systems are brittle to prompts

“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications

● Training-free RAG systems are brittle to prompts

Scores

33%
with GPT-3.5
on a multi-hop QA task

“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications

● DSPy optimizes instructions and few-shot demonstrations to achieve the best performance

“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications

“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG

Scores

55%
with GPT-3.5
on a multi-hop QA task
Future: Limitations & future
directions
Roadmap for more efficient & reliable retrieval-augmented LMs
Challenges of scaling up datastores & increased inference-time costs

● Performance gains are achieved by scaling up the datastore


to trillions of tokens
Evaluations ● Significantly increases inference costs, including CPU memory
and storage requirements (e.g., 24 TB for 1.7 trillion-token).

Algorithms

Infrastructure

“Scaling Retrieval-Based Language Models with a Trillion-Token Datastore.” Shao, He, Asai et al., ArXiv 2024.
Roadmap for more efficient & reliable retrieval-augmented LMs
New algorithms & arthictectures to enable more efficient and effective RAG

● Current “RAG” has many issues such as efficiency &


redundancy
Evaluations ● Alternative algorithms, better LM architectures, caching … etc
for improving efficiency and performance

Algorithms

Infrastructure

“Generative Representational Instruction Tuning.” Muennighoff et al., ArXiv 2024.


Roadmap for more efficient & reliable retrieval-augmented LMs
New algorithms & arthictectures to enable more efficient and effective RAG

● Current “RAG” has many issues such as efficiency &


redundancy
Evaluations ● Alternative algorithms, better LM architectures, caching … etc
for improving efficiency and performance

Algorithms

Infrastructure

“PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design.” Jiang et al., ArXiv 2024.
Roadmap for more efficient & reliable retrieval-augmented LMs
Careful analyses on their effectiveness and limitations

Prior systems are often evaluated only on simple general-domain


Evaluations tasks. Further exploration into their evaluation are needed

● Domains: most prior evaluations are in general-domain tasks,


where Wikipedia is a sufficient knowledge source
Algorithms
● Tasks: going beyond open-domain QA, multiple-choice QA

● Aspects: instead of merely evaluating final “correctness”, more


Infrastructure holistic evaluations of different aspects of RAG
Questions? Sli.do code #2068655

Acknowledgements: Some slides are adapted from our ACL 2023 tutorials https://acl2023-retrieval-
lm.github.io/ co-taught by Akari, Sewon Min, Zexuan Zhong and Danqi Chen. We thank Omar Khattab
for sharing the DSPy slides

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy