LLMs and Retrieval-Augmented Generation (RAG)
LLMs and Retrieval-Augmented Generation (RAG)
𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )
LLM
How do normal parametric LLMs work?
Encapsulating everything in parameters by pre-training parameters on large-scale text corpora
𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )
LLM
Pittsburgh is located in
𝑥1 𝑥2 𝑥3 𝑥4
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot memorize everything in their parameters (yet), resulting in factual inaccuracy
…
Limitations of parametric LLMs #1: Hallucinations
LLMs cannot encapslate everything in their parameters yet.
…
Catastrophic incidents due to LLM hallucinations
Such LLM hallucinations have been causing many critical incidents in the real world
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time
Input 𝒙
LLM
Pre-training
data Output 𝒚
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time
Input 𝒙
Retriever LLM
Datastore
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time
Query 𝒒 Input 𝒙
Retriever LLM
Datastore
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time
Query 𝒒 Input 𝒙
Retriever LLM
sim(𝑞, 𝑑)
Documents
𝒁
Datastore
Retrieval-augmented LMs: Definitions & Notations
A new type of LMs that can use large-scale text data (datastore) at inference-time
Query 𝒒 Input 𝒙
Retriever LLM
Documents
𝒁
Datastore Output 𝒚
Benefit of retrieval-augmented LMs #1: reduce hallucinations
Retrieval-augmented LMs can reduce hallucinations, especially in long-tail knowledge
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
Benefit of retrieval-augmented LMs #1: reduce hallucinations
Retrieval-augmented LMs can reduce hallucinations, especially in long-tail knowledge
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
Quiz:
What are the other benefits of using
retrieval-augmented LMs?
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training
“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training
“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training
𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )
LLM
“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training
𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )
LLM
“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
Parametric LMs’ knowledge gets obsolete quickly & requires continuous training
2023
The incumbent prime minister is Keir
Starmer, who assumed the office on
5 July 2024. Rishi
Pre-training Boris
data 2024 data
Keir
𝑃 𝑥𝑛 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 )
LLM
“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #2: Adaptations w/o training
We can easily swap datastores for retrieval-augmented LMs for new data distributions
2023
2024 data
“RealTime QA: What's the Answer Right Now?” Kasai et al. NeurIPS (Benchmark). 2023
Benefit of retrieval-augmented LMs #3: Providing attributions
Retrieval-augmented LMs can provide a small number of documents as attributions
“Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models”. Bohnet et al. ArXiv 2020.
Benefit of retrieval-augmented LMs #4: Flexible data opt-in / out
We can incorporate or remove high-risk data dynamically at inference, not training time
“SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore”. Min et al. In ICLR 2024
Benefit of retrieval-augmented LMs #5: parameter efficiency
Retrieval-augmented LMs can be much more parameter efficient and compute-optimal
“Scaling Retrieval-Based Language Models with a Trillion-Token Datastore.” Shao, He, Asai et al., ArXiv 2024.
Benefit of retrieval-augmented LMs #5: parameter efficiency
Retrieval-augmented LMs can be much more parameter efficient and compute-optimal
“Scaling Retrieval-Based Language Models with a Trillion-Token Datastore.” Shao, He, Asai et al., ArXiv 2024.
Retrieval-augmented LMs have been widely used!
Retrieval-augmented LMs have been widely used both in academia and industry
Retrieval-augmented LMs have been widely used!
Retrieval-augmented LMs have been widely used both in academia and industry
2017: DrQA
Trained QA model
BM25
2017: DrQA
2019: ORQA
2020: RALM, RAG
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Lewis et al., NeurIPS 2020.
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering
2017: DrQA
2019: ORQA
2020: RALM, RAG
2020: kNN LM
“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering
2017: DrQA
2019: ORQA
2020: RALM, RAG
2020: kNN LM
2021: RETRO
New architectures for retrieval-augmented LMs
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
Brief history of retrieval-augmented LMs development
Versatile and powerful LLMs demonstrate effectiveness even without fine-tuning
2017: DrQA
2019: ORQA
2020: RALM, RAG
2020: kNN LM
2020: GPT3
2021: RETRO
2022: ChatGPT
https://paperswithcode.com/sota/question-answering-on-triviaqa
Brief history of retrieval-augmented LMs development
Success of In-Context Retrieval-Augmented LMs (commonly referred to as RAG today)
2017: DrQA
2019: ORQA
2020: RALM, RAG
2020: kNN LM
2021: RETRO
Use off-the-shelf LLMs & retrieval systems
2017: DrQA
2019: ORQA
Past: Developments in
2020: RALM, RAG Architecture and Training for
2020: kNN LM Specific Tasks
2021: RETRO
2023: Retrieval-
augmented LLMs
Brief history of retrieval-augmented LMs development
RAG was initially extensively studied for certain NLP tasks, namely Question Answering
2017: DrQA
2019: ORQA
Past: Architecture / training
2020: RALM, RAG developments for certain down or
2020: kNN LM up-stream tasks
2021: RETRO
● Input augmentaiton
○ Augment the input of LMs with retrieved context
○ E.g., RAG, REALM, DrQA, In-context RALM Retriever 𝑥
LLM
𝑦
REALM: Augmenting input space of LMs
REALM is an retrieval-augmented masked LMs that predicts next tokens / spans in context
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Lewis et al., NeurIPS 2020.
RAG & REALM: Results
RAG and REALM show their effectiveness on open-domain QA and other tasks
Open-domain QA (EM)
50
NQ WQ
40
30
Higher is
better 20
10
0
T5 REALM DPR RAG
10 20
0 0
T5 REALM DPR RAG Gold BART RAG-Token
● Combining retrieval and off-the-shelf LMs e.g., GPT-4 at inference time without training
● Often referred to as “RAG” nowadays
● We’ll cover this in depth in the next section!
● Pros
○ Easy to switch to new, more powerful LMs with fine-tuning / without training
○ LLMs can effectively levarage input context
Pros and cons of input augmentation
Input augmentation is powerful but has several limitations.
● Pros
○ Easy to switch to new, more powerful LMs with fine-tuning / without training
○ LLMs can effectively levarage input context
● Cons
○ Expensive to scale up to hundreads or thousands of documents
■ LLMs also often do not fully levarage long context
○ No strict attributions to specific evidences
“Lost in the Middle: How Language Models Use Long Contexts.” Liu et al., TACL 2023.
Diverse architectures of retrieval-augmented LMs
Classifying retrieval-augmented LMs based on “where” we incorporate retrieved context
● Input augmentaiton
○ Augment the input of LMs with retrieved context
○ E.g., RAG, REALM, DrQA, In-context RALM Retriever
● Intermediate incorporation
○ Incorporate retrieved context in intermediate LLM
spaces of transformers
○ E.g., RETRO, Instruct RETRO
RETRO: Incorporating context in intermediate layers
RETRO enables more efficient incorporations of many documents
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
RETRO enables more efficient incorporations of many documents
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
RETRO uses frozen BERT as a retriever, and retrieve nearest neighbors from 1.7T datastore
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Given the input sequence, it first retrieves a set of relevant documents (embedding of text)
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Use cross-attention to generate retrieved context-aware representations
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Use cross-attention to generate retrieved context-aware representations
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
Concatnate all of the CA output (the size of input H and output CCA(H,E) remains the same
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Incorporating context in intermediate layers
RETRO uses frozen BERT as a retriever, and retrieve nearest neighbors from 1.7T datastore
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Results
RETRO shows impressive performance improvements on upstream (language modeling) tasks
Lower is
better
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Results
RETRO shows impressive performance improvements on upstream (language modeling) tasks
Lower is
better
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
RETRO: Results
RETRO shows impressive performance improvements on upstream (language modeling) tasks
Lower is
better
“Improving language models by retrieving from trillions of tokens.” Borgeaud et al., Arxiv 2020.
Recent follow-up: Instruct RETRO
Develop RETRO-block on top of Llama (autoregressive LMs), pre-training & multi-task training
“InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining.” Wang et al., ICML 2024.
Pros and cons of intermediate incorporation
Alternative way to incorporate retrieved context in a more scalable way, but requires training
● Pros
○ More efficiently incorporates many passages than input augmentation
○ Possibly more effective than retrieval augmentaion (i.e., Instruct RETRO results)
● Cons
○ Require modification of underlying LMs
○ Expensive pre-training is necessary
○ Doesn’t provide strict attribution
Diverse architectures of retrieval-augmented LMs
Classifying retrieval-augmented LMs based on “where” we incorporate retrieved context
● Input augmentaiton
○ Augment the input of LMs with retrieved context
○ E.g., RAG, REALM, DrQA, In-context RALM
● Intermediate incorporation
○ Incorporate retrieved context in intermediate
spaces of transformers
○ E.g., RETRO, Instruct RETRO
● Output interpolation
○ Interpolate output token probabilities with retrieved Retriever LLM
non-parametric distributions
○ E.g., kNN LM
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs
Parametric distribution
(LM output distribution)
“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs
Nonparametric distribution
“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs
“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: directly interpolate output token distributions
Directly interpolate output token distributions of LMs
Nonparametric
distribution
Parametric
distribution
“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: Results
kNN LM outperforms much larger parametric LMs by large margin
● kNN LM constantly outperforms parametric 100M LMs & 30x larger 3B LMs with larger datastore
“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
kNN LM: Results
kNN LM outperforms much larger parametric LMs by large margin
● kNN LM constantly outperforms parametric LMs and 30x larger 3B LMs with larger datastore
● kNN LM also enables efficient & controlled domain adaptations
“Generalization through Memorization: Nearest Neighbor Language Models.” Khandelwal et al., ICLR 2020.
Recent follow-up: TRIME
Training kNN LM to better learn interpolations
15
10
0
Transformer kNN LM TRIME
Dev Perp Test Perp
“Training Language Models with Memory Augmentation.” Zhong et al., EMNLP 2022.
Pros and cons of output interpolation
kNN LM & variatns have unique advantages but have several empirical challenges
● Pros
○ Provides token-level attributions
○ Enables explicit control between parametric and non-parametric memories
Pros and cons of output interpolation
kNN LM & variatns have unique advantages but have several empirical challenges
● Pros
○ Provides token-level attributions
○ Enables explicit control between parametric and non-parametric memories
● Cons
○ Difficult to scale to large retrieval corpora (i.e., the number of embeddings equals the
number of tokens)
○ Empirically shows limited effectiveness outside of upsteam language modeling tasks
Summary
Diverse types of retrieval-augmented LMs have been studied; have pros & cons
● Input augmentation: widely used and effective but faces challenges when incorporating more
passages
● Intermediate incorporation: can efficiently handle more passages but requires pre-training and
fine-tuning
● Output interpolation: provides direct control over LM output, but has limited success in
downstream tasks and faces challenges of scaling the datastore
Representative Retrieval unit Retrieval frequency
methods
Input DrQA, RAG, REALM, Passage Once at the beginning
augmentation ICRALM
Intermediate RETRO, Passage Every k tokens
incorporation InstructRETRO
Output kNNLM Token Every token
interpolation TRIME
Present: Retrieval-augmented
Generation with LLMs
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks
LLM
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks
LLM
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks
Retriever LLM
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks
Retriever LLM
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context retrieval-augmented LMs
Simply augmenting input of LMs gives signficant gain across different tasks
Retriever LLM
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
In-context Retrieval-augmented LMs: Result
Simply augmenting input-space of LMs give signficant gain across different tasks
● In upsream language modeling task, simply adding retrieved context gives large gains, especially
smaller models
● Similar significant gains in downstream tasks such as Question Answering
“When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories”. Mallen*, Asai* et al. ACL 2023
Limitations of such naïve “RAG”
Is combining off-the-shelf models sufficient?
“Evaluating Verifiability in Generative Search Engines”. Liu et al. Findings of EMNLP 2023.
Limitations of such naïve “RAG”
Is combining off-the-shelf models sufficient?
The equation x 2 + 2x = i has two complex Questions with similar solutions may have limited
solutions. Determine the product of their real semantic similarities in embedding space
parts (from MATH)
Designing and training more reliable LLM RAG
Approaches to optimize (1) LM, (2) Retrievers, or (3) prompts for LLM RAG
1. Optimizing LLMs for RAG: training / controlling LLMs with retrieved context
Retriever LLM
SAIL: Training LMs with retrieval-augmented data
SAIL augments existing instruction-tuning data to teach the LM how to use retrieved context
● Train an arbitrary LM (e.g., Llama 3) to generate special tokens for (1) triggering retrieval only
when necessary and (2) evaluating the relevance of retrieved context and its own generations.
“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Advanced RAG inference algorithm
Advanced RAG inference algorithm to better incorporate retrieved context
“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Optimizing LLMs for RAG: Results
New training and advanced inference algorithm for RAG significantly boost performance
● Training with 8B and 13B models significantly boosts performance compared to off-the-shelf RAG
pipelines
Acc,
rouge 100
80
60
40
20
0
PopQA FActScore ASQA PubHealth
Llama 2 13B Self-RAG 13B ChatGPT SAIL
“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Optimizing LLMs for RAG: Results
New training and advanced inference algorithm for RAG significantly boost performance
● Training with 8B and 13B models significantly boosts performance compared to off-the-shelf RAG
pipelines
● Adaptive use of retrieval also improves the efficiency of RAG systems
Acc,
rouge 100
80
60
40
20
0
PopQA FActScore ASQA PubHealth
Llama 2 13B Self-RAG 13B ChatGPT SAIL
“Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”. Asai et al. ICLR 2024.
Designing and training more reliable LLM RAG
Approaches to optimize (1) LM, (2) Retrievers, or (3) prompts for LLM RAG
1. Optimizing LLMs for RAG: training / controlling LLMs with retrieved context
Retriever LLM
Optimizing retrievers for RAG
Training retrieval modules using LM feedback
● For RAG pipelines using blackbox LLMs e.g., GPT o1, we cannot directly train the LLMs for RAG
● Can we train retrievers instead?
● Train retrievers for black-box LLMs by minimizing KL divergence between LM & retriever
● Train retrievers for black-box LLMs by minimizing KL divergence between LM & retriever
● Train retrievers for black-box LLMs by minimizing KL divergence between LM & retriever
● RA-DIT observes performance gain from combinations of off-the-shelf (REPLUG w/o LSR)
80
Acc, EM
75
70
65
60
55
50
MMLU TQA Avg
REPLUG RA-DIT (LM only) RA-DIT (R only) RA-DIT
● RA-DIT observes performance gain from combinations of off-the-shelf (REPLUG w/o LSR)
● Both LM and retriever training contributes to performance gain
80
Acc, EM
75
70
65
60
55
50
MMLU TQA Avg
REPLUG RA-DIT (LM only) RA-DIT (R only) RA-DIT
● RA-DIT observes performance gain from combinations of off-the-shelf (REPLUG w/o LSR)
● Both LM and retriever training contributes to performance gain
80
Acc, EM
75
70
65
60
55
50
MMLU TQA Avg
REPLUG RA-DIT (LM only) RA-DIT (R only) RA-DIT
● From initial retrived docs 𝑍, select more relevant context before feeding it to LMs
● Examples include: cross-encoder, context compression (Xi et al., 2024)
“RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation”. Xu et al. ICLR 2024.
Designing and training more reliable LLM RAG
Approaches to optimize (1) LM, (2) Retrievers, or (3) prompts for LLM RAG
1. Optimizing LLMs for RAG: training / controlling LLMs with retrieved context
Retriever LLM
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications
“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications
Scores
33%
with GPT-3.5
on a multi-hop QA task
“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications
● DSPy optimizes instructions and few-shot demonstrations to achieve the best performance
“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG
Optimizing prompts for RAG applications
“DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines”. Khattab et al. ICLR 2024.
DSPy: Optimizing prompts for LLM RAG
Scores
55%
with GPT-3.5
on a multi-hop QA task
Future: Limitations & future
directions
Roadmap for more efficient & reliable retrieval-augmented LMs
Challenges of scaling up datastores & increased inference-time costs
Algorithms
Infrastructure
“Scaling Retrieval-Based Language Models with a Trillion-Token Datastore.” Shao, He, Asai et al., ArXiv 2024.
Roadmap for more efficient & reliable retrieval-augmented LMs
New algorithms & arthictectures to enable more efficient and effective RAG
Algorithms
Infrastructure
Algorithms
Infrastructure
“PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design.” Jiang et al., ArXiv 2024.
Roadmap for more efficient & reliable retrieval-augmented LMs
Careful analyses on their effectiveness and limitations
Acknowledgements: Some slides are adapted from our ACL 2023 tutorials https://acl2023-retrieval-
lm.github.io/ co-taught by Akari, Sewon Min, Zexuan Zhong and Danqi Chen. We thank Omar Khattab
for sharing the DSPy slides