Practical RAG
Practical RAG
Practical RAG
Preface 1
1 Introduction 3
1.1 The Role of LLMs in NLP . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Importance of Question Answering over PDFs . . . . . . . 5
1.3 The Retrieval-Augmented Generation Approach . . . . . . . . . 6
1.4 A Brief History of LLMs . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Foundation Models . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Reinforcement Learning from Human Feedback . . . . . 10
1.4.3 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.5 Prompt Learning . . . . . . . . . . . . . . . . . . . . . . 14
iii
Table of contents
iv
Table of contents
v
Table of contents
References 121
vi
List of Figures
vii
List of Figures
viii
List of Figures
4.27 Insert text chunks into the vector database and perform retrieval 84
4.28 Metadata filtering in LlamaIndex for document retrieval . . . . . 85
4.29 Define text node and metadata for auto retrieval . . . . . . . . . 86
4.30 Define VectorIndexAutoRetriever retriever and VectorStoreInfo,
which contains a structured description of the vector store col-
lection and the metadata filters it supports. . . . . . . . . . . . . 87
4.31 Hybrid retrieval pipeline . . . . . . . . . . . . . . . . . . . . . . 89
4.32 Load documents and initialize document store . . . . . . . . . . 92
4.33 Define keyword and embedding based retrievers . . . . . . . . . 93
4.34 Create end-to-end pipeline and run the retrievers . . . . . . . . 94
4.35 Query re-writing using LLMs. LLM can expand the query or
create multiple sub-queries. . . . . . . . . . . . . . . . . . . . . . 95
4.36 Query router architecture . . . . . . . . . . . . . . . . . . . . . . 99
4.37 Using a zero-shot classifier to categorize and route user queries 101
4.38 Using a LLM to categorize and route user queries . . . . . . . . 102
4.39 Query routing example in LlamaIndex. First we load documents
and create different indecies. . . . . . . . . . . . . . . . . . . . . 104
4.40 Define QueryEngine and RouterQueryEngine objects, and run
the engine for user queries. . . . . . . . . . . . . . . . . . . . . . 105
4.41 A basic key-value implementation of memory for RAG. . . . . . 108
ix
Preface
1
1 Introduction
In this chapter, we will lay the foundation for building a chat-to-PDF app us-
ing Large Language Models (LLMs) with a focus on the Retrieval-Augmented
Generation approach. We’ll explore the fundamental concepts and technologies
that underpin this project.
Large Language Models (LLMs) play a crucial role in Natural Language Process-
ing (NLP). These models have revolutionized the field of NLP by their ability to
understand and generate human-like text. With advances in deep learning and
neural networks, LLMs have become valuable assets in various NLP tasks, in-
cluding language translation, text summarization, and chatbot development.
One of the key strengths of LLMs lies in their capacity to learn from vast
amounts of text data. By training on massive datasets, LLMs can capture
complex linguistic patterns and generate coherent and contextually appropri-
ate responses. This enables them to produce high-quality outputs that are
indistinguishable from human-generated text.
LLMs are trained using a two-step process: pre-training and fine-tuning. Dur-
ing pre-training, models are exposed to a large corpus of text data and learn
to predict the next word in a sentence. This helps them develop a strong un-
derstanding of language structure and semantics. In the fine-tuning phase, the
models are further trained on task-specific data to adapt their knowledge to
specific domains or tasks.
3
1 Introduction
The versatility and effectiveness of LLMs make them a powerful tool in advanc-
ing the field of NLP. They have not only improved the performance of existing
NLP systems but have also opened up new possibilities for developing inno-
vative applications. With continued research and development, LLMs are ex-
pected to further push the boundaries of what is possible in natural language
understanding and generation.
4
1.2 The Importance of Question Answering over PDFs
5
1 Introduction
6
1.4 A Brief History of LLMs
Lately, ChatGPT, as well as DALL-E-2 and Codex, have been getting a lot of
attention. This has sparked curiosity in many who want to know more about
what’s behind their impressive performance. ChatGPT and other Generative AI
(GAI) technologies fall into a category called Artificial Intelligence Generated
Content (AIGC). This means they’re all about using AI models to create content
like images, music, and written language. The whole idea behind AIGC is to
make creating conetent faster and easier.
7
1 Introduction
With more data and bigger models, these AI systems can make things that look
and sound quite realistic and high-quality. The following shows an example of
text prompting that generates images according to the instructions, leveraging
the OpenAI DALL-E-2 model.
In the realm of Generative AI (GAI), models can typically be divided into two cat-
egories: unimodal models and multimodal models. Unimodal models operate
by taking instructions from the same type of data as the content they generate,
while multimodal models are capable of receiving instructions from one type
of data and generating content in another type. The following figure illustrates
these two categories of models.
These models have found applications across diverse industries, such as art and
design, marketing, and education. It’s evident that in the foreseeable future,
AIGC will remain a prominent and continually evolving research area with ar-
tificial intelligence.
Speaking of LLMs and GenAI, we cannot overlook the significant role played
by Transformer models.
8
1.4 A Brief History of LLMs
9
1 Introduction
sic examples of masked language models and have further improved upon the
BERT architecture with additional training data and techniques.
In this graph, you can see two types of information flow indicated by lines: the
black line represents bidirectional information flow, while the gray line repre-
sents left-to-right information flow. There are three main model categories:
1. Encoder models like BERT, which are trained with context-aware objec-
tives.
2. Decoder models like GPT, which are trained with autoregressive objec-
tives.
3. Encoder-decoder models like T5 and BART, which merge both ap-
proaches. These models use context-aware structures as encoders and
left-to-right structures as decoders.
To improve AI-generated content (AIGC) alignment with user intent, i.e., con-
siderations in usefulness and truthfulness, reinforcement learning from human
feedback (RLHF) has been applied in models like Sparrow, InstructGPT, and
ChatGPT.
10
1.4 A Brief History of LLMs
The RLHF pipeline involves three steps: pre-training, reward learning, and fine-
tuning with reinforcement learning. In reward learning, human feedback on di-
verse responses is used to create reward scalars. Fine-tuning is done through re-
inforcement learning with Proximal Policy Optimization (PPO), aiming to max-
imize the learned reward.
However, the field lacks benchmarks and resources for RL, which is seen as
a challenge. But this is changing day-by day. For example, an open-source
library called RL4LMs was introduced to address this gap. Claude, a dialogue
agent, uses Constitutional AI, where the reward model is learned via RL from AI
feedback. The focus is on reducing harmful outputs, with guidance from a set
of principles provided by humans. See more about the topic of Constitutional
AI in one of our blog post here.
1.4.3 GAN
Generative Adversarial Networks (GANs) are widely used for image generation.
GANs consist of a generator and a discriminator. The generator creates new
data, while the discriminator decides if the input is real or not.
The design of the generator and discriminator influences GAN training and per-
formance. Various GAN variants have been developed, including LAPGAN, DC-
GAN, Progressive GAN, SAGAN, BigGAN, StyleGAN, and methods addressing
mode collapse like D2GAN and GMAN.
The following graph illustrates some of the categories of vision generative mod-
els.
Although GAN models are not the focus of our book, they are essential in pow-
ering multi-modality applications such as the diffusion models.
1.4.4 Applications
Chatbots are probably one of the most popular applications for LLMs.
11
1 Introduction
Chatbots are computer programs that mimic human conversation through text-
based interfaces. They use language models to understand and respond to user
input. Chatbots have various use cases, like customer support and answering
common questions. Our “chat with your PDF documents” is a up-and-coming
use case!
Other notable examples include Xiaoice, developed by Microsoft, which ex-
presses empathy, and Google’s Meena, an advanced chatbot. Microsoft’s Bing
now incorporates ChatGPT, opening up new possibilities for chatbot develop-
ment.
This graph illustrates the relationships among current research areas, applica-
tions, and related companies. Research areas are denoted by dark blue circles,
applications by light blue circles, and companies by green circles.
In addition, we have previously written about chatbots and now they are part
of history, but still worth reviewing:
12
1.4 A Brief History of LLMs
13
1 Introduction
Of course, chatbots are not the only application. There are vast possibilities in
arts and design, music generation, education technology, coding and beyond -
your imagination doesn’t need to stop here.
Normally, prompt learning will freeze the language model and di-
rectly perform few-shot or zero- shot learning on it. This enables
the language models to be pre-trained on large amount of raw text
data and be adapted to new domains without tuning it again. Hence,
prompt learning could help save much time and efforts.
Traditionally, prompt learning involves prompting the model with a task, and
it can be done in two stages: prompt engineering and answer engineering.
Prompt engineering: This involves creating prompts, which can be either dis-
crete (manually designed) or continuous (added to input embeddings) to convey
task-specific information.
Answer engineering: After reformulating the task, the generated answer
must be mapped to the correct answer space.
Besides single-prompt, multi-prompt methods combine multiple prompts for bet-
ter predictions, and prompt augmentation basically beefs up the prompt to gen-
erate better results.
Moreover, in-context learning, a subset of prompt learning, has gained popu-
larity. It enhances model performance by incorporating a pre-trained language
model and supplying input-label pairs and task-specific instructions to improve
alignment with the task.
14
1.4 A Brief History of LLMs
Figure 1.6: Emerging RAG & Prompt Engineering Architecture for LLMs. Image
source
In the top-right corner, you can see the complex, yet powerful tools like Ope-
nAI, Cohere, and Anthropic (to-be-added), which have pushed the boundaries of
what language models can achieve. Along the diagonal, the evolution of prompt
engineering is displayed, from static prompts to templates, prompt chaining,
RAG pipelines, autonomous agents, and prompt tuning. On the more flexible
side, options like Haystack and LangChain have excelled, presenting broader
horizons for those seeking to harness the versatility of language models.
This graph serves as a snapshot of the ever-evolving landscape of toolings in the
realm of language model and prompt engineering today, providing a roadmap
15
1 Introduction
for those navigating the exciting possibilities and complexities of this field. It
is likely going to be changing every day, reflecting the continuous innovation
and dynamism in the space.
In the next Chapter we’ll turn our focus to more details of Retrieval Augmented
Generation (RAG) pipelines. We will break down their key components, archi-
tecture, and the key steps involved in building an efficient retrieval system.
16
2 Retrieval Augmented Generation
(RAG)
17
2 Retrieval Augmented Generation (RAG)
Your existing chatbot is built around a traditional Large Language Model (LLM).
While it’s knowledgeable about general product information, your customers
are increasingly seeking more specific and real-time assistance. Here are some
challenges you’ve encountered:
Product Reviews: Shoppers want to know about recent product reviews and
ratings to make informed decisions. The LLM lacks access to the latest customer
reviews and sentiment analysis.
18
2.2 Introducing Retrieval Augmented Generation (RAG)
Now, let’s introduce RAG into your e-commerce customer support system:
Retrieval of Real-Time Data: With RAG, your chatbot can connect to your e-
commerce platform’s databases and data warehouses in real-time. It can re-
trieve the latest information about product availability, stock levels, and ship-
ping status.
Incorporating User Reviews: RAG can scrape and analyze customer reviews
and ratings from your website, social media, and other sources. It can then gen-
erate responses that include recent reviews, helping customers make informed
choices.
Dynamic Promotions: RAG can access your promotion database and provide
up-to-the-minute details about ongoing discounts, flash sales, and limited-time
offers. It can even suggest personalized promotions based on a user’s browsing
history.
Order Tracking: RAG can query your logistics system to provide customers
with real-time tracking information for their orders. It can also proactively no-
tify customers of any delays or issues.
To address the limitation of generative AI, researchers and engineers have de-
veloped innovative approaches, one of which is the Retrieval Augmented Gen-
eration (RAG) approach. RAG initially caught the interest of generative AI de-
velopers following the release of a seminal paper titled “Retrieval-Augmented
Generation for Knowledge-Intensive NLP Tasks” (Lewis et al. 2020) at Facebook
AI Research. RAG combines the strengths of generative AI with retrieval tech-
niques to enhance the quality and relevance of generated text. Unlike traditional
19
2 Retrieval Augmented Generation (RAG)
generative models that rely solely on their internal knowledge, RAG incorpo-
rates an additional step where it retrieves information from external sources,
such as databases, documents, or the web, before generating a response. This
integration of retrieval mechanisms empowers RAG to access up-to-date infor-
mation and context, making it particularly valuable for applications where ac-
curate and current information is critical.
In this chapter, we will delve deeper into the Retrieval Augmented Generation
(RAG) approach, exploring its architecture, advantages, and real-world applica-
tions. By doing so, we will gain a better understanding of how RAG represents a
significant step forward in improving the capabilities of generative AI and over-
coming the limitations posed by reliance on static training data. Understanding
the key concepts and components of this approach is essential for building an
effective chat-to-PDF app.
20
2.2 Introducing Retrieval Augmented Generation (RAG)
initial retrieval may not always return the perfect answer, so the gen-
eration component can refine and enhance the response iteratively by
referring back to the retrieval results.
4. Fine-Tuning: Successful implementation of this approach often requires
fine-tuning LLMs on domain-specific data. Fine-tuning adapts the model
to understand and generate content relevant to the specific knowledge
domain, improving the quality of responses.
5. Latent Space Representations: Retrieval models often convert docu-
ments and queries into latent space representations, making it easier to
compare and rank documents based on their relevance to a query. These
representations are crucial for efficient retrieval.
6. Attention Mechanisms: Both the retrieval and generation components
typically employ attention mechanisms. Attention mechanisms help
the model focus on the most relevant parts of the input documents and
queries, improving the accuracy of responses.
21
2 Retrieval Augmented Generation (RAG)
Generation Model: On the other hand, the generation model excels in crafting
coherent and contextually rich responses to user queries. It’s often based on
large language models (LLMs) like GPT-3, which can generate human-like text.
Figure 2.1 shows the RAG architecture.
22
2.3 RAG Architecture
23
2 Retrieval Augmented Generation (RAG)
In this section, we will focus on building the retrieval system, a critical compo-
nent of the chat-to-PDF app that enables the extraction of relevant information
from PDF documents. This section is essential for implementing the Retrieval-
Augmented Generation approach effectively.
Choosing the right retrieval model is a crucial decision when building your
chat-to-PDF app. Retrieval models determine how efficiently and accurately
the system can find and rank relevant documents in response to user queries.
Here are some considerations when selecting a retrieval model:
24
2.5 Embeddings and Vector Databases for Retrieval in RAG
• Domain and Data Size: Consider the specific requirements of your chat-
to-PDF app. Some retrieval models may be more suitable for small or
specialized document collections, while others excel in handling large,
diverse corpora.
• Scalability: Ensure that the chosen retrieval model can scale to meet the
needs of your application, especially if you anticipate handling a substan-
tial volume of PDF documents.
25
2 Retrieval Augmented Generation (RAG)
There are abundent of tutorials and resources that can help you learn more
about vector embeddings. Here are some resources that can help you get
started:
26
2.5 Embeddings and Vector Databases for Retrieval in RAG
All of these factors has resulted in numerous new vector databases. Selecting
and depending on one of these databases can have long-lasting consequences
and dependencies within your system. Ideally, we opt for a vector database that
exhibits strong scalability, all while maintaining cost-efficiency and minimizing
latency. Some of these vector databases are: Qdrant, Weaviate, Pinecone, pgvec-
tor, Milvus, and Chroma.
27
2 Retrieval Augmented Generation (RAG)
Before your Chat-to-PDF app can effectively retrieve information from a vector
database, it’s imperative to preprocess the PDF documents and create a struc-
tured and searchable index for the preprocessed data. This searchable index
serves as the cornerstone of your application, akin to a meticulously organized
library catalog. It empowers your system to swiftly and accurately locate rele-
vant information within PDF documents, enhancing the efficiency and precision
of the retrieval process.
Figure 2.3 illustrates the RAG data ingestion pipeline. in Chapter 3, we will fully
discuss how to prepare, index, and store the documents in a vector database.
28
2.7 Challenges of Retrieval-Augmented Generation
RAG heavily relies on the availability of high-quality and relevant data for both
retrieval and generation tasks. Challenges in this area include:
2.7.3 Scalability
Scaling RAG systems to handle large volumes of data and user requests can be
challenging:
29
2 Retrieval Augmented Generation (RAG)
30
3 RAG Pipeline Implementation
Before we can harness the power of Large Language Models (LLMs) and partic-
ularly RAG method for question answering over PDF documents, it’s essential
to prepare our data. PDFs, while a common format for documents, pose unique
challenges for text extraction and analysis. In this section, we’ll explore the
critical steps involved in preprocessing PDF documents to make them suitable
for our Chat-to-PDF app. These steps are not only essential for PDFs but are
also applicable to other types of files. However, our primary focus is on PDF
documents due to their prevalence in various industries and applications.
PDFs may contain a mix of text, images, tables, and other elements. To enable
text-based analysis and question answering, we need to extract the textual con-
tent from PDFs. Here’s how you can accomplish this:
• Text Extraction Tools: Explore available tools and libraries like PyPDF2,
pdf2txt, or PDFMiner to extract text from PDF files programmatically.
• Handling Scanned Documents: If your PDFs contain scanned images
instead of selectable text, you may need Optical Character Recognition
(OCR) software to convert images into machine-readable text.
• Quality Control: Check the quality of extracted text and perform any
necessary cleanup, such as removing extraneous characters or fixing for-
matting issues.
31
3 RAG Pipeline Implementation
PDF documents can span multiple pages, and maintaining context across pages
is crucial for question answering. Here’s how you can address this challenge:
PDFs may introduce artifacts or inconsistencies that can affect the quality of
the extracted text. To ensure the accuracy of question answering, perform text
cleanup and normalization:
32
3.2 Data Ingestion Pipeline Implementation
As it has been depicted in Figure 2.3 the first step of the data ingestion pipeline
is extracting and spliting text from the pdf documents. There are several packages
for this goal including:
• PyPDF2
• pdfminer.six
• unstructured
Note
If you have scanned pdfs, you can utilize libraries such as unstructured,
pdf2image and pytesseract.
Additionally, there are data loaders hub like llamahub that contains tens of data
loaders for reading and connecting a wide variety data sources to a Large Lan-
guage Model (LLM).
Finally, there are packages like llamaindex, and langchain. These are frame-
works that faciliates developing applications powered by LLMs. Therefore, they
have implemented many of these data loaders including extracting and spliting
text from pdf files.
Step 2: Load the pdf file and extract the text from it
Code below will iterate over the pages of the pdf file, extract the text and add it
to the documents list object, see Figure 3.1.
33
3 RAG Pipeline Implementation
Now every page has become a separate document that we can later embed (vec-
torize) and store in the vector database. However, some pages could be very
lengthy and other ones could be very short as page length varies. This could
signaficantly impact the quality of the document search and retrieval.
Additionally, LLMs have a limited context window (token limit), i.e. they can
handle certain number of tokens (a token roughly equals to a word). Therefore,
we instead first concatenate all the pages into a long text document and then
split that document into smaller reletively equal size chunks. We then embed
each chunk of text and insert into the vector database.
pdf_content[0] contains the entire content of pdf, and has s special structure.
It is a Document object with some properties including page_content and meta-
data. page_content is the textual content and metadata contains some meta-
34
3.2 Data Ingestion Pipeline Implementation
data about the pdf. Here’s the partial output the Document object of our pdf in
Figure 3.3.
A Document object is a generic class for storing a piece of unstructured text and
its associated metadata. See here for more information.
There are several different text splitters. For more information see langchain
API, or llamaIndex documentation. Two common ones are:
35
3 RAG Pipeline Implementation
The following code in Figure 3.4 chunks the pdf content into sizes no greater
than 1000, with a bit of overlap to allow for some continued context.
Here’s the number of chunks created from splitting the pdf file.
In this step we need to convert chunks of text into embedding vectors. There are
plenty of embedding models we can use including OpenAI models, Huggingface
models, and Cohere models. You can even define your own custom embedding
model. Selecting an embedding model depnds on several factors:
36
3.2 Data Ingestion Pipeline Implementation
Similar to having several choices for embedding models, there are so many op-
tions for choosing a vector database, which is out the scope of this book.
Figure 3.5 shows some of the most popular vector database vendors and some
of the features of their hosting. This blog fully examines these vector databases
from different perspective.
37
3 RAG Pipeline Implementation
Please note that there are several different ways to achieve the same goal
(embedding and storing in the vector database). You can use Qdrant client
library directly instead of using the langchain wrapper for it. Also, you can
38
3.2 Data Ingestion Pipeline Implementation
first create embeddings separately and then store them in the Qdrant vector
database. Here, we embedded the documents and stored them all by calling
Qdrant.from_documents().
In addition, you can use Qdrant cloud vector database to store the embeddings
and use their REST API to interact with it, unlike this example where the index
is stored locally in the /tmp/local_qdrant directory. This approach is suitable
for testing and POC (Proof-Of-Concept), not for production environment.
We can try and see how we can search and retrieve relevant documents from the
vector database. For instance, let’s see what the answer to the question “what
is knearest neighbor?”. See the output in Figure 3.7.
39
3 RAG Pipeline Implementation
40
3.3 Generation Component Implementation
Figure 3.9 illustrates a simplified version of the RAG pipeline we saw in Chap-
ter 2. So far our Retrieval component of the RAG is implemented. In the next
section we will implement the Generation component.
• Step 1: Embed the user’s query using the same model used for embedding
documents
• Step 2: Pass the query embedding to vector database, search and retrieve
the top-k documents (i.e. context) from the vector database
• Step 3: Create a “prompt” and include the user’s query and context in it
• Step 4: Call the LLM and pass the the prompt
• Step 5: Get the generated response from LLM and display it to the user
Again, we can follow each step one by one, or utilize the features langchain or
llamaIndex provide. We are going to use langchain in this case.
Langchain includes several kinds of built-in question-answering chains. A chain
in LangChain refers to a sequence of calls to components, which can include
41
3 RAG Pipeline Implementation
42
3.3 Generation Component Implementation
43
3 RAG Pipeline Implementation
44
3.4 Impact of Text Splitting on Retrieval Augmented Generation (RAG) Quality
the quality of the RAG, and we need to take them into consideration. We will
discuss some of the main challenges and how to address them in the next few
sections.
Advantages:
Challenges:
45
3 RAG Pipeline Implementation
• Token Limitations: Most LLMs, such as GPT-3, have token limits, of-
ten around 4,000 tokens. If a document exceeds this limit, you’ll need to
truncate or omit sections, potentially losing valuable context.
• Increased Inference Time: Longer sequences require more inference
time, which can lead to slower response times and increased computa-
tional costs.
Advantages:
• Token Efficiency: Splitting text by token ensures that each input se-
quence remains within the model’s token limit, allowing for efficient pro-
cessing.
• Balanced Context: Each token represents a meaningful unit, striking a
balance between granularity and manageability.
• Scalability: Splitting by token accommodates documents of varying
lengths, making the system more scalable and adaptable.
Challenges:
46
3.5 Impact of Metadata in the Vector Database on Retrieval Augmented Generation (RAG)
In practice, you can also explore hybrid approaches to text splitting. For in-
stance, you might use token-level splitting for most of the document and switch
to character-level splitting when a specific question requires fine-grained con-
text.
The impact of text splitting on RAG quality cannot be overstated. It’s a critical
design consideration that requires a balance between capturing detailed con-
text and ensuring system efficiency. Carefully assess the nature of your PDF
documents, the capabilities of your chosen LLM, and user expectations to deter-
mine the most suitable text splitting strategy for your Chat-to-PDF app. Regular
testing and user feedback can help refine this choice and optimize the overall
quality of your RAG system.
The inclusion of metadata about the data stored in the vector database is an-
other factor that can significantly enhance the quality and effectiveness of your
Retrieval Augmented Generation (RAG) system. Metadata provides valuable
47
3 RAG Pipeline Implementation
contextual information about the PDF documents, making it easier for the RAG
model to retrieve relevant documents and generate accurate responses. Here,
we explore the ways in which metadata can enhance your RAG system.
Metadata acts as contextual clues that help the RAG model better understand
the content and context of each PDF document. Typical metadata includes in-
formation such as:
With metadata available in the vector database, the retrieval component of your
RAG system can become more precise and efficient. Here’s how metadata im-
pacts document retrieval:
48
3.5 Impact of Metadata in the Vector Database on Retrieval Augmented Generation (RAG)
Metadata can also play a crucial role in the generation component of your RAG
system. Here’s how metadata impacts response generation:
49
3 RAG Pipeline Implementation
Including metadata in the RAG system not only enhances its technical capabili-
ties but also improves the overall user experience. Users are more likely to trust
and find value in a system that provides contextually relevant responses. Meta-
data can help build this trust by demonstrating that the system understands and
respects the nuances of the user’s queries.
Overall, incorporating metadata about data in the vector database of your Chat-
to-PDF app’s RAG system can significantly elevate its performance and user
experience. Metadata acts as a bridge between the user’s queries and the con-
tent of PDF documents, facilitating more accurate retrieval and generation of
responses.
As we conclude our exploration of the nuts and bolts of RAG pipelines in this
Chapter, it’s time to move on to more complex topics. In Chapter 4, we’ll take
a deep dive and try to address some of the retrieval and generation challenges
that come with implementing advanced RAG systems.
We’ll discuss the optimal chunk size for efficient retrieval, consider the balance
between context and efficiency, and introduce additional resources for evalu-
ating RAG performance. Furthermore, we’ll explore retrieval chunks versus
synthesis chunks and ways to embed references to text chunks for better under-
standing.
We’ll also investigate how to rethink retrieval methods for heterogeneous doc-
ument corpora, delve into hybrid document retrieval, and examine the role of
query rewriting in enhancing RAG capabilities.
50
4 From Simple to Advanced RAG
4.1 Introduction
Please note that LlamaIndex framework has been used for several of the code im-
plementations in this chapter. This framework also contains a lot of advanced
tutorials about RAG that inspired the content of this chapter.
RAG pipeline consists of two components: i) Retrieval and, ii) response genera-
tion (synthesis). Each component has its own challenges.
51
4 From Simple to Advanced RAG
Retrieval Challenges
• Low precision: When retrieving the top-k chunks, there’s a risk of in-
cluding irrelevant content, potentially leading to issues like hallucination
and generating inaccurate responses.
• Low recall: In certain cases, even when all the relevant chunks are re-
trieved, the text chunks might lack the necessary global context beyond
the retrieved chunks to generate a coherent response.
• Obsolete information: Ensuring the data remains up-to-date is critical
to avoid reliance on obsolete information. Regular updates are essential
to maintain data relevance and accuracy.
Generation Challenges
52
4.2 Optimul Chunk Size for Efficient Retrieval
Let’s dive in a bit more deeply into aforementioned challenges and propose how
to alleiviate each one.
The chunk size in a RAG system is the size of the text passages that are extracted
from the source text and used to generate the retrieval index. The chunk size
has a significant impact on the system’s efficiency and performance in several
ways:
The chunk size should strike a balance between providing sufficient context for
generating coherent responses (i.e. Relevance and Granularity) and ensuring
efficient retrieval and processing (i.e Performance).
53
4 From Simple to Advanced RAG
A smaller chunk size results in more granular chunks, which can improve the
relevance of the retrieved chunks. However, it is important to note that the
most relevant information may not be contained in the top retrieved chunks,
especially if the similarity_top_k setting is low (e.g. k=2). A smaller chunk size
also results in more chunks, which can increase the system’s memory and pro-
cessing requirements.
A larger chunk size can improve the system’s performance by reducing the num-
ber of chunks that need to be processed. However, it is important to note that
a larger chunk size can also reduce the relevance of the retrieved chunks.
The optimal chunk size for a RAG system depends on a number of factors, in-
cluding the size and complexity of the source text, the desired retrieval perfor-
mance, and the available system resources. However, it is important to experi-
ment with different chunk sizes to find the one that works best for your specific
system. But, how do we know what works and what doesn’t?
Answer: We have to define evaluation metrics and then use evaluation tools
to measure how the RAG performs considering the metrics. There are several
tools to evaluate the RAG including LlamaIndex Response Evaluation module
to test, evaluate and choose the right chunk size. It contains a few components,
particularly:
The code below shows how to use Evaluation module and determine the optimal
chunk size for retrieval. To read the full article, see this link.
54
4.2 Optimul Chunk Size for Efficient Retrieval
55
4 From Simple to Advanced RAG
And Figure 4.4 demonstrates testing the evaluation function with different
chunk sizes.
They test across different chunk sizes and conclude (in this experiment) that
chunk_size = 1024 results in peaking of Average Faithfulness and Average
Relevancy.
Here are summary of the tips for choosing the optimal chunk size for a RAG
system:
• Consider the size and complexity of the source text. Larger and more com-
plex texts will require larger chunk sizes to ensure that all of the relevant
information is captured.
56
4.2 Optimul Chunk Size for Efficient Retrieval
57
4 From Simple to Advanced RAG
However, if you need to retrieve chunks quickly, you may want to use a
larger chunk size.
• Consider the available system resources. If you are limited on system
resources, you may want to use a smaller chunk size. However, if you
have ample system resources, you may want to use a larger chunk size.
You can evaluate the optimal chunk size for your RAG system by using a
variety of metrics, such as:
• Relevance: The percentage of retrieved chunks that are relevant to the
query.
• Faithfulness: The percentage of retrieved chunks that are faithful to the
source text.
• Response time: The time it takes to retrieve chunks for a query. Once
you have evaluated the performance of your RAG system for different
chunk sizes, you can choose the chunk size that strikes the best balance
between relevance, faithfulness, and response time.
58
4.3 Retrieval Chunks vs. Synthesis Chunks
RAG: 1. This blog from Databricks has some best practices to evaluate RAG
applications. Figure 4.5 illustrates what their experiment setup looks like.
59
4 From Simple to Advanced RAG
chunk representation for retrieval may not necessarily align with the require-
ments for effective synthesis. While a raw text chunk could contain essential
details for the LLM to generate a comprehensive response, it might also con-
tain filler words or information that could introduce biases into the embedding
representation. Furthermore, it might lack the necessary global context, mak-
ing it challenging to retrieve the chunk when a relevant query is received. To
give an example, think about having a question answering system on emails.
Emails often contain so much fluff (a big portion of the email is “looking for-
ward”, “great to hear from you”, etc.) and so little information. Thus, retaining
semantics in this context for better and more accurate question answering is
very important.
There are a few ways to implement this technique including:
The main idea here is that, we create an index in the vector database for storing
document summaries. When a query comes in, we first fetch relevant docu-
ment(s) at the high-level (i.e. summaries) before retrieving smaller text chunks
directly, because it might retrieve irrelavant chunks. Then we can retrieve
smaller chunks from the fetched document(s). In other words, we store the data
in a hierarchical fashion: summaries of documents and chunks for each docu-
ment. We can consider this approach as Dense Hierarchical Retrieval, in which
a document-level retriever (i.e. summary index) first identifies the relevant doc-
uments, and then a passage-level retriever finds the relevant passages/chunks.
Y. Liu et al. (2021) and Zhao et al. (2022) gives you a deep understanding of this
approach. Figure 4.6 shows how this technique works.
We can choose different strategies based on the type of documents we are deal-
ing with. For example, if we have a list of web pages, we can consider each page
as a document that we summarize, and also we split each document into a set of
60
4.3 Retrieval Chunks vs. Synthesis Chunks
61
4 From Simple to Advanced RAG
smaller chunks as the second level of our data store strategy. When user asks
a question, we first find the relevant page using the summary embeddings, and
then we can retrieve the relevant chunks from that particular page.
If we have a pdf document, we can consider each page of the pdf as a separate
document, and then split each page into smaller chunks. If we have a list of pdf
files, we can choose the entire content of each pdf to be a document and split
the it into smaller chunks.
We read the pdf file, Figure 4.7, and create a list of pages as later we will view
each page as a separate document.
Figure 4.7: Read a list of documents from each page of the pdf
62
4.3 Retrieval Chunks vs. Synthesis Chunks
In order to create an index, first we have to convert a list of texts into a list of
Document that is compatible with LlamaIndex.
We can use the summary index to get the summary of each page/document
using the document id, for instanc, Figure 4.9 shows the output of a summary
of a document.
In this step, when a query comes, we run a retrieval from the document sum-
mary index to find the relevant pages. Retrieved document has links to its cor-
responding chunks, that are used to generate the final response to the query.
LLM based retrieval: This approach is low-level so we can view and change
the parameters. Figure 4.11 below displays the code snippet:
63
4 From Simple to Advanced RAG
64
4.3 Retrieval Chunks vs. Synthesis Chunks
65
4 From Simple to Advanced RAG
66
4.3 Retrieval Chunks vs. Synthesis Chunks
67
4 From Simple to Advanced RAG
In this approach, we have split the text into sentence level chunks to be able to
perform fine-grained retrieval. However, before passing the fetched sentences
to LLM response generator, we include the sentences surrounding the retrieved
sentence, to enlarge the context window for better accuaracy. Please be mindful
of lost in the middle problem when splitting large textual content at a very fine-
grained level, such as sentence-level.
68
4.3 Retrieval Chunks vs. Synthesis Chunks
Figure 4.13: Expanding the sentence level context, so LLM has a bigger context
to use to generate the response
Implementation
Figure 4.14 shows the basic setup such as importing necessary modules, reading
the pdf file and initializing the LLM and embedding models.
69
4 From Simple to Advanced RAG
70
4.3 Retrieval Chunks vs. Synthesis Chunks
Next, we have to define nodes that are going to be stored in the VectorIndex
as well as sentence index. Then, we create a query engine and run the query.
Figure 4.15 shows the steps.
Figure 4.15: Build the sentence index, and run the query
We can see the original sentence that is retrieved for each node (we show the
first node below) and also the actual window of sentences in the Figure 4.17.
71
4 From Simple to Advanced RAG
Retrieval process in a RAG based application is all about retrieving the right
and most relevant documents for a given user’s query. The way we find these
documents is that retrieval method assigns a relevance score to each document
based on their similarity to the query. Then, sorts them descendingly and re-
turns them. Nevertheless, this approach might not work well when we are
returning many documents such as top-k >= 10. The reason is when we pass
a very long context to LLM, it tend to ignore or overlook the documents in the
middle. Consequently, putting the least relevant document to the bottom of
the fetch documents is not the best strategy. A better way is to place the least
relevant documents in the middle.
N. F. Liu et al. (2023) in Lost in the Middle: How Language Models Use Long
Contexts demonstrated interesting findings about LLMs behavior. They realized
that performance of LLMs is typically at its peak when relevant information
is located at the beginning or end of the input context. However, it notably
deteriorates when models are required to access relevant information buried
within the middle of lengthy contexts. Figure 4.18 demonstrates the results.
72
4.3 Retrieval Chunks vs. Synthesis Chunks
Figure 4.17: Original sentence that was retrieved for each node, as well as the
actual window of sentences
73
4 From Simple to Advanced RAG
Figure 4.18: Accuracy of the RAG based on the postions of the retrieved docu-
ments. Image source
74
4.3 Retrieval Chunks vs. Synthesis Chunks
They also show that LLMs with longer context windows still face this prob-
lem and increasing the context window doesn’t solve this issue. The following
demonstrates this experiment.
Figure 4.19: Comparing LLM models with various context size and the impact
of changing the position of relevant documents
How can we alleviate this problem? The answer is to reorder the retrieved
documents in such a way that most similar documents to the query are placed
at the top, the less similar documents at the bottom, and the least similar docu-
ments in the middle.
Figure 4.21 shows how to use Langchain solution to deal with this problem.
We can also use Haystack to deal with this problem. Haystack is the open source
framework for building custom NLP apps with large language models (LLMs)
in an end-to-end fashion. It offers a few components that are building blocks
for performing various tasks like document retrieval, and summarization. We
can connect these components and create an end-to-end pipeline. The two very
75
4 From Simple to Advanced RAG
Figure 4.21: Langchain approach for solving lost in the middle problem
76
4.3 Retrieval Chunks vs. Synthesis Chunks
Optimizing embeddings can have a significant impact on the results of your use
cases. There are various APIs and providers of embedding models, each catering
to different objectives:
However, determining which embedding model is the best fit for your dataset
requires an effective evaluation method.
77
4 From Simple to Advanced RAG
Figure 4.22: Models by average English MTEB score (y) vs speed (x) vs embed-
ding size (circle size). Image source
For better results, you can still utilize open-source tools by applying them to
your specific data and use case. Additionally, you can enhance relevance by in-
corporating human feedback through a simple relevance feedback endpoint.
Constructing your own datasets is also important as you have a deep under-
78
4.3 Retrieval Chunks vs. Synthesis Chunks
standing of your production data, relevant metrics, and what truly matters to
you. This allows you to tailor the training and evaluation process to your spe-
cific needs.
It is worth noting that recent research and experiments have shown that em-
bedding models with the same training objective and similar data tend to learn
very similar representations, up to an affine linear transform. This means that
it is possible to project one model’s embedding space into another model’s em-
bedding space using a simple linear transform.
This is called linear identifiability, and it was discussed in the 2020 paper by
Roeder and Kingma (2021) On Linear Identifiability of Learned Representations
from Google Brain. The paper states,
Therefore, the selection of a particular embedding model may not be that impor-
tant if you are able to discover and implement a suitable transformation from
your own dataset.
79
4 From Simple to Advanced RAG
Example: Assume a user asks a question and the answer to user’s question
only involves two pdf files, we would rather first get those two relevant pdf doc-
uments and then find the actual answer from their chunks, instead of searching
through thusands of text chunks. But how to do that?
80
4.4 Rethinking Retrieval Methods for Heterogeneous Document Corpora
• Add metadata about each document and store that along with the docu-
ment in the vector database.
Figure 4.24 shows how metadata filtering can help the retrieval process. When
user asks a question, they can explicitly give metadata for instance by specifying
filters such as dropdown list, etc., or we can use LLM to determine the metadata fil-
ters from the query and search the vector database using the filters. Vector database
utilizes the filters to narrow down search to the documents that match with the fil-
ters, and then finds the most similar chunks from documents and returns the top-k
chunks.
Please note that although we can add metadata to the text chunks after they are
stored in the vector database, we should do that in the preprocessing step while
we are splitting and embedding the documents, because if the vector database
index becomes very large (i.e. we already have a great deal of embeddings in
vector database), updating it will be significantly time consuming.
Figure 4.25 shows how to define new metadata for text chunks, and how to use
filters to perform retrieval.
In this example, we load two pdf files, one file is about machine learning inter-
view questions, and the other file is a research paper. We would like to add a
topic or category of each file as metadata, so later we can restrict our search to
81
4 From Simple to Advanced RAG
the category. Therefore, we update the initial metadata field by adding a cate-
gory property to each text chunk and then store them in the vector database.
If we print out the metadata of documents, we can see the category property
has been added, shown in Figure 4.26.
Now, we define the index and use it to perform search and retrieval, which is
shown in Figure 4.27.
82
4.4 Rethinking Retrieval Methods for Heterogeneous Document Corpora
Figure 4.25: Read the files and update the metadata property
83
4 From Simple to Advanced RAG
Figure 4.27: Insert text chunks into the vector database and perform retrieval
LLM to infer the metadata from user query? The short answer is: Yes.
Figure 4.30 shows the retrieval process including how to define vector index,
vector store, and VectorIndexAutoRetriever object.
84
4.4 Rethinking Retrieval Methods for Heterogeneous Document Corpora
85
4 From Simple to Advanced RAG
Figure 4.29: Define text node and metadata for auto retrieval
86
4.4 Rethinking Retrieval Methods for Heterogeneous Document Corpora
87
4 From Simple to Advanced RAG
88
4.5 Hybrid Document Retrieval
Merging the results from keyword-based and semantic retrievers can be ap-
proached in several ways, depending on the nature of the RAG application:
89
4 From Simple to Advanced RAG
2. Reciprocal Rank Fusion (RRF): RRF operates with a formula that re-
ranks documents from both retrievers, giving priority to those that appear
in both results lists. Its purpose is to elevate the most relevant documents
to the top of the list, thereby enhancing the overall relevance of the results.
RRF is particularly useful when the order of the results is important or
when you intend to pass on only a subset of results to the subsequent
processing stages.
90
4.6 Query Rewriting for Retrieval-Augmented Large Language Models
There are frameworks which support hybrid retrieval out of the box such as
ElasticSearch, Haystack, Weaviate, and Cohere Rerank. Let’s find out how to
implement this approach using Haystack. The following code is from Haystack
documentation, you can see all the implementation details here. The documents
that are used for this example are abstracts of papers from PubMed. You can
find and download the dataset here.
Step 1: We load the dataset and initialize the document store (i.e. vector
database). Figure 4.32 shows this step.
Step 2: Define the retrievers, insert the documents and embeddings into the
document store and choose the join document strategy. You can see this step in
Figure 4.33.
Step 3: Create an end-to-end pipeline in Haystack and perform the hybrid re-
trieval for a query. This step is depicted in Figure 4.34.
91
4 From Simple to Advanced RAG
92
4.6 Query Rewriting for Retrieval-Augmented Large Language Models
93
4 From Simple to Advanced RAG
94
4.6 Query Rewriting for Retrieval-Augmented Large Language Models
Query rewriting is closely tied to the document retrieval phase in RAG systems.
It contributes to better retrieval rankings, which, in turn, leads to more informa-
tive and contextually relevant answers generated by the language model. The
goal is to ensure that the retrieved documents align closely with the user’s in-
tent and cover a wide spectrum of relevant information.
Question: With the advent of Large Language Models (LLMs) that have revolu-
tionized natural language understanding and generation tasks, can we use them
for query rewriting? The answer is: Yes. They can help in two primary ways:
query expansion and generating better prompts. Figure 4.36 shows how query
rewriting works. We can use LLM to either expand (enhance) a query or gener-
ate multiple (sub)queries for better retreival process.
Figure 4.35: Query re-writing using LLMs. LLM can expand the query or create
multiple sub-queries.
95
4 From Simple to Advanced RAG
LLMs can assist in generating more effective prompts for retrieval, especially
when a prompt-based retrieval mechanism is employed. Here’s how LLMs con-
tribute to prompt generation.
Query Refinement: LLMs can refine a user query by making it more concise,
unambiguous, and contextually relevant. The refined query can then serve as a
prompt for the retrieval component, ensuring a more focused search.
96
4.6 Query Rewriting for Retrieval-Augmented Large Language Models
Example: User Query: “Market trends for electric cars in 2021”. LLM-Generated
Multi-Step Prompts:
The LLM generates sequential prompts to guide the retrieval component in find-
ing documents related to market trends for electric cars in 2021.
Context-Aware Prompts: LLMs can consider the context of the available doc-
ument corpus and generate prompts that align with the characteristics of the
documents. They can adapt prompts for specific domains or industries, ensur-
ing that the retrieval phase retrieves contextually relevant documents.
Example: User Query: “Legal documents for the healthcare industry”. LLM-
Generated Domain-Specific Prompt: “Retrieve legal documents relevant to the
healthcare industry.”
Understanding the context, the LLM generates a prompt tailored to the health-
care industry, ensuring documents retrieved are pertinent to that domain.
Query rerwriting for RAGs is an active research area, and new approaches are
suggested regularly. One recent research is Ma et al. (2023), where they propose
a new framework for query generation. See here to learn more about their
method.
97
4 From Simple to Advanced RAG
Rather than relying on a fixed retrieval method, query routing empowers the
system to intelligently assess the user’s query and select the appropriate re-
trieval mechanism. This approach is particularly powerful in scenarios where
a diverse range of retrieval techniques or tools can be employed to answer dif-
ferent types of queries. The following shows the generate architecture of this
approach.
A user enters a query into the RAG system. This query could encompass
various types of information needs, such as fact-based lookup, summa-
rization, translation, question answering, etc.
2. Query Analysis:
Based on the analysis, the router detects the retrieval technique or tool
that best matches the query’s requirements. This detection is often driven
by heuristics, pre-defined rules, machine learning models, or a combina-
tion of these methods.
98
4.7 Query Routing in RAG
99
4 From Simple to Advanced RAG
100
4.7 Query Routing in RAG
Figure 4.37 shows how to use a zero-shot classifer to categorize user queries.
Figure 4.37: Using a zero-shot classifier to categorize and route user queries
2. Prompt-Based Routing:
101
4 From Simple to Advanced RAG
The code in Figure 4.38 shows how to use a LLM text-davinci-002 for query
routing.
3. Rule-Based Routing:
102
4.8 Leveraging User History to Enhance RAG Performance
Let’s take a look at an example where we use LlamaIndex framework for rout-
ing queries between a summarization route and fact-based route. LlamaIndex
has a concept called RouterQueryEngine which accepts a set of query engines
QueryEngineTool objects as input. A QueryEngineTool is essentially an index
used for retrieving documents from vector database. First step is to load the doc-
uments and define the summary index and fact-based index, which is displayed
in Figure 4.39.
Then, we create QueryEngine objects for each index and add them to the
RouterQueryEngine object. Figure 4.40 shows this step.
103
4 From Simple to Advanced RAG
104
4.8 Leveraging User History to Enhance RAG Performance
Figure 4.40: Define QueryEngine and RouterQueryEngine objects, and run the
engine for user queries.
105
4 From Simple to Advanced RAG
4.8.1 Challenge
106
4.8 Leveraging User History to Enhance RAG Performance
107
4 From Simple to Advanced RAG
108
4.8 Leveraging User History to Enhance RAG Performance
these challenges, RAG systems can benefit from using a vector database for
memory.
The implementation of a vector database for memory in a RAG system involves
several key steps:
Incorporating a vector database for memory in a RAG system enhances its abil-
ity to provide contextually relevant and coherent responses by efficiently man-
aging and retrieving historical information. This approach is particularly valu-
able in scenarios where extensive conversation histories or long-term memory
are essential for the application’s success.
As we wrap up our exploration of advanced RAG systems in this Chapter, we
are on the cusp of a new frontier. In Chapter 5 - “Observability Tools for RAG,”
we will discuss various observability tools tailored for RAG systems. We will ex-
plore their integration with LlamaIndex, including Weights & Biases, Phoenix,
and HoneyHive. These tools will not only help us monitor and evaluate the per-
formance of our RAG systems but also provide valuable insights for continuous
improvement.
109
5 Observability Tools for RAG
2. Storage: Data collected by the tool is stored for analysis and historical
reference. The storage can be in the form of time-series databases, log
repositories, or other storage solutions designed to handle large volumes
of data.
111
5 Observability Tools for RAG
9. Scalability: The tool should be able to scale with the system it monitors.
It needs to handle growing data volumes and provide insights without
compromising performance.
10. Customization: Users should be able to customize the tool to meet the
specific needs of their system. This includes defining custom dashboards,
alerts, and data collection methods.
There are several observability tools for RAG based systems. Frameworks like
LlamaIndex also provides an easy way to integrate some of these tools with
RAG application. This enable us to:
112
5.1 Weights & Biases Integration with LlamaIndex
Figure 5.1: General pattern for integrating observability tools into LlamaIndex
The code depicted in Figure 5.2 shows how to integrate W&B with LlamaIndex.
For complete example, please see here.
If we go the W&B website and login, we can see all the details, Figure 5.4 dis-
plays our project including charts, artifacts, logs, and traces.
113
5 Observability Tools for RAG
114
5.2 Phoenix Integration with LlamaIndex
115
5 Observability Tools for RAG
When we run queries, we can see the traces in real time in the Phoenix UI.
Figure 5.6 illustrates the Phoenix UI for a RAG application.
HoneyHive is a framework that can be used to test and evaluate, monitor and de-
bug LLM applications. It can be seamlessly integrated as displayed in Figure 5.7
into LlamaIndex applications.
The HoneyHive dashaboard looks like Figure 5.8 below:
116
5.3 HoneyHive Integration with LlamaIndex
117
5 Observability Tools for RAG
118
6 Ending Note
6.1 Acknowledgements
119
References
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
Karpukhin, Naman Goyal, Heinrich Küttler, et al. 2020. “Retrieval-
Augmented Generation for Knowledge-Intensive Nlp Tasks.” Advances in
Neural Information Processing Systems 33: 9459–74.
Liu, Nelson F, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua,
Fabio Petroni, and Percy Liang. 2023. “Lost in the Middle: How Language
Models Use Long Contexts.” arXiv Preprint arXiv:2307.03172.
Liu, Ye, Kazuma Hashimoto, Yingbo Zhou, Semih Yavuz, Caiming Xiong, and
Philip S Yu. 2021. “Dense Hierarchical Retrieval for Open-Domain Question
Answering.” arXiv Preprint arXiv:2110.15439.
Ma, Xinbei, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023.
“Query Rewriting for Retrieval-Augmented Large Language Models.” arXiv
Preprint arXiv:2305.14283.
Roeder, Luke Metz, Geoffrey, and Durk Kingma. 2021. “On Linear Identifiability
of Learned Representations.” arXiv Preprint arXiv:2007.00810.
Zhao, Wayne Xin, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. 2022. “Dense Text
Retrieval Based on Pretrained Language Models: A Survey.” arXiv Preprint
arXiv:2211.14876.
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,
Yonghao Zhuang, Zi Lin, et al. 2023. “Judging LLM-as-a-Judge with MT-
Bench and Chatbot Arena.” arXiv Preprint arXiv:2306.05685.
121