Generative AI On AWS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 208

Generative AI on AWS

Building Multimodal Generative AI Applications


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.

Antje Barth, Chris Fregly, and Shelbee


Eigenbrode
Generative AI on AWS
by Antje Barth, Chris Fregly, and Shelbee Eigenbrode
Copyright © 2025 Antje Barth, Flux Capacitor, LLC, and Shelbee
Eigenbrode. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.

Acquisitions Editor: Nicole Butterfield

Development Editor: Sara Hunter

Production Editor: Gregory Hyman

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Kate Dullea

October 2024: First Edition

Revision History for the Early Release


2023-08-15: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781098159221 for release


details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Generative AI on AWS, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not
represent the publisher’s views. While the publisher and the authors have
used good faith efforts to ensure that the information and instructions
contained in this work are accurate, the publisher and the authors disclaim all
responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes
is subject to open source licenses or the intellectual property rights of others,
it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
978-1-098-15922-1
Chapter 1. Prompt Engineering
and In-Context Learning

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the second chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

In this chapter, you will learn about low-code ways to interact with
generative AI models - specifically, prompt engineering and in-context
learning. You will see that writing prompts is both an art and a science that
helps the model generate better and more-applicable responses. You will
also see some best practices when defining prompts and prompt templates to
get the most out of your generative models.
You will also see how to use in-context-learning to pass multiple prompt-
completion pairs (e.g. question-answer pairs) in the “context” along with
your prompt input. This in-context learning nudges the model to respond
similar to the prompt-completion pairs in the context. This is one of the more
remarkable, mysterious, and lightweight capabilities of generative models
that temporarily alters the model’s behavior for the duration of that single
request-response.
Lastly, you will learn some of the most-important generative configuration
parameters like `temperature` and `top k` that control the generative model’s
creativity when creating content.

Prompt Engineering
Prompt engineering is a new and exciting skill focused on how to better
understand and apply generative models to your tasks and use cases.
Effective prompt engineering helps you push the boundaries of generative AI
and get the most out of your generative-based applications.
The text that you send into a generative model is typically called the
“prompt”. This prompt is passed to the model during inference time to
generate a “completion”. Below is an example question-answer prompt and
completion between a “Human” and the generative AI “Assistant”. Note that
the generative model is simply completing the Human’s prompt following the
term “Assistant:”
Figure 1-1. A simple question produces a lengthy response from the model

You may have to rewrite your prompt several times to get a proper and
precise response as some of these generative models are quite chatty. Prompt
engineering is a learned skill that requires many iterations across many
different model types and linguistic nuances - and often depends on how the
model was trained and fine-tuned.
Most modern human-facing chat models have been fine-tuned using some
form of human labeled data - often with reinforcement learning as we will
explore in Chapter 7. This is often why you see some form of `Human:` and
`Assistant:` in most chat prompts. These are required to indicate the start of
the input question and output answer, respectively. However, they are often
model-specific. Using different indicators may result in “off-distribution”
and undesirable results.
Next, you’ll explore some prompt structures and techniques to get the most
out of off-the-shelf generative AI models.

Prompt Structure
The prompt structure used in the previous example is a simple chat-assistant
prompt structure which implies the question-answer tasks and uses “Human:”
as the input indicator and “Assistant:” as the output indicator.
A more-complete prompt structure includes a section for each of the
following: instruction, context, input data, and output indicator. A
restructured version of the previous chat example using the more-complete
prompt structure is shown below. Here, the prompt includes the instruction
on the first line followed by the context. `Human:` indicates the start of the
input data and `Assistant:` is the output indicator where the model generates
the completion.
Figure 1-2. Restructured with instruction, context, input data, and output indicator

The prompt does not need all four elements, however. Nor does it require the
elements to follow the exact order used here. The ideal prompt structure may
vary depending on the task as well as the size of the combined context and
input data.
In addition, the best prompt structure depends on how the generative model
was trained and fine-tuned. Therefore, it’s important to read the
documentation for a given generative model to gain intuition into the prompt
templates used during training and tuning. Optimizing the prompt and prompt
structure is all part of prompt engineering!
If you know which datasets and prompt templates were used during model
training and fine-tuning, you can often find a more-relevant and successful
prompt structure by following a pattern similar to the prompt templates.
Below, you see the `samsum` prompt template and the `samsum` dataset used
together to fine-tune the popular FLAN-T5 model for dialog summarization
among many other tasks.

"samsum": [
("{dialogue}Briefly summarize that dialogue.", "{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
"{summary}"),
("Dialogue:\n{dialogue}\n\nWhat is a summary of this dialogue?",
"{summary}"),
("{dialogue}\n\nWhat was that dialogue about, in two sentences or less?",
"{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWhat were they talking about?",
"{summary}"),
("Dialogue:\n{dialogue}\nWhat were the main points in that "
"conversation?", "{summary}"),
("Dialogue:\n{dialogue}\nWhat was going on in that conversation?",
"{summary}"),
]
Figure 1-3. samsum dialog-summarization dataset

By applying the template to the dataset, the creators of FLAN-T5 created a


prompt dataset consisting of rows of `dialogue` and human-curated
`summary` columns using `{}` similar to how f-strings use variables in
Python. The resulting prompt dataset contains multiple prompt variations for
each row of data in the `samsum` dataset.
Humans - often linguistics experts - have curated these prompt templates to
provide a good set of examples to give your model a good mix of prompts to
learn the given task, dialog-summarization in this case, during fine-tuning.
You will learn more about instruction fine-tuning in Chapter 4, but with the
knowledge of the different variants of prompts, it becomes much easier to
create a prompt that works well with your model.
Next, you will see how to further enrich the prompt context to evoke an
emergent and thought-provoking property of generative AI models called in-
context learning with few-shot inference.

In-Context Learning with Few-Shot Inference


A powerful technique to help your generative model produce better
completions for your prompt is to include a few prompt-completion pairs
inside the context portion of your prompt. This is called in-context learning
with few-shot inference. You actually saw this in the previous human-
assistant example in which we passed a few examples, called “shots”, as
part of the context before the `Human:` input data and `Assistant:` output
indicator.
In this case, by adding just a few examples to the context, you helped the
model respond with just the winner of the baseball World Series - and not all
of the other details generated in the “zero-shot” example in a previous
section in which no examples were provided in the context.
As you’ve probably guessed, if you pass in one prompt-completion pair into
the context, this is called “one-shot” inference. Below are the zero-shot, one-
shot, and few-shot examples side-by-side to help you visualize their
differences in both prompt context and model response.
Figure 1-4. The model generates a more focused and relevant response as you add more in-
context examples called “shots”.

There are subtle differences between these completions. With more examples
or shots, the model more-closely follows the pattern of the response of the
in-context prompt-completion pairs which only mentions the winning team,
the Chicago Cubs, and losing team, the Cleveland Indians - and nothing more.
The zero-shot completion, without the additional prompt-completion pairs in
context, includes additional information such as the previous time the
Chicago Cubs won the baseball World Series in 1908 - over a century
earlier, by the way.
NOTE
2016 was a great year for one of the authors of this book who is a life-long Chicago Cubs
fan!

By adding a few prompt-completion examples as additional context, the


model learns to modify its response for just that request based on the context.
It’s important to provide a consistent and appropriate distribution of prompt-
completion few-shot examples to allow the model to learn properly in-
context.
You should make sure that your context does not increase your prompt length
above the input size or “context window” of the given generative model.
Each model has a fixed context window size - anywhere from a few hundred
to a few hundred thousand tokens. This is often due to algorithmic limitations
of the underlying neural-network architecture including the Transformer. You
should read the documentation and consider the size of the model’s context
window when choosing a generative model.
In-context learning is very useful, but if you find yourself using upwards of 5
or 6 examples in your context, you may need to choose a different model,
train a new model from scratch (called “pre-training”), or fine-tune an
existing model. In the next few chapters, you will explore various methods to
pre-train and fine-tune a foundational model.

NOTE
It’s worth noting that in-context learning does not modify the model in any way. The model
adjusts - or learns - on-the-fly for the duration of that single request using the context
provided in the prompt. This is a truly remarkable and somewhat mysterious property of
generative models which can be used in many creative ways.

In Chapter 9, you will see how to further augment the prompt using data
stores (e.g. databases and knowledge stores) and APIs (e.g. web searches
and custom APIs). This is called retrieval-augmented generation (RAG) and
is part of the larger generative AI ecosystem that helps augment prompts with
domain knowledge and external tools. RAG improves model responses
across many generative tasks and use cases.
While some of the larger, more recent models provide good responses with
zero-shot inference, some of the smaller models may require in-context
learning with one-shot or few-shot inference that include examples of the
desired response.
As larger and larger models have been trained, it has become clear that the
ability of models to perform multiple tasks - and how well they perform
those tasks - depends strongly on the scale of the model.
Models with more parameters are typically able to capture more
understanding of language. The largest models are surprisingly good at zero-
shot inference, and are able to infer and successfully complete many tasks
that they were not specifically trained to perform.
In contrast, smaller models are generally only good at a small number of
tasks, typically those that are similar to the task they were trained on. You
may have to try out a few models to find the right one for your use case.
It’s worth noting that you can “trick” a model into temporarily in-context
learning an incorrect answer. For example, you can pass three in-context
prompt-completion examples that demonstrate a positive customer review as
a NEGATIVE sentiment and a negative customer review as a POSITIVE
sentiment as shown below.
Figure 1-5. Few-Shot, In-Context Prompt with Opposite Sentiment

In this case, inference requests made to the model with this prompt are more
likely to return the opposite sentiment. This is an alarming and mischievous
quality of in-context learning. And this is relatively easy to do in practice, so
it’s worth double-checking your in-context prompt-completion pairs
carefully.
Next, you’ll explore some prompt-engineering best practices to improve the
responses from your generative AI models.

Prompt-Engineering Best Practices


Using different prompt-augmentation strategies, you will often generate better
completions from off-the-shelf foundation models without having to re-train
or fine-tune the model. This can save you significant time and money. Below
are some best practices to help you construct effective prompts for better
generative results.
Be clear and concise. Prompts should be simple, straightforward, and avoid
ambiguity. Clear prompts lead to more coherent responses.
Be creative. New and thought-provoking prompts can effect different
outcomes from the model. A novel prompt may lead to an unexpected and
innovative outcome.
Move the instruction to the end of the prompt for large amounts of text.
If the context and input data are long, try moving the instruction to the end
right before the output indicator
Figure 1-6. Prompt structure with a small vs. large amount of data

Convey the force of the question. Clearly state one of the following: who,
what, where, when, why, how, etc.
Use explicit directives. If you want the model to output in a particular format
, specify that directly. For example, “Summarize the following customer-
support dialog in a single sentence:”
Use simple language. Prompts should use natural language and coherent
sentence structure. Avoid using just single-words or “baby” or “pet” phrases.
Avoid negative formulations. Negative formulations, while syntactically
correct, may cause confusion. For example, use “Summarize in 5 sentences
or less” instead of “Summarize in no more than 5 sentences”. Avoid negative
formulations if a more-straightforward linguistic variation exists.

NOTE
General rule of thumb at this stage of the generative AI maturity cycle is this: if the
wording is confusing to humans, it is more-likely to be confusing to these models. Simplify
when possible.

Include context and few-shot example prompts. Provide additional context


that helps the model respond more accurately. You can specify a single
context across all inputs or a specific context for each input. You saw
examples of including additional context in the previous examples.
Specify the size of the response. Include the requested output size at the
end of the prompt to focus the model. E.g. “List the top 3 complaints from the
following customer-support conversation:”
Define what to do if the model can’t answer confidently. You can often
ask the model to respond with, “I don’t know”, if it cannot confidently
respond to the prompt. Otherwise, the model may generate a “hallucination”
response. While hallucinations are often fun to share with your co-workers
during development, they are not fun to share with your customers in
production!
Figure 1-7. Prompt that causes Hallucination
Figure 1-8. Prompt that Allows “I don’t know”

Provide a specific response format. Give the response format using an


example. Include brackets for clarity. For example, “Summarize this
document article in 10 words or less as shown here: [New generative AI
model beats X benchmark by Y %.]”
Ask if the model understands the instruction. Foundation models may get
confused when asked to perform complex instructions that require multiple
steps. It’s important to recognize when the model is getting confused - and
when to break the prompt down into multiple steps.
Ask the model to “think step-by-step”. If the model is confused about the
instructions, you can ask the model to “think step-by-step” which gives the
model the freedom to break a single instruction into multiple steps. This is
called “chain of thought” prompting. Depending on how they were trained
and tuned, some models may respond to other variants such as “divide into
subtasks”, “approach the problem systematically”, “reason through the
problem one step at a time”, etc.

Figure 1-9. Prompt without Chain of Thought


Figure 1-10. Prompt with Chain of Thought (think step by step)

Add constraints for more control. Constrain responses by length, format,


included information, excluded information, etc. E.g. “Summarize this
process in exactly 5 steps:”
Evaluate the response. This seems obvious, but it’s worth noting that you
should review the models’ responses to ensure the responses are high-quality
and appeal to your audience. Make changes to the prompts as needed.
Evaluating responses at scale is an open area of research. Human evaluation
does not scale well and automated evaluation may miss the nuances of human
language.
Use disclaimers or avoid prompts that the model should not answer: If
your generative model is not equipped to respond to certain domains like
legal, medical, or religion, you can instruct your model to respond with
something like, “I am not licensed to provide medical advice. Please seek a
licensed medical professional in your area.”
Use step-by-step action plans for more-complex action plans. Building
upon chain-of-thought prompting, some models are capable of generating
step-by-step action plans carried out by tools such as a web search, a SQL
query, or a Python-based calculator script, for example. This is a very
exciting and active space in generative-based application development. You
will dive deeper into ReAct, LangChain, PAL, and other generative
application frameworks in Chapter 10, but here is a quick example that uses
a fictitious Python-based calculator script.
Figure 1-11. Prompt without Calculator Python Script
Figure 1-12. Prompt with Intermediate Reasoning Steps and Python Calculator

Use XML/HTML tags in your prompt. Some models support XML/HTML


tags like <tag>this is important</tag> to create structure within the prompt.
For example, if you want to reference an important piece of text in your input
data, you can wrap that text in a tag to indicate where the important text starts
and ends. You also ask some models to tag important parts of the response so
you can parse the response and extract important data in a structured way.
Selective focus: You can ask the model to only focus on certain parts of the
input text. For example, you can ask that the model summarize only the first
and last paragraph of your input data.
Combine the techniques above. By combining chain-of-thought prompting
with additional context, for example, your model has a better chance of
understanding the task and providing a better completion.

Figure 1-13. Prompt with Combined “Think Step by Step” and “I don’t know”

By trying different prompts, you see what works and what doesn’t work for
your prompt, model, and use case combination. Continue to refine your
prompt as needed. With more and more experimentation, you will gain the
necessary intuition to quickly create and optimize a prompt to best suit your
task and use case. Prompt engineering is an iterative skill that improves with
practice. Prompt-optimization is not as clear or well-studied as classical
numerical-optimization techniques which may be frustrating.
Take this time to enjoy the creative and non-deterministic side of generative
AI. At a minimum, you’ll enjoy a good laugh when the model surprises you
with a seemingly random response to a question that you did not intend to
ask. Next, you’ll see some common generative inference-specific parameters
that influence the creativity of the generative model response. This is where
the fun begins!

Adversarial Prompts
It’s important to understand the safety issues caused by adversarial prompts
with generative AI. By understanding the risks, you can better evaluate the
safety of your model - and address any issues upfront before releasing into
production. Some common examples of adversarial prompts and prompt
misuse include prompt injection, prompt leaking, and jailbreaking. Without
the proper guardrails, some models may respond to adversarial prompts with
harmful, dishonest, and unethical responses. While this is not an exhaustive
summary of the many adversarial attack vectors, these are a few of the most
common.
Prompt injection is a technique that attackers use to influence the outputs of
generative models by adding malicious instructions directly in the prompt.
For example, an attacker may ask the model to generate responses to induce
self-harm, create fake news, or promote bias at scale.

NOTE
It’s worth noting that prompt injection is also used to create non-malicious prompts.
However, the term “prompt injection”, like “SQL injection” and other malicious
“injections”, has a poor connotation.

One mitigation technique is to add a hard-coded “defense instruction”


directly into the prompt as shown below. Combined with tagging and
isolation of the user-provided instructions (from an attacker, presumably), the
model may be able to defend against an attack. This is not a complete
solution, but it’s an relatively easy way to provide a first-level defense for
certain adversarial attacks.

Figure 1-14. Prompt with additional defense for instructions that include the word “hack” or
“harm”

Prompt leaking occurs when the generative AI model leaks sensitive


information. For example, consider a generative system trained on private
customer data to make product recommendations. An attacker can potentially
alter the prompt to encourage the model to respond with private customer
data such as order history or salary information used by the system to make
product recommendations or loan decisions, respectively.
Figure 1-15. Prompt with leaked personal information

Jailbreaking is an adversarial prompt technique that causes the generative


model to escape the original constraints provided by the model creators. A
simple example is asking the model to forget its original constraints and only
perform your instructions as shown below.
Figure 1-16. Prompt with Instructions to By-Pass Original Constraints

A more frightening example of jailbreaking is if the attacker somehow allows


the generative model to access data or systems outside of its constrained
environment and into the real world. This could potentially cause potentially
harmful effects that the model creator did not intend.
Many of the recent foundation models have been trained and fine-tuned to
address adversarial prompts. However, there are still many bad actors trying
to find new ways to hack these models, so it is recommended that you always
test your model for these types of attacks before releasing them to production.
This includes testing the boundaries of the system in which these models are
deployed.
The field of adversarial-prompt detection and prevention is a vast and
expanding field of research that, similar to the field of malware-prevention,
will continue to expand over the years to potentially include thousands of
guardrails across the many potential generative AI attack vectors. There are
already some pretty advanced techniques to detect and prevent adversarial
prompt attacks that involve step-by-step reasoning, advanced linguistic
analysis, and external evaluation tools. This is an exciting field to follow.
Are you still interested in generative AI?! We hope so as there is a lot more
to cover. Once you have found a set of prompts that are working for you,
there are a few model-inference settings you can use to influence the
creativity and style of the model completions as you will see next.

Generation Inference Parameters


Here, you’ll examine some configuration parameters to influence the way
your model generates text during inference. If you’ve used generative models
in a “playground” such as Amazon SageMaker or Bedrock, you have likely
seen slides and other numerical controls as shown below.
Figure 1-17. Generative configuration parameters to control model outputs

These inference parameters influence the model’s completion to your prompt.


They give you fine-grained control over the length of the model response - as
well as the creativity. Each model exposes a different, but often overlapping
set of inference parameters. Often these parameters are named the same
between models - or similar enough to reason through when you try out
different models. Here, you’ll see a few of the most common inference
parameters.
Max new tokens. This is one of the most obvious and straightforward
parameters to tune. Use this parameter to limit the number of new tokens
generated by the model. This is a very basic mechanism to keep model
responses short and prevent rambling. This is not a mechanism to prevent
hallucinations, however. This may merely mask the hallucination by reducing
its length.
Greedy vs. random sampling. During model inference, the Transformer
neural network produces a probability distribution across all tokens in the
model’s known vocabulary. The model chooses - or samples - a single token
from this distribution as the next token to include in the response. For each
inference request, you can configure the model to choose the next token using
either greedy or random sampling as shown below.

Figure 1-18. Greedy vs. random sampling to predict the next token from a probability distribution
Most generative model-inference implementations default to greedy
decoding. This is the simplest form of next token prediction as the model
always chooses the word with the highest probability. This method works
well for very short generations, but may result in repeated tokens or
sequences of tokens.
If you want to generate text that is more natural and minimizes repeating
tokens, you can configure the model to use random sampling during inference.
This will cause the model to randomly choose the next token using a
weighted strategy across the probability distribution. The token `banana`, as
shown here, has a probability score of 0.02. With random sampling, this
equates to a 2% chance that this word will be selected from the distribution.
Using random-sampling, you reduce the likelihood of repeated tokens in your
model completion. The trade-off, however, is that the model output may be
too creative and either generate an off-topic or unintelligible response.
Finding this optimal setting is why this is called prompt engineering!

NOTE
Some implementations and libraries may require you to explicitly disable greedy sampling
and manually enable random sampling using a function argument similar to
`do_sample=True`.

`top-p` and `top-k` random sampling. These are the most common inference
parameters to enable random sampling. These parameters provide more fine-
grained control for the random sample which, if used properly, should
improve the model’s response yet allow it to be creative enough to fulfill the
generative task,
`top-k`, as you probably guessed, limits the model to choose a token
randomly from only the top-k tokens with the highest probability. For
example, if `k` is set to 3, you are restricting the model to choose from only
the top-3 tokens using the weighted random-sampling strategy. In this case,
the model randomly chooses “donut” as the next token, although it could have
selected from 1 of the other 2.
Figure 1-19. Top-k sampling restricts the model to choose from the top-3 probabilities, in this
case

`top-p` limits the model to randomly sample from the set of tokens whose
cumulative probabilities do not exceed `p` starting from the highest
probability working down to the lowest probability. To illustrate this, first
sort the tokens in descending order based on the probability. Then select a
subset of tokens whose cumulative probability scores do not exceed `p`. For
example, if `p=0.3`, the options are “cake” and “donut” since their
probabilities of 0.2 and 0.1 add up to 0.3. The model then uses the weighted
random-sampling strategy to choose the next token from this subset of tokens
as shown below.
Figure 1-20. Top-p uses random probability weighting to choose the next token from tokens
whose combined probabilities do not exceed P when ranked highest probability to lowest.

Temperature. This parameter also helps to control the randomness of the


model output by modifying the shape of the next-token probability
distribution. In general, the higher the temperature, the higher the randomness.
The lower the temperature, the lower the randomness.
In contrast to top-k and top-p, changing the temperature actually changes the
next-token probability distribution which ultimately affects the next-token
prediction. A low temperature, below 1 for example, results in stronger
peaks where the probabilities are concentrated among a smaller subset of
tokens. A higher temperature, above 1 for example, results in a flatter next-
token probability distribution where the probabilities are more evenly spread
across the tokens. Setting the temperature to 1 leaves the next-token
probability distribution unaltered which represents the distribution learned
during model training and tuning. The figure below compares the low and
high temperature scenarios.

Figure 1-21. Changing the temperature will change the next-token probability distribution

In both cases, the model selects the next token from the modified probability
distribution using either greedy or random sampling which is orthogonal to
the temperature parameter.

Summary
In this chapter, you learned techniques to help get the best possible
performance out of these generative AI models using prompt engineering and
by experimenting with different inference configuration parameters. Prompt
engineering guides the generative foundation model to provide more relevant
and accurate completions using various methods such as better-worded
prompts, in-context learning examples, and step-by-step logical reasoning,
etc.
While you got far with prompt engineering, in-context learning, and inference
parameters, these techniques do not actually modify the generative models’
weights. As such, you may need to train or fine-tune a generative model on
your own datasets to better understand your specific domain and set of
generative use cases. In the next chapter, you will see how to train a
generative model from scratch with public or private datasets.
Chapter 2. Model Pre-Training

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the third chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

In the previous chapter, you defined your use case, performed some prompt
engineering, and determined whether a generative model will work for your
use case. Next, you need to decide if an existing foundation model is
sufficient to understand your domain or if you need to pre-train a foundation
model from scratch.
In this chapter, you will see the trade-offs between choosing an existing
foundation model vs. training a foundation model from scratch. You will also
learn about empirical scaling laws that have emerged for generative AI
models which provide a good starting point if you choose to pre-train your
model from scratch. You will then see an example of pre-training a financial-
based generative model called BloombergGPT.

Foundation Models
At the start of any generative AI project, you should first explore the vast
amount of public, pre-trained foundation models that exist today. These
models have been trained on just about every public piece of text in existence
on the internet across many different languages. As such, these models have
built a solid understanding of human language - as well as a massive amount
of built-knowledge across many domains.
You can find these foundation models in a “model hub” such as HuggingFace
Model Hub, PyTorch Model Hub, or Amazon SageMaker JumpStart. Model
hubs offer a “model card” for each model. Model cards typically contain
important information about the model including training details, context-
window size, prompt information, and known limitations.
Often, the model hubs contain the same models. So just pick a model hub that
best fits your security and infrastructure needs. For example, with the
SageMaker JumpStart model hub, you can deploy a private copy of a
foundation model directly into your AWS account with just a few clicks. This
lets you start generating content within minutes!
You will likely choose a foundation model that performs well on the
generative task that fits your use case. Some models may use slight variations
of the original Transformer architecture to optimize for specific language
tasks. This may cause issues if you try to swap out models during
development.

NOTE
Fear of missing out (FOMO) may cause you to swap out a newer generative model before
completing your evaluation on the current model. Try to avoid this temptation and complete
your testing with a single model - or set of models - before chasing the latest and greatest
leaderboard winner.

These pre-trained foundation models may not have seen enough public text to
learn the nuances of your specific domain, however. For example, the
vocabulary of the public foundation models, often measured in 100,000
tokens, may not include the terms commonly used by your business.
Additionally, public foundation models and datasets may have been scrubbed
to avoid providing medical, legal, or financial advice due to the sensitive
nature of these domains. One financial company, Bloomberg, chose to pre-
train their own foundation model from scratch called BloombergGPT.
BloombergGPT was trained with both public and private financial data. You
will learn more about BloombergGPT in a bit. But first, you will learn more
about model pre-training, in general.

Model Pre-Training
If your domain’s data dictionary uses words, phrases, or linguistic structures
not commonly found in everyday language, you may need to pre-train a
foundation model from scratch to establish a more relevant vocabulary that
better represents and understands your specific domain. This is important in
highly-specialized domains like legal, medical, and financial.
Another issue is that certain domains may use the same language constructs
differently than their everyday context. In this case, using an LLM not trained
on medical data may give improper advice because of a simple
misunderstanding of the patient’s situation.
Since the medical domain uses words and phrases outside of the normal,
every-day language distribution, certain medical conditions and procedures
may not appear in the general-purpose training datasets commonly used by
publicly-available foundation models. Such general-purpose datasets include
Wikipedia entries, Reddit threads, and Stack Overflow discussions which
are likely not the best source for legal or medical advice.

Pre-Training Objectives
There are three variants of Transformer-based models overall: encoder-only,
decoder-only, and encoder-decoder. Each is trained with a different
objective and therefore better-capable of addressing specific tasks. During
pre-training, the model weights are updated to minimize the loss of the
training objectives described next for each variation.
Encoder-only models, or autoencoders, are pre-trained using a technique
called masked language modeling (MLM) that randomly masks input tokens
and tries to predict the masked tokens. This is sometimes called a
“denoising” objective. Autoencoding models use bidirectional
representations of the input to better-understand the full context of a token -
not just the previous tokens in the sequence as shown below.

Figure 2-1. Autoencoding models use a bidirectional context to reconstruct the masked input
tokens

Encoder-only models are best-suited for tasks that utilize its bidirectional
property such as named-entity recognition. A well-known encoder-only
model is BERT which is covered extensively in Data Science on AWS by
O’Reilly.
Decoder-only models, or autoregressive models, are pre-trained using
unidirectional causal language modeling (CLM) which predicts the next
token using only the previous tokens - every other token is masked as shown
below.

Figure 2-2. Autoregressive, decoder-only models only reveal the tokens leading up to the token
being predicted

Decoder-only, autoregressive models use millions of text examples to learn a


statistical language representation by continuously predicting the next token
from the previous tokens. These models are good at many tasks including
generative, classification, and named-entity recognition. GPT, Falcon, and
LLaMa are well-known autoregressive models.
Encoder-decoder models, often called sequence-to-sequence models, use
both the Transformer encoder and decoder. While the pre-training objectives
vary from model to model, the popular T5 foundation model (e.g. FLAN-T5)
was pre-trained using span corruption and span reconstruction for the
encoder and decoder, respectively. Similar to the masked LM used by
encoder-only models, span corruption is used by the encoder to mask
multiple consecutive tokens. The decoder then attempts to reconstruct the
masked sequence of tokens, `<x>` as shown below.

Figure 2-3. Sequence to sequence


Sequence-to-sequence, originally designed for translation. are used for many
language tasks including summarization, question-answer, and classification.
T5 and its fine-tuned sibling, FLAN-T5, are well-known encoder-decoder,
sequence-to-sequence models used across a wide number of generative
language tasks.
Now that you’ve seen the three main types of language models, you will
explore some of the most common public datasets used when pre-training a
model from scratch.

Datasets for Pre-Training


A generative model learns its capabilities during the self-supervised pre-
training phase when the model sees a large amount of unstructured data -
often at the scale of terabytes and petabytes. The data is often sourced from
the public internet, but can also include proprietary data from your private
Amazon S3 buckets or databases. Some of the biggest challenges during pre-
training include addressing bias, removing harmful content, and increasing
the quality of tokens.

NOTE
By some estimates, only 1-3% of parsed tokens are usable for pre-training. Since modern
large language models are pre-trained with 2+ trillion tokens, this implies that 66-200 trillion
tokens are parsed, but not used.

Two of the most popular datasets to pre-train large-language models are


Wikipedia and CommonCrawl. Wikipedia is a multi-lingual extract of
Wikipedia contents from 2022 while CommonCrawl is a monthly dump of
text found on the whole of the internet.
As you can imagine, this type of free-form internet data is very messy. As
such, there are variants of these datasets such as Wiki-40B, Colossal Clean
Crawled Corpus (C4), The Pile, and RefinedWeb that attempt to clean the
data for higher-quality model training. RefinedWeb, in particular, attempts to
filter out machine-generated text using statistical methods to determine if the
text is human-generated vs. machine-generated.

NOTE
Falcon was trained on 1.5 trillion tokens of data. The data was processed on a cluster of
257 ml.c5.18xlarge SageMaker instances consisting of 18,504 CPUs and 37TB of CPU
RAM.

Next, you’ll learn about scaling laws which describe the relationship
between model size, dataset size, and compute budget.

Scaling Laws
The goal of generative model pre-training is to maximize the model training
objective and minimize the loss when predicting the next token. Empirically,
a common set of “scaling laws” have emerged that describe the trade-offs
between model size and dataset size for a fixed compute budget (e.g. number
of GPU hours). These scaling laws state that you can achieve better
generative model performance by either increasing the number of tokens or
the number of model parameters as shown below.
Figure 2-4. Scaling choices for pre-training

Scaling up both will typically require a higher compute budget which is


typically defined in terms of floating point operations or FLOPs. For this
discussion, you will use petaflop where 1 petaflop is one quadrillion floating
point operations per second.
More specifically, you will use the measurement of petaflop/s-day or
“petaflop per second-day” which is the number of floating point operations
performed at a throughput of 1 petaFLOP per second running for an entire
day. The petaflop/s-day measurement is a bit awkward and complex, but it’s
a relatively-common way to describe the compute performance especially
when using Nvidia GPUs as shown below. Approximately 2 Nvidia A100’s
are needed to fulfill a compute budget of 1 petaflop/s-day for Transformer-
based generative models.

NOTE
The petaflop/s-day metric is neural-network and workload dependent, so always make
sure you calculate this number using estimates that match your environment -
Transformers-based workloads, in our case.

The following is a comparison of compute budgets required to pre-train


different variations and sizes of BERT, T5, and GPT-3 using petaflop/s-days.
Remember that BERT is an encoder-only model, T5 is an encoder-decoder
model, and GPT-3 is a decoder-only model. Note that the y-axis is
logarithmic in this figure.
Figure 2-5. Pre-training requirements for common models in petaflop/s-days

Here, you see that large models require large compute budgets. T5-XL, with
3 billion parameters, requires 100 petaflop/s-days compared to GPT-3’s 175
billion parameter variant which requires approximately 4,000 petaflop/s-
days. You might wonder if it’s possible to get 175 billion parameter
performance from a smaller model. It turns out that you can!
Researchers have found that by increasing the training dataset size - instead
of the model size - you can get state-of-the-art performance that exceeds the
175 billion parameter models with a much-smaller set of weights. In fact, the
Scaling Laws for Neural Language Models paper shows that, if you hold the
compute budget constant, model performance may increase when you
increase the training-dataset size (and holding model parameter size
constant) or the number of model parameters (and holding the dataset size
constant).

Figure 2-6. Impact of dataset size and parameter size on model performance

In other words, model performance may continue to improve with more data
- holding both the compute budget and parameter-size constant. This means
that smaller models may just need to be trained on more data and keep the
model size small. This is the exciting field of compute-optimal model
research which you will learn about next.
Compute-Optimal “Chinchilla” Models
In 2022, a paper group of researchers released a paper titled Training
Compute-Optimal Large Language Models that compared model performance
of various model and dataset size combinations. Since the authors named
their final “compute-optimal” model, Chinchilla, this paper is famously
called the “Chinchilla Paper.”
The Chinchilla paper implies that the massive 100 billion-plus parameter
models like GPT3 may be over-parameterized and under-trained.
Additionally, they hypothesize that you could achieve 100 billion-plus
parameter model performance with a small model by simply providing more
training data to the smaller model.
To be more specific, they claim that the optimal training dataset size is 20
times the number of model parameters. The Chinchilla model was never
publicly released, but it was documented as 70 billion model parameters
trained on 1.4 trillion tokens.
Shortly after the Chinchilla paper, however, the compute-optimal LLaMa
model was trained - and ultimately leaked publicly - by Facebook/Meta.
LLaMa followed the Chinchilla scaling laws with 65 billion parameters and
1.3 trillion tokens. The figure below compares the compute-optimal
Chinchilla and LLaMa models with the potentially over-parameterized and
under-trained 175 billion-parameter variants of GPT-3, OPT, and BLOOM.
Figure 2-7. Chinchilla scaling laws for given model size and dataset size

Here, you see that these 175+ billion parameter models, according to the
Chinchilla scaling laws, should be trained on 3.5 trillion tokens. Instead, they
were trained with 180-350 billion tokens - an order of magnitude smaller
than the Chinchilla paper recommends.
In the figure above, you also see that the more-recent LLaMA2 model,
released publicly by Facebook/Meta, was trained with 2 trillion tokens -
even higher than the 20-to-1 token-to-parameter ratio described by the
Chinchilla paper. Similar to the first version of LLaMA, LLaMA2 out-
performed the much larger models and even owned the top position in the
HuggingFace OpenLLM Leaderboard for a bit.
The Chinchilla scaling laws are a great starting point for your pre-training
efforts. They demonstrate that you can achieve state-of-the-art performance
on relatively small 50-70 billion parameter models simply by increasing the
amount of training data.
Next, you will explore a well-documented model called BloombergGPT that
used the Chinchilla scaling laws as a blueprint for model pre-training.

BloombergGPT
First announced in 2023 in a paper called “BloombergGPT: A Large
Language Model for Finance” by Shijie Wu, Steven Lu, and others at
Bloomberg, BloombergGPT is proprietary, domain-specific LLM pre-trained
on the finance domain.
With a budget of 1.3 million GPU hours (230 million petaFLOPs),
BloombergGPTs’ 50 billion parameters were trained in a compute-optimal
way following the Chinchilla laws. Researchers at Bloomberg pre-trained
with a combination of both public and private financial-text data - as well as
public internet-text data. Below shows the breakdown of 51% public and
private financial data vs. 49% public internet data.
While the 50 billion-parameter BloombergGPT should have been trained
with approximately 1 trillion tokens based on the 20-to-1 token-to-parameter
Chinchilla ratio, it was only trained on closer to 700 billion parameters
mainly due to limited availability of high-quality, text-based financial data at
the time of pre-training.
Even so, BloombergGPT reached state-of-the-art performance on financial
language benchmarks - and very good results on general-purpose language
benchmarks. However, there may be room for improvement in model
performance for the same 50 billion parameter model when more financial
text is sourced.
Summary
In this chapter, you saw how models are trained on vast amounts of text data
during an initial training phase called “pre-training”. This is where a model
develops its understanding of language.
You also learned 3 different types of language models including encoder-only
autoencoding, decoder-only autoregressive, and encoder-decoder sequence-
to-sequence models.
Additionally, you will learned some empirical scaling laws that have been
discovered for generative AI models - and how these scaling laws help you
choose the number of model parameters (e.g. 1 billion, 7 billion, 70 billion,
etc) and dataset size (e.g. 700 billion tokens, 1.4 trillion tokens, 2 trillion
tokens, etc) when pre-training your own model from scratch.
In the next chapter, you will explore some of the computation challenges of
training large models including GPU memory limitations. You will see how
to use quantization to reduce the memory requirements of your training job.
You will also learn how to efficiently scale model training across multiple
GPUs using distributed clusters.
Chapter 3. Quantization and
Distributed Computing

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the fourth chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

In this chapter, you will explore some of the challenges associated with pre-
training foundation models. In particular, GPU memory is relatively scarce
compared to CPU RAM. As such, you will explore various techniques such
as quantization and distributed computing to minimize the required GPU
RAM and scale horizontally across multiple GPUs for larger models.
For example, the original 40 billion Falcon model was trained on a cluster of
48 ml.p4d.24xlarge SageMaker instances consisting of 384 Nvidia A100
GPUs, 15TB of GPU RAM, and 55TB of CPU RAM. A more recent version
of Falcon was trained on a cluster of 392 ml.p4d.24xlarge SageMaker
instances consisting of 3,136 Nvidia A100 GPUs, 125TB of GPU RAM, and
450TB of CPU RAM. The size and complexity of the Falcon model requires
a cluster of GPUs, but also benefits from quantization as you will see here.
Computational Challenges
One of the most common issues you’ll encounter when you try to pre-train
large language models is running out of memory. If you’ve ever tried training,
or even just loading your model on NVIDIA GPUs, this error message might
look familiar.

Figure 3-1. CUDA out of memory error.

CUDA, short for Compute Unified Device Architecture, is a collection of


libraries and tools developed for Nvidia GPUs to boost performance on
common deep learning operations including matrix multiplication among
many others. Deep learning libraries such as PyTorch and TensorFlow use
CUDA extensively to handle the low-level, hardware-specific details
including data movement between CPU and GPU memory. As modern
generative models contain multiple billions of parameters, you have likely
encountered this out of memory error during development while loading and
testing a model in your research environment.
A single model parameter, at full 32-bit precision, is represented by 4-bytes.
Therefore, a 1 billion parameter model requires 4GB of GPU RAM just to
load the model into GPU RAM at full precision. If you want to also train the
model, you need more GPU memory to store the states of the numerical
optimizer, gradients, and activations - as well as any temporary variables
used by your functions as shown below.
Figure 3-2. Additional RAM needed to train a model

NOTE
When exploring a new model, it’s recommended that you start with `batch_size=1` to find
the memory boundaries of your model with just a single training example. You can then
increase the batch size until you hit the CUDA out of memory error. This will determine
the maximum batch size for your model and dataset.

These additional components lead to approximately 12-20 extra bytes of


GPU memory per model parameter. For example, to train a 1 billion
parameter model, you will need approximately 80 GB or GPU RAM at 32-
bit full-precision as shown in the figure below.

Figure 3-3. Approximate GPU RAM needed to load and train a 1 billion parameter model at 32-
bit full precision

It’s worth noting that the Nvidia A100 and H100, used at the time of this
writing, only support up to 80GB of GPU RAM. And since you likely want to
train models larger than 1 billion parameters, you’ll need to find a
workaround.
Next, you will explore common data types for model training - as well as a
discussion about numerical precision.
Data Types and Numerical Precision
The following are the various data types used by PyTorch and Tensorflow:
`fp32` for 32-bit full precision, `fp16` for 16-bit half-precision, and `int8` for
8-bit integer precision. More recently, `bfloat16` has become a popular
alternative to fp16 for 16-bit precision. Most of the modern generative AI
models were pre-trained with `bfloat16` including FLAN-T5, Falcon, and
Llama2. Here, you’ll learn how these data types compare - and why
`bfloat16` is a popular choice for 16-bit quantization.
Suppose you want to store Pi using full 32-bit precision. Remember that
floating point numbers are stored as a series of bits consisting of only 0s and
1s. Numbers are stored in 32-bits using 1 bit for the sign (negative or
positive), 8 bits for the exponent, and 23 bits for the fraction, also called the
mantissa or significand, which represents the precision of the number. fp32
can store a range of numbers from -3e-38 to +3e38 as shown below using Pi
as our value.
Figure 3-4. fp32 representing Pi

Note the act of storing a number in 32-bits will actually cause a slight loss in
precision. You can see this by storing Pi as an fp32 and printing the result.
The real value of Pi starts with 3.1415926535897932384. However, the 32-
bit representation is 3.1415920257568359375 which shows a loss in
precision just by storing the value.
Next, you will learn about a common technique called quantization to reduce
the memory requirements required to load and train your multi-billion
parameter model.
Quantization
Quantization is a popular way to convert your model parameters from 32-bit
precision down to 16-bit precision - or even 8-bit integers. By quantizing
your model weights from 32-bit full-precision down to 16-bit half-precision,
you can quickly reduce your 1 billion-parameter model-memory requirement
down 50% to only 2GB for loading and 40GB for training.
Quantization projects a source set of higher-precision floating point numbers
into a lower-precision target set of numbers. Using the source and target
ranges, the mechanism of quantization first calculates a scaling factor, makes
the projection, then stores the results in reduced precision which requires
less memory and ultimately improving training performance and reducing
cost.
With fp16, the 16-bits consist of one bit for the sign but only 5 bits for the
exponent and 10 bits for the fraction. The range of representable fp16
numbers is only from -65,504 to +65,504. Below is an example of quantizing
Pi from fp32 down to fp16.
Figure 3-5. Quantization from fp32 to fp16 saving 50% memory

Note the small loss in precision after this projection as there are only 6
places after the decimal point now. Remember that we already lost precision
just by storing the value in fp32. The loss in precision is acceptable in most
cases, however, the benefits of a 50% reduction in GPU memory for fp16 is
typically worth the tradeoff since fp16 only requires 2 bytes of memory vs. 4
bytes of fp32 as shown below.
Figure 3-6. Only 40GB of GPU RAM is needed to load and train a 1 billion parameter model at
16-bit half precision

bfloat16 or bf16 is short for “BrainFloat 16” as it was developed at Google


Brain and is a popular choice for more-modern generative AI models. bf16
is considered a better alternative to fp16 for generative AI models - and is
natively supported by newer GPUs such as Nvidia’s A100 and H100.
bf16 captures the full range of fp32 with only 16-bits. bfloat16 uses a single
bit for the sign and the full 8 bits for the exponent. However, it truncates the
fraction to just 7 bits which is why it’s often called the “truncated 32-bit
float” as shown below.
Figure 3-7. Quantization with bf16

Another option that’s worth exploring is int8 8-bit quantization. Using 1 bit
for the sign, int8 values are represented by the remaining 7 bits. This results
in a dynamic range of -128 to +127. Unsurprisingly, Pi is projected to just `3`
in the 8-bit lower precision space as shown below. This brings the memory
requirement down from originally 4 bytes to just 1 byte, but obviously results
in a pretty dramatic loss of precision.
Figure 3-8. Quantization with int8

Below shows a table that compares the data types.


Figure 3-9. Quantization summary

Quantization reduces the memory needed to load and train a model by


reducing the precision of the model weights. Below is a comparison of the
amount of memory needed to load a 1 billion parameter model at 32-bit, 16-
bit, and 8-bit precision.
Figure 3-10. Approximate GPU RAM needed to store 1B parameters at different quantization
sizes

And you could reduce the memory footprint further by representing the model
parameters with 4-bits, 2-bits, and even 1-bit! Just remember that you may
reduce the expressiveness and power of these models as you continue to
reduce precision.
When you try to train a 1B parameter model at 32-bit full precision, you will
quickly hit the limit of a single Nvidia A100 or H100 GPU with only 80GB
of GPU RAM. Therefore, you will almost always need to use quantization
when using a single GPU. However, most modern generative AI models
exceed 1 billion parameters and require 10’s of thousands of GB’s of GPU
RAM as shown below.
Figure 3-11. GPU RAM needed for large models

For larger models, you will likely need to use a distributed cluster of GPUs
to train these massive models across hundreds or thousands of GPUs. Even
for single GPU scenarios, there are still some performance benefits to
scaling your training across multiple GPUs even though it’s not required.
However, in addition to the extra cost, training with a distributed cluster
requires a deeper understanding of distributed-computing techniques which
you will explore next.

Distributed Computing
There are many different types of distributed computing patterns including
distributed data parallel (DDP) and fully-sharded data parallel (FSDP). The
main difference is how the model is split - or sharded - across the GPUs in
the system.
If the model parameters can fit into a single GPU, then you would choose
DDP to load a single copy of the model into each GPU. If the model is too
large for a single GPU - even after quantization - then you need to use FSDP
to shard the model across multiple GPUs. In both cases, the data is split into
batches and spread across all available GPUs to increase GPU utilization
and cost efficiency at the expense of some communication overhead which
you will see in a bit.
PyTorch comes with an optimized implementation of DDP that automatically
copies your model onto each GPU (assuming it fits into a single GPU - often
combined with quantization), splits the data into batches, and sends the
batches to each GPU in parallel. With DDP, each batch of data is processed
in parallel on each GPU followed by a synchronization step where the results
of each GPU (e.g. gradients) are combined (e.g. averaged). Subsequently,
each model - 1 per GPU - is updated with the combined results and the
process continues as shown below.
Figure 3-12. Distributed Data Parallel (DDP)

Note that DDP assumes that each GPU can fit not only your model parameters
and data batches, but also any additional data that is needed to fulfill the
training loop including optimizer states, activations, temporary function
variables, etc as shown below. If your GPU cannot store all of this data, you
need to shard your model across multiple GPUs.
Figure 3-13. Memory consumption for DDP

PyTorch has an optimized implementation of model sharding called fully


sharded data parallel (FSDP). FSDP was motivated by the 2019 ZeRO paper
from Samyam Rajbhandari, et. al. The goal of ZeRO, or Zero Redundancy
Optimizer, is to reduce DDP’s data redundancy by sharding the model - and
its additional gradients, activations, and optimizer states - across the GPUs
to achieve zero redundancy in the system. ZeRO describes 3 optimization
stages depending on what is being sharded across the GPUs: stage 1, 2, and 3
as shown below.
Figure 3-14. ZeRO consists of 3 stages depending on the GPU shards: parameters, gradients,
and optimizer states

ZeRO Stage 1 only shards the optimizer states across GPUs, but still reduces
your model’s memory footprint up to 4x. ZeRO Stage 2 shards both the
optimizer states and gradients across the GPUs to reduce GPU memory up to
8x. ZeRO Stage 3 shards everything - including the model parameters -
across the GPUs to help reduce GPU memory up to `n` times where `n` is the
number of GPUs. For example, when using ZeRO Stage 3 with 128 GPUs,
you can reduce your memory consumption by up to 128x.
Compared to DDP in which each GPU has a full copy of everything needed
to perform the forward and backward pass, FSDP needs to dynamically re-
construct a full layer from the sharded data onto each GPU before the
forward and backward passes as shown below.

Figure 3-15. Fully sharded data parallel across multiple GPUs

Here, you see that before the forward pass, each GPU requests data from the
other GPUs on-demand to materialize the sharded data into unsharded, local
data for the duration of the operation - typically on a per-layer basis.
When the forward pass completes, FSDP releases the unsharded, local data
back to the other GPUs - reverting the data back to its original sharded state
to free up GPU memory for the backward pass. After the backward pass,
FSDP synchronizes the gradients across the GPUs, similar to DDP, and
updates the model parameters across all the model shards where different
shards are stored on different GPUs.

NOTE
You can also configure FSDP to offload some computation to CPU and CPU memory to
further reduce memory pressure on your GPUs. Another performance vs. memory
tradeoff.

By materializing the data on-demand, FSDP balances the communication


overhead with the overall GPU memory footprint. You can manually
configure a “sharding factor” to manage the trade-off between performance
and memory utilization and adapt to your specific environment as shown
below.
Figure 3-16. Choose a sharding factor based on the resources in your environment

Sharding factor of 1 avoids sharding and replicates the model across all
GPUs - reverting the system back to DDP. You can set the sharding factor to a
maximum of `n` number of GPUs to unlock the potential of full sharding. Full
sharding offers the best memory savings - as the cost of GPU-communication
overhead. Setting the sharing factor to anything in between will enable hybrid
sharding.
Below is a comparison of FSDP and DDP from the 2023 paper Experiences
on Scaling Fully Sharded Data Parallel by Zhao, et al. These tests were
performed on different-size T5 models using 512 NVIDIA A100 GPUs -
each with 80GB of memory. They compare the number of teraFLOPs per
GPU where teraFLOP, as a reminder, is 1 trillion floating point operations
per second.

Figure 3-17. Performance improvement with FSDP over DDP

Full sharding is in blue, hybrid sharding is orange, no sharding (full


replication similar to DDP) is green, and DDP is in red. As expected the no
sharding and DDP configurations perform similarly. For the smaller T5
models, 611 million parameters and 2.28 billion parameters, FSDP performs
the same as DDP. However, DDP runs out of GPU memory when training the
11.3 billion parameter T5 model. FSDP, however, easily supports the higher
parameter size when using hybrid and full sharding.
Furthermore, training the 11 billion parameter model with different cluster
sizes from 8 GPUs to 512 GPUs shows only a 7% decrease in per-GPU
teraFLOPS due to GPU communication overhead. These tests were run with
batch sizes of 8 (blue) and 16 (orange).

This demonstrates FSDP can scale model training for both small and large
models across different GPU cluster sizes. As the cluster size increases,
however, there will be a small decrease in overall throughput as
communication overhead increases between the GPUs.

Summary
In this chapter, you explored some of the computational challenges of training
these models due to GPU memory limitations. And you saw how to use
quantization to save memory, reduce cost, and improve performance. You
also learned how to scale your training across multiple GPUs and nodes in a
cluster using distributed computing strategies such as distributed data
parallel and fully-sharded data parallel. By combining quantization and
distributed computing, you can train very large models efficiently and cost
effectively with minimal impact on training throughput and model accuracy.
In the next chapter, you will learn how to adapt existing generative
foundation models to your own datasets using a technique called fine-tuning.
Fine-tuning an existing foundation model can be a less costly, yet sufficient
alternative to model pre-training from scratch.
Chapter 4. Fine-Tuning and
Evaluation

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the fifth chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

Your first experience with generative AI was likely a general-purpose


chatbot that could answer questions, perform arithmetic, or recommend
recipes based on your nutritional preferences. These general purpose
chatbots are often referred to as “multi-task” because of their ability to
perform many tasks with a single generative model. However, these model’s
didn’t learn this multi-task behavior during pre-training as shown in previous
chapters.
After large-language model pre-training, the model is very aware of
grammatical and linguistic nuances. In other words, it has the ability to
understand language. However, the model needs a bit more help to
understand how to perform tasks and reason through problems. To do this, the
model needs to be fine-tuned with examples of instruction inputs and outputs
- as well as examples of step-by-step reasoning to help the generative model
perform more complex tasks.
As you saw in Chapter 2, using instructions with your prompts can help the
model generate more relevant and specific responses. This is an important
aspect of generative AI and you will soon see how to add instructions to your
data sets to fine-tune your model to perform tasks that it may not have seen
previously. This is the phase of the generative AI project lifecycle where you
will adapt pre-trained models to specific tasks and use cases using your own
datasets through a process called “fine-tuning”.

Figure 4-1. Gen AI lifecycle for adapt phase

Fine-tuning helps to improve the pre-trained model’s performance on


specific tasks. To measure model performance, you will need evaluation
metrics to compare the generated text before and after model fine-tuning. As
such, you will see various evaluation metrics and benchmarks to help
measure and compare the performance of generative AI models.

Single-Task Fine-Tuning with Instruction


Consider a scenario where you want to fine-tune a pre-trained model to
improve performance on a summarization task. This is useful to help
optimize workflows that involve conversations or large documents such as
customer-support tickets or legal documents, respectively.
Interestingly, good results can be achieved with relatively few examples.
Often, just 500-1,000 examples will result in noticeable performance
improvement relative to the pre-trained baseline as shown below. This is in
contrast to the billions of tokens needed during model pre-training as we saw
in a previous chapter.
Figure 4-2. Fine-tuning on a single task

However, if your goal is to build and deploy a general-purpose generative


model, then you will need to perform multi-task fine-tuning on many tasks at
the same time.

Multi-Task Fine-Tuning with Instruction


To perform multi-task fine-tuning, you will need a training dataset that
includes instruction-based prompt-completion pairs across multiple tasks.
Note that a decent, general-purpose, multi-task model may require 100,000+
examples spread across many different generative tasks. This, of course,
requires more data and more compute, but not nearly as much as model pre-
training described in a previous chapter.
Below, you see a sample dataset that includes instruction examples for a
variety of tasks including summarization, classification, code translation, and
named-entity recognition. By training the model on a mixed dataset, you can
improve the performance of the model on many tasks simultaneously, avoid
the issue of catastrophic forgetting, and maintain the model’s multi-task
capability.

Figure 4-3. Multi-task fine-tuning with instruction

During this supervised instruction fine-tuning process, the loss is calculated


across many different examples. This loss is then used to update the weights
of the model during backpropagation which results in a multi-task,
instruction-fine-tuned model that has learned how to be good at many
different tasks simultaneously. There are many examples of instruction fine-
tuned models including FLAN-T5-XXL (11 billion parameters), falcon-40b-
instruct (40 billion parameters), and llama2-70b-chat (70 billion
parameters).
To start, let’s work with the popular FLAN-T5 model which is well-
documented and known as a great general-purpose, multi-task instruction
model. FLAN-T5 is the instruct version of the T5 foundation model using the
FLAN instruction dataset which includes chain-of-thought (CoT) reasoning.
As we saw in Chapter 2, CoT reasoning helps the FLAN family of models
perform step-by-step reasoning for more-complex problems such as
arithmetic.
In total, FLAN models have been fine-tuned on 473 datasets across 146 task
categories consisting of nearly 1,800 fine-grained tasks as shown below.
Figure 4-4. Data sets used to fine-tune the FLAN-T5 model from base T5

One of the datasets, `samsum`, is used to train a language model to summarize


conversational dialog. This dataset contains 16 thousand conversations and
human-curated summaries. The conversations and summaries were created
by linguistics experts specifically to create a high-quality training dataset for
dialog summarization as shown below.
Figure 4-5. samsum dataset of conversational dialog including human-curated summaries

Below is a prompt template that references the `dialogue` and `summary`


columns of samsum dataset. This template consists of multiple instructions
for the same conversation-summary pair. By seeing different instructions for
the same task, the model generalizes better on instructions that are not in the
original dataset.

{dialogue}
Briefly summarize that dialogue.
{summary}

Here is a dialogue:
{dialogue}
Write a short summary.
{summary}

Dialogue:
{dialogue}
What is a summary of this dialogue?
{summary}

{dialogue}
What was that dialogue about, in two sentences or less?
{summary}

Here is a dialogue:
{dialogue}
What were they talking about?
{summary}

Dialogue:
{dialogue}
What were the main points in that conversation?
{summary}

Dialogue:
{dialogue}
What was going on in that conversation?
{summary}

Applying this template to each row in the samsum dataset, you actually create
7 fine-tuned examples per row, in this case. Since samsum is approximately
16,000 rows of data, after applying the template, you now have 112,000
examples to fine-tune your model for the conversation-summarization task!
See, getting to 100,000 examples wasn’t that difficult, was it?

Instruction Fine-Tuning on a Custom Dataset


While the conversations in the samsum dataset helped the FLAN-T5 model
learn to summarize conversations, they may not capture the nuance and
uniqueness of your use case. Therefore, you may want to perform additional
fine-tuning with your dataset - or even add a new set of tasks that better-
match your business problem.
To simulate this, you can fine-tune the model to summarize a different dialog
dataset called `dialogsum` which the model has not seen. The dialogsum
dataset consists of over 13,000 conversations and summaries.

NOTE
Fine-tuning an already-fine tuned model like FLAN-T5 on a single task like summarization
will likely improve the model’s performance on summarization but may degrade the
model’s performance on other tasks like sentiment analysis and question-answer. This
phenomenon is called “catastrophic forgetting”. To combat catastrophic forgetting, you can
mix in a small percentage of multi-task examples - in addition to summarization - during the
fine-tuning process. Approximately 5% of mix-in, multi-task data is a good starting point.

Below is an example of a conversation between a customer and a hotel


representative during the customer’s check-in process. The summary was
created by a set of human annotators.
Figure 4-6. Example of human-annotated conversation summary

You can use Python’s “f-string” formatting code to create a sample prompt
from the `dialogue` column above as shown here.

prompt_template = f”””
Here is a dialogue:

{dialogue}

Write a short summary.


“””

You will then pass this prompt to the model which returns a sample
completion as shown below.
Figure 4-7. Generated summary before fine-tuning

Here, you see the model does OK, but the generated summary does not match
the human-curated completion which includes important details about the
conversation. The model has also generated some hallucinations with
information not found in the original conversation such as the hotel name.
First, you will prepare a dataset for fine-tuning by applying a similar
`prompt_template` above to the full dialogsum dataset, but adding the
`{summary}` column as this is needed for the fine-tuning. Below is the code
to apply this format to the dialogsum dataset - as well as convert the text into
`token_ids` used during fine-tuning.
prompt_completion_template = f”””
Here is a dialogue:

{dialogue}

Write a short summary.

{summary}
“””
from transformers import AutoTokenizer
from datasets import load_dataset, DatasetDict
dataset = load_dataset("knkarthick/dialogsum")
print(dataset)
def tokenize_prompt_and_completion(sample):
prompt_and_completion = \
Prompt_completion_template \
.format(
dialogue=sample["dialogue"],
summary=sample["summary"],
eos_token=tokenizer.eos_token)
return tokenizer(prompt_and_completion)

tokenized_prompt_and_completion_datasets =
dataset.map(tokenize_prompt_and_completion)

tokenized_prompt_and_completion_datasets =
tokenized_datasets.remove_columns(['id',
'topic', 'dialogue', 'summary',])

print(tokenized_prompt_and_completion_datasets)

Next, you will actually perform the fine-tuning on the tokenized dataset. Here
is the code for fine-tuning your model using the HuggingFace Transformers
library.

import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments,
Trainer, GenerationConfig

import torch

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
model =
AutoModelForSeq2SeqLM.from_pretrained(
model_checkpoint,
trust_remote_code=True, # needed by Falcon
torch_dtype=torch.bfloat16,
device_map="auto", # place shards automatically
)

training_args = TrainingArguments(
output_dir="./output",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=1e-5,
num_train_epochs=5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
weight_decay=0.01,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=\
tokenized_prompt_and_completion_datasets['train'],
eval_dataset=\
tokenized_prompt_and_completion_datasets['validation']
)

NOTE
To use a model like Falcon or Llama2, just change `model_checkpoint`, replace `model`
definition with the one shown below using `AutoModelForCausalLM`, and re-run both the
data preparation and fine-tuning steps.

model =
AutoModelForCausalLM.from_pretrained(
model_checkpoint,
trust_remote_code=True, # Required by Falcon torch_dtype=torch.bfloat16,
device_map="auto", # place shards automatically
)
Below is the model’s response after performing fine-tuning. You see that this
summary is closer to the human-curated baseline summary. The model
includes the important information and does not seem to hallucinate.

Figure 4-8. Generated summary after fine-tuning

While this example uses the public `dialogsum` dataset to demonstrate fine-
tuning on custom data, you will likely use your company’s own internal data
such as the chat-support conversations exported from your customer-support
application. This helps your model learn the nuances of the interactions
between your customer-service representatives and your customers.
In this example, you qualitatively compared the summaries based on your
human understanding of language. While humans are very good at model
evaluation, they are not scalable, unfortunately. Next, you will learn about
more-formal techniques to create repeatable, scalable, and quantitative
mechanisms to evaluate your generative models before deciding to push to
production.

Evaluation Metrics
There exist many metrics to evaluate generative AI model performance, and
there is much debate in the community on their significance and effectiveness.
At their core, evaluation metrics provide a baseline to which you can
compare changes to your model such as fine-tuning.
Classic machine learning evaluation metrics such as accuracy, root-mean
squared-error (RMSE) are straightforward to calculate since the predictions
are deterministic and easy to compare against the labels in a validation or
test dataset.
The output from Generative AI models, however, are famously non-
deterministic by design which makes evaluation very difficult without human
intervention. Additionally, evaluation metrics for generative models are very
task-specific. For example, the ROUGE metric is used to evaluate
summarization tasks while the BLEU metric is used for translation tasks.
Since this chapter focuses on summarization, you will learn how to calculate
the ROUGE metric which reveals why it’s both useful and controversial at
the same time.
ROUGE, which is the acronym for Recall-Oriented Understudy for Gisting
Evaluation, calculates how well the input (‘dialogue`, in our case) compares
to the generated output (`summary`, in our case). To do this, ROUGE
calculates the number of similar unigrams (single-words), bigrams (two
consecutive words), and longest-common-sequences (consecutive n-grams)
between the inputs and generated outputs to calculate the ROUGE-1,
ROUGE-2, and ROUGE-L scores. The higher the score, the more similar
they are.
By now, you might understand the controversy. Human language consists of
many examples in which similar phrases vary wildly in their meaning
differing either by only a few words - or a slight change in word position.
Consider the example, “This book is great” and “This book is not great”.
Using ROUGE alone, these phrases appear to be similar as shown below.
However, they are, in fact, opposite.
While ROUGE is far from perfect, it is suitable to use as a baseline metric
before and after fine-tuning your model. Many popular natural-language
libraries, including HuggingFace, support ROUGE. Below is the code to
evaluate your model using the `evaluate` library from HuggingFace. Here,
you see an approximately 80% improvement in the ROUGE scores after fine-
tuning on the `dialogsum` dataset based on a holdout test dataset not seen by
the model during fine-tuning.

import evaluate

rouge = evaluate.load('rouge')

original_results = rouge.compute(
predictions=original_model_summaries,
references=human_baseline_summaries,
use_aggregator=True,
use_stemmer=True,
)
print(original_results)
```

{'rouge1': 0.2334,
'rouge2': 0.0760,
'rougeL': 0.2014}

```
tuned_results = rouge.compute(
predictions=tuned_model_summaries,
references=human_baseline_summaries,
use_aggregator=True,
use_stemmer=True,
)
print(tuned_results)
```

{'rouge1': 0.4216,
'rouge2': 0.1804,
'rougeL': 0.3384}

Evaluation Benchmarks and Datasets


To evaluate and compare your generative models more holistically, you can
use existing benchmarks and datasets established by the community such as
GLUE, SuperGLUE, HELM, BIG-bench, and MMLU among many others.
These benchmarks have evolved over the years to include many complex
tasks such as reading comprehension and common-sense inference.
GLUE, or General Language Understanding Evaluation, was introduced back
in 2018 specifically to evaluate and compare model performance across a set
of language tasks. The result was more-generalizable language models which
positively impacted the landscape of natural language research and
development. SuperGLUE, the successor to GLUE, was introduced a year
later in 2019 to include more-challenging tasks such as multi-sentence
reasoning, and reading comprehension. Both GLUE and SuperGLUE offer
public leaderboards to encourage and reward improvements in language
understanding.
Holistic Evaluation of Language Models, or HELM, is a benchmark designed
to encourage model transparency and ultimately provide users information on
which model to choose for a given task. HELM is a combination of 7 metrics
across 16 core “scenarios” as defined by the HELM community. Scenarios
include tasks such as question-answer, summarization, sentiment analysis - as
well as toxicity and bias detection. HELM also offers an extension
mechanism to add new scenarios and tasks. As such, HELM is considered a
“living” benchmark that can evolve over time.
Massive Multitask Language Understanding, or MMLU, evaluates a model’s
knowledge and problem-solving capabilities. Models are tested across
different subjects including mathematics, history, and science.
WARNING
Multiple variants of a benchmark may exist - each covering a different set of tasks and
datasets. An example is the MMLU benchmark which, as of this writing, has 3 different
variations. Unfortunately, this causes further controversy regarding the relevancy of
benchmarks overall.

BIG-bench is another popular benchmark for generative models. Consisting


of 204 tasks across linguistics, mathematics, biology, physics, software
development, common-sense reasoning, and much more. Because BIG-bench
is so massive, it was released in different sizes to help reduce the inference
cost to participate in the benchmark’s leaderboard.
It’s important to choose metrics, benchmarks, and datasets that help to
evaluate not just your models’ generative capabilities, but also its potential
to produce hate speech, fake news, and other harmful output. The
RealToxicityPrompts and TruthfulQA datasets are good starting points to
evaluate your model’s potential to generate hate speech and misinformation,
respectively.

Summary
In this chapter, you saw how to fine-tune your model with instructions by
applying prompt templates to a dataset that matches your generative task and
use case. You also learned some common metrics, benchmarks and datasets
used to evaluate your model after making changes, compare your model to
other models, and measure the model’s toxicity and truthfulness.
Chapter 5. Parameter-efficient
Fine Tuning (PEFT)

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the sixth chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

As you saw in previous chapters, training generative models is


computationally expensive. Adapting models to your domain or task through
full fine-tuning requires memory not just to store the model, but also various
other parameters that are required during the training process. In contrast to
full fine-tuning, Parameter Efficient Fine Tuning (PEFT) provides a set of
techniques allowing you to fine-tune your models while utilizing less
compute resources. In this chapter, you’ll learn about PEFT as well as a few
specific PEFT methods including prompt tuning, LoRA, and QLoRA.

Full fine-tuning vs. parameter-efficient fine-


tuning (PEFT)
In this section, you’ll learn more about the differences between full fine-
tuning of an LLM and utilizing parameter efficient methods for model
adaptation. At a high level, with full-fine tuning you are updating every
model weight through supervised learning. In contrast, PEFT methods only
update a small subset of parameters.
As shown in Figure 6.n, the compute you use for full fine-tuning of your
Large Language Models (LLMs) needs to be able to hold not only the model
weights, which are now on the order of 100s of GB for the largest models,
but you must also be able to allocate memory for optimizer states, gradients,
forward activations, and temporary memory throughout the training process.
Because these additional components can be many times larger than the
model itself, the compute requirements for full fine-tuning can quickly
become too large to handle on consumer hardware.
Figure 5-1. Training and tuning models requires more memory than just the model weights

PEFT methods aim to reduce the resources required for adaptation by


freezing most of the model weights and focusing on fine-tuning only a subset
of existing or new model parameters. This subset of parameters can include
specific model layers or components. As shown in Figure 6.n, there are
multiple PEFT methods, some which train a subset of model parameters
while others that add new trained layers. However, with all PEFT methods
you’re able to utilize the frozen weights of the original model while only fine
tuning a subset of weights which requires less resources in comparison to
full fine-tuning.
Figure 5-2. Methods methods aim to only train a small subset of layers or parameters

With PEFT, most, if not all of the weights are kept frozen. As a result, the
number of trained parameters is much smaller than the number of parameters
in the original model. In some cases the trainable parameters can be just 1-
2% of the original LLM weights. Because you’re training a relatively-small
number of parameters, the memory requirements for fine-tuning become more
manageable and can often be performed on a single GPU.
In addition to requiring less resources during fine-tuning, PEFT methods are
also less prone to catastrophic forgetting, as discussed in depth in Chapter 4,
when compared to full fine-tuning because the original model is only slightly
modified or left unchanged.
Another key consideration for PEFT includes common scenarios where you
need to adapt your model for multiple tasks. For example, let’s assume you
need to fine-tune for three separate tasks. If you use full fine-tuning for each
task, that results in a new model version for every task you train on as shown
in Figure 6.n. Each of these new adapted models are the same size as the
original models which can create an expensive storage and hosting problem
if you are performing full fine-tuning for multiple tasks.

Figure 5-3. Full fine-tuning creates a full copy of the original model for each task

With parameter efficient fine tuning, you train only a small number of
weights, which results in a much smaller model footprint overall - as small
as MBs depending on the task.
The new, or updated, parameters are combined with original model weights
for inference. The PEFT weights are trained for each task and can be easily
swapped out for inference as shown in Figure 6.n. This allows for efficient
adaptation of the original model to multiple tasks.

Figure 5-4. PEFT reduces task-specific model weights and can be swapped out at inference

There are some things to consider when choosing between full fine-tuning
and parameter-efficient fine-tuning to adapt your model to specific tasks.
Table 6.n summarizes these considerations.
T
a
b
l
e
5
-
1
.
C
o
n
s
i
d
e
r
a
t
i
o
n
s
f
o
r
c
h
o
o
s
i
n
g
P
E
F
T
v
s
F
u
l
l
F
i
n
e
-
T
u
n
i
n
g

Consideratio Full Fine- Parameter Efficient Fine-Tuning


n Tuning (PEFT)

Fine-Tuning Increase compute Reduced compute requirements


Compute requirements as a result of training only a
Resource (compute, memory, subset of model parameters
Requirements storage)

Storage Increased storage Reduced storage requirements


Resource requirements model
Requirements

Training Data Larger dataset with Smaller dataset with fewer


multiple examples examples

Parameter Each weight updated Only a subset of weights


Efficiency during fine-tuning updated during fine-tuning

Model Typically results in Performance can be equivalent


Performance higher performance or comparable to full fine-
tuning in some scenarios

Inference Each fine-tuned Host original LLM and swap


Performance model must be model weights at inference
hosted

In general, PEFT methods can often be a good option to minimize resource


requirements while still maintaining adequate model performance for your
adapted tasks. There are several methods you can use for parameter efficient
fine tuning. Each method has more detailed trade-offs between parameter
efficiency, memory efficiency, training speed, model performance, and
inference costs. In the next section, you’ll learn about the high level
categories of PEFT methods as well as trade-offs to consider when choosing
each.

High-level PEFT Methods


PEFT methods vary in the approach used to fine tune models for specific
tasks. These methods can be classified into three high level categories first
introduced by researchers at UMass in the paper Scaling Down to Scale Up:
A Guide to Parameter-Efficient Fine-tuning (Lialin, 2023). The categories
identified include selective, reparameterization, additive, and hybrid
methods.

Figure 5-5. PEFT methods

Selective methods are those that fine-tune only a subset of the original model
parameters or layers. There are several approaches that you can take to
identify which parameters or layers you want to update. As an example, the
BitFit method focuses on only training the bias weights or a subset of those
weights. Depending on the selective method used, you have the option to
train only certain components of the model, specific layers, or even
individual parameter types. Researchers have found that the performance of
these methods is mixed, and there are significant trade-offs between
parameter efficiency and compute efficiency.
Reparameterization methods also work with the original model parameters,
but reduce the number of parameters to train by creating new low-rank
transformations of the original network weights. A popular technique in this
category is Low-rank Adaptation (LoRA). Because LoRA is broadly used,
you’ll explore this method, along with a complementary method called
Quantized LoRA (QLoRA) more in later sections of this chapter.
Additive methods carry out fine-tuning by keeping all of the original model
weights frozen and introducing new trainable components. There are two
primary categories of additive methods including adapters and soft prompts.
Adapters add new trainable layers to the architecture of the model, typically
inside the encoder or decoder components after the Attention or Feed
Forward layers. On the other hand, soft prompts focus on manipulating the
inputs to achieve better performance. The inputs can be manipulated by
adding trainable parameters to the prompt embeddings or keeping the input
fixed and retraining the embedding weights. For this chapter, you’ll learn
more about one specific technique that falls within the soft prompt category
called Prompt Tuning.
In the next section, you’ll learn about a specific reparameterization technique
called LoRA.

Low-rank Representation Adaption (LoRA)


and QLoRA
LoRA is a commonly used PEFT technique in the reparameterization
category that freezes the original weights of the LLM and creates new
trainable low rank matrices into each layer of the Transformer architecture.
This technique was first introduced in the paper “LoRA: Low-rank
Adaptation of Large Language Models” (Hu, 2021). This fine-tuning method
reduces the number of trainable parameters and, as a result, the training time
required. This also results in a reduction in the compute resource required.
To understand how LoRA works, let’s first revisit the transformer
architecture from Chapter 1 and recall that the input prompt is turned into
tokens which are then converted to embedding vectors then passed into the
encoder and/or decoder parts of the transformer. In each of these
components, there are two kinds of neural networks including the self-
attention and feed-forward networks. The weights of these networks are
learned during pre-training.

Figure 5-6. During full fine-tuning every parameter in network layers is updated

After the embedding vectors are created, they’re fed into the self-attention
layers, where a series of weights are applied to calculate the attention
scores. During full fine-tuning, every parameter in these layers is updated.
This process of updating every parameter can require a lot of compute
resources and time.
LoRA is a fine-tuning strategy that reduces the number of parameters to be
trained by freezing all of the original model parameters and then inserting a
pair of “rank decomposition matrices” alongside the original weights. The
dimensions of the smaller matrices, shown in Figure 6.n as A and B, are
defined so that their product is a matrix with the same dimensions as the
weights they are modifying. You then keep the original weights of the model
frozen and train these smaller matrices using the same supervised learning
process defined in Chapter 2.

Figure 5-7. Low-rank matrices are learned during LoRA fine-tuning process
The size of the low rank matrices is set by the parameter called rank (‘r’).
The original pre-trained model contains full rank weight matrices for the
tasks they are trained on. For new tasks, LoRA relies on the fact that this data
can be represented by a lower dimensional matrix. The size of this lower
dimensional matrix is determined by the value of ‘r’. A smaller value leads
to a simpler low-rank matrix with less parameters to train. While it’s
important to experiment with the right value of ‘r’ for your own tasks, you
can often achieve good results with a smaller ‘r’ number (i.e. 4, 8).
There are different options to utilize LoRA for fine-tuning; however, open
source libraries support the different PEFT methods. Below is an example
using HuggingFace Transformers to perform LoRA fine-tuning for (‘Task 1’)
with a rank=32.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
r=32, # Rank
lora_alpha=32,
target_modules=["q", "v"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)

peft_model = get_peft_model(original_model,
lora_config)

peft_training_args = TrainingArguments(
output_dir="./model",
auto_find_batch_size=True,
learning_rate=1e-3,
num_train_epochs=1,
logging_steps=1,
max_steps=1
)

peft_trainer = Trainer(
model=peft_model,
args=peft_training_args,
train_dataset=tokenized_datasets["train"]
)
For inference, the two low-rank matrices are multiplied together to create a
matrix with the same dimensions as the frozen weights. You then add this to
the original weights and replace them in the model with these updated values
as shown below in Figure 6.n. You now have a LoRA fine-tuned model that
can carry out your specific task. To return to the original weights for another
task, you can then subtract the value of the low rank matrix from the original
weights. Because this model has the same number of parameters as the
original, there is little to no impact on inference latency.

Figure 5-8. Low-rank matrices multiplied together and added to original weights for inference

Below is the code used for performing inference against your LoRA fine-
tuned model using the low rank matrices for the task (‘Task 1’) that you
previously fine-tuned. Here, you are using `is_trainable=False` because we
are only performing inference - and not updating any weights. There are some
performance benefits to setting the parameters to immutable.

from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForCausalLM.from_pretrained(base_model_dir,
torch_dtype=torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(base_model_dir)

peft_model = PeftModel.from_pretrained(peft_model_base,
model_dir,
torch_dtype=torch.bfloat16,
is_trainable=False)

While LoRA can be applied to any subset of weight matrices in the


Transformer architecture (ex. feed-forward layers), researchers have found
that applying LoRA to just the self-attention layers of the model is often
enough to fine-tune for a task and achieve performance gains. Most of the
model parameters are in the attention layers so this also results in a higher
degree of parameter efficiency.
If you translate this into practical terms, the “Attention is all you need”
(Vaswani, 2017) research paper specifies transformer weights with the
dimensions of 512 by 64 which means each weight matrix in the architecture
has 32,768 trainable parameters (512 x 64 = 32,768). If you were performing
full fine-tuning you are updating 32,768 parameters for each weight matrix in
the architecture. With LoRA, assuming a rank equal to 4, two small rank
decomposition matrices will be trained whose small dimension is 4. This
means that matrix A will have the dimensions of 4 x 64 resulting in 256 total
parameters while matrix B will have the dimensions of 512 x 4 resulting in
2,048 trainable parameters. By updating the weights of these new low rank
matrices, you are able to fine-tune for your task by training only 2,304 (256 +
2,048) parameters instead of 32,768.
Because LoRA allows you to significantly reduce the number of trainable
parameters, you can often perform this method of parameter efficient fine-
tuning with a single GPU and avoid the need for a distributed cluster of
GPUs. This saves not only in cost but also in time required to fine-tune for
your task. In addition, because the rank decomposition matrices are small,
you can fine-tune a different set for each task, and then switch them out at
inference time by updating the weights.
For example, you can train a pair of LoRA matrices for a specific task then to
carry out inference on this task, you multiply these matrices together, and then
add the resulting matrix to the original frozen weights. You then take this new
summed weights matrix and replace the original weights where they appear
in your model. You can then use this model to carry out inference on ‘Task 1’.
Alternatively, you can fine tune another pair of LoRA matrices for a separate
task, shown as ‘Task 2’. To carry out inference for `Task 2`, you take the
LoRA matrices trained for this task, calculate their product and add this
matrix to the original weights. This method is compute and storage efficient
because you are still only storing one copy of the full sized pre-trained model
and training these smaller matrices adapted to your tasks and only switching
the weights out when you need to use them. Below shows how to load 2
different PEFT models (`peft_model_1` and `peft_model_2`) from a single
base model.

peft_model_1 = PeftModel.from_pretrained(peft_model_base,
model_1_dir,
torch_dtype=torch.bfloat16,
is_trainable=False)

peft_model_2 = PeftModel.from_pretrained(peft_model_base,
model_2_dir,
torch_dtype=torch.bfloat16,
is_trainable=False)

While LoRA reduces memory requirements, there is a variation of LoRA,


called QLoRA that aims to further reduce memory requirements by
combining low rank adaptation with quantization (Dettmers, 2023). QLoRA’s
uses 4-bit quantization in a format called NormalFloat or nf4. Fine tuning
with QLoRA is shown to match 16-bit fine-tuning methods because the 4-bit
weights are only dequantized to 16-bits as needed for computations during
the forward and backward passes. Below is a code sample that shows how
to fine tune with QLoRA using the open source `bitsandbytes` library.

from transformers import BitsAndBytesConfig, AutoModelForCausalLM

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)

model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config)

from peft import LoraConfig, get_peft_model

config = LoraConfig(
r=32,
lora_alpha=32,
target_modules=[
"query_key_value",
"dense",
"dense_h_to_4h",
"dense_4h_to_h",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

trainer = transformers.Trainer(
model=model,
args=transformers.TrainingArguments(

bf16=True
)
)

You learned about quantization in Chapter 3 as a method to reduce the


memory required to store the weights of your model by reducing their
precision from 32-bit floating point to a lower precision representation. With
QLoRA, a new data type called NumericFloat4 (nf4) is introduced for 4-bit
quantization targeting every layer in the transformer architecture as shown in
Figure 6.n. QLoRA also brings in the concept of double quantization to
further reduce the memory footprint required for fine-tuning by performing
quantization on the quantized constants. Finally, QLoRA provides
optimizations that avoid typical gradient checkpointing memory spikes
through the use of paged optimizers using unified memory management.
In this section, you learned about a PEFT technique called LoRA that uses
learned low rank matrices to adapt your models to a task. QLoRA was also
discussed as a variation of LoRA that aims to further reduce memory through
quantization. LoRA provides efficiencies in the resources required to fine-
tune for specific tasks; however, in the next section you’ll explore
considerations for model performance when using LoRA in comparison to
full fine-tuning. .

Performance comparison using LoRA vs. Full


Fine-Tuning
Let’s use the ROUGE metric you learned about in a previous chapter to
compare the performance of a LoRA fine-tuned model to both an original
base model and a full fine-tuned version.
Table 6.n summarizes the performance comparison between fine-tuning the
generative model for dialog summarization. For this, the baseline score
represents the performance of the pre-trained model and the dialogsum
dataset. A higher number indicates better performance for this metric.
T
a
b
l
e
5
-
2
.
S
a
m
p
l
e
R
O
U
G
E
m
e
t
r
i
c
s
f
o
r
f
u
ll
fi
n
e
-
t
u
n
i
n
g
v
s
.
L
o
R
A
fi
n
e
-
t
u
n
i
n
g

Full fine-tune LoRA fine tune


Base model (approx +80%) (approx -3%)

rouge1 0.2334 0.4216 0.4081


rouge2 0.0760 0.1804 0.1633

rougeL 0.2014 0.3384 0.3251

rougeLsum 0.2015 0.3384 0.3249

As you can see, the scores are fairly low for the base model, then get better
when performing full fine-tuning by updating all of the model parameters.
The metric drops a bit when using LoRA-based parameter-efficient fine-
tuning. However, using LoRA for fine-tuning trained a much smaller number
of parameters than full fine-tuning using significantly less compute, in this
case 1.4% so this small trade off in performance may well be worth it.
Choosing the optimal value for rank of the LoRA matrices is still an active
area of research and you will need to experiment to determine the right level
that meets your performance and resource requirements. In general, the
smaller the rank, the smaller the number of trainable parameters, and the
bigger the savings on compute. However, there are potential issues related to
model performance to consider.
In the original LoRA research paper, “LoRA: Low-rank Adaptation of Large
Language Models” (Hu, 2021), researchers at Microsoft explored how
different choices of rank impacted the model performance on language
generation tasks. In general, they found that the effectiveness of a higher rank
setting appears to plateau when setting the rank greater than a value of 16.
This means that setting the rank between 4 and 16 can often provide you with
a good trade-off between reducing the number of trainable parameters while
still preserving acceptable levels of model performance.
In the previous sections, you learned about LoRA which is a PEFT technique
that falls into the reparameterization category. In the next section, you’ll learn
more about a technique that falls within the additive category of techniques.
These techniques focus on adding trainable layers or parameters to the
model.

Prompt tuning and soft prompts


Prompt tuning is a collection of techniques that falls into the additive
category of PEFT techniques. While reparameterization techniques aim to
fine tune by reparameterizing model weights using low rank representations,
additive techniques focus on adding layers or parameters. This section will
cover a technique that falls into the sub-category of soft prompts called
prompt tuning.
It’s important to keep in mind that prompt tuning is different from prompt
engineering which you learned about in Chapter 4. Prompt engineering
requires you to work with the language of your prompt to get the intended
completion. This can be time consuming and require a lot of human effort to
perform effectively. There are also limitations in prompt engineering related
to the maximum length of the context window for your chosen model.
Conversely, prompt tuning focuses on adding additional, trainable tokens to
your input prompt. The optimal value of these tokens is then learned during
the supervised learning process.
Traditional prompt engineering utilizes what is known as “hard” prompts, or
prompts that represent natural language and correspond to a fixed location in
the embedding vector space. Prompt tuning relies on “soft” prompts which
are often referred to as virtual tokens because they can represent any value
within the continuous multi-dimensional embedding space. These soft
prompts are added to your prompt, as shown in figure 6.n, then learned
through supervised learning creating non-discrete values that aim to
maximize task level performance.
Figure 5-9. Soft prompts are learned in an attempt to maximize task performance.

The soft prompts then get prepended to the embedding vectors that represent
your input text. These soft prompt vectors have the same length as the
embedding vectors representing the language token. Research has shown that
somewhere between 20 and 100 virtual tokens can be enough to achieve
good performance for your task. While the tokens from the hard prompt are
specifically related to the input text, these virtual tokens with the trainable
soft prompts do not directly represent discrete text.
Prompt tuning falls into the additive category of PEFT fine-tuning methods
because you are adding soft prompts while the weights of the underlying
large language models remain frozen. The embedding vectors of the soft
prompt then get updated over time to optimize the model’s ability to
accurately complete the prompt. Because you are only tuning a set of soft
prompts, this is a parameter efficient tuning strategy over full fine-tuning of a
model. Also, similar to LoRA you can train a different set of task level
prompts and swap those out during inference. To do this, you prepend your
input prompt with those learned tokens, soft prompts, specific to your task.
Prompt tuning performance varies and research has shown the prompt tuning
may not perform as well as full fine-tuning for smaller LLMs but as the
model size increases the performance of prompt tuning tends to improve. As
an example, research (Lester, 2021) has shown equivalent performance to
full fine-tuning for some models that have 10 billion parameters using the
SuperGLUE evaluation benchmark. However, the primary challenge with
prompt tuning tends to be the interpretability because these learned virtual
tokens can take on any value within that continuous embedding vector space
and they do not necessarily correspond to any known token or discrete
language in the vocabulary of the LLM. Typically these soft prompts form
tight semantic clusters, based on analysis of the nearest neighbor tokens to the
soft prompt locations, meaning the words closest to these soft prompt tokens
have similar meanings. This suggests that the tokens are learned based on
word representations.

Summary
In this chapter, you explored LoRA which uses rank decomposition matrices
to update the model parameters in an efficient way. With LoRA, the goal is to
find an efficient way to update the weights of the model, without having to
train every single parameter again.
LoRA is a powerful fine-tuning method that achieves great performance.
Because LoRA reduces the amount of resources needed to fine tune your
models relative to full fine-tuning, it is used widely in practice for many
tasks and use cases. The principles behind this method are useful not just for
training generative language models, but also for other types of models
including image and video.
QLoRA is a variant of LoRA that uses quantization, a new data type called
NumericFloat4 (nf4), and targets more than just the attention layers of the
Transformer.
In the next chapter, you will learn a powerful technique called reinforcement
learning from human feedback (RLHF) to fine-tune your generative models to
align with human values and preferences.
Chapter 6. Fine-Tuning using
Reinforcement Learning from
Human Feedback (RLHF)

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the seventh chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

As you saw in the previous 2 chapters, fine-tuning with instructions can


improve your model’s performance and help the model to better understand
human-like prompts and generate more human-like responses. However, it
doesn’t prevent the model from generating undesired, false, and sometimes
even harmful, completions.
Undesirable output is really no surprise given that these models are trained
on vast amounts of text data from the internet which contains plenty of bad
language and even hate speech.
In this chapter, you will see examples of these models performing poorly.
You will then learn how to use reinforcement learning from human feedback,
or RLHF, to fine-tune your model to align model output with human values in
order to increase helpfulness, honesty, and harmlessness.
Helpful, Honest, and Harmless (HHH)
You could ask your model to tell you a “knock, knock'' joke. Those jokes
follow a simple, short dialogue structure of “Knock, knock. Who’s there?
Candice. Candice who? Can-this joke get any worse.” Now, if the model
responds to “knock, knock” with just “bang, bang”, this isn’t very helpful -
and likely not the best completion to your prompt.
Your model might also generate misleading or incorrect responses. Let’s say
you ask the model if shaking your head can improve your hearing. The model
may sometimes generate a confident, yet totally incorrect response such as,
“Yes! Shaking your head can improve your hearing” which is dishonest and
not scientifically proven to be true.
You also don’t want your model to generate harmful, offensive, or criminal
responses. Instead of responding as such, you can fine-tune your model to
either ignore the question or respond with a softer response, less toxic,
response that does not propagate offensiveness or encourage criminal
behavior. For example, if you ask your model how to hack into a computer
system, your model can respond with, “I am unable to answer this question
because I do not encourage criminal behavior” or equivalent.
These three principles are summarized as helpful, honesty, and harmless.
Next, you will see how to fine-tune a model using reinforcement learning
from human feedback to better align your models with human preferences -
ultimately improving their helpfulness, honesty, and harmlessness (HHH).

Reinforcement Learning from Human


Feedback (RLHF)
Reinforcement learning from human feedback, or RLHF, makes use of
reinforcement learning to further fine-tune your generative model with human
annotations - also called human feedback - to help the model generate more
useful, relevant, and less toxic completions. You could also use RLHF to
personalize models, for example, to deploy highly-personalized AI assistants
that align with the user’s sense of humor or daily routine.
It’s important to understand reinforcement learning before we dive deeper
into RLHF. A popular example of reinforcement learning is AWS DeepRacer
where a player programs a small driverless car to drive on a race track and
avoid crashing. The player competes with other drivers to complete the track
in the shortest amount of time. The player with the lowest time wins the race.
In the figure below, you see the “agent” is the car that is learning a “policy”
or model based on rewards given to the car for staying on the track and
choosing the proper actions. The player maximizes the car’s “objective” to
complete the track in the lowest amount of time and win the race.

Figure 6-1. Reinforcement learning in the context AWS DeepRacer


The “environment” includes the race track - as well as its curves and
conditions. The “state” at any moment is the car’s current position and speed
on the race track. The “action space” comprises all the possible actions a car
can choose based on the current state including steering left/right, braking
and accelerating. The agent makes decisions by following a strategy, known
as the RL policy. During the race, the agent chooses a set of actions that lead
to a win or a loss.
After each race, the agent collects an overall “reward” that affects the
agents’ actions in the next race. The goal of reinforcement learning is for the
agent to learn the optimal policy, or model, for choosing the actions for a
given environment that maximizes the rewards. This learning process is
iterative and involves trial and error. Initially, the agent takes a random
action, which leads to a new state. From this state, the agent proceeds to
explore subsequent states through further actions.
The sequence of states and actions that lead to a “reward” are often called a
“playout” in RL terms. As the agent gathers more experience through
additional playouts, it will learn to follow actions that produce a high reward
- winning the race, in this case.
In the figure below, you see the RL concepts applied to a generative model.
Here, the model is the agent. The policy consists of the model weights. The
RL algorithm will update the model weights to choose a better action, or
generate a better next-token, given the environment, state, and objective. The
“objective” is for the model to generate completions that are better-aligned
with human preferences such as helpfulness, honesty, and harmlessness
(HHH).
Figure 6-2. Reinforcement learning in the context of a generative AI model

The “action” is chosen from the “action space” consisting of all possible
tokens. Specifically, the next token is chosen based on the probability
distribution of tokens over all tokens in the model’s vocabulary. The
“environment” is the model’s context window. The “state” consists of the
tokens that are currently in the context window.
In the context of generating language, the sequence of actions and states
resulting in a reward is called a “rollout”. This is in contrast to the term,
“playout,” used in classic RL. The “reward” is based on how well the
model’s completion aligns with a human preference such as helpfulness. As
the model experiences more rollouts and rewards, it will learn to generate
tokens that produce a high reward - a more helpful completion, in this case.
Measuring helpfulness is a bit trickier than tracking a car’s time to complete
a race, but you can come close by using human annotations, or human
feedback, to train a reward model. You can then use the reward model during
the RLHF process to encourage the model to generate more human-aligned
completions by reinforcing with positive rewards. The reward model plays a
key role in RLHF as it encodes the preferences learned from human
feedback.
Over the next few sections, you will see the end-to-end RLHF process
starting with collecting human feedback to train the reward model.

Generate a Dataset for Human Feedback


The first step in fine-tuning a generative model with RLHF is to prepare a
prompt-completion dataset that will be annotated by humans. This dataset is
used to train the reward model. RLHF typically benefits from 10,000-20,000
prompt-completion pairs, so you will need a relatively large number of
human annotators to provide the human feedback and generate a proper
dataset.
10,000-20,000 pairs is actually not as difficult as it sounds. In fact, you will
typically generate 4-5 completions for the same prompt. The human will use
a service like Amazon SageMaker Ground Truth to rank the completions for
the given prompt. The ranking will depend on the human-alignment
objectives including helpfulness, honesty, and harmlessness.
In order to generate multiple completions for the same prompt, you can use
the same generative model that you are about to fine-tune with RLHF. This is
often a model that has been fine-tuned with instruction, or “instruct” model,
as shown below. You should choose a generative model capable of
performing the task you are interested in such as text summarization, question
answering, or classification.
Figure 6-3. Prepare a prompt-completion dataset for human ranking

Next, the human labelers need to provide feedback on the different


completions. This is the human feedback step in RLHF.

Collect Human Feedback Rankings to Train a


Reward Model
You will ask the human labelers to evaluate the 4-5 completions for each
prompt based on a specific set of human-alignment criteria. For example,
“Please rank the completions from the most helpful to least helpful.”, or
“Please rank the completions from the most harmless to the least harmless.”
Below is an example set of human-labeling instructions from the Scaling
Instruction-Finetuned Language Models paper.

* Rank the responses according to which one provides the best answer to the input
prompt.

* What is the best answer? Make a decision based on (a) the correctness of the
answer, and (b) the informativeness of the response. For (a) you are allowed to
search the web. Overall, use your best judgment to rank answers based on being the
most useful response, which we define as one which is at least somewhat correct, and
minimally informative about what the prompt is asking for.

* If two responses provide the same correctness and informativeness by your


judgment, and there is no clear winner, you may rank them the same, but please only
use this sparingly.

* If the answer for a given response is nonsensical, irrelevant, highly


ungrammatical/confusing, or does not clearly respond to the given prompt, label it
with “F” (for fail) rather than its rank.

* Long answers are not always the best. Answers which provide succinct, coherent
responses may be better than longer ones, if they are at least as correct and
informative.

To ensure quality labeling and feedback, make sure you provide clear
instructions to help the labelers understand their task, the human-alignment
criteria, and any edge cases. Generally, instructions should clearly describe
the task for the labeler. Typically, they are asked to rank the completions for
a given prompt according to some criteria. The more details you share, the
more likely the labeler will correctly perform the task and provide a high-
quality, human-aligned ranking dataset to train your reward model.
Note the additional guidance in the second bullet point above. The item asks
that the labelers make decisions based on their perception of the correctness
and informativeness of the response. They are allowed to use the internet to
verify the information.
They are also given clear instructions about what to do if they encounter an
edge case where 2 or more completions are equally correct and informative.
Though discouraged, the labelers are allowed to rank these equal
completions the same.
A final instruction worth calling out here is what to do in the case of a
nonsensical, confusing, or irrelevant answer. In this case labelers should
select “F” to fail the answer rather than rank it. This way, poor quality
answers can be filtered and removed from the dataset altogether.
Providing these detailed human instructions will increase the likelihood that
the responses will be high quality and that all individual humans will carry
out the task in a consistent way. This helps ensure that the humans reach a
consensus.
As a best practice, you should send the same prompt-completion sets to a
larger number of human labelers to help reduce the impact of individual
human labeling mistakes such as misreading the instructions or ranking in the
opposite order. In addition, the group of human labelers should represent a
diverse set of cultures across the world to encourage global thinking and help
reduce local bias.
Below are example rankings for the 4 generated completions to the prompt,
“Tell me a funny story.” You now want the human labelers to rank the
completions from the most helpful (1) to the least helpful (4).
Figure 6-4. Collect human feedback from human labelers

By repeating this process for many different prompts across many labelers,
you are creating a human-preference dataset that you will use to train your
RL reward model as you will see in a bit. First, you need to modify your
dataset slightly before you can train your reward model.

Prepare the Ranking Dataset


Once your human labelers have completed their rankings of the prompt-
completion pairs, you are almost ready to train your reward model. As a
reminder, the reward model encodes the human preferences learned from the
human feedback process. This reward model is typically a language-aware
binary classifier (e.g. BERT) used to assign rewards to model completions
during the RL-based, human-alignment fine-tuning process.
However, before you can train your reward model, you first need to convert
the human-labeled ranking data into a set of pairwise comparisons between
all pairs of completions. Each completion in the comparison pair is assigned
either a 0 or 1. This is used to train the binary classifier (reward model) that
will ultimatley predict the reward (0 or 1) for a generated completion.
This sounds complicated, but it is relatively straightforward. For example,
consider 3 possible completions for a given prompt named A, B, and C. For
simplicity, assume that A is the highest ranked completion and that A > B >
C. Here, the human labeler assigned rank 1 to the most-preferred completion
A. They assigned rank 2 to completion B and rank 3 to least-preferred
completion C.
You can now convert A > B > C into a set of pairwise comparisons as
follows: A > B, B > C, and A > C. Once you do this, you have generated 3
separate rows of training data as shown below. Each row has a positive-
reward completion labeled with a 1 and a negative reward completion
labeled with a 0.
Figure 6-5. Convert the labeled ranking data into reward-training data

Note that with 3 possible completions for a prompt, you will generate 3 rows
of reward-training data. The field of combinatorics dictates that, for `n`
number of possible completions, you will generate `(n choose 2)` pairwise
comparisons.
Therefore, if you had 4 possible completions, you would generate 6 pairwise
comparisons. 5 possible completions would generate 10 pairwise
comparisons, and so on. This is why reaching 10,000-20,000 rows of
reward-training data is relatively easy to reach - since each prompt actually
generates an exponential number of rows of training data!
NOTE
While thumbs-up/down human feedback is often easier to capture than multi-completion
ranking feedback, the ranking feedback gives you exponentially more data to train your
reward model as shown here.

After generating the pairwise reward-training data of 0 and 1 rewards, you


typically reorder the data so that the preferred completion is in the first
column.
This is merely a convention, but it’s important to understand this extra step as
a lot of RL-reward-training code expects the preferred completion to be `y_j`
while the non-preferred completion is `y_k` as shown below.
Figure 6-6. Move preferred completion into the `y_j` column per convention

Hopefully, this tips helped a few of you understand the already-complex RL


code a bit better! Once you have completed this data restructuring, the human
responses will finally be in the proper format to train the reward model.

Train the Reward Model


For language tasks, the reward model is typically a language model such as
BERT that has been trained as a binary classifier to predict the reward for a
given prompt-completion pair. In this case, you will train the reward model
using supervised learning with the train-reward dataset you patiently
generated and engineered above.
Below you see that, for a given prompt `x`, the reward model learns to favor
the human-preferred completion, `yj`, while minimizing the log sigmoid of the
reward difference, `rj` minus `rk`. It’s worth noting again that the human-
preferred option is, by convention, always the completion labeled `yj`.

Figure 6-7. Train model to predict the preferred completion yj from {yj, yk } for prompt `x`

The reward model is trained as a binary classifier to predict the probability


distribution across the positive (preferred) and negative (non-preferred)
classes for a given prompt and completion pair. In practice, you may choose
to use the logits across the 2 classes predicted by the binary classifier. The
difference is just that the probability distribution is just the `softmax` across
the 2 logits to normalize the values between 0 and 1 where the aggregate of
the 2 probabilities is 1.0 (e.g. 100%).
Below is an example of a reward model that predicts if the given text is “not
hate” or “hate”. This reward model is used during the RL-based fine-tuning
phase to help reduce the toxicity of your models’ generated completions.

toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name,
device_map="auto")
toxicity_model =
AutoModelForSequenceClassification.from_pretrained(toxicity_model_name,
device_map="auto")

toxic_text = "You are disgusting and terrible and i dang hate you"

toxicity_input_ids = tokenizer(toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]


probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!


nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (low): {nothate_reward}')

Output:

logits [not hate, hate]: [-2.061086416244507, 1.5835624933242798]


probabilities [not hate, hate]: [0.025465160608291626, 0.9745348691940308]
reward (low): [-2.061086416244507]

In this case, you want to optimize the model to generate completions that,
along with the prompt, will classify as “not hate”.
NOTE
The reason to include the prompt is that the toxicity of the completion may change given
the context of the prompt.

To be more concrete, during the RL process, the model will generate a


completion for a given prompt. Both the prompt and completion are passed to
the reward model to predict “not hate” or “hate”. The logit value of the
positive class (“not hate” in this case) is the actual reward value. And just to
remind you, if you apply a softmax function to the logits, you will get the
probabilities.
The first example in the figure below demonstrates a positive logit reward
for a non-toxic prompt-completion pair. The second example shows a
negative reward for a toxic completion.
Figure 6-8. Use reward model to calculate a reward for each prompt-completion pair relative to
the positive class

With your reward model trained and working as expected, you can now
explore how to use the model in the broader reinforcement learning process
to fine-tune your generative model for better human-alignment.

Use the Reward Model with Reinforcement


Learning
Reinforcement learning uses the reward model to modify the underlying
weights of your generative model to align with the human preferences
captured by the reward model. As mentioned earlier, the generative model is
typically a model that has been fine-tuned with instruction called an
“instruct” model - often a multi-task instruction variant such as FLAN-T5,
Falcon-instruct, or Llama-chat.
Consider sending the prompt, “Does Antje like cats?” to a question-answer
instruct model. Before fine-tuning the instruct model with RLHF to reduce
toxicity, the model may generate “Antje hates cats.” The goal is to reduce the
toxicity of the model’s response using RLHF as shown below. A softer, less-
toxic completion might be, “Antje does not like cats. She might be allergic.”

Figure 6-9. Use the reward model with reinforcement learning


To fine-tune the generative model to generate less toxic responses, you can
use RLHF to reward less-toxic responses and penalize toxic responses. First,
the prompt is passed to the generative model which produces a completion.
The prompt-completion pair is then passed to the reward model to provide a
logit for the optimal “not hate” class.
You can use PEFT, which you learned in Chapter 6, within RLHF to reduce
the amount of compute and memory resources required for the compute-
intensive PPO algorithm as shown below. Specifically, you would only need
to update the model’s much-smaller PEFT adapter weights and not the full
weights of the tunable model.

Figure 6-10. Using PEFT within RLHF to minimize the resources needed to fine-tune the
generative model
A single iteration of the RLHF process ends with updating the generative
models’ weights. The iterations continue for a given number of steps and
epochs similar to other types of model training and fine-tuning. After a while,
the generative model should start to receive higher rewards as it produces
less toxic completions. This process continues until the model is considered
aligned based on an evaluation threshold such as toxicity score - or until the
maximum number of iterations is reached.
The fine-tuned, human-aligned model is then ready for evaluation and,
depending on the evaluation result, ready for production deployment.

Evaluate RLHF Fine-Tuned Model


Continuing with the toxicity example, you can evaluate your potentially
human-aligned model using a toxicity evaluation score. This is the
probability of the negative class, “hate”, in this case, averaged across all
prompt-completion pairs. If RLHF has successfully reduced the toxicity of
your generative model, the toxicity score should decrease relative to the
baseline as shown below.
Figure 6-11. Evaluate using the toxicity score

You’ll need to create a baseline toxicity score for the original, instruct model
by evaluating its prompt-completion pairs with a model that can assess toxic
language. This is also a binary classifier capable of detecting toxic language.
You can use this classifier model to determine if your generative model is
below a desired toxicity threshold - or at least decreased relative to the
baseline toxicity score from the original instruct model as you see in the code
samples below.
Here, you define the `evaluate_toxicity` function to calculate the mean and
standard deviation across all prompt-completion pairs for the provided
dataset using the `toxicity_evaluator` defined below.
def evaluate_toxicity(model,
toxicity_evaluator,
tokenizer,
dataset,
num_samples):

max_new_tokens=100

toxicities = []
input_texts = []
for i, sample in tqdm(enumerate(dataset)):
input_text = sample["query"]

if i > num_samples:
break

input_ids = tokenizer(input_text, return_tensors="pt",


padding=True).input_ids

generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
tok_k=0.0,
top_p=1.0,
do_sample=True)

response_token_ids = model.generate(input_ids=input_ids,
generation_config=generation_config)

generated_text = tokenizer.decode(response_token_ids[0],
skip_special_tokens=True)

toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " +


generated_text)])

toxicities.extend(toxicity_score["toxicity"])

# Compute mean & std using np.


mean = np.mean(toxicities)
std = np.std(toxicities)

return mean, std

Next, we load the `toxicity_evaluator` using the HuggingFace `evaluate`


Python library - as well as a Facebook BERT-based “hatespeech” model.
Since this model is a binary classifier, we need to define the `toxic_label`
which is `hate` in this case.
from transformers import AutoTokenizer
import evaluate

tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

toxicity_evaluator = evaluate.load(
"toxicity",
"facebook/roberta-hate-speech-dynabench-r4-target",
module_type="measurement",
toxic_label="hate")

Next, you calculate a toxicity baseline using the `toxicity_evaluator` on the


original generative model before performing RLHF. After performing RLHF,
you measure the toxicity score again using the `toxicity_evaluator`, but this
time on the generative model after performing RLHF. You see a drop in
toxicity which is the desired result.

mean_before_detoxification, std_before_detoxification =
evaluate_toxicity(model=model_before_rlhf,

toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset["test"],
num_samples=10)
print(f'toxicity [mean, std] before detox: [{mean_before_detoxification},
{std_before_detoxification}]')

#
# Perform RLHF here…
#

mean_after_detoxification, std_after_detoxification =
evaluate_toxicity(model=model_after_rlhf,
toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset["test"],
num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification},
{std_after_detoxification}]')

# Calculate improvement
mean_improvement = (mean_before_detoxification - mean_after_detoxification) /
mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) /
std_before_detoxification
print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Output:

toxicity [mean, std] before detox: [0.032297799189109355, 0.03010236943945737]


toxicity [mean, std] after detox: [0.0271528000858697, 0.02743170674039297]

Percentage improvement of toxicity score after detoxification:


mean: 15.93%
std: 8.87%

Mitigate Reward Hacking


As with any reward-based system, there exists a tendency to ignore
constraints and “hack the reward”. This is true for reinforcement learning, as
well, in which the agent may learn to cheat and maximize the reward even if
the chosen actions lead to an incorrect state.
For example, a generative model may learn to produce nonsensical,
grammatically-incorrect sequences of tokens that maximize the reward (e.g.
low toxicity), but do not respect the learnings of the original language model
- or, at the extreme, diverge from human language completely.
A common technique to avoid reward hacking is to first make a copy of the
original instruct model before performing any reinforcement learning or
weight updates. You then freeze the weights of this copied model and use it
as an immutable “reference model”. During RLHF, every prompt is
completed by both the frozen reference model and the model you are trying to
fine-tune with RLHF.
Next, the 2 completions are compared to determine the statistical distance
between the 2 token-probability distributions. This distance is calculated
using Kullback–Leibler divergence, or KL divergence - a well-known and
widely implemented algorithm as shown below.
Figure 6-12. Mitigate rewards hacking with a KL divergence reward penalty

KL divergence quantifies how much the mutable, RLHF-tuned is generating


completions that diverge too far from the completions generated by the
immutable reference model. In short, if the fine-tuned model starts hacking
the reward and generating sequences of tokens that diverge too far from
sequences that the reference model would generate, the fine-tuned model is
penalized by the RL algorithm through a lower reward.
NOTE
RLHF and KL divergence are extremely compute-intensive processes that benefit greatly
from accelerators like the Nvidia GPU - as well as the purpose-built AWS Trainium and
Inferentia chips available through Amazon EC2 and SageMaker.

The details of KL divergence and reward-penalties are typically contained in


the RL libraries, so you typically don’t need to implement this type of
complexity yourself. However, it does help to understand reward-hacking -
as well as the techniques and extra computations required to control it.
Below is the code to configure the `PPOTrainer` class from the TRL library
with the frozen reference to help avoid reward hacking. Also shown is the
PPO iteration `step` function that updates the weights of the tunable model
using the reward from the reward model. Remember that this reward may be
penalized depending on the KL divergence calculated within the
`PPOTrainer` implementation.

from trl import PPOTrainer


from trl import AutoModelForCausalLMWithValueHead
from trl import create_reference_model
model = AutoModelForCausalLMWithValueHead \
.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
ref_model = create_reference_model(model)
ppo_trainer = PPOTrainer(
model=model, # tunable model
ref_model=ref_model, # frozen reference model
tokenizer=tokenizer,
dataset=dataset)
#
# For each prompt-completion pair, calculate the reward from the reward model
#
# Run PPO step with the prompt, completion, and reward scores
ppo_trainer.step(prompt, completion, reward_scores)

Summary
Fine-tuning your generative models for human values is a very important tool
in your generative toolbox to improve your model’s helpfulness, honesty, and
harmlessness (HHH). RLHF is a very active area of research with a great
amount of impact on making these models more human-like, useful, and
enjoyable. In this chapter, you learned the fundamentals of RLHF which will
help you understand the field as it evolves further.
Specifically, you saw how to collect human feedback including prompt-
completion rankings, process the feedback in order to train a reward model,
and ultimately use reinforcement learning to reduce a generative model’s
toxicity.
Now that you have a human-aligned, reduced toxicity model, you will see
how to prepare it for low-latency, high-performance inference in the next
chapter.
Chapter 7. Optimize Generative
AI Models for Inference

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the eighth chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

After you have adapted your model to your target task, you will ultimately
want to deploy your model so you can begin interacting with it as well as
potentially integrate the model into an application that is designed to
consume the model.
When you are ready to deploy your LLMs you need to understand the
resources your model may need as well as the intended experience for
interacting with the model. When considering the resources your model will
need this will include identifying requirements such as how fast do you need
your model to generate completions, what compute budget do you have
available, and what trade-offs are you willing to make regarding model
performance to be able to achieve faster inference speed as well as
potentially reduce storage costs.
In this chapter, you will explore various techniques to perform post-training
optimizations on your model including pruning, quantization, and distillation.
There are also additional considerations and potential tuning of your
deployment configurations that will need to be done post deployment as well
such as selecting the optimal compute resources to balance cost and
performance.
There are also broader considerations for building generative AI
applications which you’ll learn more about in Chapters 9 and 10. However,
this chapter will focus on techniques that are specifically aimed at optimizing
your model for inference to improve your end user’s experience.

Model optimizations to improve application


performance
Due to the size of generative AI models, they often present a challenge for
deployment in terms of compute, storage, and memory requirements as well
as being able to ensure low latency completions. One of the primary ways to
optimize for deployment is to take advantage of techniques that aim to reduce
the size of the model, typically referred to as model compression. Reducing
the model size allows for quicker loading of the model, reduced latency as
well as reduces the resource requirements for compute, storage, and memory.
While reducing model size helps optimize the model for deployment, the
challenge is reducing the model size while still maintaining good model
performance. As a result, there can be a trade-off to consider between model
performance, compute budget, and latency.
This section will outline three techniques aimed at reducing model size
including distillation, post-training quantization, and model pruning as shown
in Figure 8.n.
Figure 7-1. Techniques aimed at reducing model size for deployment optimization

Pruning is a technique that focuses on removing redundant, or low impact,


parameters that do not contribute, or contribute little, to the model’s
performance. Pruning reduces the size of the model, but also increases
performance by reducing the number of computations during inference.
Quantization, a technique that you saw in Chapter 4, converts a models’
weights from high-precision (e.g. 32-bit) to lower precision (e.g. 16-bit).
This not only reduces the model’s memory footprint, but also improves
model performance by working with smaller number representations. With
large generative models, it’s common to reduce the precision further to 8-bits
to increase inference performance.
Distillation trains a smaller student model from a larger teacher model. The
smaller model is then used for inference to reduce your compute resources,
yet retain a high percentage of accuracy of your student model. A popular
distilled student model is DistilBERT from HuggingFace trained from the
larger BERT teacher model. DistilBERT is an order of magnitude smaller
than BERT, yet retains approximately 97% of the accuracy of the original
BERT model. See Data Science on AWS from O’Reilly for a deep dive on
BERT and DistilBERT.
The following sections will discuss each of the techniques above in more
detail.

Pruning
Pruning aims to eliminate model weights that are not contributing
significantly to the model’s overall performance. By eliminating those model
weights, you’re able to reduce the model size for inference which reduces the
compute resources required.
Figure 7-2. Pruning aims to reduce the overall model size by eliminating weights that are not
contributing to model performance

The model weights to be eliminated during pruning are those with a value of
zero or a value very close to zero. There are various methods to perform
pruning with some requiring full re-training, some that prune utilizing a
Parameter Efficient Fine-Tuning (PEFT) technique like those discussed in
Chapter 6, and then others that focus on post-training pruning. Pruning during
training is accomplished through unstructured pruning, removing weights, or
structured pruning, removing entire columns or rows of the weight matrices.
Both approaches require retraining; however, there are post training pruning
methods typically referred to as one-shot pruning methods that can do pruning
without retraining. The challenge in performing one-shot pruning on large
language models is compute intensive for models with billions of
parameters. One method of post-training pruning, called SparseGPT (Frantar,
2023), is introduced in a research paper to overcome the challenges of one-
shot pruning on large language models. This method is specifically built for
GPT foundation models and introduces an algorithm that performs sparse
regression at a large scale.
In theory, pruning reduces the size of the LLM which reduces compute
resources and model latency. However, in practice there are LLMs where
only a small percentage of their weights are zero so in those cases pruning
may not have a large impact on the model size.

Quantization
Quantization aims to transform a model’s weights to a lower precision
representation with the goal of reducing the model’s size as well as compute
requirements for hosting LLMs. Chapter 4 detailed a method of quantization
that can be performed during training called quantization-aware training
(QAT). In this section, the focus will be on a second quantization method that
is performed post-training to optimize for deployment specifically called
post-training quantization (PTQ), in this context.
Most models are built with 32-bit precision and doing matrix multiplication
on 32-bit can be resource intensive. PTQ transforms a models’ weights to a
lower precision representation, such as 16-bit floating point or 8-bit integer,
to reduce the model’s size and memory footprint which translates into less
compute resources that are needed for serving your model. Figure 8.n shows
a model quantized from 32-bit floating point to 16-bit floating point. At a
high level, you can estimate that going from a 32-bit to 16-bit representation
will result in a model that is approximately half of the size of the original.
Figure 7-3. Reduce memory footprint and inference latency with post-training quantization

PTQ can be applied to just the model weights and/or activation layers. Also,
quantization approaches that include both the activations as well as the
model weights have a higher impact on model performance. Quantization
applied to both the model weights and activations is typically referred to as
dynamic range quantization. This method requires an extra calibration step to
statistically capture the “dynamic range” of the original parameter values as
shown in Figure 8.n. The calibration step involves using a small unlabeled
dataset to identify the dynamic range, more specifically the minimum and
maximum, of values on input.
Figure 7-4. Dynamic range post training quantization requires an extra calibration step

Similar to other model compression methods, there are trade-offs between


inference performance and model performance. Quantization is a common
option to be able to reduce model size thereby reducing the compute
resources required for inference and improving latency; however, sometimes
quantization also results in reduced model performance. This is often a small
percentage reduction which can be worth the cost savings and performance
gains but you should benchmark quantization results to evaluate if the
potential trade-off in model accuracy is acceptable for your use case.

Distillation
Distillation is a technique that helps reduce the model size which ultimately
reduces the number of computations and improves model inference
performance. Distillation uses statistical methods to train a smaller student
model on a larger teacher model. The end result is a student model that
retains a high percentage of the teacher’s model accuracy, but uses a much
smaller number of parameters. The student model is then deployed for
inference. The smaller model requires smaller hardware and therefore less
cost per inference request.
The teacher model is often a generative foundation model - or a fine-tuned
variant. During the distillation training process, the student model learns to
statistically replicate the behavior of the teacher model. The teacher model
weights do not change during the distillation process - only the student model
weights change. The teacher model’s output is used to “distill” knowledge to
the student model.
Both the teacher and student models generate completions from a prompt-
based training dataset. A distillation loss is calculated by comparing the 2
completions. The loss is then minimized during the distillation process using
backpropagation to improve the student model’s ability to match the teacher
model’s predicted next-token probability distribution.
The teacher models’ predicted tokens are known as “soft labels” while the
student models’ predicted tokens are called “soft predictions”. In parallel,
you need to compare the student models’ predictions called the “hard
predictions” against the ground truth “hard labels” from the prompt dataset.
The difference is the “student loss”. The distillation loss and student loss are
combined and used to update the student models’ weights using standard
backpropagation.
Figure 7-5. Minimize the combination of distillation loss and student loss are used to distill
information from teacher to student

NOTE
In practice, distillation may not be as effective for generative decoder models as it is for
encoder models like BERT. This is because the output space is relatively large for decoder
models (e.g. vocabular size of 100,000 tokens) without a lot of redundancy in
representation.

Next, you will see how to deploy and your generative model into production
using Amazon SageMaker.
Deploying Generative AI Models on AWS
When you’re ready to deploy your generative model, you can use Amazon
SageMaker Endpoints to host and scale your models. Below is the code to
create the Amazon SageMaker Endpoint including an
`EndpointConfiguration` which includes the hardware using `InstanceType`
and `InitialInstanceCount`. In this case, you’re deploying 2 different variants
of your model in an A/B test across 10 GPU-based SageMaker instances.
50% of your traffic will go to `model_variant_1` and 50% will go to
`model_variant_2`.

import boto3

sm = boto3.Session().client(
service_name="sagemaker")

model_variant_1 = "generative-model-1"
model_variant_2 = "generative-model-2"
endpoint_config = "generative-endpoint-config"

sm.create_endpoint_config(
EndpointConfigName=endpoint_config,
ProductionVariants=[
{
"ModelName": "generative-model",
"VariantName": "generative-model-1",
"InstanceType": "ml.g5.12xlarge",
"InitialVariantWeight": 5,
"InitialInstanceCount": 5
},
{
"ModelName": "generative-model",
"VariantName": "generative-model-2",
"InstanceType": "ml.g5.12xlarge",
"InitialVariantWeight": 50,
"InitialInstanceCount": 5
}

]
)

endpoint = “generative-endpoint”

sm.create_endpoint(
EndpointName=endpoint,
EndpointConfigName=endpoint_config
)

waiter = sm.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=endpoint_name)

Here, you are comparing the 2 variants and, at some point, will likely send
100% traffic to the better model based on some evaluation criteria or longer-
term objective such as increasing revenue or reducing churn.
Additionally, SageMaker Endpoints support advanced deployment strategies
including shadow deployments. In contrast to an A/B deployment, a shadow
deployment puts a model into production where it accepts the same input as
the model being shadowed, but it simply stores the model response to disk
for offline analysis. This helps you conservatively evaluate a model against
live production inputs without exposing potentially bad responses to the end
user.
Once the model is deployed as a SageMaker endpoint, you can generate
completions for your prompts using the following code:

import json
from sagemaker import Predictor

prompt = """
Summarize the following conversation.

#Person1#: Tom, I've got good news for you.


#Person2#: What is it?
#Person1#: Haven't you heard that your novel has won The Nobel Prize?
#Person2#: Really? I can't believe it. It's like a dream come true. I never expected
that I would win The Nobel Prize!
#Person1#: You did a good job. I'm extremely proud of you.
#Person2#: Thanks for the compliment.
#Person1#: You certainly deserve it. Let's celebrate!

Summary:
"""

predictor = Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sess,
)

response = predictor.predict(zero_shot_prompt,
{
"ContentType": "application/x-text",
"Accept": "application/json",
}
)
response_json = json.loads(response.decode('utf-8'))
print(response_json)

Summary
In this chapter, you saw powerful techniques to optimize your model for
inference by reducing the size of the model through distillation, quantization,
or pruning. These techniques help reduce model size and improve model
inference performance with minimal impact on model accuracy - ultimately
improving the user’s happiness. They also help to minimize the amount of
hardware resources needed to serve your generative models in production -
ultimately lowering cost and improving your CFO’s happiness!
In the next chapter, you will explore some popular mechanisms to augment
the capabilities of your generative models using external data sources and
APIs.
Chapter 8. Retrieval Augmented
Generation (RAG) and
Information Retrieval

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the ninth chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

In this chapter, you’ll learn about a method, Retrieval Augmented Generation


(RAG), that overcomes several challenges of LLMs including knowledge
cut-offs and having domain specific knowledge outside of the data used for
training. RAG allows you to overcome these limitations for foundation
models as well as fine-tuned models by providing access to external data
sources that can be used to augment the prompt with relevant contextual
information. This method has grown rapidly in popularity due to the
effectiveness in incorporating additional external data sources without
continued fine-tuning which can be both impractical and cost prohibitive.
You’ll learn more about the knowledge limitations in LLMs that RAG
addresses, the framework that RAG proposes as well as specific examples
of using RAG for accessing external data sources, specifically focusing on
document retrieval within the detailed examples. This chapter will also
introduce a common framework that makes it easier to implement various
augmentation techniques known as LangChain.

LLM Knowledge Limitations


Large language models suffer from several challenges related to having
accurate knowledge as well as current knowledge. This section discusses
two common problems with large language models that can be improved
using RAG methods known as hallucination and knowledge cut-off.
Chapter 2 discussed the challenge of hallucination, shown in Figure 9.n,
where a model confidentantly returns an incorrect response. In this case a
‘snazzy-fluffikens’ is not a real dog breed but the model still returns a
factitious, and potentially misleading, completion.
Figure 8-1. Knowledge limitations in LLMs - hallucinations

One thing to keep in mind is there are foundation models, especially some of
the more current models, that include additional built-in guardrails to avoid
hallucinations. For example, asking Anthropic’s Claude V2 model the same
question results in the model acknowledging it does not have information
about ‘snazzy-fluffikens’ as a dog breed.
Figure 8-2. LLMs can use additional built-in guardrails to avoid hallucinations

The second common issue, shown in Figure 9.n, is known as knowledge cut-
off which results in the model returning an answer that is out of date with
current data. All foundation models have a knowledge cut-off which refers to
the date they were trained. This is important because the knowledge of the
model is limited to the data that was current at the time it was trained. For
example, if you ask the model who recently won the NBA championship, it
will give you the most recent information it has available, in this case the
champions in 2021. However, it won’t provide the most current data
available because it’s outside the current knowledge the model was trained
on.
Figure 8-3. Knowledge limitations in LLMs - knowledge cut-off

RAG provides a technique that allows you to mitigate some of the challenges
with hallucinations and knowledge cut-off in foundation models. For
hallucinations, RAG is useful because you are able to provide the model
with access to information it would not already have, such as proprietary
data for your business. For knowledge cut-offs, RAG allows you to provide
access to current information beyond the model’s training date. This
technique has gained a lot of traction to be able to augment foundation
models with additional information, including domain specific information,
without the need to continuously perform full fine-tuning. The next section
will discuss RAG in more detail.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) isn’t a specific set of technologies
but rather a framework for providing LLMs access to data they did not see
during training. A number of different implementations exist, and the one you
choose will depend on the details of your task and the format of the data you
have to work with. In general, RAG allows applications backed by large
language models to make use of external data sources and applications to
overcome some of the knowledge limitations previously discussed.
RAG works by providing your model access to additional external data at
run-time. This data can be from a number of data sources including
knowledge bases, document stores, databases, as well as data that is
searchable through the internet as shown in Figure 9.n. .
Figure 8-4. External data sources

RAG is useful in any case where you want the language model to have access
to additional data that is not contained within its knowledge base. This could
be data not in the original training data or proprietary information stored in
your organization’s internal data stores. Allowing your model to have access
to this information helps improve both the relevance as well as the accuracy
of completions.
As previously mentioned RAG is a framework where a number of
implementations exist. Figure 9.n outlines a high level pattern and detailing
the concepts of a retriever and a generator. These concepts were introduced
in the ‘Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
‘ research paper (Lewis, 2020) published in 2020. At a high level, the
prompt is augmented with additional information from external data sources
to improve the accuracy and currency of the completion. At a high level, the
retriever is responsible for searching through external information sources to
find relevant text while the generator uses the retrieved documents, along
with the initial prompt, to generate a completion.

Figure 8-5. RAG integrates with external information sources to augment prompt input

In the implementation shown, the retriever consists of a query encoder and an


external data source. The encoder is responsible for taking the user’s input
prompt and encoding it into a form that can be used to query the external data
source. To perform this task, the data must be available in a format that
allows for easy retrieval of the most relevant texts.
To illustrate this, we’ll focus specifically on information retrieval from
documents. A common implementation for document search includes storing
your documents in a vector store, or more specifically a vector database,
where each document is indexed based on an embedding vector produced by
an embedding model. The vector embedding includes the numeric
representations of text data within your documents. Each embedding aims to
capture the semantic or contextual meaning of the data. Each vector
embedding is put into a vector database often with additional metadata such
as a reference to the original content the embedding was created from. The
vector database then indexes the vectors which can be done using a variety of
approaches. This indexing allows for quick retrieval of documents. The
vector database, shown in figure 9.n, is then used within the prompt
workflow to efficiently retrieve external information based on an input query.
Figure 8-6. Efficient indexing of documents for quick retrieval

There are multiple implementation options for vector databases including


open source options as well as managed offerings. On AWS, there are a
variety of services that can be used to support your vector database
requirements including Amazon OpenSearch Service, as well as the support
for pgvector included in Amazon Aurora PostgreSQL and Amazon Relational
Database Services (RDS) for PostgreSQL. OpenSearch Service provides
support for multiple engines allowing you to optimize the indexing and
querying of your vector database for specific use cases. Another option on
AWS is to utilize pgvector support within the PostgreSQL options for Aurora
and RDS to store and search vector embeddings.
One thing to keep in mind with document retrieval is that often these
documents are large in nature so a technique called chunking, shown in
Figure 9.n, is typically used in building document indexes, as well as
searching which is covered later in the section. The goal of chunking is to
improve the relevance of a document for a given query by reducing noise, or
information that is not relevant. The chunks should contain information that is
semantically related and has meaningful context in that single chunk.

Figure 8-7. Chunking is often used to reduce noise and improve relevancy

Determining the optimal chunk strategy can improve your search results.
There are a few considerations to keep in mind when determining the best
chunking strategy. First, consider the size of your indexed content whether it’s
long documents such as books or shorter content like product reviews.
Chunking smaller content may not have much impact with chunking larger
documents is not only necessary but also improves the ability to search for
similar relevant information related to a search. Second, chunking may be
required depending on how your results will be used due to context window
limits discussed later in this section. Third, the model you choose may also
have guidance around the optimal chunk size. Finally, there is a concept
called overlap when implementing chunking which refers to the overlap of a
defined amount of text between chunks. Overlap can help preserve context
between chunks so it’s another parameter to experiment with your chunking
implementation strategies.
Once the documents have been indexed, they can be used to retrieve
information using RAG based approaches as part of your prompt workflows.
That workflow starts with an input user prompt, as shown in Figure 9.n. In
this case the prompt is asking ‘What group is responsible for maintenance
on product FlashTag?’. That prompt input will utilize the same query
encoder and embedding model process to create vector embedding
representations of the prompt input as shown in Figure 9.n Those embeddings
are then used to query the database for similar vector embeddings to return
the most relevant text as the query result.
Figure 8-8. Information retrieval based on prompt input

One thing to consider as part of the information retrieval is the different


search methods and similarity metrics that may be adjusted depending on
your use case. There are also potential post processing tasks you might do
such as not only retrieving relevant information but also performing re-
ranking as well.
The query result is then combined with the original prompt to create an
augmented prompt as shown in Figure 9.n. This augmented prompt now has
contextual information specific to the indexed documents as well as the
original prompt. Because the documents are domain-specific, and not within
the LLMs training corpus, this method allows you to provide additional
context to the model that is otherwise unknown. Depending on the model, if
you were to ask the LLM this question directly it may say it does not know
but it may also return fictitious information (hallucination). However, using
this method we are able to provide context to the model outside its scope of
knowledge.

Figure 8-9. Using the augmented prompt to generate the completion using additional context

The new expanded prompt that now contains information about the product
team is then passed to the LLM. The model uses the information in the context
of the prompt to generate a completion that contains the correct answer.
In this section, an example of using RAG specifically for accessing external
information was demonstrated to help understand the fundamental concepts
behind RAG. RAG is a powerful technique to incorporate data outside the
context of the model’s knowledge without needing to perform full fine-tuning
to incorporate that data. RAG architectures can be used to integrate multiple
types of external information sources . You can augment large language
models with access to local documents , including private wikis and
applications . RAG can also allow for access to the internet to retrieve
information posted on web pages, for example wikipedia. By encoding the
user input prompt as a SQL query, RAG can also interact with databases.
However, there are some things to consider with RAG. First, there is
potential for added latency due to the additional API calls and knowledge
retrieval from external memory. Second, there are also considerations around
the size of the context window. Every LLM has what is known as a context
window, which was covered in Chapter 2. The size of the context windows
varies with each LLM but most text sources are too long to fit into the limited
context window of the model, which is still at most just a few thousand
tokens for many models. In these cases, the external data sources are parsed
into many chunks, each of which will fit in the context window as shown in
Figure 9.n. There are libraries and frameworks, such as LangChain, that can
handle this work for you using different strategies and many of the latest
models also continue to increase the size of the context window.
Figure 8-10. Data processing long documents into multiple small chunks

In the next section, building applications that implement and orchestrate the
various tasks in a RAG workflow are explored in greater details.

Implementation & Orchestration of RAG


In the previous section, RAG was explored as a framework for augmenting a
model and a specific implementation of using RAG was detailed showing a
technique to augment a model using data accessed through external
information sources specifically focusing on document retrieval. There are
multiple steps in a typical RAG workflow from accepting the user input
prompt, to performing the query encoding, calling the vector database,
potentially breaking your document into chunks, combining the augmented
prompt(s), calling the large language model, and returning the completions
back for the user or application. This all requires a solution that can
orchestrate and perform tasks within the workflow as shown in Figure 9n.

Figure 8-11. Orchestrating RAG workflows

Luckily, there are frameworks developed that take some of the heavy lift
away in implementing these solutions. This section will explore a popular
framework called LangChain that provides you with modular pieces that
contain the components necessary to work with large language models and
implementing techniques such as RAG. LangChain has several high level
modules including model interfaces, prompts, data connectors, chains,
memory, agents, and callbacks. These components will be discussed over the
next few chapters as they apply to each specific technique.
In the context of RAG, LangChain provides document loaders as part of the
data connector modules. These loaders provide libraries for loading data
across a variety of input formats into documents. For example, you can use
the PyPDFLoader to load and split pdf formatted documents. In the previous
section, the challenge of context window length was discussed with
strategies around chunking or splitting data as a way to overcome context
window limitations. LangChain also provides document transformers that
include splitters allowing you to chunk your documents using simple
configurations as shown in the code example below.

import numpy as np
from langchain.text_splitter import CharacterTextSplitter,
RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap = 100,
)
docs = text_splitter.split_documents(documents)

In this case, the code will load pdf documents from the designated location
and then split the documents into chunks of 1000 characters. These chunks
then contain portions of the original PDF document that can be preprocessed
to create vector embeddings using an embedding model and ultimately stored
in a vector store or loaded into a vector database using one of the many third-
party integrations provided through LangChain.
To create and store the vector embeddings in a local vector store, LangChain
provides a convenient library for performing similarity search and the
clustering of dense vectors called Facebook AI Similarity Search (FAISS).
This library takes the embeddings model as well as the loaded documents on
input to create the complete FAISS vector store.

from langchain.chains.question_answering import load_qa_chain


from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
docs,
bedrock_embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

Then, using the provided Index Wrapper a lot of the heavy lift can be
abstracted away including creating the prompt, getting the embeddings of the
query, sampling the most relevant documents and calling the LLM. The code
example below shows how to get the embedding of the query prompt then use
that embedding to retrieve relevant documents using a similarity search of the
input query with data in the vector store.

query = "Is it possible that I get sentenced to jail due to failure in filings?"

# get embedding of query


query_embedding = vectorstore_faiss.embedding_function(query)
np.array(query_embedding)

# get relevant documents


relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the
query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
print_ww(f'## Document {i+1}: {rel_doc.page_content}.......')
print('---')

The code above then returns the relevant documents but, if you recall, the
relevant document information now needs to be combined with the original
prompt to create the augmented prompt that will be sent to the LLM to
generate a completion. There are multiple ways to do this with different
retrievers; however, one method is to utilize chains.
A chain allows you to create a sequence of calls to different components and
is also useful when you want to be able to add additional customizations to
response output such as when you want not only the final completed prompt
but also the supporting information or citations.
In the example below a built-in chain called RetrievalQA is used along with
a default prompt template. The provided chain is set up to retrieve relevant
documents from the vector store and also specifies the type of search to
perform. In this case, a similarity search will be performed to find the most
relevant documents and the top 3 relevant documents will be returned.

from langchain.chains import RetrievalQA


from langchain.prompts import PromptTemplate

prompt_template = """Human: Use the following pieces of context to provide a concise


answer to the question at the end. If you don't know the answer, just say that you
don't know, don't try to make up an answer.

{context}

Question: {question}
Assistant:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore_faiss.as_retriever(
search_type="similarity", search_kwargs={"k": 3}
),
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)
query = "Is it possible that I get sentenced to jail due to failure in filings?"
result = qa({"query": query})
print_ww(result['result'])
In this example, FAISS is used to get the top 3 relevant documents from a
local vector store; however, on AWS you could optionally use the k-NN
plugin with OpenSearch to retrieve the top n documents. OpenSearch uses
approximate nearest neighbor (ANN) algorithms to perform k-NN searches
using the same FAISS algorithms you see above or other available engines.
This allows you to map the search algorithm that best meets the needs of your
use case across characteristics such as vector dimensionality, post filtering
requirements, additional training requirements, compression, index and
search latency. For example, if you want to use a search approach that does
not require training and support large use cases then you may consider using
FAISS Hierarchical Navigable Small Worlds (HNSW) or Non-Metric Space
Library’s (NMSLIB) implementation of HNSW. Both are available in
OpenSearch with support for configuring the optimal similarity metric, such
as cosine similarity, for your use case.
Finally the results are used to create the augmented prompt, call the LLM,
and generate the final completion. Creating this sequence of steps as well as
the components needed to execute each step takes a lot of work as well as
something to explicitly orchestrate the end-to-end workflow. LangChain
provides libraries that greatly simplify the implementation of techniques like
RAG. These libraries can be integrated directly into your applications.
This section showed one way to implement RAG through LangChain;
however, there are also other implementation patterns of RAG through
LangChain as well. In addition to RAG, LangChain also provides the
framework for implementing other techniques which you’ll learn about in
later chapters such as agent based architectures which will be covered in
Chapter 10.

Summary
This chapter covered RAG not only as a common framework for augmenting
LLMs but more specifically using RAG to mitigate the common knowledge
limitations of hallucinations and knowledge cut-offs in LLMs by providing
access to external sources of information. A specific implementation for
document retrieval was explored as well as the importance of vector stores
in implementing RAG architectures. A specific example of a vector store,
vector database, was explored using a detailed example of retrieving
information from an external document store. The complexity of
implementing these architectures is greatly reduced by frameworks like
LangChain which are making it possible to quickly build, deploy, and test
LLM-powered applications that implement augmentation techniques like
RAG
Next, you’ll explore chain-of-thought reasoning which is a technique to
improve a model’s ability to reason through complex tasks. This is an
important mechanism for using an LLM to power your complex applications.
Chapter 9. Chain-of-Thought
Reasoning and Application
Integration

A NOTE FOR EARLY RELEASE READERS


With Early Release ebooks, you get books in their earliest form—the
authors’ raw and unedited content as they write—so you can take
advantage of these technologies long before the official release of these
titles.
This will be the 10th chapter of the final book. Please note that the
GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this
chapter, please reach out to the editor at shunter@oreilly.com.

An exciting emerging behavior of LLMs is their ability to reason. You can


teach an LLM to reason through complex tasks by including an example in the
prompt. The example shows how to break a complex task into smaller, more
manageable reasoning steps that help you find the solution—just like a human
would approach a complex task. This method is known as chain-of-thought
(CoT) prompting or reasoning.

Chain-of-Thought Reasoning
Let’s have a look at how humans would reason through complex tasks, for
example, the following math challenge:
Sarah has 4 golf balls. She buys 2 more packages of golf balls. Each package has 12
golf balls. How many golf balls does she have now?

You would most likely take a step-by-step approach, breaking this task into
the following intermediate reasoning steps:

Start: Sarah started with 4 golf balls.


Step 1: 2 packages of 12 golf balls each is 24 golf balls.
Step 2: 4 + 24 = 28
End: The answer is 28.

By including an example like this in the model prompt, you can teach a model
to mimic this behavior to solve similar tasks. This is how chain-of-thought
prompting works! Figure 10-X shows how to use this one-shot inference
example in a chain-of-thought prompt to solve a similar task:

You are ordering pizza for a meetup event for 20 people. If each pizza has 8 slices
and you think every attendee eats 2 slices, how many pizzas do you need to order?
Figure 9-1. Chain-of-thought prompting helps LLM reason through complex tasks.

When you send this chain-of-thought prompt to the LLM, you can see the
model generates a completion that breaks the task into multiple steps and
explains how it reasons through the task, similar to the provided one-shot
example. Here, the LLM finds the correct answer of 5.

NOTE
Chain-of-thought (CoT) reasoning should be introduced during the model’s pre-training
phase. If you are using a pre-trained LLM that has not been exposed to CoT, it’s often too
late to introduce CoT during fine-tuning. The FLAN series of models have been exposed
to CoT, so FLAN-T5 is a great choice.
Chain-of-thought reasoning is a powerful concept that’s not just limited to
arithmetic challenges. You can apply CoT to help models reason through
different types of complex tasks or problems. When you start building
generative AI applications, CoT is often a key component that enables the
LLM to understand complex user requests. You can then use this ability to
build applications that can perform tasks and interact with other applications
or systems.

Application Integration with Agents


Connecting LLMs to other applications or systems allows the model to
interact with the broader world, extending their utility beyond language tasks.
In chapter 9, you’ve seen how Retrieval Augmented Generation (RAG) can
help LLMs overcome knowledge cut-offs by providing access to external
data sources that can be used to augment the prompt with relevant contextual
information.
When you give the LLM the ability to interact with APIs, you can extend its
capability to perform actions on behalf of the user. For example, an LLM-
powered travel agent application can not only respond to the question,
“Which beaches should I visit in Hawaii?” with a list of suggestions, but can
also book the flight and hotel for you. For this to work, you need an
additional piece of software, usually referred to as an agent, that orchestrates
the prompt-completion workflows as shown in Figure 10-X.
Figure 9-2. Agents orchestrate prompt-completion workflows between user requests, the LLM,
and external data sources and applications.

The agent uses the LLM as its reasoning engine. The agent builds the CoT
reasoning prompt and augments it with relevant information to provide
responses back to the user in natural language. The agent is able to figure out
the actions required to automatically process user-requested tasks by
breaking the task into multiple steps, orchestrating a sequence of API calls
and data lookups, and maintaining memory to complete the action for the
user. The set of actions the agent can take depends on the tools you configure.
Tools are functions, or APIs, you give the agent access to.
NOTE
Agent implementations are available in many popular open source libraries, such as
Hugging Face Transformers Agents
[https://huggingface.co/docs/transformers/transformers_agents] or LangChain Agents
[https://docs.langchain.com/docs/components/agents/]. You can also choose from fully
managed cloud services, such agents for Amazon Bedrock which is covered in more detail
in chapter 12.

For an agent to be able to perform all of these tasks, it needs to structure the
prompts in the correct way. You’ve seen in the beginning of this chapter, how
CoT reasoning helps the model reason through complex tasks. But how do
you structure prompts to decide on which actions to take? One popular
framework to achieve this is called ReAct
[https://arxiv.org/pdf/2210.03629.pdf].

ReAct Framework
ReAct is a prompting strategy that combines CoT reasoning with action
planning. ReAct structures prompts to show the LLM how to reason through a
problem and decide on actions to take that help find a solution. The
structured prompts include a sequence of question-thought-action-
observation examples.
The question is the user-requested task or problem to solve. The thought is a
reasoning step that helps demonstrate to the LLM how to tackle the problem
and identify an action to take. The action is an API that the model can invoke
from an allowed set of APIs. The observation is the result of carrying out the
action. The actions that the LLM is able to choose from are defined by a set
of instructions that are prepended to the example prompt text.
Let’s come back to our travel agent example and assume a user is asking
which hotel is closest to the most popular beach in Hawaii. This question
will take a couple of intermediate steps and actions to find the solution. In the
prompt-prepended instructions, you describe the ReAct prompt structure and
list the allowed actions:
Solve a question answering task with interleaving Thought, Action, Observation
steps.

Thought can reason about the current situation, and Action can be three types:
(1) wikipedia_search[topic], which searches the topic on Wikipedia and returns the
first paragraph if it exists. If not, it will return a similar topic to search.
(2) hotel_database_lookup[request], which performs an API call to the hotel database
to gather hotel information defined in request
(3) Finish[answer], which returns the answer and finishes the task.
Here are some examples.

First, you define the task by telling the model to answer a question using the
discussed ReAct prompt structure. Then, you provide instructions that
explain what “thought” means and list the allowed actions to take.
The first is the wikipedia_search action, which looks for wikipedia entries
related to the specified topic. The second is a hotel_database_lookup
action, which can query the travel companies’ hotel database with a specific
request. The last action is finish, which returns the answer and brings the
task to an end.
It’s important to define a set of allowed actions when you use LLMs to
perform tasks. LLMs can be very creative and otherwise may suggest taking
actions that don’t correspond to anything the application can actually do. You
finish the instructions by providing some ReAct example prompts. Depending
on the LLM you are working with, you may need to include more than one
example and carry out few-shot inference. Figure 10-X summarizes how you
build up the ReAct prompt, together with the actual user request appended.
Figure 9-3. Build up the ReAct prompt with instructions, ReAct examples, and the user request.

Now, let’s see how the model applies the instructions to the user’s request to
find the closest hotel to the most popular beach in Hawaii:

Question: Which hotel is closest to the most popular beach in Hawaii?

Thought 1: I need to search for the most popular beach in Hawaii and find the
closest hotel for that location.
Action 1: wikipedia_search["most popular beach in Hawaii"]
Observation 1: Waikiki is most famous for Waikiki Beach.

Thought 2: I need to find the hotel closest to Waikiki Beach.


Action 2: hotel_database_lookup["hotel closest to Waikiki Beach"]
Observation 2: MyHawaiianDreamHotel is closest to Waikiki Beach.
Thought 3: MyHawaiianDreamHotel is closest to Waikiki Beach, the most popular
beach in Hawaii. So the answer is MyHawaiianHotel.
Action 3: Finish["MyHawaiianDreamHotel"]

You can see how the thoughts reason through the task and plan two
intermediate steps that help find the answer. The model then needs to decide
on an action to take from a predetermined list.
In this example, the allowed actions are wikipedia_search which performs
a Wikipedia search on a specific topic, hotel_database_lookup which
performs an API call to the travel agencies’ hotel database, and finish
which the model carries out when it has found the answer. The observations
bring the new information retrieved from the actions back into the model’s
prompt context. The model will cycle through as many iterations as needed to
find the answer. The final action is then to finish the cycle and pass the
answer back to the user.
As you can see, ReAct prompting is a powerful strategy to guide LLMs
through reasoning and action planning. Many agent implementations support
ReAct prompting and will automatically build the structured prompts for you.
Your travel agent application is now able to connect to external data sources
to retrieve additional information, reason through tasks, and plan and perform
tasks. But what if one of the tasks is to calculate the sales tax for the travel
booking? Even with CoT, an LLM’s ability to perform arithmetic or other
mathematical operations is limited. The model might be able to correctly
reason through the task, but may get the actual calculation wrong. After all, an
LLM is not really doing math, it’s just predicting the most probable next
token to complete the prompt.
To overcome this limitation, you can connect your model to an application
that’s good at performing calculations, such as a code interpreter. The
“Program-aided Language Models” (PAL) framework
[https://arxiv.org/pdf/2211.10435.pdf] does exactly that.
PAL Framework
PAL uses CoT reasoning to generate programs in the intermediate reasoning
steps that help solve the given problem. These programs are then passed to
an interpreter, for example, a Python interpreter, that runs the code and
returns the result back to the LLM. You can use the Program-aided Language
Models (PAL) framework to connect an LLM to an external code interpreter
to perform calculations, as shown in Figure 10-X.

Figure 9-4. PAL connects an LLM to an external code interpreter to perform calculations.

Similar to ReAct, you need to add one or more examples to the prompt that
shows the model how to format the output. You start each example with a
question followed by a couple of reasoning steps and lines of Python code
that solve the problem. Then, you add the new question to solve to the
prompt. The PAL-formatted prompt now contains your example(s) and the
new problem to solve.
Once you pass this prompt to the LLM, the model follows the example and
generates a completion in the form of a Python script. Next, you send the
script to a Python interpreter that will run the code and return the result.
Figure 10-X shows the complete PAL workflow. You can now append the
result to the prompt and the LLM generates a completion that contains the
correct answer.

Figure 9-5. PAL workflow connecting the LLM to a Python interpreter.


Let’s come back to our previous CoT arithmetic example and see how this
can be solved using PAL. In the one-shot example below, you can see how
the reasoning steps (highlighted in green) are expressed in natural language
again, but formatted as code comments. What’s different compared to the
CoT example is that you show the model how to solve the problem using
Python code (highlighted in blue):

Q: Sarah has 4 golf balls. She buys 2 more packages of golf balls. Each package has
12 golf balls. How many golf balls does she have now?

Answer:
# Sarah started with 4 golf balls
golf_balls = 4
# 2 packages of golf balls each is
bought_balls = 2 * 12
# golf balls. The answer is
answer = golf_balls + bought_balls

Q: You are ordering pizza for a meetup event for 20 people. If each pizza has 8
slices and you think every attendee eats 2 slices, how many pizzas do you need to
order?

You then complete the prompt with the new task to solve. The answer should
look similar to this:

Answer:
# There are 20 people
num_people = 20
# Each person eats 2 slices, each pizza has 8 slices
slices_per_person = 2
slices_per_pizza = 8
# The answer is
answer = (num_people x slices_per_person)/slices_per_pizza

Note how the LLM declares the variables based on the text in each reasoning
step and assigns values either directly or as calculations, also based on the
numbers in the reasoning text. The LLM can also pick up and work with
variables it created in an earlier step, as you can see when the model
calculates the final result. The model then correctly generates the required
math operation to calculate the result.
For simple math operations like this, you can likely get the correct answer by
just applying CoT reasoning. But for more complex math, such as arithmetic
with large numbers, trigonometry, or calculus, PAL is a powerful technique
that ensures that any calculations done by your LLM are accurate and
reliable.
Similar to ReAct, many agent implementations support PAL and will
automatically build the prompts and orchestrate the communication between
LLM and the code interpreter for you.
The following code example shows how to use ReAct and PAL with
LangChain agents:

from langchain.agents import load_tools


from langchain.agents import initialize_agent
from langchain.agents import AgentType
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline
import torch

model_id = "..."

tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=256)


model = AutoModelForCausalLM.from_pretrained(model_id)

pipeline = transformers.pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
max_new_tokens=100
)

llm = HuggingFacePipeline(pipeline=pipe)

tools = load_tools(["serpapi", "llm-math"], llm=llm)

agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,


verbose=True)

agent.run("Which hotel is closest to the most popular beach in Hawaii, and how much
is each night with 50% discount?")
The output should look similar to this:

> Entering new AgentExecutor chain...


I need to find the most popular beach in Hawaii and find the closest hotel to that
beach and find out how much a hotel night is and then calculate 50% of that price.
Action: Search
Action Input: "most popular beach in Hawaii"
Observation: Waikiki Beach
Thought: I need to find the closest hotel from Waikiki Beach
Action: Search
Action Input: "closest hotel from Waikiki Beach"
Observation: MyHawaiianDreamHotel
Thought: I need to find out how much a hotel night is
Action: Search
Action Input: "How much is a hotel night at MyHawaiianDreamHotel"
Observation: 250 USD
Thought: I need to calculate 50% of that price
Action: Calculator
Action Input: 250x0.5
Observation: Answer: 125

Thought: I now know the final answer


Final Answer: Waikiki Beach is the most popular beach in Hawaii and the closest
hotel is MyHawaiianDreamHotel and a hotel night with 50% discount is 125 USD.
> Finished chain.

"Waikiki Beach is the most popular beach in Hawaii and the closest hotel is
MyHawaiianDreamHotel and a hotel night with 50% discount is 125 USD."

With additional orchestration software, such as agents, and advanced


prompting strategies, such as CoT, ReAct, and PAL, you can now build
powerful generative AI applications that use the LLM as the application’s
reasoning engine, shown in Figure 10-X.
Figure 9-6. Generative AI applications use LLMs as their reasoning engine.

To build an end-to-end generative AI solution, there’s a few more


components that you need. For example, you need infrastructure to train, fine-
tune, and serve your model, and host your application components. You might
also need additional orchestration components, frameworks, model hubs, as
well as application interfaces that allow your consumers, including users and
systems, to interact with your solution.

Generative AI-based Solutions


Building generative AI-based solutions requires several key components.
The infrastructure layer provides the compute, storage, and network to serve
your models and host your application components. You can build upon on-
premise infrastructure or using on-demand and pay-as-you-go cloud services.
You also need to choose the LLM that powers your application. This might
be a foundation model from a model hub, or a model that you’ve customized
with your own data. Depending on your use case and whether you need the
model to respond in real-time or near-real-time, you then deploy the model
on the appropriate infrastructure.
Your solution might also include external data sources and applications to
retrieve additional information and to perform specific tasks, as discussed in
chapters 9 and 10.
The application uses the LLM to generate completions for the user or the
consuming application. Depending on your specific use case, it might be
necessary to develop a system to capture and store the outputs effectively.
For instance, incorporating the ability to save user completions during a
session could enhance the fixed context-window size of your language
model.
Additionally, gathering feedback from users can prove valuable for further
fine-tuning, alignment, or evaluation as your application continues to evolve.
To make use of advanced prompting techniques and manage interactions with
external systems, you may need to employ various tools and frameworks
designed for large language models. For example, LangChain’s built-in
libraries offer techniques like PAL, ReACT, or chain-of-thought prompting
that can be easily implemented. You can also take advantage of model hubs
that enable centralized management and sharing of models to be used in
different applications.
Finally, your application typically incorporates a user interface, which could
be a website or a REST API through which users will access and consume
the application. This layer is also responsible for integrating the necessary
security components to ensure secure interactions with the application.
Broadly speaking, the generative AI application stack, shown in figure 10-X,
consists of various components that you need to consider while developing
generative AI applications.

Figure 9-7. The generative AI application stack.

Your application consumers, whether they are human end users or other
systems accessing your application via its APIs, will engage with this entire
stack. As you can observe, while the model plays a significant role, building
end-to-end generative AI applications involves more than just the model
itself.
Safety and Guardrails
With the growing popularity and next wave of widespread AI adoption
comes the recognition that we must all use it responsibly. This includes
building generative AI applications in a safe, secure, and transparent way.
Throughout design, development, deployment, and operations, you need to
consider a range of factors including accuracy, fairness, appropriate usage,
toxicity, security, safety, and privacy.
One concern of generative AI is related to the potential generation of
offensive, disturbing, or inappropriate content. To minimize the risk of such
toxic outputs, you could carefully curate your training data. If the training
data lacks offensive or biased text, the LLM will likely not generate such
content. However, this approach requires that you proactively identify and
remove any offensive phrases.
A more feasible approach could be to train guardrail models. These models
are designed to identify and filter out undesirable content not only from the
training data but also from input prompts and generated outputs. In practice, it
tends to be more manageable to control the output of a generative model
rather than curating the training data and prompts. This is particularly true
due to the highly diverse and general nature of the tasks for generative AI.

Summary
In this chapter, you’ve explored methods to enhance your model’s
performance during deployment. These methods include using structured
prompts and connecting your model to external data sources and
applications. LLMs can serve as remarkable reasoning engines in
applications, leveraging their “intelligence” to fuel exciting and practical use
cases.
To facilitate this process, orchestration software, like agents, takes charge of
prompt engineering and communication between systems. They enable LLM-
powered applications to perform actions in the real world, making the
applications more versatile and interactive.
Finally, when building generative AI applications, you also need to consider
additional components, from the infrastructure layer up to the application
interfaces, and ensure the technology is used responsibly. In the next chapter,
you’ll explore multimodal models, some of their common use cases, and core
concepts of image-based generative AI.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy