Generative AI On AWS
Generative AI On AWS
Generative AI On AWS
In this chapter, you will learn about low-code ways to interact with
generative AI models - specifically, prompt engineering and in-context
learning. You will see that writing prompts is both an art and a science that
helps the model generate better and more-applicable responses. You will
also see some best practices when defining prompts and prompt templates to
get the most out of your generative models.
You will also see how to use in-context-learning to pass multiple prompt-
completion pairs (e.g. question-answer pairs) in the “context” along with
your prompt input. This in-context learning nudges the model to respond
similar to the prompt-completion pairs in the context. This is one of the more
remarkable, mysterious, and lightweight capabilities of generative models
that temporarily alters the model’s behavior for the duration of that single
request-response.
Lastly, you will learn some of the most-important generative configuration
parameters like `temperature` and `top k` that control the generative model’s
creativity when creating content.
Prompt Engineering
Prompt engineering is a new and exciting skill focused on how to better
understand and apply generative models to your tasks and use cases.
Effective prompt engineering helps you push the boundaries of generative AI
and get the most out of your generative-based applications.
The text that you send into a generative model is typically called the
“prompt”. This prompt is passed to the model during inference time to
generate a “completion”. Below is an example question-answer prompt and
completion between a “Human” and the generative AI “Assistant”. Note that
the generative model is simply completing the Human’s prompt following the
term “Assistant:”
Figure 1-1. A simple question produces a lengthy response from the model
You may have to rewrite your prompt several times to get a proper and
precise response as some of these generative models are quite chatty. Prompt
engineering is a learned skill that requires many iterations across many
different model types and linguistic nuances - and often depends on how the
model was trained and fine-tuned.
Most modern human-facing chat models have been fine-tuned using some
form of human labeled data - often with reinforcement learning as we will
explore in Chapter 7. This is often why you see some form of `Human:` and
`Assistant:` in most chat prompts. These are required to indicate the start of
the input question and output answer, respectively. However, they are often
model-specific. Using different indicators may result in “off-distribution”
and undesirable results.
Next, you’ll explore some prompt structures and techniques to get the most
out of off-the-shelf generative AI models.
Prompt Structure
The prompt structure used in the previous example is a simple chat-assistant
prompt structure which implies the question-answer tasks and uses “Human:”
as the input indicator and “Assistant:” as the output indicator.
A more-complete prompt structure includes a section for each of the
following: instruction, context, input data, and output indicator. A
restructured version of the previous chat example using the more-complete
prompt structure is shown below. Here, the prompt includes the instruction
on the first line followed by the context. `Human:` indicates the start of the
input data and `Assistant:` is the output indicator where the model generates
the completion.
Figure 1-2. Restructured with instruction, context, input data, and output indicator
The prompt does not need all four elements, however. Nor does it require the
elements to follow the exact order used here. The ideal prompt structure may
vary depending on the task as well as the size of the combined context and
input data.
In addition, the best prompt structure depends on how the generative model
was trained and fine-tuned. Therefore, it’s important to read the
documentation for a given generative model to gain intuition into the prompt
templates used during training and tuning. Optimizing the prompt and prompt
structure is all part of prompt engineering!
If you know which datasets and prompt templates were used during model
training and fine-tuning, you can often find a more-relevant and successful
prompt structure by following a pattern similar to the prompt templates.
Below, you see the `samsum` prompt template and the `samsum` dataset used
together to fine-tune the popular FLAN-T5 model for dialog summarization
among many other tasks.
"samsum": [
("{dialogue}Briefly summarize that dialogue.", "{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWrite a short summary!",
"{summary}"),
("Dialogue:\n{dialogue}\n\nWhat is a summary of this dialogue?",
"{summary}"),
("{dialogue}\n\nWhat was that dialogue about, in two sentences or less?",
"{summary}"),
("Here is a dialogue:\n{dialogue}\n\nWhat were they talking about?",
"{summary}"),
("Dialogue:\n{dialogue}\nWhat were the main points in that "
"conversation?", "{summary}"),
("Dialogue:\n{dialogue}\nWhat was going on in that conversation?",
"{summary}"),
]
Figure 1-3. samsum dialog-summarization dataset
There are subtle differences between these completions. With more examples
or shots, the model more-closely follows the pattern of the response of the
in-context prompt-completion pairs which only mentions the winning team,
the Chicago Cubs, and losing team, the Cleveland Indians - and nothing more.
The zero-shot completion, without the additional prompt-completion pairs in
context, includes additional information such as the previous time the
Chicago Cubs won the baseball World Series in 1908 - over a century
earlier, by the way.
NOTE
2016 was a great year for one of the authors of this book who is a life-long Chicago Cubs
fan!
NOTE
It’s worth noting that in-context learning does not modify the model in any way. The model
adjusts - or learns - on-the-fly for the duration of that single request using the context
provided in the prompt. This is a truly remarkable and somewhat mysterious property of
generative models which can be used in many creative ways.
In Chapter 9, you will see how to further augment the prompt using data
stores (e.g. databases and knowledge stores) and APIs (e.g. web searches
and custom APIs). This is called retrieval-augmented generation (RAG) and
is part of the larger generative AI ecosystem that helps augment prompts with
domain knowledge and external tools. RAG improves model responses
across many generative tasks and use cases.
While some of the larger, more recent models provide good responses with
zero-shot inference, some of the smaller models may require in-context
learning with one-shot or few-shot inference that include examples of the
desired response.
As larger and larger models have been trained, it has become clear that the
ability of models to perform multiple tasks - and how well they perform
those tasks - depends strongly on the scale of the model.
Models with more parameters are typically able to capture more
understanding of language. The largest models are surprisingly good at zero-
shot inference, and are able to infer and successfully complete many tasks
that they were not specifically trained to perform.
In contrast, smaller models are generally only good at a small number of
tasks, typically those that are similar to the task they were trained on. You
may have to try out a few models to find the right one for your use case.
It’s worth noting that you can “trick” a model into temporarily in-context
learning an incorrect answer. For example, you can pass three in-context
prompt-completion examples that demonstrate a positive customer review as
a NEGATIVE sentiment and a negative customer review as a POSITIVE
sentiment as shown below.
Figure 1-5. Few-Shot, In-Context Prompt with Opposite Sentiment
In this case, inference requests made to the model with this prompt are more
likely to return the opposite sentiment. This is an alarming and mischievous
quality of in-context learning. And this is relatively easy to do in practice, so
it’s worth double-checking your in-context prompt-completion pairs
carefully.
Next, you’ll explore some prompt-engineering best practices to improve the
responses from your generative AI models.
Convey the force of the question. Clearly state one of the following: who,
what, where, when, why, how, etc.
Use explicit directives. If you want the model to output in a particular format
, specify that directly. For example, “Summarize the following customer-
support dialog in a single sentence:”
Use simple language. Prompts should use natural language and coherent
sentence structure. Avoid using just single-words or “baby” or “pet” phrases.
Avoid negative formulations. Negative formulations, while syntactically
correct, may cause confusion. For example, use “Summarize in 5 sentences
or less” instead of “Summarize in no more than 5 sentences”. Avoid negative
formulations if a more-straightforward linguistic variation exists.
NOTE
General rule of thumb at this stage of the generative AI maturity cycle is this: if the
wording is confusing to humans, it is more-likely to be confusing to these models. Simplify
when possible.
Figure 1-13. Prompt with Combined “Think Step by Step” and “I don’t know”
By trying different prompts, you see what works and what doesn’t work for
your prompt, model, and use case combination. Continue to refine your
prompt as needed. With more and more experimentation, you will gain the
necessary intuition to quickly create and optimize a prompt to best suit your
task and use case. Prompt engineering is an iterative skill that improves with
practice. Prompt-optimization is not as clear or well-studied as classical
numerical-optimization techniques which may be frustrating.
Take this time to enjoy the creative and non-deterministic side of generative
AI. At a minimum, you’ll enjoy a good laugh when the model surprises you
with a seemingly random response to a question that you did not intend to
ask. Next, you’ll see some common generative inference-specific parameters
that influence the creativity of the generative model response. This is where
the fun begins!
Adversarial Prompts
It’s important to understand the safety issues caused by adversarial prompts
with generative AI. By understanding the risks, you can better evaluate the
safety of your model - and address any issues upfront before releasing into
production. Some common examples of adversarial prompts and prompt
misuse include prompt injection, prompt leaking, and jailbreaking. Without
the proper guardrails, some models may respond to adversarial prompts with
harmful, dishonest, and unethical responses. While this is not an exhaustive
summary of the many adversarial attack vectors, these are a few of the most
common.
Prompt injection is a technique that attackers use to influence the outputs of
generative models by adding malicious instructions directly in the prompt.
For example, an attacker may ask the model to generate responses to induce
self-harm, create fake news, or promote bias at scale.
NOTE
It’s worth noting that prompt injection is also used to create non-malicious prompts.
However, the term “prompt injection”, like “SQL injection” and other malicious
“injections”, has a poor connotation.
Figure 1-14. Prompt with additional defense for instructions that include the word “hack” or
“harm”
Figure 1-18. Greedy vs. random sampling to predict the next token from a probability distribution
Most generative model-inference implementations default to greedy
decoding. This is the simplest form of next token prediction as the model
always chooses the word with the highest probability. This method works
well for very short generations, but may result in repeated tokens or
sequences of tokens.
If you want to generate text that is more natural and minimizes repeating
tokens, you can configure the model to use random sampling during inference.
This will cause the model to randomly choose the next token using a
weighted strategy across the probability distribution. The token `banana`, as
shown here, has a probability score of 0.02. With random sampling, this
equates to a 2% chance that this word will be selected from the distribution.
Using random-sampling, you reduce the likelihood of repeated tokens in your
model completion. The trade-off, however, is that the model output may be
too creative and either generate an off-topic or unintelligible response.
Finding this optimal setting is why this is called prompt engineering!
NOTE
Some implementations and libraries may require you to explicitly disable greedy sampling
and manually enable random sampling using a function argument similar to
`do_sample=True`.
`top-p` and `top-k` random sampling. These are the most common inference
parameters to enable random sampling. These parameters provide more fine-
grained control for the random sample which, if used properly, should
improve the model’s response yet allow it to be creative enough to fulfill the
generative task,
`top-k`, as you probably guessed, limits the model to choose a token
randomly from only the top-k tokens with the highest probability. For
example, if `k` is set to 3, you are restricting the model to choose from only
the top-3 tokens using the weighted random-sampling strategy. In this case,
the model randomly chooses “donut” as the next token, although it could have
selected from 1 of the other 2.
Figure 1-19. Top-k sampling restricts the model to choose from the top-3 probabilities, in this
case
`top-p` limits the model to randomly sample from the set of tokens whose
cumulative probabilities do not exceed `p` starting from the highest
probability working down to the lowest probability. To illustrate this, first
sort the tokens in descending order based on the probability. Then select a
subset of tokens whose cumulative probability scores do not exceed `p`. For
example, if `p=0.3`, the options are “cake” and “donut” since their
probabilities of 0.2 and 0.1 add up to 0.3. The model then uses the weighted
random-sampling strategy to choose the next token from this subset of tokens
as shown below.
Figure 1-20. Top-p uses random probability weighting to choose the next token from tokens
whose combined probabilities do not exceed P when ranked highest probability to lowest.
Figure 1-21. Changing the temperature will change the next-token probability distribution
In both cases, the model selects the next token from the modified probability
distribution using either greedy or random sampling which is orthogonal to
the temperature parameter.
Summary
In this chapter, you learned techniques to help get the best possible
performance out of these generative AI models using prompt engineering and
by experimenting with different inference configuration parameters. Prompt
engineering guides the generative foundation model to provide more relevant
and accurate completions using various methods such as better-worded
prompts, in-context learning examples, and step-by-step logical reasoning,
etc.
While you got far with prompt engineering, in-context learning, and inference
parameters, these techniques do not actually modify the generative models’
weights. As such, you may need to train or fine-tune a generative model on
your own datasets to better understand your specific domain and set of
generative use cases. In the next chapter, you will see how to train a
generative model from scratch with public or private datasets.
Chapter 2. Model Pre-Training
In the previous chapter, you defined your use case, performed some prompt
engineering, and determined whether a generative model will work for your
use case. Next, you need to decide if an existing foundation model is
sufficient to understand your domain or if you need to pre-train a foundation
model from scratch.
In this chapter, you will see the trade-offs between choosing an existing
foundation model vs. training a foundation model from scratch. You will also
learn about empirical scaling laws that have emerged for generative AI
models which provide a good starting point if you choose to pre-train your
model from scratch. You will then see an example of pre-training a financial-
based generative model called BloombergGPT.
Foundation Models
At the start of any generative AI project, you should first explore the vast
amount of public, pre-trained foundation models that exist today. These
models have been trained on just about every public piece of text in existence
on the internet across many different languages. As such, these models have
built a solid understanding of human language - as well as a massive amount
of built-knowledge across many domains.
You can find these foundation models in a “model hub” such as HuggingFace
Model Hub, PyTorch Model Hub, or Amazon SageMaker JumpStart. Model
hubs offer a “model card” for each model. Model cards typically contain
important information about the model including training details, context-
window size, prompt information, and known limitations.
Often, the model hubs contain the same models. So just pick a model hub that
best fits your security and infrastructure needs. For example, with the
SageMaker JumpStart model hub, you can deploy a private copy of a
foundation model directly into your AWS account with just a few clicks. This
lets you start generating content within minutes!
You will likely choose a foundation model that performs well on the
generative task that fits your use case. Some models may use slight variations
of the original Transformer architecture to optimize for specific language
tasks. This may cause issues if you try to swap out models during
development.
NOTE
Fear of missing out (FOMO) may cause you to swap out a newer generative model before
completing your evaluation on the current model. Try to avoid this temptation and complete
your testing with a single model - or set of models - before chasing the latest and greatest
leaderboard winner.
These pre-trained foundation models may not have seen enough public text to
learn the nuances of your specific domain, however. For example, the
vocabulary of the public foundation models, often measured in 100,000
tokens, may not include the terms commonly used by your business.
Additionally, public foundation models and datasets may have been scrubbed
to avoid providing medical, legal, or financial advice due to the sensitive
nature of these domains. One financial company, Bloomberg, chose to pre-
train their own foundation model from scratch called BloombergGPT.
BloombergGPT was trained with both public and private financial data. You
will learn more about BloombergGPT in a bit. But first, you will learn more
about model pre-training, in general.
Model Pre-Training
If your domain’s data dictionary uses words, phrases, or linguistic structures
not commonly found in everyday language, you may need to pre-train a
foundation model from scratch to establish a more relevant vocabulary that
better represents and understands your specific domain. This is important in
highly-specialized domains like legal, medical, and financial.
Another issue is that certain domains may use the same language constructs
differently than their everyday context. In this case, using an LLM not trained
on medical data may give improper advice because of a simple
misunderstanding of the patient’s situation.
Since the medical domain uses words and phrases outside of the normal,
every-day language distribution, certain medical conditions and procedures
may not appear in the general-purpose training datasets commonly used by
publicly-available foundation models. Such general-purpose datasets include
Wikipedia entries, Reddit threads, and Stack Overflow discussions which
are likely not the best source for legal or medical advice.
Pre-Training Objectives
There are three variants of Transformer-based models overall: encoder-only,
decoder-only, and encoder-decoder. Each is trained with a different
objective and therefore better-capable of addressing specific tasks. During
pre-training, the model weights are updated to minimize the loss of the
training objectives described next for each variation.
Encoder-only models, or autoencoders, are pre-trained using a technique
called masked language modeling (MLM) that randomly masks input tokens
and tries to predict the masked tokens. This is sometimes called a
“denoising” objective. Autoencoding models use bidirectional
representations of the input to better-understand the full context of a token -
not just the previous tokens in the sequence as shown below.
Figure 2-1. Autoencoding models use a bidirectional context to reconstruct the masked input
tokens
Encoder-only models are best-suited for tasks that utilize its bidirectional
property such as named-entity recognition. A well-known encoder-only
model is BERT which is covered extensively in Data Science on AWS by
O’Reilly.
Decoder-only models, or autoregressive models, are pre-trained using
unidirectional causal language modeling (CLM) which predicts the next
token using only the previous tokens - every other token is masked as shown
below.
Figure 2-2. Autoregressive, decoder-only models only reveal the tokens leading up to the token
being predicted
NOTE
By some estimates, only 1-3% of parsed tokens are usable for pre-training. Since modern
large language models are pre-trained with 2+ trillion tokens, this implies that 66-200 trillion
tokens are parsed, but not used.
NOTE
Falcon was trained on 1.5 trillion tokens of data. The data was processed on a cluster of
257 ml.c5.18xlarge SageMaker instances consisting of 18,504 CPUs and 37TB of CPU
RAM.
Next, you’ll learn about scaling laws which describe the relationship
between model size, dataset size, and compute budget.
Scaling Laws
The goal of generative model pre-training is to maximize the model training
objective and minimize the loss when predicting the next token. Empirically,
a common set of “scaling laws” have emerged that describe the trade-offs
between model size and dataset size for a fixed compute budget (e.g. number
of GPU hours). These scaling laws state that you can achieve better
generative model performance by either increasing the number of tokens or
the number of model parameters as shown below.
Figure 2-4. Scaling choices for pre-training
NOTE
The petaflop/s-day metric is neural-network and workload dependent, so always make
sure you calculate this number using estimates that match your environment -
Transformers-based workloads, in our case.
Here, you see that large models require large compute budgets. T5-XL, with
3 billion parameters, requires 100 petaflop/s-days compared to GPT-3’s 175
billion parameter variant which requires approximately 4,000 petaflop/s-
days. You might wonder if it’s possible to get 175 billion parameter
performance from a smaller model. It turns out that you can!
Researchers have found that by increasing the training dataset size - instead
of the model size - you can get state-of-the-art performance that exceeds the
175 billion parameter models with a much-smaller set of weights. In fact, the
Scaling Laws for Neural Language Models paper shows that, if you hold the
compute budget constant, model performance may increase when you
increase the training-dataset size (and holding model parameter size
constant) or the number of model parameters (and holding the dataset size
constant).
Figure 2-6. Impact of dataset size and parameter size on model performance
In other words, model performance may continue to improve with more data
- holding both the compute budget and parameter-size constant. This means
that smaller models may just need to be trained on more data and keep the
model size small. This is the exciting field of compute-optimal model
research which you will learn about next.
Compute-Optimal “Chinchilla” Models
In 2022, a paper group of researchers released a paper titled Training
Compute-Optimal Large Language Models that compared model performance
of various model and dataset size combinations. Since the authors named
their final “compute-optimal” model, Chinchilla, this paper is famously
called the “Chinchilla Paper.”
The Chinchilla paper implies that the massive 100 billion-plus parameter
models like GPT3 may be over-parameterized and under-trained.
Additionally, they hypothesize that you could achieve 100 billion-plus
parameter model performance with a small model by simply providing more
training data to the smaller model.
To be more specific, they claim that the optimal training dataset size is 20
times the number of model parameters. The Chinchilla model was never
publicly released, but it was documented as 70 billion model parameters
trained on 1.4 trillion tokens.
Shortly after the Chinchilla paper, however, the compute-optimal LLaMa
model was trained - and ultimately leaked publicly - by Facebook/Meta.
LLaMa followed the Chinchilla scaling laws with 65 billion parameters and
1.3 trillion tokens. The figure below compares the compute-optimal
Chinchilla and LLaMa models with the potentially over-parameterized and
under-trained 175 billion-parameter variants of GPT-3, OPT, and BLOOM.
Figure 2-7. Chinchilla scaling laws for given model size and dataset size
Here, you see that these 175+ billion parameter models, according to the
Chinchilla scaling laws, should be trained on 3.5 trillion tokens. Instead, they
were trained with 180-350 billion tokens - an order of magnitude smaller
than the Chinchilla paper recommends.
In the figure above, you also see that the more-recent LLaMA2 model,
released publicly by Facebook/Meta, was trained with 2 trillion tokens -
even higher than the 20-to-1 token-to-parameter ratio described by the
Chinchilla paper. Similar to the first version of LLaMA, LLaMA2 out-
performed the much larger models and even owned the top position in the
HuggingFace OpenLLM Leaderboard for a bit.
The Chinchilla scaling laws are a great starting point for your pre-training
efforts. They demonstrate that you can achieve state-of-the-art performance
on relatively small 50-70 billion parameter models simply by increasing the
amount of training data.
Next, you will explore a well-documented model called BloombergGPT that
used the Chinchilla scaling laws as a blueprint for model pre-training.
BloombergGPT
First announced in 2023 in a paper called “BloombergGPT: A Large
Language Model for Finance” by Shijie Wu, Steven Lu, and others at
Bloomberg, BloombergGPT is proprietary, domain-specific LLM pre-trained
on the finance domain.
With a budget of 1.3 million GPU hours (230 million petaFLOPs),
BloombergGPTs’ 50 billion parameters were trained in a compute-optimal
way following the Chinchilla laws. Researchers at Bloomberg pre-trained
with a combination of both public and private financial-text data - as well as
public internet-text data. Below shows the breakdown of 51% public and
private financial data vs. 49% public internet data.
While the 50 billion-parameter BloombergGPT should have been trained
with approximately 1 trillion tokens based on the 20-to-1 token-to-parameter
Chinchilla ratio, it was only trained on closer to 700 billion parameters
mainly due to limited availability of high-quality, text-based financial data at
the time of pre-training.
Even so, BloombergGPT reached state-of-the-art performance on financial
language benchmarks - and very good results on general-purpose language
benchmarks. However, there may be room for improvement in model
performance for the same 50 billion parameter model when more financial
text is sourced.
Summary
In this chapter, you saw how models are trained on vast amounts of text data
during an initial training phase called “pre-training”. This is where a model
develops its understanding of language.
You also learned 3 different types of language models including encoder-only
autoencoding, decoder-only autoregressive, and encoder-decoder sequence-
to-sequence models.
Additionally, you will learned some empirical scaling laws that have been
discovered for generative AI models - and how these scaling laws help you
choose the number of model parameters (e.g. 1 billion, 7 billion, 70 billion,
etc) and dataset size (e.g. 700 billion tokens, 1.4 trillion tokens, 2 trillion
tokens, etc) when pre-training your own model from scratch.
In the next chapter, you will explore some of the computation challenges of
training large models including GPU memory limitations. You will see how
to use quantization to reduce the memory requirements of your training job.
You will also learn how to efficiently scale model training across multiple
GPUs using distributed clusters.
Chapter 3. Quantization and
Distributed Computing
In this chapter, you will explore some of the challenges associated with pre-
training foundation models. In particular, GPU memory is relatively scarce
compared to CPU RAM. As such, you will explore various techniques such
as quantization and distributed computing to minimize the required GPU
RAM and scale horizontally across multiple GPUs for larger models.
For example, the original 40 billion Falcon model was trained on a cluster of
48 ml.p4d.24xlarge SageMaker instances consisting of 384 Nvidia A100
GPUs, 15TB of GPU RAM, and 55TB of CPU RAM. A more recent version
of Falcon was trained on a cluster of 392 ml.p4d.24xlarge SageMaker
instances consisting of 3,136 Nvidia A100 GPUs, 125TB of GPU RAM, and
450TB of CPU RAM. The size and complexity of the Falcon model requires
a cluster of GPUs, but also benefits from quantization as you will see here.
Computational Challenges
One of the most common issues you’ll encounter when you try to pre-train
large language models is running out of memory. If you’ve ever tried training,
or even just loading your model on NVIDIA GPUs, this error message might
look familiar.
NOTE
When exploring a new model, it’s recommended that you start with `batch_size=1` to find
the memory boundaries of your model with just a single training example. You can then
increase the batch size until you hit the CUDA out of memory error. This will determine
the maximum batch size for your model and dataset.
Figure 3-3. Approximate GPU RAM needed to load and train a 1 billion parameter model at 32-
bit full precision
It’s worth noting that the Nvidia A100 and H100, used at the time of this
writing, only support up to 80GB of GPU RAM. And since you likely want to
train models larger than 1 billion parameters, you’ll need to find a
workaround.
Next, you will explore common data types for model training - as well as a
discussion about numerical precision.
Data Types and Numerical Precision
The following are the various data types used by PyTorch and Tensorflow:
`fp32` for 32-bit full precision, `fp16` for 16-bit half-precision, and `int8` for
8-bit integer precision. More recently, `bfloat16` has become a popular
alternative to fp16 for 16-bit precision. Most of the modern generative AI
models were pre-trained with `bfloat16` including FLAN-T5, Falcon, and
Llama2. Here, you’ll learn how these data types compare - and why
`bfloat16` is a popular choice for 16-bit quantization.
Suppose you want to store Pi using full 32-bit precision. Remember that
floating point numbers are stored as a series of bits consisting of only 0s and
1s. Numbers are stored in 32-bits using 1 bit for the sign (negative or
positive), 8 bits for the exponent, and 23 bits for the fraction, also called the
mantissa or significand, which represents the precision of the number. fp32
can store a range of numbers from -3e-38 to +3e38 as shown below using Pi
as our value.
Figure 3-4. fp32 representing Pi
Note the act of storing a number in 32-bits will actually cause a slight loss in
precision. You can see this by storing Pi as an fp32 and printing the result.
The real value of Pi starts with 3.1415926535897932384. However, the 32-
bit representation is 3.1415920257568359375 which shows a loss in
precision just by storing the value.
Next, you will learn about a common technique called quantization to reduce
the memory requirements required to load and train your multi-billion
parameter model.
Quantization
Quantization is a popular way to convert your model parameters from 32-bit
precision down to 16-bit precision - or even 8-bit integers. By quantizing
your model weights from 32-bit full-precision down to 16-bit half-precision,
you can quickly reduce your 1 billion-parameter model-memory requirement
down 50% to only 2GB for loading and 40GB for training.
Quantization projects a source set of higher-precision floating point numbers
into a lower-precision target set of numbers. Using the source and target
ranges, the mechanism of quantization first calculates a scaling factor, makes
the projection, then stores the results in reduced precision which requires
less memory and ultimately improving training performance and reducing
cost.
With fp16, the 16-bits consist of one bit for the sign but only 5 bits for the
exponent and 10 bits for the fraction. The range of representable fp16
numbers is only from -65,504 to +65,504. Below is an example of quantizing
Pi from fp32 down to fp16.
Figure 3-5. Quantization from fp32 to fp16 saving 50% memory
Note the small loss in precision after this projection as there are only 6
places after the decimal point now. Remember that we already lost precision
just by storing the value in fp32. The loss in precision is acceptable in most
cases, however, the benefits of a 50% reduction in GPU memory for fp16 is
typically worth the tradeoff since fp16 only requires 2 bytes of memory vs. 4
bytes of fp32 as shown below.
Figure 3-6. Only 40GB of GPU RAM is needed to load and train a 1 billion parameter model at
16-bit half precision
Another option that’s worth exploring is int8 8-bit quantization. Using 1 bit
for the sign, int8 values are represented by the remaining 7 bits. This results
in a dynamic range of -128 to +127. Unsurprisingly, Pi is projected to just `3`
in the 8-bit lower precision space as shown below. This brings the memory
requirement down from originally 4 bytes to just 1 byte, but obviously results
in a pretty dramatic loss of precision.
Figure 3-8. Quantization with int8
And you could reduce the memory footprint further by representing the model
parameters with 4-bits, 2-bits, and even 1-bit! Just remember that you may
reduce the expressiveness and power of these models as you continue to
reduce precision.
When you try to train a 1B parameter model at 32-bit full precision, you will
quickly hit the limit of a single Nvidia A100 or H100 GPU with only 80GB
of GPU RAM. Therefore, you will almost always need to use quantization
when using a single GPU. However, most modern generative AI models
exceed 1 billion parameters and require 10’s of thousands of GB’s of GPU
RAM as shown below.
Figure 3-11. GPU RAM needed for large models
For larger models, you will likely need to use a distributed cluster of GPUs
to train these massive models across hundreds or thousands of GPUs. Even
for single GPU scenarios, there are still some performance benefits to
scaling your training across multiple GPUs even though it’s not required.
However, in addition to the extra cost, training with a distributed cluster
requires a deeper understanding of distributed-computing techniques which
you will explore next.
Distributed Computing
There are many different types of distributed computing patterns including
distributed data parallel (DDP) and fully-sharded data parallel (FSDP). The
main difference is how the model is split - or sharded - across the GPUs in
the system.
If the model parameters can fit into a single GPU, then you would choose
DDP to load a single copy of the model into each GPU. If the model is too
large for a single GPU - even after quantization - then you need to use FSDP
to shard the model across multiple GPUs. In both cases, the data is split into
batches and spread across all available GPUs to increase GPU utilization
and cost efficiency at the expense of some communication overhead which
you will see in a bit.
PyTorch comes with an optimized implementation of DDP that automatically
copies your model onto each GPU (assuming it fits into a single GPU - often
combined with quantization), splits the data into batches, and sends the
batches to each GPU in parallel. With DDP, each batch of data is processed
in parallel on each GPU followed by a synchronization step where the results
of each GPU (e.g. gradients) are combined (e.g. averaged). Subsequently,
each model - 1 per GPU - is updated with the combined results and the
process continues as shown below.
Figure 3-12. Distributed Data Parallel (DDP)
Note that DDP assumes that each GPU can fit not only your model parameters
and data batches, but also any additional data that is needed to fulfill the
training loop including optimizer states, activations, temporary function
variables, etc as shown below. If your GPU cannot store all of this data, you
need to shard your model across multiple GPUs.
Figure 3-13. Memory consumption for DDP
ZeRO Stage 1 only shards the optimizer states across GPUs, but still reduces
your model’s memory footprint up to 4x. ZeRO Stage 2 shards both the
optimizer states and gradients across the GPUs to reduce GPU memory up to
8x. ZeRO Stage 3 shards everything - including the model parameters -
across the GPUs to help reduce GPU memory up to `n` times where `n` is the
number of GPUs. For example, when using ZeRO Stage 3 with 128 GPUs,
you can reduce your memory consumption by up to 128x.
Compared to DDP in which each GPU has a full copy of everything needed
to perform the forward and backward pass, FSDP needs to dynamically re-
construct a full layer from the sharded data onto each GPU before the
forward and backward passes as shown below.
Here, you see that before the forward pass, each GPU requests data from the
other GPUs on-demand to materialize the sharded data into unsharded, local
data for the duration of the operation - typically on a per-layer basis.
When the forward pass completes, FSDP releases the unsharded, local data
back to the other GPUs - reverting the data back to its original sharded state
to free up GPU memory for the backward pass. After the backward pass,
FSDP synchronizes the gradients across the GPUs, similar to DDP, and
updates the model parameters across all the model shards where different
shards are stored on different GPUs.
NOTE
You can also configure FSDP to offload some computation to CPU and CPU memory to
further reduce memory pressure on your GPUs. Another performance vs. memory
tradeoff.
Sharding factor of 1 avoids sharding and replicates the model across all
GPUs - reverting the system back to DDP. You can set the sharding factor to a
maximum of `n` number of GPUs to unlock the potential of full sharding. Full
sharding offers the best memory savings - as the cost of GPU-communication
overhead. Setting the sharing factor to anything in between will enable hybrid
sharding.
Below is a comparison of FSDP and DDP from the 2023 paper Experiences
on Scaling Fully Sharded Data Parallel by Zhao, et al. These tests were
performed on different-size T5 models using 512 NVIDIA A100 GPUs -
each with 80GB of memory. They compare the number of teraFLOPs per
GPU where teraFLOP, as a reminder, is 1 trillion floating point operations
per second.
This demonstrates FSDP can scale model training for both small and large
models across different GPU cluster sizes. As the cluster size increases,
however, there will be a small decrease in overall throughput as
communication overhead increases between the GPUs.
Summary
In this chapter, you explored some of the computational challenges of training
these models due to GPU memory limitations. And you saw how to use
quantization to save memory, reduce cost, and improve performance. You
also learned how to scale your training across multiple GPUs and nodes in a
cluster using distributed computing strategies such as distributed data
parallel and fully-sharded data parallel. By combining quantization and
distributed computing, you can train very large models efficiently and cost
effectively with minimal impact on training throughput and model accuracy.
In the next chapter, you will learn how to adapt existing generative
foundation models to your own datasets using a technique called fine-tuning.
Fine-tuning an existing foundation model can be a less costly, yet sufficient
alternative to model pre-training from scratch.
Chapter 4. Fine-Tuning and
Evaluation
{dialogue}
Briefly summarize that dialogue.
{summary}
Here is a dialogue:
{dialogue}
Write a short summary.
{summary}
Dialogue:
{dialogue}
What is a summary of this dialogue?
{summary}
{dialogue}
What was that dialogue about, in two sentences or less?
{summary}
Here is a dialogue:
{dialogue}
What were they talking about?
{summary}
Dialogue:
{dialogue}
What were the main points in that conversation?
{summary}
Dialogue:
{dialogue}
What was going on in that conversation?
{summary}
Applying this template to each row in the samsum dataset, you actually create
7 fine-tuned examples per row, in this case. Since samsum is approximately
16,000 rows of data, after applying the template, you now have 112,000
examples to fine-tune your model for the conversation-summarization task!
See, getting to 100,000 examples wasn’t that difficult, was it?
NOTE
Fine-tuning an already-fine tuned model like FLAN-T5 on a single task like summarization
will likely improve the model’s performance on summarization but may degrade the
model’s performance on other tasks like sentiment analysis and question-answer. This
phenomenon is called “catastrophic forgetting”. To combat catastrophic forgetting, you can
mix in a small percentage of multi-task examples - in addition to summarization - during the
fine-tuning process. Approximately 5% of mix-in, multi-task data is a good starting point.
You can use Python’s “f-string” formatting code to create a sample prompt
from the `dialogue` column above as shown here.
prompt_template = f”””
Here is a dialogue:
{dialogue}
You will then pass this prompt to the model which returns a sample
completion as shown below.
Figure 4-7. Generated summary before fine-tuning
Here, you see the model does OK, but the generated summary does not match
the human-curated completion which includes important details about the
conversation. The model has also generated some hallucinations with
information not found in the original conversation such as the hotel name.
First, you will prepare a dataset for fine-tuning by applying a similar
`prompt_template` above to the full dialogsum dataset, but adding the
`{summary}` column as this is needed for the fine-tuning. Below is the code
to apply this format to the dialogsum dataset - as well as convert the text into
`token_ids` used during fine-tuning.
prompt_completion_template = f”””
Here is a dialogue:
{dialogue}
{summary}
“””
from transformers import AutoTokenizer
from datasets import load_dataset, DatasetDict
dataset = load_dataset("knkarthick/dialogsum")
print(dataset)
def tokenize_prompt_and_completion(sample):
prompt_and_completion = \
Prompt_completion_template \
.format(
dialogue=sample["dialogue"],
summary=sample["summary"],
eos_token=tokenizer.eos_token)
return tokenizer(prompt_and_completion)
tokenized_prompt_and_completion_datasets =
dataset.map(tokenize_prompt_and_completion)
tokenized_prompt_and_completion_datasets =
tokenized_datasets.remove_columns(['id',
'topic', 'dialogue', 'summary',])
print(tokenized_prompt_and_completion_datasets)
Next, you will actually perform the fine-tuning on the tokenized dataset. Here
is the code for fine-tuning your model using the HuggingFace Transformers
library.
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments,
Trainer, GenerationConfig
import torch
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
model =
AutoModelForSeq2SeqLM.from_pretrained(
model_checkpoint,
trust_remote_code=True, # needed by Falcon
torch_dtype=torch.bfloat16,
device_map="auto", # place shards automatically
)
training_args = TrainingArguments(
output_dir="./output",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=1e-5,
num_train_epochs=5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=\
tokenized_prompt_and_completion_datasets['train'],
eval_dataset=\
tokenized_prompt_and_completion_datasets['validation']
)
NOTE
To use a model like Falcon or Llama2, just change `model_checkpoint`, replace `model`
definition with the one shown below using `AutoModelForCausalLM`, and re-run both the
data preparation and fine-tuning steps.
model =
AutoModelForCausalLM.from_pretrained(
model_checkpoint,
trust_remote_code=True, # Required by Falcon torch_dtype=torch.bfloat16,
device_map="auto", # place shards automatically
)
Below is the model’s response after performing fine-tuning. You see that this
summary is closer to the human-curated baseline summary. The model
includes the important information and does not seem to hallucinate.
While this example uses the public `dialogsum` dataset to demonstrate fine-
tuning on custom data, you will likely use your company’s own internal data
such as the chat-support conversations exported from your customer-support
application. This helps your model learn the nuances of the interactions
between your customer-service representatives and your customers.
In this example, you qualitatively compared the summaries based on your
human understanding of language. While humans are very good at model
evaluation, they are not scalable, unfortunately. Next, you will learn about
more-formal techniques to create repeatable, scalable, and quantitative
mechanisms to evaluate your generative models before deciding to push to
production.
Evaluation Metrics
There exist many metrics to evaluate generative AI model performance, and
there is much debate in the community on their significance and effectiveness.
At their core, evaluation metrics provide a baseline to which you can
compare changes to your model such as fine-tuning.
Classic machine learning evaluation metrics such as accuracy, root-mean
squared-error (RMSE) are straightforward to calculate since the predictions
are deterministic and easy to compare against the labels in a validation or
test dataset.
The output from Generative AI models, however, are famously non-
deterministic by design which makes evaluation very difficult without human
intervention. Additionally, evaluation metrics for generative models are very
task-specific. For example, the ROUGE metric is used to evaluate
summarization tasks while the BLEU metric is used for translation tasks.
Since this chapter focuses on summarization, you will learn how to calculate
the ROUGE metric which reveals why it’s both useful and controversial at
the same time.
ROUGE, which is the acronym for Recall-Oriented Understudy for Gisting
Evaluation, calculates how well the input (‘dialogue`, in our case) compares
to the generated output (`summary`, in our case). To do this, ROUGE
calculates the number of similar unigrams (single-words), bigrams (two
consecutive words), and longest-common-sequences (consecutive n-grams)
between the inputs and generated outputs to calculate the ROUGE-1,
ROUGE-2, and ROUGE-L scores. The higher the score, the more similar
they are.
By now, you might understand the controversy. Human language consists of
many examples in which similar phrases vary wildly in their meaning
differing either by only a few words - or a slight change in word position.
Consider the example, “This book is great” and “This book is not great”.
Using ROUGE alone, these phrases appear to be similar as shown below.
However, they are, in fact, opposite.
While ROUGE is far from perfect, it is suitable to use as a baseline metric
before and after fine-tuning your model. Many popular natural-language
libraries, including HuggingFace, support ROUGE. Below is the code to
evaluate your model using the `evaluate` library from HuggingFace. Here,
you see an approximately 80% improvement in the ROUGE scores after fine-
tuning on the `dialogsum` dataset based on a holdout test dataset not seen by
the model during fine-tuning.
import evaluate
rouge = evaluate.load('rouge')
original_results = rouge.compute(
predictions=original_model_summaries,
references=human_baseline_summaries,
use_aggregator=True,
use_stemmer=True,
)
print(original_results)
```
{'rouge1': 0.2334,
'rouge2': 0.0760,
'rougeL': 0.2014}
```
tuned_results = rouge.compute(
predictions=tuned_model_summaries,
references=human_baseline_summaries,
use_aggregator=True,
use_stemmer=True,
)
print(tuned_results)
```
{'rouge1': 0.4216,
'rouge2': 0.1804,
'rougeL': 0.3384}
Summary
In this chapter, you saw how to fine-tune your model with instructions by
applying prompt templates to a dataset that matches your generative task and
use case. You also learned some common metrics, benchmarks and datasets
used to evaluate your model after making changes, compare your model to
other models, and measure the model’s toxicity and truthfulness.
Chapter 5. Parameter-efficient
Fine Tuning (PEFT)
With PEFT, most, if not all of the weights are kept frozen. As a result, the
number of trained parameters is much smaller than the number of parameters
in the original model. In some cases the trainable parameters can be just 1-
2% of the original LLM weights. Because you’re training a relatively-small
number of parameters, the memory requirements for fine-tuning become more
manageable and can often be performed on a single GPU.
In addition to requiring less resources during fine-tuning, PEFT methods are
also less prone to catastrophic forgetting, as discussed in depth in Chapter 4,
when compared to full fine-tuning because the original model is only slightly
modified or left unchanged.
Another key consideration for PEFT includes common scenarios where you
need to adapt your model for multiple tasks. For example, let’s assume you
need to fine-tune for three separate tasks. If you use full fine-tuning for each
task, that results in a new model version for every task you train on as shown
in Figure 6.n. Each of these new adapted models are the same size as the
original models which can create an expensive storage and hosting problem
if you are performing full fine-tuning for multiple tasks.
Figure 5-3. Full fine-tuning creates a full copy of the original model for each task
With parameter efficient fine tuning, you train only a small number of
weights, which results in a much smaller model footprint overall - as small
as MBs depending on the task.
The new, or updated, parameters are combined with original model weights
for inference. The PEFT weights are trained for each task and can be easily
swapped out for inference as shown in Figure 6.n. This allows for efficient
adaptation of the original model to multiple tasks.
Figure 5-4. PEFT reduces task-specific model weights and can be swapped out at inference
There are some things to consider when choosing between full fine-tuning
and parameter-efficient fine-tuning to adapt your model to specific tasks.
Table 6.n summarizes these considerations.
T
a
b
l
e
5
-
1
.
C
o
n
s
i
d
e
r
a
t
i
o
n
s
f
o
r
c
h
o
o
s
i
n
g
P
E
F
T
v
s
F
u
l
l
F
i
n
e
-
T
u
n
i
n
g
Selective methods are those that fine-tune only a subset of the original model
parameters or layers. There are several approaches that you can take to
identify which parameters or layers you want to update. As an example, the
BitFit method focuses on only training the bias weights or a subset of those
weights. Depending on the selective method used, you have the option to
train only certain components of the model, specific layers, or even
individual parameter types. Researchers have found that the performance of
these methods is mixed, and there are significant trade-offs between
parameter efficiency and compute efficiency.
Reparameterization methods also work with the original model parameters,
but reduce the number of parameters to train by creating new low-rank
transformations of the original network weights. A popular technique in this
category is Low-rank Adaptation (LoRA). Because LoRA is broadly used,
you’ll explore this method, along with a complementary method called
Quantized LoRA (QLoRA) more in later sections of this chapter.
Additive methods carry out fine-tuning by keeping all of the original model
weights frozen and introducing new trainable components. There are two
primary categories of additive methods including adapters and soft prompts.
Adapters add new trainable layers to the architecture of the model, typically
inside the encoder or decoder components after the Attention or Feed
Forward layers. On the other hand, soft prompts focus on manipulating the
inputs to achieve better performance. The inputs can be manipulated by
adding trainable parameters to the prompt embeddings or keeping the input
fixed and retraining the embedding weights. For this chapter, you’ll learn
more about one specific technique that falls within the soft prompt category
called Prompt Tuning.
In the next section, you’ll learn about a specific reparameterization technique
called LoRA.
Figure 5-6. During full fine-tuning every parameter in network layers is updated
After the embedding vectors are created, they’re fed into the self-attention
layers, where a series of weights are applied to calculate the attention
scores. During full fine-tuning, every parameter in these layers is updated.
This process of updating every parameter can require a lot of compute
resources and time.
LoRA is a fine-tuning strategy that reduces the number of parameters to be
trained by freezing all of the original model parameters and then inserting a
pair of “rank decomposition matrices” alongside the original weights. The
dimensions of the smaller matrices, shown in Figure 6.n as A and B, are
defined so that their product is a matrix with the same dimensions as the
weights they are modifying. You then keep the original weights of the model
frozen and train these smaller matrices using the same supervised learning
process defined in Chapter 2.
Figure 5-7. Low-rank matrices are learned during LoRA fine-tuning process
The size of the low rank matrices is set by the parameter called rank (‘r’).
The original pre-trained model contains full rank weight matrices for the
tasks they are trained on. For new tasks, LoRA relies on the fact that this data
can be represented by a lower dimensional matrix. The size of this lower
dimensional matrix is determined by the value of ‘r’. A smaller value leads
to a simpler low-rank matrix with less parameters to train. While it’s
important to experiment with the right value of ‘r’ for your own tasks, you
can often achieve good results with a smaller ‘r’ number (i.e. 4, 8).
There are different options to utilize LoRA for fine-tuning; however, open
source libraries support the different PEFT methods. Below is an example
using HuggingFace Transformers to perform LoRA fine-tuning for (‘Task 1’)
with a rank=32.
lora_config = LoraConfig(
r=32, # Rank
lora_alpha=32,
target_modules=["q", "v"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
peft_model = get_peft_model(original_model,
lora_config)
peft_training_args = TrainingArguments(
output_dir="./model",
auto_find_batch_size=True,
learning_rate=1e-3,
num_train_epochs=1,
logging_steps=1,
max_steps=1
)
peft_trainer = Trainer(
model=peft_model,
args=peft_training_args,
train_dataset=tokenized_datasets["train"]
)
For inference, the two low-rank matrices are multiplied together to create a
matrix with the same dimensions as the frozen weights. You then add this to
the original weights and replace them in the model with these updated values
as shown below in Figure 6.n. You now have a LoRA fine-tuned model that
can carry out your specific task. To return to the original weights for another
task, you can then subtract the value of the low rank matrix from the original
weights. Because this model has the same number of parameters as the
original, there is little to no impact on inference latency.
Figure 5-8. Low-rank matrices multiplied together and added to original weights for inference
Below is the code used for performing inference against your LoRA fine-
tuned model using the low rank matrices for the task (‘Task 1’) that you
previously fine-tuned. Here, you are using `is_trainable=False` because we
are only performing inference - and not updating any weights. There are some
performance benefits to setting the parameters to immutable.
peft_model_base = AutoModelForCausalLM.from_pretrained(base_model_dir,
torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(base_model_dir)
peft_model = PeftModel.from_pretrained(peft_model_base,
model_dir,
torch_dtype=torch.bfloat16,
is_trainable=False)
peft_model_1 = PeftModel.from_pretrained(peft_model_base,
model_1_dir,
torch_dtype=torch.bfloat16,
is_trainable=False)
peft_model_2 = PeftModel.from_pretrained(peft_model_base,
model_2_dir,
torch_dtype=torch.bfloat16,
is_trainable=False)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=bnb_config)
config = LoraConfig(
r=32,
lora_alpha=32,
target_modules=[
"query_key_value",
"dense",
"dense_h_to_4h",
"dense_4h_to_h",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
trainer = transformers.Trainer(
model=model,
args=transformers.TrainingArguments(
…
bf16=True
)
)
As you can see, the scores are fairly low for the base model, then get better
when performing full fine-tuning by updating all of the model parameters.
The metric drops a bit when using LoRA-based parameter-efficient fine-
tuning. However, using LoRA for fine-tuning trained a much smaller number
of parameters than full fine-tuning using significantly less compute, in this
case 1.4% so this small trade off in performance may well be worth it.
Choosing the optimal value for rank of the LoRA matrices is still an active
area of research and you will need to experiment to determine the right level
that meets your performance and resource requirements. In general, the
smaller the rank, the smaller the number of trainable parameters, and the
bigger the savings on compute. However, there are potential issues related to
model performance to consider.
In the original LoRA research paper, “LoRA: Low-rank Adaptation of Large
Language Models” (Hu, 2021), researchers at Microsoft explored how
different choices of rank impacted the model performance on language
generation tasks. In general, they found that the effectiveness of a higher rank
setting appears to plateau when setting the rank greater than a value of 16.
This means that setting the rank between 4 and 16 can often provide you with
a good trade-off between reducing the number of trainable parameters while
still preserving acceptable levels of model performance.
In the previous sections, you learned about LoRA which is a PEFT technique
that falls into the reparameterization category. In the next section, you’ll learn
more about a technique that falls within the additive category of techniques.
These techniques focus on adding trainable layers or parameters to the
model.
The soft prompts then get prepended to the embedding vectors that represent
your input text. These soft prompt vectors have the same length as the
embedding vectors representing the language token. Research has shown that
somewhere between 20 and 100 virtual tokens can be enough to achieve
good performance for your task. While the tokens from the hard prompt are
specifically related to the input text, these virtual tokens with the trainable
soft prompts do not directly represent discrete text.
Prompt tuning falls into the additive category of PEFT fine-tuning methods
because you are adding soft prompts while the weights of the underlying
large language models remain frozen. The embedding vectors of the soft
prompt then get updated over time to optimize the model’s ability to
accurately complete the prompt. Because you are only tuning a set of soft
prompts, this is a parameter efficient tuning strategy over full fine-tuning of a
model. Also, similar to LoRA you can train a different set of task level
prompts and swap those out during inference. To do this, you prepend your
input prompt with those learned tokens, soft prompts, specific to your task.
Prompt tuning performance varies and research has shown the prompt tuning
may not perform as well as full fine-tuning for smaller LLMs but as the
model size increases the performance of prompt tuning tends to improve. As
an example, research (Lester, 2021) has shown equivalent performance to
full fine-tuning for some models that have 10 billion parameters using the
SuperGLUE evaluation benchmark. However, the primary challenge with
prompt tuning tends to be the interpretability because these learned virtual
tokens can take on any value within that continuous embedding vector space
and they do not necessarily correspond to any known token or discrete
language in the vocabulary of the LLM. Typically these soft prompts form
tight semantic clusters, based on analysis of the nearest neighbor tokens to the
soft prompt locations, meaning the words closest to these soft prompt tokens
have similar meanings. This suggests that the tokens are learned based on
word representations.
Summary
In this chapter, you explored LoRA which uses rank decomposition matrices
to update the model parameters in an efficient way. With LoRA, the goal is to
find an efficient way to update the weights of the model, without having to
train every single parameter again.
LoRA is a powerful fine-tuning method that achieves great performance.
Because LoRA reduces the amount of resources needed to fine tune your
models relative to full fine-tuning, it is used widely in practice for many
tasks and use cases. The principles behind this method are useful not just for
training generative language models, but also for other types of models
including image and video.
QLoRA is a variant of LoRA that uses quantization, a new data type called
NumericFloat4 (nf4), and targets more than just the attention layers of the
Transformer.
In the next chapter, you will learn a powerful technique called reinforcement
learning from human feedback (RLHF) to fine-tune your generative models to
align with human values and preferences.
Chapter 6. Fine-Tuning using
Reinforcement Learning from
Human Feedback (RLHF)
The “action” is chosen from the “action space” consisting of all possible
tokens. Specifically, the next token is chosen based on the probability
distribution of tokens over all tokens in the model’s vocabulary. The
“environment” is the model’s context window. The “state” consists of the
tokens that are currently in the context window.
In the context of generating language, the sequence of actions and states
resulting in a reward is called a “rollout”. This is in contrast to the term,
“playout,” used in classic RL. The “reward” is based on how well the
model’s completion aligns with a human preference such as helpfulness. As
the model experiences more rollouts and rewards, it will learn to generate
tokens that produce a high reward - a more helpful completion, in this case.
Measuring helpfulness is a bit trickier than tracking a car’s time to complete
a race, but you can come close by using human annotations, or human
feedback, to train a reward model. You can then use the reward model during
the RLHF process to encourage the model to generate more human-aligned
completions by reinforcing with positive rewards. The reward model plays a
key role in RLHF as it encodes the preferences learned from human
feedback.
Over the next few sections, you will see the end-to-end RLHF process
starting with collecting human feedback to train the reward model.
* Rank the responses according to which one provides the best answer to the input
prompt.
* What is the best answer? Make a decision based on (a) the correctness of the
answer, and (b) the informativeness of the response. For (a) you are allowed to
search the web. Overall, use your best judgment to rank answers based on being the
most useful response, which we define as one which is at least somewhat correct, and
minimally informative about what the prompt is asking for.
* Long answers are not always the best. Answers which provide succinct, coherent
responses may be better than longer ones, if they are at least as correct and
informative.
To ensure quality labeling and feedback, make sure you provide clear
instructions to help the labelers understand their task, the human-alignment
criteria, and any edge cases. Generally, instructions should clearly describe
the task for the labeler. Typically, they are asked to rank the completions for
a given prompt according to some criteria. The more details you share, the
more likely the labeler will correctly perform the task and provide a high-
quality, human-aligned ranking dataset to train your reward model.
Note the additional guidance in the second bullet point above. The item asks
that the labelers make decisions based on their perception of the correctness
and informativeness of the response. They are allowed to use the internet to
verify the information.
They are also given clear instructions about what to do if they encounter an
edge case where 2 or more completions are equally correct and informative.
Though discouraged, the labelers are allowed to rank these equal
completions the same.
A final instruction worth calling out here is what to do in the case of a
nonsensical, confusing, or irrelevant answer. In this case labelers should
select “F” to fail the answer rather than rank it. This way, poor quality
answers can be filtered and removed from the dataset altogether.
Providing these detailed human instructions will increase the likelihood that
the responses will be high quality and that all individual humans will carry
out the task in a consistent way. This helps ensure that the humans reach a
consensus.
As a best practice, you should send the same prompt-completion sets to a
larger number of human labelers to help reduce the impact of individual
human labeling mistakes such as misreading the instructions or ranking in the
opposite order. In addition, the group of human labelers should represent a
diverse set of cultures across the world to encourage global thinking and help
reduce local bias.
Below are example rankings for the 4 generated completions to the prompt,
“Tell me a funny story.” You now want the human labelers to rank the
completions from the most helpful (1) to the least helpful (4).
Figure 6-4. Collect human feedback from human labelers
By repeating this process for many different prompts across many labelers,
you are creating a human-preference dataset that you will use to train your
RL reward model as you will see in a bit. First, you need to modify your
dataset slightly before you can train your reward model.
Note that with 3 possible completions for a prompt, you will generate 3 rows
of reward-training data. The field of combinatorics dictates that, for `n`
number of possible completions, you will generate `(n choose 2)` pairwise
comparisons.
Therefore, if you had 4 possible completions, you would generate 6 pairwise
comparisons. 5 possible completions would generate 10 pairwise
comparisons, and so on. This is why reaching 10,000-20,000 rows of
reward-training data is relatively easy to reach - since each prompt actually
generates an exponential number of rows of training data!
NOTE
While thumbs-up/down human feedback is often easier to capture than multi-completion
ranking feedback, the ranking feedback gives you exponentially more data to train your
reward model as shown here.
Figure 6-7. Train model to predict the preferred completion yj from {yj, yk } for prompt `x`
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name,
device_map="auto")
toxicity_model =
AutoModelForSequenceClassification.from_pretrained(toxicity_model_name,
device_map="auto")
toxic_text = "You are disgusting and terrible and i dang hate you"
logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')
Output:
In this case, you want to optimize the model to generate completions that,
along with the prompt, will classify as “not hate”.
NOTE
The reason to include the prompt is that the toxicity of the completion may change given
the context of the prompt.
With your reward model trained and working as expected, you can now
explore how to use the model in the broader reinforcement learning process
to fine-tune your generative model for better human-alignment.
Figure 6-10. Using PEFT within RLHF to minimize the resources needed to fine-tune the
generative model
A single iteration of the RLHF process ends with updating the generative
models’ weights. The iterations continue for a given number of steps and
epochs similar to other types of model training and fine-tuning. After a while,
the generative model should start to receive higher rewards as it produces
less toxic completions. This process continues until the model is considered
aligned based on an evaluation threshold such as toxicity score - or until the
maximum number of iterations is reached.
The fine-tuned, human-aligned model is then ready for evaluation and,
depending on the evaluation result, ready for production deployment.
You’ll need to create a baseline toxicity score for the original, instruct model
by evaluating its prompt-completion pairs with a model that can assess toxic
language. This is also a binary classifier capable of detecting toxic language.
You can use this classifier model to determine if your generative model is
below a desired toxicity threshold - or at least decreased relative to the
baseline toxicity score from the original instruct model as you see in the code
samples below.
Here, you define the `evaluate_toxicity` function to calculate the mean and
standard deviation across all prompt-completion pairs for the provided
dataset using the `toxicity_evaluator` defined below.
def evaluate_toxicity(model,
toxicity_evaluator,
tokenizer,
dataset,
num_samples):
max_new_tokens=100
toxicities = []
input_texts = []
for i, sample in tqdm(enumerate(dataset)):
input_text = sample["query"]
if i > num_samples:
break
generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
tok_k=0.0,
top_p=1.0,
do_sample=True)
response_token_ids = model.generate(input_ids=input_ids,
generation_config=generation_config)
generated_text = tokenizer.decode(response_token_ids[0],
skip_special_tokens=True)
toxicities.extend(toxicity_score["toxicity"])
toxicity_evaluator = evaluate.load(
"toxicity",
"facebook/roberta-hate-speech-dynabench-r4-target",
module_type="measurement",
toxic_label="hate")
mean_before_detoxification, std_before_detoxification =
evaluate_toxicity(model=model_before_rlhf,
toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset["test"],
num_samples=10)
print(f'toxicity [mean, std] before detox: [{mean_before_detoxification},
{std_before_detoxification}]')
#
# Perform RLHF here…
#
mean_after_detoxification, std_after_detoxification =
evaluate_toxicity(model=model_after_rlhf,
toxicity_evaluator=toxicity_evaluator,
tokenizer=tokenizer,
dataset=dataset["test"],
num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification},
{std_after_detoxification}]')
# Calculate improvement
mean_improvement = (mean_before_detoxification - mean_after_detoxification) /
mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) /
std_before_detoxification
print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')
Output:
Summary
Fine-tuning your generative models for human values is a very important tool
in your generative toolbox to improve your model’s helpfulness, honesty, and
harmlessness (HHH). RLHF is a very active area of research with a great
amount of impact on making these models more human-like, useful, and
enjoyable. In this chapter, you learned the fundamentals of RLHF which will
help you understand the field as it evolves further.
Specifically, you saw how to collect human feedback including prompt-
completion rankings, process the feedback in order to train a reward model,
and ultimately use reinforcement learning to reduce a generative model’s
toxicity.
Now that you have a human-aligned, reduced toxicity model, you will see
how to prepare it for low-latency, high-performance inference in the next
chapter.
Chapter 7. Optimize Generative
AI Models for Inference
After you have adapted your model to your target task, you will ultimately
want to deploy your model so you can begin interacting with it as well as
potentially integrate the model into an application that is designed to
consume the model.
When you are ready to deploy your LLMs you need to understand the
resources your model may need as well as the intended experience for
interacting with the model. When considering the resources your model will
need this will include identifying requirements such as how fast do you need
your model to generate completions, what compute budget do you have
available, and what trade-offs are you willing to make regarding model
performance to be able to achieve faster inference speed as well as
potentially reduce storage costs.
In this chapter, you will explore various techniques to perform post-training
optimizations on your model including pruning, quantization, and distillation.
There are also additional considerations and potential tuning of your
deployment configurations that will need to be done post deployment as well
such as selecting the optimal compute resources to balance cost and
performance.
There are also broader considerations for building generative AI
applications which you’ll learn more about in Chapters 9 and 10. However,
this chapter will focus on techniques that are specifically aimed at optimizing
your model for inference to improve your end user’s experience.
Pruning
Pruning aims to eliminate model weights that are not contributing
significantly to the model’s overall performance. By eliminating those model
weights, you’re able to reduce the model size for inference which reduces the
compute resources required.
Figure 7-2. Pruning aims to reduce the overall model size by eliminating weights that are not
contributing to model performance
The model weights to be eliminated during pruning are those with a value of
zero or a value very close to zero. There are various methods to perform
pruning with some requiring full re-training, some that prune utilizing a
Parameter Efficient Fine-Tuning (PEFT) technique like those discussed in
Chapter 6, and then others that focus on post-training pruning. Pruning during
training is accomplished through unstructured pruning, removing weights, or
structured pruning, removing entire columns or rows of the weight matrices.
Both approaches require retraining; however, there are post training pruning
methods typically referred to as one-shot pruning methods that can do pruning
without retraining. The challenge in performing one-shot pruning on large
language models is compute intensive for models with billions of
parameters. One method of post-training pruning, called SparseGPT (Frantar,
2023), is introduced in a research paper to overcome the challenges of one-
shot pruning on large language models. This method is specifically built for
GPT foundation models and introduces an algorithm that performs sparse
regression at a large scale.
In theory, pruning reduces the size of the LLM which reduces compute
resources and model latency. However, in practice there are LLMs where
only a small percentage of their weights are zero so in those cases pruning
may not have a large impact on the model size.
Quantization
Quantization aims to transform a model’s weights to a lower precision
representation with the goal of reducing the model’s size as well as compute
requirements for hosting LLMs. Chapter 4 detailed a method of quantization
that can be performed during training called quantization-aware training
(QAT). In this section, the focus will be on a second quantization method that
is performed post-training to optimize for deployment specifically called
post-training quantization (PTQ), in this context.
Most models are built with 32-bit precision and doing matrix multiplication
on 32-bit can be resource intensive. PTQ transforms a models’ weights to a
lower precision representation, such as 16-bit floating point or 8-bit integer,
to reduce the model’s size and memory footprint which translates into less
compute resources that are needed for serving your model. Figure 8.n shows
a model quantized from 32-bit floating point to 16-bit floating point. At a
high level, you can estimate that going from a 32-bit to 16-bit representation
will result in a model that is approximately half of the size of the original.
Figure 7-3. Reduce memory footprint and inference latency with post-training quantization
PTQ can be applied to just the model weights and/or activation layers. Also,
quantization approaches that include both the activations as well as the
model weights have a higher impact on model performance. Quantization
applied to both the model weights and activations is typically referred to as
dynamic range quantization. This method requires an extra calibration step to
statistically capture the “dynamic range” of the original parameter values as
shown in Figure 8.n. The calibration step involves using a small unlabeled
dataset to identify the dynamic range, more specifically the minimum and
maximum, of values on input.
Figure 7-4. Dynamic range post training quantization requires an extra calibration step
Distillation
Distillation is a technique that helps reduce the model size which ultimately
reduces the number of computations and improves model inference
performance. Distillation uses statistical methods to train a smaller student
model on a larger teacher model. The end result is a student model that
retains a high percentage of the teacher’s model accuracy, but uses a much
smaller number of parameters. The student model is then deployed for
inference. The smaller model requires smaller hardware and therefore less
cost per inference request.
The teacher model is often a generative foundation model - or a fine-tuned
variant. During the distillation training process, the student model learns to
statistically replicate the behavior of the teacher model. The teacher model
weights do not change during the distillation process - only the student model
weights change. The teacher model’s output is used to “distill” knowledge to
the student model.
Both the teacher and student models generate completions from a prompt-
based training dataset. A distillation loss is calculated by comparing the 2
completions. The loss is then minimized during the distillation process using
backpropagation to improve the student model’s ability to match the teacher
model’s predicted next-token probability distribution.
The teacher models’ predicted tokens are known as “soft labels” while the
student models’ predicted tokens are called “soft predictions”. In parallel,
you need to compare the student models’ predictions called the “hard
predictions” against the ground truth “hard labels” from the prompt dataset.
The difference is the “student loss”. The distillation loss and student loss are
combined and used to update the student models’ weights using standard
backpropagation.
Figure 7-5. Minimize the combination of distillation loss and student loss are used to distill
information from teacher to student
NOTE
In practice, distillation may not be as effective for generative decoder models as it is for
encoder models like BERT. This is because the output space is relatively large for decoder
models (e.g. vocabular size of 100,000 tokens) without a lot of redundancy in
representation.
Next, you will see how to deploy and your generative model into production
using Amazon SageMaker.
Deploying Generative AI Models on AWS
When you’re ready to deploy your generative model, you can use Amazon
SageMaker Endpoints to host and scale your models. Below is the code to
create the Amazon SageMaker Endpoint including an
`EndpointConfiguration` which includes the hardware using `InstanceType`
and `InitialInstanceCount`. In this case, you’re deploying 2 different variants
of your model in an A/B test across 10 GPU-based SageMaker instances.
50% of your traffic will go to `model_variant_1` and 50% will go to
`model_variant_2`.
import boto3
sm = boto3.Session().client(
service_name="sagemaker")
model_variant_1 = "generative-model-1"
model_variant_2 = "generative-model-2"
endpoint_config = "generative-endpoint-config"
sm.create_endpoint_config(
EndpointConfigName=endpoint_config,
ProductionVariants=[
{
"ModelName": "generative-model",
"VariantName": "generative-model-1",
"InstanceType": "ml.g5.12xlarge",
"InitialVariantWeight": 5,
"InitialInstanceCount": 5
},
{
"ModelName": "generative-model",
"VariantName": "generative-model-2",
"InstanceType": "ml.g5.12xlarge",
"InitialVariantWeight": 50,
"InitialInstanceCount": 5
}
]
)
endpoint = “generative-endpoint”
sm.create_endpoint(
EndpointName=endpoint,
EndpointConfigName=endpoint_config
)
waiter = sm.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=endpoint_name)
Here, you are comparing the 2 variants and, at some point, will likely send
100% traffic to the better model based on some evaluation criteria or longer-
term objective such as increasing revenue or reducing churn.
Additionally, SageMaker Endpoints support advanced deployment strategies
including shadow deployments. In contrast to an A/B deployment, a shadow
deployment puts a model into production where it accepts the same input as
the model being shadowed, but it simply stores the model response to disk
for offline analysis. This helps you conservatively evaluate a model against
live production inputs without exposing potentially bad responses to the end
user.
Once the model is deployed as a SageMaker endpoint, you can generate
completions for your prompts using the following code:
import json
from sagemaker import Predictor
prompt = """
Summarize the following conversation.
Summary:
"""
predictor = Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sess,
)
response = predictor.predict(zero_shot_prompt,
{
"ContentType": "application/x-text",
"Accept": "application/json",
}
)
response_json = json.loads(response.decode('utf-8'))
print(response_json)
Summary
In this chapter, you saw powerful techniques to optimize your model for
inference by reducing the size of the model through distillation, quantization,
or pruning. These techniques help reduce model size and improve model
inference performance with minimal impact on model accuracy - ultimately
improving the user’s happiness. They also help to minimize the amount of
hardware resources needed to serve your generative models in production -
ultimately lowering cost and improving your CFO’s happiness!
In the next chapter, you will explore some popular mechanisms to augment
the capabilities of your generative models using external data sources and
APIs.
Chapter 8. Retrieval Augmented
Generation (RAG) and
Information Retrieval
One thing to keep in mind is there are foundation models, especially some of
the more current models, that include additional built-in guardrails to avoid
hallucinations. For example, asking Anthropic’s Claude V2 model the same
question results in the model acknowledging it does not have information
about ‘snazzy-fluffikens’ as a dog breed.
Figure 8-2. LLMs can use additional built-in guardrails to avoid hallucinations
The second common issue, shown in Figure 9.n, is known as knowledge cut-
off which results in the model returning an answer that is out of date with
current data. All foundation models have a knowledge cut-off which refers to
the date they were trained. This is important because the knowledge of the
model is limited to the data that was current at the time it was trained. For
example, if you ask the model who recently won the NBA championship, it
will give you the most recent information it has available, in this case the
champions in 2021. However, it won’t provide the most current data
available because it’s outside the current knowledge the model was trained
on.
Figure 8-3. Knowledge limitations in LLMs - knowledge cut-off
RAG provides a technique that allows you to mitigate some of the challenges
with hallucinations and knowledge cut-off in foundation models. For
hallucinations, RAG is useful because you are able to provide the model
with access to information it would not already have, such as proprietary
data for your business. For knowledge cut-offs, RAG allows you to provide
access to current information beyond the model’s training date. This
technique has gained a lot of traction to be able to augment foundation
models with additional information, including domain specific information,
without the need to continuously perform full fine-tuning. The next section
will discuss RAG in more detail.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) isn’t a specific set of technologies
but rather a framework for providing LLMs access to data they did not see
during training. A number of different implementations exist, and the one you
choose will depend on the details of your task and the format of the data you
have to work with. In general, RAG allows applications backed by large
language models to make use of external data sources and applications to
overcome some of the knowledge limitations previously discussed.
RAG works by providing your model access to additional external data at
run-time. This data can be from a number of data sources including
knowledge bases, document stores, databases, as well as data that is
searchable through the internet as shown in Figure 9.n. .
Figure 8-4. External data sources
RAG is useful in any case where you want the language model to have access
to additional data that is not contained within its knowledge base. This could
be data not in the original training data or proprietary information stored in
your organization’s internal data stores. Allowing your model to have access
to this information helps improve both the relevance as well as the accuracy
of completions.
As previously mentioned RAG is a framework where a number of
implementations exist. Figure 9.n outlines a high level pattern and detailing
the concepts of a retriever and a generator. These concepts were introduced
in the ‘Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
‘ research paper (Lewis, 2020) published in 2020. At a high level, the
prompt is augmented with additional information from external data sources
to improve the accuracy and currency of the completion. At a high level, the
retriever is responsible for searching through external information sources to
find relevant text while the generator uses the retrieved documents, along
with the initial prompt, to generate a completion.
Figure 8-5. RAG integrates with external information sources to augment prompt input
Figure 8-7. Chunking is often used to reduce noise and improve relevancy
Determining the optimal chunk strategy can improve your search results.
There are a few considerations to keep in mind when determining the best
chunking strategy. First, consider the size of your indexed content whether it’s
long documents such as books or shorter content like product reviews.
Chunking smaller content may not have much impact with chunking larger
documents is not only necessary but also improves the ability to search for
similar relevant information related to a search. Second, chunking may be
required depending on how your results will be used due to context window
limits discussed later in this section. Third, the model you choose may also
have guidance around the optimal chunk size. Finally, there is a concept
called overlap when implementing chunking which refers to the overlap of a
defined amount of text between chunks. Overlap can help preserve context
between chunks so it’s another parameter to experiment with your chunking
implementation strategies.
Once the documents have been indexed, they can be used to retrieve
information using RAG based approaches as part of your prompt workflows.
That workflow starts with an input user prompt, as shown in Figure 9.n. In
this case the prompt is asking ‘What group is responsible for maintenance
on product FlashTag?’. That prompt input will utilize the same query
encoder and embedding model process to create vector embedding
representations of the prompt input as shown in Figure 9.n Those embeddings
are then used to query the database for similar vector embeddings to return
the most relevant text as the query result.
Figure 8-8. Information retrieval based on prompt input
Figure 8-9. Using the augmented prompt to generate the completion using additional context
The new expanded prompt that now contains information about the product
team is then passed to the LLM. The model uses the information in the context
of the prompt to generate a completion that contains the correct answer.
In this section, an example of using RAG specifically for accessing external
information was demonstrated to help understand the fundamental concepts
behind RAG. RAG is a powerful technique to incorporate data outside the
context of the model’s knowledge without needing to perform full fine-tuning
to incorporate that data. RAG architectures can be used to integrate multiple
types of external information sources . You can augment large language
models with access to local documents , including private wikis and
applications . RAG can also allow for access to the internet to retrieve
information posted on web pages, for example wikipedia. By encoding the
user input prompt as a SQL query, RAG can also interact with databases.
However, there are some things to consider with RAG. First, there is
potential for added latency due to the additional API calls and knowledge
retrieval from external memory. Second, there are also considerations around
the size of the context window. Every LLM has what is known as a context
window, which was covered in Chapter 2. The size of the context windows
varies with each LLM but most text sources are too long to fit into the limited
context window of the model, which is still at most just a few thousand
tokens for many models. In these cases, the external data sources are parsed
into many chunks, each of which will fit in the context window as shown in
Figure 9.n. There are libraries and frameworks, such as LangChain, that can
handle this work for you using different strategies and many of the latest
models also continue to increase the size of the context window.
Figure 8-10. Data processing long documents into multiple small chunks
In the next section, building applications that implement and orchestrate the
various tasks in a RAG workflow are explored in greater details.
Luckily, there are frameworks developed that take some of the heavy lift
away in implementing these solutions. This section will explore a popular
framework called LangChain that provides you with modular pieces that
contain the components necessary to work with large language models and
implementing techniques such as RAG. LangChain has several high level
modules including model interfaces, prompts, data connectors, chains,
memory, agents, and callbacks. These components will be discussed over the
next few chapters as they apply to each specific technique.
In the context of RAG, LangChain provides document loaders as part of the
data connector modules. These loaders provide libraries for loading data
across a variety of input formats into documents. For example, you can use
the PyPDFLoader to load and split pdf formatted documents. In the previous
section, the challenge of context window length was discussed with
strategies around chunking or splitting data as a way to overcome context
window limitations. LangChain also provides document transformers that
include splitters allowing you to chunk your documents using simple
configurations as shown in the code example below.
import numpy as np
from langchain.text_splitter import CharacterTextSplitter,
RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("./data/")
documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap = 100,
)
docs = text_splitter.split_documents(documents)
In this case, the code will load pdf documents from the designated location
and then split the documents into chunks of 1000 characters. These chunks
then contain portions of the original PDF document that can be preprocessed
to create vector embeddings using an embedding model and ultimately stored
in a vector store or loaded into a vector database using one of the many third-
party integrations provided through LangChain.
To create and store the vector embeddings in a local vector store, LangChain
provides a convenient library for performing similarity search and the
clustering of dense vectors called Facebook AI Similarity Search (FAISS).
This library takes the embeddings model as well as the loaded documents on
input to create the complete FAISS vector store.
vectorstore_faiss = FAISS.from_documents(
docs,
bedrock_embeddings,
)
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)
Then, using the provided Index Wrapper a lot of the heavy lift can be
abstracted away including creating the prompt, getting the embeddings of the
query, sampling the most relevant documents and calling the LLM. The code
example below shows how to get the embedding of the query prompt then use
that embedding to retrieve relevant documents using a similarity search of the
input query with data in the vector store.
query = "Is it possible that I get sentenced to jail due to failure in filings?"
The code above then returns the relevant documents but, if you recall, the
relevant document information now needs to be combined with the original
prompt to create the augmented prompt that will be sent to the LLM to
generate a completion. There are multiple ways to do this with different
retrievers; however, one method is to utilize chains.
A chain allows you to create a sequence of calls to different components and
is also useful when you want to be able to add additional customizations to
response output such as when you want not only the final completed prompt
but also the supporting information or citations.
In the example below a built-in chain called RetrievalQA is used along with
a default prompt template. The provided chain is set up to retrieve relevant
documents from the vector store and also specifies the type of search to
perform. In this case, a similarity search will be performed to find the most
relevant documents and the top 3 relevant documents will be returned.
{context}
Question: {question}
Assistant:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore_faiss.as_retriever(
search_type="similarity", search_kwargs={"k": 3}
),
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)
query = "Is it possible that I get sentenced to jail due to failure in filings?"
result = qa({"query": query})
print_ww(result['result'])
In this example, FAISS is used to get the top 3 relevant documents from a
local vector store; however, on AWS you could optionally use the k-NN
plugin with OpenSearch to retrieve the top n documents. OpenSearch uses
approximate nearest neighbor (ANN) algorithms to perform k-NN searches
using the same FAISS algorithms you see above or other available engines.
This allows you to map the search algorithm that best meets the needs of your
use case across characteristics such as vector dimensionality, post filtering
requirements, additional training requirements, compression, index and
search latency. For example, if you want to use a search approach that does
not require training and support large use cases then you may consider using
FAISS Hierarchical Navigable Small Worlds (HNSW) or Non-Metric Space
Library’s (NMSLIB) implementation of HNSW. Both are available in
OpenSearch with support for configuring the optimal similarity metric, such
as cosine similarity, for your use case.
Finally the results are used to create the augmented prompt, call the LLM,
and generate the final completion. Creating this sequence of steps as well as
the components needed to execute each step takes a lot of work as well as
something to explicitly orchestrate the end-to-end workflow. LangChain
provides libraries that greatly simplify the implementation of techniques like
RAG. These libraries can be integrated directly into your applications.
This section showed one way to implement RAG through LangChain;
however, there are also other implementation patterns of RAG through
LangChain as well. In addition to RAG, LangChain also provides the
framework for implementing other techniques which you’ll learn about in
later chapters such as agent based architectures which will be covered in
Chapter 10.
Summary
This chapter covered RAG not only as a common framework for augmenting
LLMs but more specifically using RAG to mitigate the common knowledge
limitations of hallucinations and knowledge cut-offs in LLMs by providing
access to external sources of information. A specific implementation for
document retrieval was explored as well as the importance of vector stores
in implementing RAG architectures. A specific example of a vector store,
vector database, was explored using a detailed example of retrieving
information from an external document store. The complexity of
implementing these architectures is greatly reduced by frameworks like
LangChain which are making it possible to quickly build, deploy, and test
LLM-powered applications that implement augmentation techniques like
RAG
Next, you’ll explore chain-of-thought reasoning which is a technique to
improve a model’s ability to reason through complex tasks. This is an
important mechanism for using an LLM to power your complex applications.
Chapter 9. Chain-of-Thought
Reasoning and Application
Integration
Chain-of-Thought Reasoning
Let’s have a look at how humans would reason through complex tasks, for
example, the following math challenge:
Sarah has 4 golf balls. She buys 2 more packages of golf balls. Each package has 12
golf balls. How many golf balls does she have now?
You would most likely take a step-by-step approach, breaking this task into
the following intermediate reasoning steps:
By including an example like this in the model prompt, you can teach a model
to mimic this behavior to solve similar tasks. This is how chain-of-thought
prompting works! Figure 10-X shows how to use this one-shot inference
example in a chain-of-thought prompt to solve a similar task:
You are ordering pizza for a meetup event for 20 people. If each pizza has 8 slices
and you think every attendee eats 2 slices, how many pizzas do you need to order?
Figure 9-1. Chain-of-thought prompting helps LLM reason through complex tasks.
When you send this chain-of-thought prompt to the LLM, you can see the
model generates a completion that breaks the task into multiple steps and
explains how it reasons through the task, similar to the provided one-shot
example. Here, the LLM finds the correct answer of 5.
NOTE
Chain-of-thought (CoT) reasoning should be introduced during the model’s pre-training
phase. If you are using a pre-trained LLM that has not been exposed to CoT, it’s often too
late to introduce CoT during fine-tuning. The FLAN series of models have been exposed
to CoT, so FLAN-T5 is a great choice.
Chain-of-thought reasoning is a powerful concept that’s not just limited to
arithmetic challenges. You can apply CoT to help models reason through
different types of complex tasks or problems. When you start building
generative AI applications, CoT is often a key component that enables the
LLM to understand complex user requests. You can then use this ability to
build applications that can perform tasks and interact with other applications
or systems.
The agent uses the LLM as its reasoning engine. The agent builds the CoT
reasoning prompt and augments it with relevant information to provide
responses back to the user in natural language. The agent is able to figure out
the actions required to automatically process user-requested tasks by
breaking the task into multiple steps, orchestrating a sequence of API calls
and data lookups, and maintaining memory to complete the action for the
user. The set of actions the agent can take depends on the tools you configure.
Tools are functions, or APIs, you give the agent access to.
NOTE
Agent implementations are available in many popular open source libraries, such as
Hugging Face Transformers Agents
[https://huggingface.co/docs/transformers/transformers_agents] or LangChain Agents
[https://docs.langchain.com/docs/components/agents/]. You can also choose from fully
managed cloud services, such agents for Amazon Bedrock which is covered in more detail
in chapter 12.
For an agent to be able to perform all of these tasks, it needs to structure the
prompts in the correct way. You’ve seen in the beginning of this chapter, how
CoT reasoning helps the model reason through complex tasks. But how do
you structure prompts to decide on which actions to take? One popular
framework to achieve this is called ReAct
[https://arxiv.org/pdf/2210.03629.pdf].
ReAct Framework
ReAct is a prompting strategy that combines CoT reasoning with action
planning. ReAct structures prompts to show the LLM how to reason through a
problem and decide on actions to take that help find a solution. The
structured prompts include a sequence of question-thought-action-
observation examples.
The question is the user-requested task or problem to solve. The thought is a
reasoning step that helps demonstrate to the LLM how to tackle the problem
and identify an action to take. The action is an API that the model can invoke
from an allowed set of APIs. The observation is the result of carrying out the
action. The actions that the LLM is able to choose from are defined by a set
of instructions that are prepended to the example prompt text.
Let’s come back to our travel agent example and assume a user is asking
which hotel is closest to the most popular beach in Hawaii. This question
will take a couple of intermediate steps and actions to find the solution. In the
prompt-prepended instructions, you describe the ReAct prompt structure and
list the allowed actions:
Solve a question answering task with interleaving Thought, Action, Observation
steps.
Thought can reason about the current situation, and Action can be three types:
(1) wikipedia_search[topic], which searches the topic on Wikipedia and returns the
first paragraph if it exists. If not, it will return a similar topic to search.
(2) hotel_database_lookup[request], which performs an API call to the hotel database
to gather hotel information defined in request
(3) Finish[answer], which returns the answer and finishes the task.
Here are some examples.
First, you define the task by telling the model to answer a question using the
discussed ReAct prompt structure. Then, you provide instructions that
explain what “thought” means and list the allowed actions to take.
The first is the wikipedia_search action, which looks for wikipedia entries
related to the specified topic. The second is a hotel_database_lookup
action, which can query the travel companies’ hotel database with a specific
request. The last action is finish, which returns the answer and brings the
task to an end.
It’s important to define a set of allowed actions when you use LLMs to
perform tasks. LLMs can be very creative and otherwise may suggest taking
actions that don’t correspond to anything the application can actually do. You
finish the instructions by providing some ReAct example prompts. Depending
on the LLM you are working with, you may need to include more than one
example and carry out few-shot inference. Figure 10-X summarizes how you
build up the ReAct prompt, together with the actual user request appended.
Figure 9-3. Build up the ReAct prompt with instructions, ReAct examples, and the user request.
Now, let’s see how the model applies the instructions to the user’s request to
find the closest hotel to the most popular beach in Hawaii:
Thought 1: I need to search for the most popular beach in Hawaii and find the
closest hotel for that location.
Action 1: wikipedia_search["most popular beach in Hawaii"]
Observation 1: Waikiki is most famous for Waikiki Beach.
You can see how the thoughts reason through the task and plan two
intermediate steps that help find the answer. The model then needs to decide
on an action to take from a predetermined list.
In this example, the allowed actions are wikipedia_search which performs
a Wikipedia search on a specific topic, hotel_database_lookup which
performs an API call to the travel agencies’ hotel database, and finish
which the model carries out when it has found the answer. The observations
bring the new information retrieved from the actions back into the model’s
prompt context. The model will cycle through as many iterations as needed to
find the answer. The final action is then to finish the cycle and pass the
answer back to the user.
As you can see, ReAct prompting is a powerful strategy to guide LLMs
through reasoning and action planning. Many agent implementations support
ReAct prompting and will automatically build the structured prompts for you.
Your travel agent application is now able to connect to external data sources
to retrieve additional information, reason through tasks, and plan and perform
tasks. But what if one of the tasks is to calculate the sales tax for the travel
booking? Even with CoT, an LLM’s ability to perform arithmetic or other
mathematical operations is limited. The model might be able to correctly
reason through the task, but may get the actual calculation wrong. After all, an
LLM is not really doing math, it’s just predicting the most probable next
token to complete the prompt.
To overcome this limitation, you can connect your model to an application
that’s good at performing calculations, such as a code interpreter. The
“Program-aided Language Models” (PAL) framework
[https://arxiv.org/pdf/2211.10435.pdf] does exactly that.
PAL Framework
PAL uses CoT reasoning to generate programs in the intermediate reasoning
steps that help solve the given problem. These programs are then passed to
an interpreter, for example, a Python interpreter, that runs the code and
returns the result back to the LLM. You can use the Program-aided Language
Models (PAL) framework to connect an LLM to an external code interpreter
to perform calculations, as shown in Figure 10-X.
Figure 9-4. PAL connects an LLM to an external code interpreter to perform calculations.
Similar to ReAct, you need to add one or more examples to the prompt that
shows the model how to format the output. You start each example with a
question followed by a couple of reasoning steps and lines of Python code
that solve the problem. Then, you add the new question to solve to the
prompt. The PAL-formatted prompt now contains your example(s) and the
new problem to solve.
Once you pass this prompt to the LLM, the model follows the example and
generates a completion in the form of a Python script. Next, you send the
script to a Python interpreter that will run the code and return the result.
Figure 10-X shows the complete PAL workflow. You can now append the
result to the prompt and the LLM generates a completion that contains the
correct answer.
Q: Sarah has 4 golf balls. She buys 2 more packages of golf balls. Each package has
12 golf balls. How many golf balls does she have now?
Answer:
# Sarah started with 4 golf balls
golf_balls = 4
# 2 packages of golf balls each is
bought_balls = 2 * 12
# golf balls. The answer is
answer = golf_balls + bought_balls
Q: You are ordering pizza for a meetup event for 20 people. If each pizza has 8
slices and you think every attendee eats 2 slices, how many pizzas do you need to
order?
You then complete the prompt with the new task to solve. The answer should
look similar to this:
Answer:
# There are 20 people
num_people = 20
# Each person eats 2 slices, each pizza has 8 slices
slices_per_person = 2
slices_per_pizza = 8
# The answer is
answer = (num_people x slices_per_person)/slices_per_pizza
Note how the LLM declares the variables based on the text in each reasoning
step and assigns values either directly or as calculations, also based on the
numbers in the reasoning text. The LLM can also pick up and work with
variables it created in an earlier step, as you can see when the model
calculates the final result. The model then correctly generates the required
math operation to calculate the result.
For simple math operations like this, you can likely get the correct answer by
just applying CoT reasoning. But for more complex math, such as arithmetic
with large numbers, trigonometry, or calculus, PAL is a powerful technique
that ensures that any calculations done by your LLM are accurate and
reliable.
Similar to ReAct, many agent implementations support PAL and will
automatically build the prompts and orchestrate the communication between
LLM and the code interpreter for you.
The following code example shows how to use ReAct and PAL with
LangChain agents:
model_id = "..."
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
max_new_tokens=100
)
llm = HuggingFacePipeline(pipeline=pipe)
agent.run("Which hotel is closest to the most popular beach in Hawaii, and how much
is each night with 50% discount?")
The output should look similar to this:
"Waikiki Beach is the most popular beach in Hawaii and the closest hotel is
MyHawaiianDreamHotel and a hotel night with 50% discount is 125 USD."
Your application consumers, whether they are human end users or other
systems accessing your application via its APIs, will engage with this entire
stack. As you can observe, while the model plays a significant role, building
end-to-end generative AI applications involves more than just the model
itself.
Safety and Guardrails
With the growing popularity and next wave of widespread AI adoption
comes the recognition that we must all use it responsibly. This includes
building generative AI applications in a safe, secure, and transparent way.
Throughout design, development, deployment, and operations, you need to
consider a range of factors including accuracy, fairness, appropriate usage,
toxicity, security, safety, and privacy.
One concern of generative AI is related to the potential generation of
offensive, disturbing, or inappropriate content. To minimize the risk of such
toxic outputs, you could carefully curate your training data. If the training
data lacks offensive or biased text, the LLM will likely not generate such
content. However, this approach requires that you proactively identify and
remove any offensive phrases.
A more feasible approach could be to train guardrail models. These models
are designed to identify and filter out undesirable content not only from the
training data but also from input prompts and generated outputs. In practice, it
tends to be more manageable to control the output of a generative model
rather than curating the training data and prompts. This is particularly true
due to the highly diverse and general nature of the tasks for generative AI.
Summary
In this chapter, you’ve explored methods to enhance your model’s
performance during deployment. These methods include using structured
prompts and connecting your model to external data sources and
applications. LLMs can serve as remarkable reasoning engines in
applications, leveraging their “intelligence” to fuel exciting and practical use
cases.
To facilitate this process, orchestration software, like agents, takes charge of
prompt engineering and communication between systems. They enable LLM-
powered applications to perform actions in the real world, making the
applications more versatile and interactive.
Finally, when building generative AI applications, you also need to consider
additional components, from the infrastructure layer up to the application
interfaces, and ensure the technology is used responsibly. In the next chapter,
you’ll explore multimodal models, some of their common use cases, and core
concepts of image-based generative AI.