4 - ChatGPT - Optimizing Language Models For Dialogue

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

To create a reward model for reinforcement learning, we needed to collect comparison

data, which consisted of two or more model responses ranked by quality. To collect this
data, we took conversations that AI trainers had with the chatbot. We randomly selected
a model-written message, sampled several alternative completions, and had AI trainers
rank them. Using these reward models, we can fine-tune the model using Proximal Policy
Optimization. We performed several iterations of this process.

ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in
early 2022. You can learn more about the 3.5 series here. ChatGPT and GPT 3.5 were
trained on an Azure AI supercomputing infrastructure.

Limitations
ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.
Fixing this issue is challenging, as: (1) during RL training, there’s currently no source
of truth; (2) training the model to be more cautious causes it to decline questions that
it can answer correctly; and (3) supervised training misleads the model because the
ideal answer depends on what the model knows, rather than what the human
demonstrator knows.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy