4 - ChatGPT - Optimizing Language Models For Dialogue
4 - ChatGPT - Optimizing Language Models For Dialogue
4 - ChatGPT - Optimizing Language Models For Dialogue
data, which consisted of two or more model responses ranked by quality. To collect this
data, we took conversations that AI trainers had with the chatbot. We randomly selected
a model-written message, sampled several alternative completions, and had AI trainers
rank them. Using these reward models, we can fine-tune the model using Proximal Policy
Optimization. We performed several iterations of this process.
ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in
early 2022. You can learn more about the 3.5 series here. ChatGPT and GPT 3.5 were
trained on an Azure AI supercomputing infrastructure.
Limitations
ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.
Fixing this issue is challenging, as: (1) during RL training, there’s currently no source
of truth; (2) training the model to be more cautious causes it to decline questions that
it can answer correctly; and (3) supervised training misleads the model because the
ideal answer depends on what the model knows, rather than what the human
demonstrator knows.