ML Algorithms
ML Algorithms
One kind of deep learning model that is mostly utilized for Natural Language Processing
(NLP) applications is the Transformer model. It was first presented in the 2017 paper
"Attention is All You Need" by Vaswani et al.
Faster and More Efficient: Transformers may be trained more quickly since they read
every word at once. Additionally, they enable parallelization, which enables training on
massive datasets.
Text Generation: Transformers are used by programs such as AI writing tools and
content producers (like Grammarly and Jasper AI) to produce or enhance written text.
Predicting a word in a string of words is the simplest method of training language models. This
is most frequently seen as either masked-language-modeling or next-token-prediction.
Step1:Supervised Fine-Tuning
The model learns from labeled datasets, meaning it is given input-output pairs
(examples of questions and their correct responses). This process is guided by human
annotators who provide correct answers.
Technical Explanation:
Technical Example:
The model compares its generated response with the labeled output, calculating the
loss, and then adjusts its parameters to improve.
The next step involves creating a Reward Model, where the model’s outputs are
evaluated based on how good or bad they are. This involves collecting human feedback
to train a separate model to predict a reward score for each output.
● Technical Explanation:
○ After fine-tuning, the model’s responses are evaluated by human
annotators, who rank responses from best to worst based on criteria like
relevance, helpfulness, and clarity.
○ A reward model is then trained to predict the quality of responses. The
human evaluations are used as labels, and the reward model learns to
assign a reward score (like a numerical value) to each response the base
model generates.
○ The Mean Squared Error (MSE) or Ranked Loss can be used to train
the reward model, minimizing the difference between predicted rewards
and actual rewards (human evaluations).
● Technical Example:
○ The AI generates three different responses to the input “Explain the theory
of relativity.”
○ Humans rank the responses based on how well they explain the concept:
■ Response A: “Relativity is a theory about space and time.” (Rank:
3)
■ Response B: “Einstein’s theory of relativity describes how objects
move through space and time and how gravity affects that motion.”
(Rank: 1)
■ Response C: “Relativity talks about the speed of light and black
holes.” (Rank: 2)
● The reward model learns from these rankings and assigns higher reward scores
to responses like B that are more accurate and clear.
In this step, the model is fine-tuned again using Reinforcement Learning (RL), with the
reward model providing feedback on the generated responses. The goal is for the model
to maximize the rewards over time.
● Technical Explanation:
○ The model is now trained using Proximal Policy Optimization (PPO), a
popular reinforcement learning algorithm. The model generates responses
(called actions), and the reward model gives feedback (reward scores)
based on how good or bad the response is.
○ The model adjusts its responses (or policy) to maximize the cumulative
reward by trying different strategies and learning which ones work best.
○ This is where exploration vs. exploitation comes in. The model explores
different ways to answer questions, then exploits the patterns that receive
the highest rewards.
○ The policy is updated to generate responses that are not only accurate but
also helpful, polite, and informative, guided by the feedback from the
reward model.
● Technical Example:
○ Input: “What’s the difference between machine learning and deep
learning?”
○ Initial Response: “Machine learning is about algorithms. Deep learning
uses neural networks.” (Reward Score: 0.3)
○ The model is penalized (low reward) because this explanation is too vague.
○ After many rounds of RL, the model learns to provide a more
comprehensive answer:
■ Improved Response: “Machine learning is a broad field where
computers learn from data. Deep learning is a subset of machine
learning that uses neural networks to mimic how the human brain
processes data.” (Reward Score: 0.9)
● The model now generates better responses that earn higher reward scores.