0% found this document useful (0 votes)
9 views

Demystifying LLMs

Demystifying LLMs

Uploaded by

민냥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Demystifying LLMs

Demystifying LLMs

Uploaded by

민냥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Demystifying LLMs

Devendra Singh Chaplot


Mistral AI

Feb 13, 2024


Mistral AI
Co-Founders

Arthur Mensch Timothée Lacroix Guillaume Lample


CEO CTO Chief Scientist
Former AI researcher at Former AI researcher at Former AI Researcher at
DeepMind, Polytechnique alum Meta, ENS alum Meta, Polytechnique alum

Releases

$500M+ funding, Of ces in Paris/London/SF Bay Area


fi
Mistral AI LLMs
Contents
• Stages of LLM Training:

• Pretraining

• Instruction-Tuning

• Learning from Human Preferences: DPO/RLHF

• Evaluation of LLMs

• Retrieval Augmented Generation (RAG)

• Recipe for RAG with code


Stages of LLM Training
1. Pretraining

2. Instruction-Tuning

3. Learning from Human Feedback


Stages of LLM Training
1. Pretraining

2. Instruction-Tuning

3. Learning from Human Feedback


Pretraining
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model

Large Language Model


O(1-100B) parameters

We
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model

introduce

Large Language Model


O(1-100B) parameters

We
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model

introduce Mixtral

Large Language Model


O(1-100B) parameters

We introduce
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model

introduce Mixtral 8x7B , a Sparse Mixture

Large Language Model


O(1-100B) parameters

We introduce Mixtral 8x7B , a Sparse


fi
Pretraining
• Task: Next token prediction
Pretraining
• 1 token ~= 0.75 word
introduce Mixtral 8x7B
• Vocab size: O(10K) tokens
• Each token is represented by an integer
Large Language Model
(LLM)

We introduce Mixtral
Pretraining
• Why is it hard?
Pretraining
• Why is it hard?
• Time: Datasets are huge - O(1T) tokens
• Preprocessing, Cleaning, Deduplication
• More data might not lead to better model

Llama pretraining data mixture


Pretraining
• Why is it hard?
• Time: Datasets are huge - O(1T) tokens
• Preprocessing, Cleaning, Deduplication
• More data might not lead to better model
• Money: O(1-100B) parameters
• O(1-10K) GPUs for weeks or months Llama pretraining data mixture
• O($10-100M) per model
Pretraining
• Why is it hard?
• Time: Datasets are huge - O(1T) tokens
• Preprocessing, Cleaning, Deduplication
• More data might not lead to better model
• Money: O(1-100B) parameters
• O(1-10K) GPUs for weeks or months Llama pretraining data mixture
• O($10-100M) per model
• YOLO: decide model architecture, hyper
parameters, data mixture for the “big run”
• Best hyper-parameters for a smaller model
Llama Model sizes, architectures, and
might not be the best for a larger model optimization hyper-parameters.
Can we use the Pretrained model?
Can we use the Pretrained model?
Prompt:

Write a python function to nd whether the input number is prime.


fi
Can we use the Pretrained model?
Prompt:

Write a python function to nd whether the input number is prime.

Response:
Example:

Input: 17
Output: True

Input: 15
Output: False

Approach:

1. Let’s assume that the input number is n


2. Check if n is divisible by 1. If n is divisible by 1 then it is not a prime number.
3. Check if n is divisible by 2. If n is divisible by 2 then it is not a prime number.
fi
Can we use the Pretrained model?
Prompt:

def is_prime(x: int):


"""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
Can we use the Pretrained model?
Prompt:

def is_prime(x: int):


"""
takes as input an integer x. Returns True if x is prime and False otherwise
"""

Response:

if x <= 1:
return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
Can we use the Pretrained model?
Prompt:

def is_prime(x: int):


"""
takes as input an integer x. Returns True if x is prime and False otherwise
"""

Response:

if x <= 1:
return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True

Model knows the answer but it is not aligned with human preferences
Stages of LLM Training
1. Pretraining

2. Instruction-Tuning

3. Learning from Human Feedback


Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]

def is_prime(x: int):


“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]

def is_prime(x: int):


“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True

Large Language Model


O(1-100B) parameters

[INST] Write … [\INST]


fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]

def is_prime(x: int):


“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True

def

Large Language Model


O(1-100B) parameters

[INST] Write … [\INST] def


fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]

def is_prime(x: int):


“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True

def is_prime

Large Language Model


O(1-100B) parameters

[INST] Write … [\INST] def


fi
Instruction Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]

def is_prime(x: int):


“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True

def is_prime (x) :

Large Language Model


O(1-100B) parameters

[INST] Write … [\INST] def is_prime (x)


fi
Instruction Fine-tuning

• Dataset Instruction-tuning

• Paired: (Prompt, Response)


def is_prime
• O(10-100K instructions)

Large Language Model


(LLM)

[INST] … [\INST] Def


Instruction Fine-tuning

• Dataset Instruction-tuning

• Paired: (Prompt, Response)


def is_prime
• O(10-100K instructions)
• Task:
Large Language Model
• Next word prediction (Masked) (LLM)

[INST] … [\INST] Def


Instruction Fine-tuning

• Dataset Instruction-tuning

• Paired: (Prompt, Response)


def is_prime
• O(10-100K instructions)
• Task:
Large Language Model
• Next word prediction (Masked) (LLM)
• Compute:
• O(1-100) GPUs [INST] … [\INST] Def

• Few hrs/days
Stages of LLM Training
1. Pretraining

2. Instruction-Tuning

3. Learning from Human Feedback


Human Preferences
Human preferences are cheaper/easier than human annotation

Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]

def is_prime(x: int):


“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 1: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 2: return False
for i in range(2, x):
if x % i == 0:
return False
return True
fi
Human Preferences
Human preferences are cheaper/easier than human annotation

Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]

def is_prime(x: int):


“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 1: return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
def is_prime(x: int):
“""
takes as input an integer x. Returns True if x is prime and False otherwise
"""
if x <= 1:
Response 2: return False
for i in range(2, x):
if x % i == 0:
return False
return True
Response 1 > Response 2
fi
Reinforcement Learning
from Human Feedback (RLHF)

[Deep Reinforcement Learning from Human Preferences. Christiano et al. 2017]


Direct Preference Optimization (DPO)

[Deep Reinforcement Learning from Human Preferences. Christiano et al. 2017]


[Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Rafailov et al. 2023]
Stages of LLM Training
Pretraining Instruction-Tuning Learning from Human Feedback

Dataset: Dataset: Dataset:


Raw text Paired: (Prompt, Response) Human Preference Data
Few trillions of tokens O(10-100K instructions) O(10-100K)
Task: Task: Task:
Next word prediction Next word prediction (Masked) RLHF/DPO
Compute: Compute: Compute:
O(1-10K) GPUs O(1-100) GPUs O(1-100) GPUs
Weeks/months of training Few hrs/days Few hrs/days
Evaluation of LLMs
Evaluation of pretrained models
Evaluation of pretrained models
0-shot:

def is_prime(x: int):


"""
takes as input an integer x.
Returns True if x is prime and False otherwise
"""
Evaluation of pretrained models
0-shot:

def is_prime(x: int):


"""
takes as input an integer x.
Returns True if x is prime and False otherwise
"""

3-shot:
## How old is Barack Obama in 2014?

Barack Obama is 57 years old in 2014.

## What is Barack Obama’s birthday?

Barack Obama was born on August 4, 1961.

## What is the name of Barack Obama’s wife?

Barack Obama’s wife is Michelle Obama.

## How tall is Barack Obama?


Evaluation of pretrained models
0-shot:

def is_prime(x: int):


"""
takes as input an integer x.
Returns True if x is prime and False otherwise
"""

3-shot:
## How old is Barack Obama in 2014?

Barack Obama is 57 years old in 2014.

## What is Barack Obama’s birthday?

Barack Obama was born on August 4, 1961.

## What is the name of Barack Obama’s wife?

Barack Obama’s wife is Michelle Obama.

## How tall is Barack Obama?


Evaluation of Instruction-tuned models

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Evaluation of Instruction-tuned models

LMSYS Chatbot Arena Leaderboard


https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Evaluation of Instruction-tuned models
• Proxies for human evaluation:
• MT Bench:

• Ask GPT-4 to score responses


• 0.90 correlation with human
preferences
• Alpaca Eval:

• Compare win-rate against GPT-4 (v2)


• 0.84 correlation with human
preferences
Practical tips
• Proprietary vs Open-Source
• For proprietary models:
• Prompt Engineering: Few-shot prompting, Chain-of-thought
• Retrieval Augmented Generation (RAG)
Practical tips
• Proprietary vs Open-Source
• For proprietary models:
• Prompt Engineering: Few-shot prompting, Chain-of-thought
• Retrieval Augmented Generation (RAG)
• For open-source
• Everything above
• Task-speci c ne-tuning and DPO: Need data and a bit of compute
fi
fi
Practical tips
Open-source Proprietary
• Proprietary vs Open-Source
• For proprietary models:
• Prompt Engineering: Few-shot prompting, Chain-of-thought
• Retrieval Augmented Generation (RAG)
• For open-source
• Everything above
• Task-speci c ne-tuning and DPO: Need data and bit of compute
• Balance performance vs cost (training and inference)
• Proprietary models higher general-purpose performance
• Open-source models can beat proprietary models on speci c tasks
with ne-tuning
• Proprietary models typically have higher inference cost
Price
0.42€ 1.8€ 7.5€
(per M tokens)
fi
fi
fi
fi
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG)
When do we need Retrieval Augmented Generation (RAG)?
• LLM doesn’t know everything, sometimes require task-speci c knowledge
• Sometimes you want LLMs to answer queries based on some data source to reduce hallucinations
• Knowledge resource doesn’t t in the context window of the LLM

[Figure from https://lemaoliu.github.io/retrieval-generation-tutorial/]


fi
fi
Recipe for RAG

[Figure from https://gradient ow.substack.com/p/best-practices-in-retrieval-augmented]


fl
Basic RAG code
https://docs.mistral.ai/guides/basic-RAG/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy