Demystifying LLMs
Demystifying LLMs
Releases
• Pretraining
• Instruction-Tuning
• Evaluation of LLMs
2. Instruction-Tuning
2. Instruction-Tuning
We
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
introduce
We
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
introduce Mixtral
We introduce
fi
Pretraining
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same
architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e.
experts). For every token, at each layer, a router network selects two experts to process the current
state and combine their outputs. Even though each token only sees two experts, the selected experts
can be different at each timestep. As a result, each token has access to 47B parameters, but only uses
13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it
outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model
We introduce Mixtral
Pretraining
• Why is it hard?
Pretraining
• Why is it hard?
• Time: Datasets are huge - O(1T) tokens
• Preprocessing, Cleaning, Deduplication
• More data might not lead to better model
Response:
Example:
Input: 17
Output: True
Input: 15
Output: False
Approach:
Response:
if x <= 1:
return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
Can we use the Pretrained model?
Prompt:
Response:
if x <= 1:
return False
for i in range(2, int(x ** 0.5) + 1):
if x % i == 0:
return False
return True
Model knows the answer but it is not aligned with human preferences
Stages of LLM Training
1. Pretraining
2. Instruction-Tuning
def
def is_prime
• Dataset Instruction-tuning
• Dataset Instruction-tuning
• Dataset Instruction-tuning
• Few hrs/days
Stages of LLM Training
1. Pretraining
2. Instruction-Tuning
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
Prompt: [INST] Write a python function to nd whether the input number is prime. [\INST]
3-shot:
## How old is Barack Obama in 2014?
3-shot:
## How old is Barack Obama in 2014?
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Evaluation of Instruction-tuned models