0% found this document useful (0 votes)
8 views39 pages

Vinija's Notes - Primers - Mixture of Experts

The document discusses mixture of experts (MoE), an approach for efficiently scaling neural networks without proportional computational overhead. MoE divides tasks among expert models that specialize in subsets of data. The technique has evolved since 1991, allowing unprecedented scaling of large language models through innovations like sparse routing that select only the most relevant experts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views39 pages

Vinija's Notes - Primers - Mixture of Experts

The document discusses mixture of experts (MoE), an approach for efficiently scaling neural networks without proportional computational overhead. MoE divides tasks among expert models that specialize in subsets of data. The technique has evolved since 1991, allowing unprecedented scaling of large language models through innovations like sparse routing that select only the most relevant experts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Vinija's Notes • Primers • Mixture of Experts https://vinija.

ai/nlp/mixture-of-experts/

Vinija's AI Notes
Back to vinija.ai

search...

Primers • Mixture of Experts


• Overview
• Mixture-of-Experts: the Classic Approach
◦ Intuition
▪ Gate Functionality
◦ Hands-On Exercise: How Does an MoE Model Work?
▪ Key Bene�ts
• Mixture of Attention Heads (MoA)
• SwitchHead
• Mixture of Depths (MoD)
• The Deep Learning Way
• Expert Choice Routing
• Implications and Outlooks
• The “How” Behind MoE
• What’s Next?
• Related Papers
◦ Outrageously Large Neural Networks: the Sparsely-Gated Mixture-of-Experts Layer
◦ Scaling Vision with Sparse Mixture of Experts
◦ Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

1 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

◦ Mixture-of-Experts Meets Instruction Tuning: a Winning Combination for Large Language Models
◦ From Sparse to Soft Mixtures of Experts
◦ Switch Transformers
◦ QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
◦ MegaBlocks: E�cient Sparse Training with Mixture-of-Experts
• MoE Models
◦ Mixtral
◦ GPT-4
◦ Mixtral: Mistral’s 8x7B MoE Model
▪ Results
◦ OpenMoE
• Further Reading
• Citation

Overview
• Arti�cial neural networks have emerged as the cornerstone of deep learning, o�ering a remarkable
way of drawing valuable insights from a plethora of data. However, the e�cacy of these neural
networks hinges heavily on their parameter count. Mixture-of-Experts (MoE) presents an e�cient
approach to dramatically increasing a model’s capabilities without introducing a proportional amount
of computational overhead.
• Originally proposed in 1991 by Robert A. Jacobs et al., MoE adopts a conditional computation
paradigm by only selecting parts of an ensemble, referred to as experts, and activating them
depending on the data at hand. The MoE structure appeared long before the popularization of deep
learning.
• The infographic below (source) outlines the signi�cant milestones in the development of Sparse
Mixtures of Experts (MoE) technology, which has been instrumental in the advancements of machine
learning and particularly in the scaling of large language models (LLMs) like OpenAI’s GPT-4 and

2 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

Google’s Switch Transformer.


◦ Starting with the �rst milestone in 1991, we have “Mixture of Experts” by Jacobs et al., which
introduced the foundational concepts of “gating” and “experts”. This approach models
predictions by using a weighted sum of experts’ opinions, with the weights determined by a
gating function.
◦ Moving to 2017, Top-k routing was introduced. It streamlined the process by running inputs over
the k most suitable experts to reduce computational costs. Additionally, the paper introduced
load balancing losses to enhance training e�ciency.
◦ In 2022, the “Switch Transformer” by Fedus et al. pushed top-k routing further by selecting only
the most relevant expert for each token, streamlining the architecture of transformer models
signi�cantly and allowing them to scale up to unprecedented levels.
◦ Finally, also in 2022, “Dropless MoE” by Gale et al. reformulated sparse MoE as a block-sparse
matrix multiplication, which allowed scaling up transformer models without the load balancing
losses or capacity limitations seen in previous works. This led to one of the fastest sparse MoE
implementations in the industry, referred to as MegaBlocks.
◦ The �gure emphasizes that these innovations have been part of a journey that has spanned over
three decades, suggesting that the �eld has a robust foundation of research and development. It
indicates optimism for future innovations that will continue to make sparse MoE more e�cient,
paving the way for even larger and more precise machine learning models across various
domains.

3 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

Mixture-of-Experts: the Classic Approach


• The MoE concept is a type of ensemble learning technique initially developed within the �eld of
arti�cial neural networks. It introduces the idea of training experts on speci�c subtasks of a complex
predictive modeling problem.
• In a typical ensemble scenario, all models are trained on the same dataset, and their outputs are
combined through simple averaging, weighted mean, or majority voting. However, in Mixture-of-
Experts (MoE), each “expert” model within the ensemble is only trained on a subset of data where it
can achieve optimal performance, thus narrowing the model’s focus. Put simply, MoE is an architecture
that divides input data into multiple sub-tasks and trains a group of experts to specialize in each sub-
task. These experts can be thought of as smaller, specialized models that are better at solving their
respective sub-tasks.
• The popularity of MoE only rose recently as the appearance of Large Language Models (LLMs) and

4 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

transformer-based models in general swept through the machine learning �eld. Consequently, this is
because of modern datasets’ increased complexity and size. Each dataset contains di�erent regimes
with vastly di�erent relationships between the features and the labels.
• To appreciate the essence of MoE, it is crucial to understand its architectural elements:
1. Division of dataset into local subsets: First, the predictive modeling problem is divided into
subtasks. This division often requires domain knowledge or employs an unsupervised clustering
algorithm. It’s important to clarify that clustering is not based on the feature vectors’ similarities.
Instead, it’s executed based on the correlation among the relationships that the features share
with the labels.
2. Expert Models: These are the specialized neural network layers or experts that are trained to
excel at speci�c sub-tasks. Each expert receives the same input pattern and processes it
according to its specialization. Put simply, an expert is trained for each subset of the data.
Typically, the experts themselves can be any model, from Support Vector Machines (SVM) to
neural networks. Each expert model receives the same input pattern and makes a prediction.
3. Gating Model (Router): The gating model/network, also called the MoE layer or Router, is
responsible for selecting which experts to use for each input data. It works by estimating the
compatibility between the input data and each expert, and then outputs a softmax distribution
over the experts. This distribution is used as the weights to combine the outputs of the expert
layers. Put simply, this model helps interpret predictions made by each expert and decide which
expert to trust for a given input.
4. Pooling Method: Finally, an aggregation mechanism is needed to make a prediction based on
the output from the gating network and the experts.
• The gating network and expert layers are jointly trained to minimize the overall loss function of the
MoE model. The gating network learns to route each input to the most relevant expert layer(s), while
the expert layers specialize in their assigned sub-tasks.
• This divide-and-conquer approach e�ectively delegates complex tasks to experts, enabling e�cient
processing and improved accuracy. Together, these components ensure that the right expert handles
the right task. The gating network e�ectively routes each input to the most appropriate expert(s),
while the experts focus on their speci�c areas of strength. This collaborative approach leads to a more

5 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

versatile and capable overall model.

Put simply, a Mixture of Experts (MoE) is how an ensemble of AI models decides as one. It’s basically
multiple “experts”, i.e., individual models, in a “trend coat”.

Intuition
• This section seeks to answer exactly how the experts specialize, and in what? Also, how exactly does
gating work, and what does it do under the hood?
• Recent research has started to give us some insights. Here’s a neat visualization from the paper
“Towards Understanding the Mixture-of-Experts Layer in Deep Learning” by Chen et. al (2022), which
shows how a 4-expert MoE model learns to solve a binary classi�cation problem on a toy dataset
that’s segmented into 4 clusters.
• Initially, the experts (shown as di�erent colors) are all over the place, but as training proceeds,
di�erent experts “specialize” in di�erent clusters until there’s almost a 1:1 correspondence. That
specialization is entirely random, and only driven by the small initial random perturbations. Meanwhile,
the gate is learning to (1) cluster the data and (2) map experts to clusters.

6 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• Another important take-away from this toy experiment is that non-linearity appears to be the key to
the success of MoE. Experts with linear activation simply don’t work as well as those with non-linear
(cubic in this work) activation.

Gate Functionality

• refers to two distinct but interconnected functions of the “gate” in a Mixture of Experts (MoE) model:
1. Clustering the Data: In the context of an MoE model, clustering the data means that the gate is
learning to identify and group together similar data points. This is not clustering in the traditional
unsupervised learning sense, where the algorithm discovers clusters without any external labels.
Instead, the gate is using the training process to recognize patterns or features in the data that
suggest which data points are similar to each other and should be treated similarly. This is a
crucial step because it determines how the data is organized and interpreted by the model.
2. Mapping Experts to Clusters: Once the gate has identi�ed clusters within the data, its next role is
to assign or map each cluster to the most appropriate expert within the MoE model. Each expert
in the model is specialized to handle di�erent types of data or di�erent aspects of the problem.
The gate’s function here is to direct each data point (or each group of similar data points) to the
expert that is best suited to process it. This mapping is dynamic and is based on the strengths
and specialties of each expert as they evolve during the training process.
• In summary, the gate in an MoE model is responsible for organizing the incoming data into meaningful
groups (clustering) and then e�ciently allocating these groups to the most relevant expert models

7 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

within the MoE system for further processing. This dual role of the gate is critical for the overall
performance and e�ciency of the MoE model, enabling it to handle complex tasks by leveraging the
specialized skills of its various expert components.

Hands-On Exercise: How Does an MoE Model Work?


• Credits to Tom Yeh for this exercise.
• Let’s calculate an MoE model by hand, with the following con�g: Experts: 2, Tokens: 2, Sparse.
• Step-by-Step Walkthrough:
1. The MoE block receives two tokens (blue, orange).
2. Gate Network processes X1 (blue) and determined Expert 2 should be activated.
3. Expert 2 processes X1 (blue).
4. Gate Network processes X2 (orange) and determined Expert 1 should be activated.
5. Expert 1 processes X2 (orange).
6. ReLU activation function processes the outputs of the experts and produces the �nal output.

8 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

Key Benefits

• Size: The model can get really large (while still being e�cient, as highlighted in the next point) simply
by adding more experts. In this example, adding one more expert means adding 16 more weight
parameters.
• E�ciency: The gate network will select a subset of experts to actually compute, in the above exercise:
one expert. In other words, only 50% of the parameters are involved in processing a token.

Mixture of Attention Heads (MoA)


• Developed by Zhang et al. in 2022, MoA applies MoE to the attention mechanism of Transformers. In

9 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

MoA, each attention head acts as an expert with its own specialized query (Q) and output (O)
projection matrices, while the key (K) and value (V) projection matrices are shared across heads.
• The router selects the top-K heads for each token, which integrates the outputs from the selected
experts. This selective attention process reduces the computational load by focusing on the most
relevant parts of the data, enhancing the model’s e�ciency and output quality.

SwitchHead
• Proposed by Csordás et al. in 2023, SwitchHead extends the concept of MoE to the self-attention
mechanism’s value (V) and output (O) projection matrices.
• Unlike MoA, SwitchHead preserves the standard multi-head structure but introduces multiple expert
layers within the projections of each head. It signi�cantly reduces the number of attention heads
needed by enhancing the model’s ability to focus compute resources where they are most needed.
This results in lower memory usage and fewer compute operations, as each token is processed by a
smaller, more focused set of expert projections.

Mixture of Depths (MoD)


• Introduced by Raposo et al. in 2024, MoD integrates MoE into the skip connection paths within a
Transformer. It allows tokens to bypass certain layers of the network based on their complexity, as
determined by a router. This adaptive computation method routes simpler tokens directly through skip
connections, avoiding unnecessary processing by the more computationally intensive attention and
FFN layers. MoD sets a threshold determining which tokens bypass the Transformer block, allowing for
dynamic allocation of compute resources based on token complexity. This method proves highly
e�cient, signi�cantly reducing the computational load by avoiding processing for tokens that do not
require the full depth of the model.

10 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

The Deep Learning Way


• In 2017, an extension of the MoE paradigm suited for deep learning was proposed by Noam Shazeer et
al.
• In most deep learning models, increasing model capacity generally translates to improved
performance when datasets are su�ciently large. Generally, when the entire model is activated by
every example, it can lead to “a roughly quadratic blow-up in training costs, as both the model size
and the number of training examples increase”, stated by Shazeer et al.
• Although the disadvantages of dense models are clear, there have been various challenges for an
e�ective conditional computation method targeted toward modern deep learning models, mainly for
the following reasons:
1. Modern computing devices like GPUs and TPUs perform better in arithmetic operations than in
network branching.
2. Larger batch sizes bene�t performance but are reduced by conditional computation.
3. Network bandwidth can limit computational e�ciency, notably a�ecting embedding layers.
4. Some schemes might need loss terms to attain required sparsity levels, impacting model quality
and load balance.
5. Model capacity is vital for handling vast data sets, a challenge that current conditional
computation literature doesn’t adequately address.
• The MoE technique presented by Shazeer et al. aims to achieve conditional computation while
addressing the abovementioned issues. They could increase model capacity by more than a
thousandfold while only sustaining minor computational e�ciency losses.
• The authors introduced a new type of network layer called the “Sparsely-Gated Mixture-of-Experts
Layer.” They are built on previous iterations of MoE and aim to provide a general-purpose neural
network component that can be adapted to di�erent types of tasks.
• The Sparsely-Gated Mixture-of-Experts Layer, or the MoE layer, consists of numerous expert networks,
each being a simple feed-forward neural network and a trainable gating network. The gating network
is responsible for selecting a sparse combination of these experts to process each input.

11 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• The fascinating feature here is the use of sparsity in the gating function. This means that for every
input instance, the gating network only selects a few experts for processing, keeping the rest inactive.
This sparsity and expert selection is achieved dynamically for each input, making the entire process
highly �exible and adaptive. Notably, the computational e�ciency is preserved since inactive parts of
the network are not processed.
• The MoE layer can be stacked hierarchically, where the primary MoE selects a sparsely weighted
combination of “experts.” Each combination utilizes a MoE layer.

12 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• Moreover, the authors also introduced an innovative technique called Noisy Top-K Gating. This
mechanism adds a tunable Gaussian noise to the gating function, retains only the top K values, and
assigns the rest to negative in�nity, translating to a zero gating value. Such an approach ensures the
sparsity of the gating network while maintaining robustness against potential discontinuities in the
gating function output. Interestingly, it also aids in load balancing across the expert networks.
• In their framework, both the gating network and the experts are trained jointly via back-propagation,
the standard training mechanism for neural networks. The output from the gating network is a sparse,
n-dimensional vector, which serves as the gate values for the n-expert networks. The output from
each expert is then weighted by the corresponding gating value to produce the �nal model output.

Expert Choice Routing


• Despite the popularity of MoE in recent transformer-based models demonstrated by the Switch
Transformer, GLaM, V-MoE, and FLAN-MoE, improvements and research potentials remain in the area.
• In any case of a MoE scheme, the routing or gating function may cause speci�c experts to be
undertrained as it over�ts other experts. Regularization has been introduced to avoid too many
examples being routed to a single or a particular subset of experts. Additionally, Google Research
proposed “Expert Choice Routing” in November 2022, aiming to improve upon the potential �aw and
explicitly targeting language models.
• Unlike traditional MoE models, the EC routing method is founded on a di�erent approach to assigning
“experts” to “tokens” within a Mixture-of-Experts (MoE) model. Instead of assigning tokens to experts
as traditional MoE models do, EC reverses this process, assigning experts to tokens based on their
importance or di�culty.
• EC routing sets an “expert capacity” value to regulate how many tokens an expert can handle
simultaneously. It’s calculated as the average number of tokens per expert in a batch of input
sequences, which is then multiplied by a “capacity factor”. The capacity factor is a variable that
determines the average number of experts each token can be assigned to. By adjusting the capacity
factor, researchers can control how many experts work on each token, providing �exibility in allocating

13 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

computation resources.
• To decide which tokens should be assigned to which experts, the EC method uses a “token-to-expert
score matrix.” This matrix scores the compatibility between each token and each expert, ranking which
experts would best �t each token. Based on these scores, the most relevant tokens for each expert are
selected via a “top-k function”. The k here refers to the number of tokens chosen for each expert.
• Once the most relevant tokens have been identi�ed for each expert, a permutation function is applied
to arrange the data. This means reshu�ing the data so that each expert gets its assigned tokens,
allowing for e�cient parallel computation across all the experts.

Implications and Outlooks


• Incorporating MoE into deep learning is a relatively new development, gaining traction only as models
for NLP and computer vision tasks began to scale signi�cantly. Before this, the demand for conditional
computation was less pronounced than it is for contemporary Large Language Models (LLM) and
intricate CNNs.
• In 2021, Meta AI conducted a dedicated study for MoE models trained on language data, comparing
how MoE models scale in comparison with dense models. They found that other than �ne-tuning, MoE-
based models can match the performance of dense models with a quarter of the computing. They
could scale MoE models up to a trillion parameters (this was long before GPT-4 was released) and
consistently outperform their dense model counterparts.
• The same year, Google Brain proposed V-MoE, a vision transformer utilizing sparse MoE layers. They
found that V-MoE can match the performance of state-of-the-art models with as little as half of the
computational resources required.
• More famously, GPT-4 was also leaked to be adopting a MoE scheme with 8 local models, each
containing 220 billion parameters, totaling a whopping 1.7 trillion parameters.

The “How” Behind MoE


14 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• Although the success of MoE is clear in the deep learning �eld, as with most things in deep learning,
our understanding of how it can perform so well is rather unclear.
• Notably, each expert model is initialized and trained in the same manner, and the gating network is
typically con�gured to dispatch data equally to each expert. Unlike traditional MoE methods, all
experts are trained jointly with the MoE layer on the same dataset. It is fascinating how each expert
can become “specialized” in their own task, and experts in MoE do not collapse into a single model.
• The paper “Towards Understanding Mixture of Experts in Deep Learning” by Zixiang Chen et al.
attempts to interpret the “how” behind the MoE layers. They conclude that the “cluster structure of the
underlying problem and the non-linearity of the expert is pivotal to the success of MoE.”
• Although the conclusion does not provide a direct answer, it helps to gain more insight into the simple
yet e�ective approach of MoE.

What’s Next?
• Theoretically, a deeper understanding of MoE architectures and their working principles is needed. As
we saw in Chen et al.’s paper, the reasons behind the success of MoE layers are still partially obscure.
Therefore, more theoretical and empirical research is required to demystify the intrinsic mechanics of
these models, potentially leading to their optimization and better generalization.
• Additionally, how to design more e�ective gating mechanisms and expert models is an open question
with great potential for future exploration. While Expert Choice Routing o�ers a promising direction,
other innovative approaches might enhance the routing mechanism.
• Lastly, while MoE has shown impressive results in domains like NLP and computer vision, there is
considerable room to explore its utility in other domains, such as reinforcement learning, tabular data
domains, and more.
• The journey of MoE is in its infancy in the realm of deep learning, with many milestones yet to be
achieved. However, its potential for transforming how we understand and deploy deep learning
models is enormous. With the current state of computing, it’s unlikely that we will see signi�cant
improvements to hardware as rapidly as we see improvements to modeling techniques. By leveraging

15 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

the inherent strength of the MoE paradigm—the division of complex tasks into simpler subtasks
handled by specialized expert models—we may continue to push the boundaries of what is achievable
with deep learning. And that, indeed, is an exciting prospect to look forward to.

Related Papers

Outrageously Large Neural Networks: the Sparsely-Gated Mixture-of-


Experts Layer
• The capacity of a neural network to absorb information is limited by its number of parameters.
Conditional computation, where parts of the network are active on a per-example basis, has been
proposed in theory as a way of dramatically increasing model capacity without a proportional
increase in computation. In practice, however, there are signi�cant algorithmic and performance
challenges. Also, static neural network architectures apply the same function to every example. In
contrast, input dependent models attempt to tailor the function to each example. While it is
straightforward for a human to manually specify a single static architecture, it is infeasible to specify
every input-dependent function by hand. Instead, the input-dependent function must be automatically
inferred by the model, which introduces an extra level of complexity in optimization.
• Given the need to automatically infer architectures for each example, a natural solution is to de�ne a
single large model (supernetwork) with a numerous sub-networks (experts), and route examples
through a path in the supernetwork. The �gure below from Ramachandran and Le (2019) visualizes an
example of a routing network.. Intuitively, similar examples can be routed through similar paths and
dissimilar examples can be routed through di�erent paths. The example-dependent routing also
encourages expert specialization, in which experts devote their representational capacity to
transforming a chosen subset of examples.

16 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• Learning to route examples to well-matched experts is critical for good performance. E�ective routing
can be achieved by training another small neural network (router) that learns to route examples
through the supernetwork. The router takes the example as input and outputs the next expert to use.
The router can take advantage of the intermediate representations of the example produced in the
supernetwork.
• This paper by Shazeer et al. in ICLR 2017 addresses these challenges and �nally realize the promise of
conditional computation, achieving greater than 1000x improvements in model capacity with only
minor losses in computational e�ciency on modern GPU clusters.
• They introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of
feed-forward sub-networks. A trainable gating network determines a sparse combination of these
experts to use for each example. In this per-example routing setup, di�erent examples are processed
by di�erent subcomponents, or experts, inside a larger model, a.k.a. a supernetwork.
• Speci�cally, the proposed MoE layer takes as an input a token representation x and then routes this to
N
the best determined top-k experts, selected from a set {Ei (x)}i=1 of N experts. The router variable
Wr produces logits h(x) = Wr ⋅ x which are normalized via a softmax distribution over the available
N experts at that layer. The gate-value for expert i is given by,

eh(x)i
pi (x) =
∑N
j e
h(x)j

• The top-k gate values are selected for routing the token x . If T is the set of selected top-k indices
then the output computation of the layer is the linearly weighted combination of each expert’s
computation on the token by the gate value,

y = ∑ pi (x)Ei (x)
i∈T

17 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• They apply the MoE to the tasks of language modeling and machine translation, where model
capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We
present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally
between stacked LSTM layers. On large language modeling and machine translation benchmarks,
these models achieve signi�cantly better results than state-of-the-art at lower computational cost.
• The following diagram from the paper illustrates a Mixture of Experts (MoE) layer embedded within a
recurrent language model. In this case, the sparse gating function selects two experts to perform
computations. Their outputs are modulated by the outputs of the gating network.

Scaling Vision with Sparse Mixture of Experts


• Almost all prevalent computer vision models networks are “dense,” that is, every input is processed by
every parameter.
• This paper by Riquelme et al. from Google Brain introduces the Vision Mixture of Experts (V-MoE), a
novel approach for scaling vision models. The V-MoE is a sparsely activated version of the Vision
Transformer (ViT) that demonstrates scalability and competitiveness with larger dense networks in
image recognition tasks.
• The paper proposes a sparse variant of the Vision Transformer (ViT) that uses a mixture-of-experts
architecture. This approach routes each image patch to a subset of experts, making it possible to scale
up to 15B parameters while matching the performance of state-of-the-art dense models.
• An innovative extension to the routing algorithm is presented, allowing prioritization of subsets of each
input across the entire batch. This adaptive per-image compute leads to a trade-o� between
performance and computational e�ciency during inference.
• The �gure below from the paper shows an overview of the architecture. V-MoE is composed of L ViT
blocks. In some, we replace the MLP with a sparsely activated mixture of MLPs. Each MLP (the expert)

18 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

is stored on a separate device, and processes a �xed number of tokens. The communication of these
tokens between devices is shown in this example, which depicts the case when k = 1 expert is selected
per token. Here each expert uses a capacity ratio $C=\frac{4}
{3}: thesparseMoElayerreceives12tokensperdevice, buteachexperthascapacityfor16\left(\frac{16
\cdot 1}{12}=\frac{4}{3}\right.$$). Non-expert components of V-MoE such as routers, attention layers
and normal MLP blocks are replicated identically across devices.

• The V-MoE shows impressive scalability, successfully trained up to 15B parameters, and demonstrates
strong performance, including 90.35% accuracy on ImageNet.
• The paper explores the transfer learning abilities of V-MoE, showing its adaptability and e�ectiveness
across di�erent tasks and datasets, even with limited data.
• A detailed analysis of the V-MoE’s routing decisions and the behavior of its experts is provided,
o�ering insights into the model’s internal workings and guiding future improvements.
• V-MoE models require less computational resources than dense counterparts, both in training and
inference, thanks to their sparsely activated nature and the e�cient use of the Batch Prioritized
Routing algorithm.
• The paper concludes with the potential of sparse conditional computation in vision tasks, emphasizing
the environmental bene�ts due to reduced CO2 emissions and the promising directions for future
research in large-scale multimodal or video modeling.
• The paper represents a signi�cant advancement in the �eld of computer vision, particularly in the
development of scalable and e�cient vision models.

Modeling Task Relationships in Multi-task Learning with Multi-gate


Mixture-of-Experts

19 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• This paper by Ma et al. published in KDD 2018, introduces a novel approach to multi-task learning
called Multi-gate Mixture-of-Experts (MMoE). The method aims to enhance the performance of multi-
task learning models by better handling the relationships between di�erent tasks.
• The MMoE model adapts the Mixture-of-Experts (MoE) framework to multi-task learning by sharing
expert submodels across all tasks and using a gating network optimized for each task. This design
allows the model to dynamically allocate shared and task-speci�c resources, e�ciently handling tasks
with varying degrees of relatedness.
• The paper presents experiments using synthetic data and real datasets, including a binary
classi�cation benchmark and a large-scale content recommendation system at Google. These
experiments demonstrate MMoE’s e�ectiveness in scenarios where tasks have low relatedness and its
superiority over traditional shared-bottom multi-task models in terms of both performance and
trainability.
• MMoE’s architecture consists of multiple experts (feed-forward networks) and a gating network for
each task, which determines the contribution of each expert to the task. This setup allows the model to
learn nuanced relationships between tasks and allocate computation resources more e�ectively.
• The following �gure from the paper shows a (a) shared-Bottom model, (b) one-gate MoE model, (c)
multi-gate MoE model.

• In the experiments with the Census-income dataset, a UCI benchmark dataset, the task was to predict
whether an individual’s income exceeds $50,000 based on census data. The dataset contains
demographic and employment-related information. MMoE’s application to this dataset involved
addressing the challenge of binary classi�cation using multiple socio-economic factors as input
features.
• On synthetic data, MMoE showed better performance, especially when task correlation is low, and
demonstrated improved trainability with less variance in model performance across runs. On real-
world datasets, including the UCI Census-income dataset and Google’s content recommendation

20 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

system, MMoE consistently outperformed baseline models in terms of accuracy and robustness.
• MMoE o�ers computational e�ciency by using lightweight gating networks and shared expert
networks, making it suitable for large-scale applications. The experiments on Google’s
recommendation system highlighted MMoE’s ability to improve both engagement and satisfaction
metrics in live experiments compared to single-task and shared-bottom models.

Mixture-of-Experts Meets Instruction Tuning: a Winning Combination


for Large Language Models
• The paper titled “Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large
Language Models” presents an innovative approach to enhancing the performance and scalability of
Large Language Models (LLMs) by combining Sparse Mixture-of-Experts (MoE) architecture with
instruction tuning. - Sparse MoE is a neural architecture that adds learnable parameters to LLMs
without increasing inference costs. In contrast, instruction tuning trains LLMs to follow instructions
more e�ectively.
• The authors advocate for the combination of these two approaches, demonstrating that MoE models
bene�t signi�cantly more from instruction tuning compared to their dense model counterparts.
• The paper presents three experimental setups: direct �netuning on individual downstream tasks
without instruction tuning; instruction tuning followed by few-shot or zero-shot generalization on
downstream tasks; and instruction tuning supplemented by further �netuning on individual tasks.
• The �ndings indicate that MoE models generally underperform compared to dense models of the
same computational capacity in the absence of instruction tuning. However, this changes with the
introduction of instruction tuning, where MoE models outperform dense models.
• The paper introduces the FLAN-MOE32B model, which outperforms FLAN-PALM62B on four
benchmark tasks while using only a third of the FLOPs. This highlights the e�ciency and e�ectiveness
of the FLAN-MOE approach.
• The authors conduct a comprehensive series of experiments to compare the performance of various
MoE models subjected to instruction tuning. These experiments include evaluations in natural

21 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

language understanding, reasoning, and question-answering tasks. The study also explores the impact
of di�erent routing strategies and the number of experts on the performance of FLAN-MOE models,
showing that performance scales with the number of tasks rather than the number of experts.
• The following image from the paper shows the e�ect of instruction tuning on MOE models versus
dense counterparts for base-size models (same �ops across all models in this �gure). They perform
single-task �netuning for each model on held-out benchmarks. Compared to dense models, MoE
models bene�t more from instruction-tuning, and are more sensitive to the number of instruction-
tuning tasks. Overall, the performance of MoE models scales better with respect to the number of
tasks, than the number of experts.

• The paper discusses the challenge of adapting MoE models to multilingual benchmarks and highlights
the importance of incorporating diverse linguistic data during training to ensure e�ective language
coverage.
• Overall, the paper “Mixture-of-Experts Meets Instruction Tuning” by Sheng Shen et al. presents
signi�cant advancements in the scalability and e�ciency of LLMs through the novel integration of
MoE architecture and instruction tuning, setting new standards in the �eld of natural language
processing.

From Sparse to Soft Mixtures of Experts


• Sparse Mixture of Experts (MoE) architectures scale model capacity without large increases in training
or inference costs. MoE allows us to dramatically scale model sizes without signi�cantly increasing
inference latency. In short, each “expert” can separately attend to a di�erent subset of tasks via
di�erent data subsets before they are combined via an input routing mechanism. Thus, the model can
learn a wide variety of tasks, but still specialize when appropriate. Despite their success, MoEs su�er
from a number of issues: training instability, token dropping, inability to scale the number of experts, or

22 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

ine�ective �netuning.
• This paper by Puigcerver et al. from Google DeepMind proposes Soft MoE, a fully-di�erentiable sparse
Transformer that addresses these challenges, while maintaining the bene�ts of MoEs.
• Extra-large models like Google’s PaLM (540B parameters) or OpenAI’s GPT-4 use Sparse MoE under
the hood, which su�ers from training instabilities, because it’s not fully di�erentiable. Soft-MoE
replaces the non-di�erentiable expert routing with a di�erentiable layer. The end-to-end model is fully
di�erentiable again, can be trained with ordinary SGD-like optimizers, and the training instabilities go
away.
• Soft MoE performs an implicit soft assignment by passing di�erent weighted combinations of all input
tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the
(combined) tokens, enabling larger model capacity at lower inference cost.
• The following �gure from the paper illustrates the main di�erences between Sparse and Soft MoE
layers. While the router in Sparse MoE layers (left) learns to assign individual input tokens to each of
the available slots, in Soft MoE layers (right) each slot is the result of a (di�erent) weighted average of
all the input tokens. Learning to make discrete assignments introduces several optimization and
implementation issues that Soft MoE sidesteps.

• They propose a fully-di�erentiable sparse vision transformer (ViT) that addresses aforementioned
challenges such as training instability, token dropping, and ine�cient �netuning. In the context of
visual recognition, Soft MoE greatly outperforms the standard ViT and popular MoE variants (Tokens
Choice and Experts Choice). Soft MoE scales ViT models to >50B parameters with little e�ect on
inference latency. For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-
clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also
scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than
ViT Huge/14, while inference time cost grows by only 2%, and it performs substantially better.
• The following �gure from the paper illustrates the Soft MoE routing algorithm. Soft MoE �rst computes

23 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

scores or logits for every pair of input token and slot, based on some learnable per-slot parameters.
These logits are then normalized per slot (columns) and every slot computes a linear combination of
all the input tokens based on these weights (in green). Each expert (an MLP in this work) then
processes its slots (e.g. 2 slots per expert, in this diagram). Finally, the same original logits are
normalized per token (i.e., by row) and used to combine all the slot outputs, for every input token (in
blue). Dashed boxes represent learnable parameters.

• The following infographic (source) presents an overview of their results:

• PyTorch implementation.

Switch Transformers
• The Switch Transformer, introduced in the paper “Switch Transformers: Scaling to Trillion Parameter
Models with Simple and E�cient Sparsity” by Fedus et al. from Google, innovates large language
models by integrating a Mixture of Experts (MoE) into the feed-forward network layer of
Transformers. This design uses a routing mechanism that directs each token to a speci�c expert,
optimizing computational e�ciency by maintaining a constant computational cost despite a large
parameter count.
• Key aspects include:
◦ Hard Routing: Simpli�es the computational demands by engaging only the most relevant expert
per token. This contrasts with top-k routing (where k > 1) which activates multiple experts but
increases complexity.

24 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

◦ Capacity Factor: Balances the trade-o� between compute e�ciency and token processing
T
accuracy. It’s de�ned by the formula f × E where T is the total number of tokens, E is the
number of experts, and f is a tunable hyperparameter. Adjusting f a�ects how tokens are
distributed across experts, in�uencing whether tokens need to be dropped or padded to optimize
computational graph consistency.
◦ Performance: Empirical results demonstrate that the Switch Transformer can achieve the same
modeling outcomes as the T5 model but at a 7x faster rate, utilizing the same amount of
computational resources (FLOPs).
• This model architecture allows LLMs to scale more e�ectively by increasing the number of parameters
without a corresponding rise in computational complexity, owing to the sparse activation of experts.
This makes the Switch Transformer a groundbreaking development in the �eld of machine learning,
especially for applications requiring large-scale and e�cient computation.

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models


• This paper by Frantar and Alistarh from the Institute of Science and Technology Austria and Neural
Magic Inc. presents QMoE, a framework designed to address the memory challenges in deploying
large language models (LLMs) with Mixture-of-Experts (MoE) architectures.
• The key problem QMoE addresses is the massive memory requirement of large models, exempli�ed
by the 1.6 trillion-parameter SwitchTransformer-c2048 model, which typically requires 3.2TB of
memory. QMoE e�ectively compresses such models to less than 1 bit per parameter, enabling their
execution on commodity hardware with minor runtime overheads.
• QMoE employs a scalable algorithm and a custom compression format paired with GPU decoding
kernels. It compresses the SwitchTransformer-c2048 model to less than 160GB (0.8 bits per parameter)
with minor accuracy loss in under a day on a single GPU.
• The implementation includes a highly scalable compression algorithm and a bespoke compression
format, facilitating e�cient end-to-end compressed inference. The framework enables running trillion-
parameter models on a�ordable hardware, like servers equipped with NVIDIA GPUs, at less than 5%

25 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

runtime overhead compared to ideal uncompressed execution.


• The paper discusses the challenges in compressing MoE models, including conceptual issues with
existing post-training compression methods and practical scaling challenges. It overcomes these by
introducing a custom compression format and highly-e�cient decoding algorithms optimized for GPU
accelerators.
• The technical contributions include a novel approach to handling massive activation sets and a unique
system design for optimized activation o�oading, expert grouping, and robustness modi�cations,
ensuring e�cient application of data-dependent compression to massive MoEs.
• The framework signi�cantly reduces the size of large models, with QMoE compressed models
achieving over 20x compression rates compared to 16-bit precision models. This reduction in size is
accompanied by minor increases in loss on pretraining validation and zero-shot data.
• The paper also discusses the system design and optimizations made to address memory costs, GPU
utilization, and reliability requirements. These include techniques like optimized activation o�oading,
list bu�er data structures, lazy weight fetching, and expert grouping.
• The following �gure from the paper illustrates the o�oading execution for the sparse part of a
Transformer block. An expert E2 and its corresponding input tokens XE are fetched to GPU memory
to produce E2 ' , which together with the corresponding outputs YE are written back to CPU again.

• The experiments demonstrate that QMoE e�ectively compresses MoE models while maintaining
performance. The system was tested on various datasets, including Arxiv, GitHub, StackExchange, and
Wikipedia, showing good performance preservation even for highly compressed models.
• The paper provides detailed insights into the encoding and decoding processes and the kernel
implementation for the GPU, highlighting the challenges and solutions for achieving sub-1-bit per
parameter compression.
• The QMoE framework is a signi�cant step towards practical deployment of massive-scale MoE
models, addressing key limitations of MoE architectures and facilitating further research and

26 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

understanding of such models.


• The paper’s �ndings are signi�cant as they make it feasible to deploy and research trillion-parameter
models on more accessible hardware, potentially democratizing access to high-performance LLMs
and spurring further innovation in the �eld.

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts


• This paper by Gale et al. from Stanford University, Microsoft Research, and Google Research,
introduces Dropless Mixture-of-Experts (MoE), a novel system for e�cient MoE training on GPUs.
• The system, named MegaBlocks, addresses the limitations of current frameworks that restrict dynamic
routing in MoE layers, often leading to a tradeo� between model quality and hardware e�ciency due
to the necessity of dropping tokens or wasting computation on excessive padding. Token dropping
leads to information loss, as it involves selectively ignoring part of the input data, while padding adds
redundant data to make the varying input sizes uniform, which increases computational load without
contributing to model learning. This challenge arises from the di�culty in e�ciently handling the
dynamic routing and load-imbalanced computation characteristic of MoE architectures, especially in
the context of deep learning hardware and software constraints.
• MegaBlocks innovatively reformulates MoE computations as block-sparse operations, developing new
GPU kernels speci�cally for this purpose. These kernels e�ciently manage dynamic, load-imbalanced
computations inherent in MoEs without resorting to token dropping. This results in up to 40% faster
end-to-end training compared to MoEs trained with the Tutel library, and 2.4 times speedup over DNNs
trained with Megatron-LM.
• The system’s core contributions include high-performance GPU kernels for block-sparse matrix
multiplication, leveraging blocked-CSR-COO encoding and transpose indices. This setup enables
e�cient handling of sparse inputs and outputs in both transposed and non-transposed forms.
• Built upon the Megatron-LM library for Transformer model training, MegaBlocks supports distributed
MoE training with data and expert model parallelism. Its unique ability to avoid token dropping through
block-sparse computation provides a fresh approach to MoE algorithms as a form of dynamic

27 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

structured activation sparsity.


• The �gure below from the paper shows a Mixture-of-Experts Layer. Shown for num experts=3 , top
k=1 and capacity factor=1 with the prevalent, token dropping formulation. First (1), tokens are
mapped to experts by the router. Along with expert assignments, the router produces probabilities that
re�ect the con�dence of the assignments. Second (2), the feature vectors are permuted to group
tokens by expert assignment. If the number of tokens assigned to an expert exceeds its capacity, extra
tokens are dropped. Third (3), the expert layers are computed for the set of tokens they were assigned
as well as any padding needed for unused capacity. Lastly (4), the results of the expert computation
are un-permuted and weighted by the router probabilities. The outputs for dropped tokens are shown
here set to zero.

• Experiments demonstrate that MegaBlocks enables signi�cant end-to-end training speedups for MoE
models compared to existing approaches, especially as model size increases. The system also reduces
the computational overhead and memory requirements associated with MoE layers, leading to more
e�cient utilization of hardware resources. Furthermore, the approach decreases the number of
hyperparameters that need to be re-tuned for each model and task, simplifying the process of
training large MoE models.
• The paper provides detailed insights into the design and performance of the block-sparse kernels,
including analyses of throughput relative to cuBLAS batched matrix multiplication and discussions on
e�cient routing and permutation for MoEs. The results show that MegaBlocks’ kernels perform
comparably to cuBLAS, achieving an average of 98.6% of cuBLAS’s throughput with minimal variations
across di�erent con�gurations.
• Code

MoE Models

28 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

Mixtral
• The Mixtral 8x7B, developed by Mistral AI, is a 32-block Transformer model that integrates a sparse
Mixture of Experts (MoE) layer using top-k routing with k=2. This model employs 8 experts, each being
a 1-layer MLP with SwiGLU activation, instead of the standard ReLU used in many similar architectures.
In the Mixtral 8x7B, the top-k routing technique is applied, where the top 2 experts (out of the 8
available) are dynamically chosen for each token based on their relevance to the input, as determined
by the routing matrix Wg .
• This con�guration results in a model with a total of 47B parameters; however, only 13B parameters are
actively engaged at any given time due to the sparse activation nature of the model. This approach
maintains the computational e�ciency typically associated with smaller models while leveraging the
capacity of a much larger model.
• Performance evaluations on diverse benchmark problems—including MMLU, knowledge retrieval,
reasoning, comprehension, math, and coding—have shown that Mixtral 8x7B performs comparably or
superior to larger models such as Llama 70B, especially in math and coding tasks. This e�ciency is
attributed to the model’s architecture, which signi�cantly upsamples multilingual data during pre-
training, enhancing its performance across multiple languages.
• While the Mixtral 8x7B incorporates established techniques like top-k routing and sparse activation, its
implementation using fewer experts and the absence of load balancing losses or extensive parallelism
distinguishes it from other MoE models like the Switch Transformer. The model’s focus on syntactical
rather than semantic specialization of experts is notable, particularly in how it in�uences performance
on speci�c types of problems.

GPT-4
• Read our GPT-4 primer here.
• Per a rumor, GPT-4 might be an 8-way Mixture-of-Experts (MoE) model with 8 220B parameters (a
total of 1.76T parameters).

29 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• A Mixture of Experts (MoE) model essentially revolves around a router that directs questions to the
appropriate expert. If GPT-4 does adopt the MoE approach, it would consist of eight specialist models
each trained in a speci�c domain, like mathematics, history, storytelling, etc. When a question is posed,
the router analyses it and seamlessly forwards it to the most suitable expert.
• The concept of MoE is quite prevalent (refer Outrageously Large Neural Networks: the Sparsely-Gated
Mixture-of-Experts Layer), with Langchain’s high-level implementation of an LLMRouterChain, and
notable low-level integrated examples like Google’s Switch Transformer (refer Switch Transformers:
Scaling to Trillion Parameter Models with Simple and E�cient Sparsity).
• Per yet another rumor, here are the speci�cs:
◦ Parameter count: GPT-4 is more than 10x the size of GPT-3; with a total of ~1.8 trillion parameters
across 120 layers.
◦ Architecture: GPT-4 uses an MoE architecture; the main idea behind used an MoE model was to
keep costs training/inference reasonable while ensuring great performance. In other words, it is
not a dense transformer like, for instance, PaLM (or GPT-3). They utilizes 16 experts within their
model, each is about ~111B parameters for MLP. 2 of these experts are routed per forward pass.
There roughly ~55B shared parameters for attention.
◦ MoE routing: While the literature talks a lot about advanced routing algorithms for choosing
which experts to route each token to, OpenAI’s is allegedly quite simple, for the current GPT-4
model.
◦ Inference: Each forward pass inference (generation of 1 token) only utilizes ~280B parameters
and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be
required per forward pass of a purely dense model (vs. the MoE architecture that’s used).
◦ Dataset: GPT-4 is trained on ~13T tokens. These are not unique tokens, but the total amount of
tokens seen over all epochs. There are millions of instruction �ne-tuning data samples from
ScaleAI & internally (probably acquired through ChatGPT + their API before they changed the
policy).
◦ Training epochs: 2 epochs for text-based data and 4 for code-based data.
◦ Training paradigm: For pre-training GPT-4 32K, they utilized an 8K context length. The 32K

30 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

context version of GPT-4 was based on �ne-tuning of the 8K after the pre-training. Extending
context is hard… but not impossible is a good reference on how to achieve this.
◦ Batch size: The batch size was gradually ramped up over a number of days on the cluster, but by
the end, OpenAI was using a batch size of 60 million! This, of course, is “only” a batch size of 7.5
million tokens per expert due to not every expert seeing all tokens. For the real batch size:**
Divide this number by the context width to get the real batch size.
◦ Parallelism strategies: To parallelize across all their A100s GPUs, they utilized 8-way tensor
parallelism as that is the limit for NVLink. Beyond that, they used 15-way pipeline parallelism. Also
apparently they used DeepSpeed ZeRo Stage 1 or block-level FSDP.
◦ Training cost: OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days
at about 32% to 36% MFU. Part of this extremely low utilization is due to an absurd number of
failures requiring checkpoints that needed to be restarted from. If their cost in the cloud was
about $1 per A100 hour, the training costs for this run alone would be about $63 million. Had
H100s been used, pre-training could be done with ~8,192 H100s in ~55 days for $21.5 million at $2
per H100 hour.
◦ MoE tradeo�s: There are multiple MoE tradeo�s taken; for example, MoE is incredibly di�cult to
deal with on inference because not every part of the model is utilized on every token generation.
This means some parts may sit dormant when other parts are being used. When serving users,
this really hurts utilization rates. Researchers have shown that using 64 to 128 experts achieves
better loss than 16 experts, but that’s purely research. There are multiple reasons to go with fewer
experts. One reason for OpenAI choosing 16 experts is because more experts are di�cult to
generalize at many tasks. More experts can also be more di�cult to achieve convergence with.
With such a large training run, OpenAI instead chose to be more conservative on the number of
experts.
◦ GPT-4 inference cost: GPT-4 costs 3x that of the 175B parameter DaVinci. This is largely due to
the larger clusters required for GPT-4 and much lower utilization achieved. An estimate of it’s
costs is $0.0049 cents per 1K tokens for 128 A100s to inference GPT-4 8K context width and
$0.0021 cents per 1K tokens for 128 H100s to inference GPT-4 8K context width. It should be noted
that they assume decent high utilization and keep batch sizes large.

31 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

◦ Multi-Query Attention: GPT-4 uses MQA instead of MHA (MQA is a classic choice at this point).
Because of that only 1 head is needed and memory capacity can be signi�cantly reduced for the
KV cache. Even then, the 32K context width GPT-4 de�nitely cannot run on 40GB A100s, and the
8K is capped on max batch size.
◦ Continuous batching: OpenAI implements both variable batch sizes and continuous batching.
This is so as to allow some level of maximum latency as well optimizing the inference costs.
◦ Vision multi-modal: They have a separate vision encoder from the text encoder, with cross-
attention. The architecture is similar to Google DeepMind’s Flamingo. This adds more parameters
on top of the 1.8T text-only GPT-4. It is �ne-tuned with another ~2 trillion tokens, after the text
only pre-training. On the vision model, OpenAI wanted to train it from scratch, but it wasn’t
mature enough, so they wanted to derisk it by starting with text. One of the primary purposes of
this vision capability is for autonomous agents able to read web pages and transcribe what’s in
images and video. Some of the data they train on is joint data (rendered LaTeX/text),
screenshots of web pages, YouTube videos: sampling frames, and run Whisper around it to get
transcript.
◦ Speculative decoding: OpenAI might be using speculative decoding on GPT-4’s inference. The
idea is to use a smaller faster model to decode several tokens in advance, and then feeds them
into a large oracle model as a single batch. If the small model was right about its predictions (i.e.,
the larger model agrees), we can decode several tokens in a single batch. But if the larger model
rejects the tokens predicted by the draft model then the rest of the batch is discarded. And we
continue with the larger model. The conspiracy theory that the new GPT-4 quality had been
deteriorated might be simply because they are letting the oracle model accept lower probability
sequences from the speculative decoding model.
▪ Per Andrej Karpathy, speculative sampling/decoding/execution for LLMs is an excellent
inference-time optimization. It hinges on the following unintuitive observation: forwarding
an LLM on a single input token takes about as much time as forwarding an LLM on K input
tokens in a batch (for larger K than what might be obvious). This unintuitive fact is
because sampling is heavily memory bound: most of the “work” is not doing compute, it is
reading in the weights of the transformer from VRAM into on-chip cache for processing. So

32 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

if you’re going to do all that work of reading in all those weights, you might as well apply
them to a whole batch of input vectors.
▪ At batch_size=1 (i.e. just generating a single stream of prediction on your computer),
the inference is super duper memory-bound. The on-chip compute units are twiddling
their thumbs while sucking model weights through a straw from DRAM. Every
individual weight that is expensively loaded from DRAM onto the chip is only used for
a single instant multiply to process each new input token. So the stat to look at is not
FLOPS but the memory bandwidth.
▪ Let’s take a look:
▪ A100: 1935 GB/s memory bandwidth, 1248 TOPS
▪ MacBook M2: 100 GB/s, 7 TFLOPS
▪ The compute is ~200X but the memory bandwidth only ~20X. So the little M2 chip that
could will only be about ~20X slower than a mighty A100. This is ~10X faster than you
might naively expect just looking at ops.
▪ The situation becomes a lot more di�erent when you inference at a very high batch
size (e.g. ~160+), such as when you’re hosting an LLM engine simultaneously serving a
lot of parallel requests. Or in training, where you aren’t forced to go serially token by
token and can parallelize across both batch and time dimension, because the next
token targets (labels) are known. In these cases, once you load the weights into on-
chip cache and pay that large �xed cost, you can re-use them across many input
examples and reach ~50%+ utilization, actually making those FLOPS count.
▪ In summary, why is LLM inference surprisingly fast on your MacBook? If all you want
to do is batch 1 inference (i.e. a single “stream” of generation), only the memory
bandwidth matters. And the memory bandwidth gap between chips is a lot smaller,
and has been a lot harder to scale compared to �ops.
▪ The reason we can’t naively use this fact to sample in chunks of K tokens at a time is that
every N th token depends on what token we sample at time at step N − 1. There is a serial
dependency, so the baseline implementation just goes one by one left to right.

33 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

▪ Now the clever idea is to use a small and cheap draft model to �rst generate a candidate
sequence of K tokens – a “draft”. Then we feed all of these together through the big model
in a batch. This is almost as fast as feeding in just one token, per the above. Then we go
from left to right over the logits predicted by the model and sample tokens. Any sample
that agrees with the draft allows us to immediately skip forward to the next token. If there
is a disagreement then we throw the draft away and eat the cost of doing some
throwaway work (sampling the draft and the forward passing for all the later tokens).
▪ The reason this works in practice is that most of the time the draft tokens get accepted,
because they are easy, so even a much smaller draft model gets them. As these easy
tokens get accepted, we skip through those parts in leaps. The hard tokens where the big
model disagrees “fall back” to original speed, but actually a bit slower because of all the
extra work.
▪ In summary, this one weird trick works because LLMs are memory bound at inference time,
in the “batch size 1” setting of sampling a single sequence of interest, that a large fraction
of “local LLM” use cases fall into. And because most tokens are “easy”.
▪ More on this here: Blockwise Parallel Decoding for Deep Autoregressive Models,
Accelerating Large Language Model Decoding with Speculative Sampling, and Fast
Inference from Transformers via Speculative Decoding
◦ Inference architecture: The inference runs on a cluster of 128 GPUs. There are multiple of these
clusters in multiple datacenters in di�erent locations. It is done in 8-way tensor parallelism and
16-way pipeline parallelism. Each node of 8 GPUs has only ~130B parameters, or less than 30GB
per GPU at FP16 and less than 15GB at FP8/int8. The model has 120 layers, so it �ts in 15 di�erent
nodes. (Possibly the there are less layers on the �rst node since it needs to also compute the
embeddings). According to these numbers: OpenAI should have trained on 2x the tokens if they
were trying to go by Chinchilla’s optimal. This goes to show that they are struggling to get high
quality data.
◦ Why no Fully Sharded Data Parallel (FSDP)? A possible reason for this could be that some of the
hardware infra they secured is of an older generation. This is pretty common at local compute
clusters as the organisation usually upgrade the infra in several “waves” to avoid a complete

34 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

pause of operation. With such a high amount of pipeline parallelism it is very likely that they
su�er from the “batch bubble”: slight idle time between batches.
◦ Dataset mixture: They trained on 13T tokens. CommonCrawl & Re�nedWeb are both 5T. Remove
the duplication of tokens from multiple epochs and we get to a much reasonable number of
“unaccounted for” tokens: the “secret” data – parts of it probably came from Twitter, Reddit, and
YouTube. Some speculations are: LibGen (4M+ books), Sci-Hub (80M+ papers), all of GitHub. Part
of the missing dataset could also be custom dataset of college textbooks collected by hand for
as much courses as possible. This is very easy to convert to text form and than use Self-Instruct
to transform it into instruction form. This creates the “illusion” that GPT-4 “is smart” no matter
who uses it: for computer scientists, it can help you with your questions about P!=NP; for a
philosophy major, it can totally talk to you about epistemology. There are also papers that try to
extract by force memorized parts of books from GPT-4 to understand what it trained on. There
are some books it knows so well that it had seen them for sure. Moreover, it even knows the
unique ids of project Euler problems.

Mixtral: Mistral’s 8x7B MoE Model


• Mixtral 8x7B (56B params) from Mistral follows a Mixture of Experts (MoE) architecture, consisting of
8x 7B experts. With 8 experts and a router network that selects two of them at every layer for the
inference of each token, it looks directly inspired from rumors about GPT-4’s architecture. This
information can be derived from the model metadata:

{"dim": 4096, "n_layers": 32, "head_dim": 128, "hidden_dim": 14336, "n_heads": 32, "n_kv_heads"

• From GPT-4 leaks, we can speculate that GPT-4 is a MoE model with 8 experts, each with 111B
parameters of their own and 55B shared attention parameters (166B parameters per model). For the
inference of each token, also only 2 experts are used.
• Since the model size (87GB) is smaller than 8x Mistral 7B (8*15GB=120GB), we could assume that the

35 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

new model uses the same architecture as Mistral 7B but the attention parameters are shared, reducing
the naïve 8x7B model size estimation.
• The conclusion is that (probably) Mistral 8x7B uses a very similar architecture to that of GPT-4, but
scaled down:
◦ 8 total experts instead of 16 (2x reduction).
◦ 7B parameters per expert instead of 166B (24x reduction).
◦ 42B total parameters (estimated) instead of 1.8T (42x reduction).
◦ Free to use under Apache 2.0 license
◦ Outperforms Llama 2 70B with 6x faster inference.
◦ Matches or outperforms GPT-3.5
◦ Multilingual: vastly outperforms LLaMA 2 70B on French, Italian, German and Spanish
◦ Same 32K context as the original GPT-4.
• Each layer in a 8x MoE model has its FFN split into 8 chunks and a router picks 2 of them, while the
attention weights are always used in full for each token. This means that if the new mistral model uses
5B parameters for the attention, you will use 5+(42-5)/4 = 14.25B params per forward pass.
• Mixtral is basically 8 models in a trenchcoat: the feedforward layers of the decoder blocks are divided
into 8 experts, and for each token, a router will decide which 2 experts to allocate the processing to.
The advantage of this architecture is that even though you have 7 × 8B = 47B parameters in total
(considering shared parameters which are not unique to each expery), the model is much cheaper and
fast to run since only 28 experts are activated for each prediction.

1 th
But how do you maintain good performance with only 4
of your model running at one time? The
image below (source) gives us a view of the answer: there’s a marked specialization between experts, with
one being stronger on logic, the other on history, and so on. The router knows which one is good at each
subject, and like an excellent TV host, it carefully pick its experts to always get a good answer.

36 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

• Mistral has also released Mixtral 8x7B Instruct v0.1 trained using supervised �ne-tuning and direct
preference optimization (DPO). It scores 8.3 on MT-Bench making it the best open-source model, with
performance comparable to GPT3.5.
• Mistral o�ers three chat endpoints with competitive pricing via Mistral AI La Plateforme:
◦ Mistral-tiny: Mistral 7B Instruct v0.2, upgraded base model with higher context length 8K → 32K
and better �netuning, 6.84 → 7.61 on MT Bench.
◦ Mistral-small: Mistral 8x7B Instruct v0.1, matches or exceeds GPT-3.5 performance, multilingual.
◦ Mistral-medium: Outperforms GPT-3.5 on all metrics, multilingual.
• They’ve also announced Mistral-embed, an embedding model with a 1024 embedding dimension,
which achieves 55.26 on MTEB.
• Refer MoE Explanation.
• Blog; La Plateforme; Mixtral-8x7B-v0.1 Base model; Mixtral-8x7B-v0.1 Instruct model.

Results

• Benchmark results comparing against the other SOTA OSS models as of this writing: LLaMA-2, Yi-34B
(from 01.AI led by Kai-Fu Lee), and DeepSeek-67B (a strong model made by a quant trading
company).

37 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

OpenMoE
• OpenMoE is a family of open-sourced MoE LLMs.
• Colossal AI’s PyTorch OpenMoE implementation including both training and inference with expert
parallelism.

Further Reading
• Mixture of Experts Explained

Citation
If you found our work useful, please cite it as:

@article{Chadha2020DistilledLossFunctions,
title = {Loss Functions},
author = {Chadha, Aman and Jain, Vinija},
journal = {Distilled AI},
year = {2020},
note = {\url{https://vinija.ai}}
}

38 of 39 6/2/24, 02:22
Vinija's Notes • Primers • Mixture of Experts https://vinija.ai/nlp/mixture-of-experts/

www.vinija.ai

39 of 39 6/2/24, 02:22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy