0% found this document useful (0 votes)

94 views11 pages

Task-Based Moe For Multitask Multilingual Machine Translation

1. The document proposes a novel task-based mixture-of-experts (MoE) architecture for multitask multilingual machine translation that incorporates task information at different levels. 2. In current MoE implementations, all data from different tasks are treated the same, but the proposed approach assigns experts based on individual task information to make training more task-aware. 3. The approach groups similar tasks to use similar experts through shared dynamic task-based adapters, allowing the model to generalize to new tasks efficiently.

Uploaded by

billeton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views11 pages

Task-Based Moe For Multitask Multilingual Machine Translation

Uploaded by

billeton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Task-Based MoE for Multitask Multilingual Machine Translation

Hai Pham Young Jin Kim Subhabrata Mukherjee∗

Carnegie Mellon University Microsoft Hippocratic AI

David P. Woodruff Barnabás Póczos Hany Hassan Awadalla

Carnegie Mellon University Carnegie Mellon University Microsoft

{htpham, bapoczos, dwoodruf}@cs.cmu.edu {youki, hanyh}@microsoft.com subhabrata.mukherjee.ju@gmail.com

Abstract on lower levels in the architecture such as at sys-

tem or communication levels. In the case of multi-
Mixture-of-experts (MoE) architecture has task learning where a single model is required to
been proven a powerful method for diverse learn from heterogeneous tasks, however, the task-
arXiv:2308.15772v1 [cs.CL] 30 Aug 2023

tasks in training deep models in many appli-

specific data could be inherently diverse and vary
cations. However, current MoE implementa-
tions are task agnostic, treating all tokens from largely from one to another (Wu et al., 2020). As
different tasks in the same manner. In this a result, treating data from such different sources
work, we instead design a novel method that the same makes the learning ineffective, as also
incorporates task information into MoE mod- evidenced recently by the interference between
els at different granular levels with shared dy- different task data (Pfeiffer et al., 2022).
namic task-based adapters. Our experiments As a result, in this work, we design a novel MoE
and analysis show the advantages of our ap-
approach where task information is used during
proaches over the dense and canonical MoE
models on multi-task multilingual machine training and inference for assigning experts based
translations. With task-specific adapters, our on individual task information. The intuition is
models can additionally generalize to new to make the training more task-aware so those
tasks efficiently. similar tasks would be routed to the same group
of experts and vice versa. From the architectural
1 Introduction perspective, we incorporate high-level application-
specific information with the system-level infor-
Mixture-of-Experts (MoE), while not being a novel
mation to make the model become task-aware and
machine learning algorithm (Yüksel et al., 2012),
hence have a better strategy in allocating experts
has revived to combine with deep learning, partic-
based on the characteristics of distinct tasks, as
ularly transformer (Vaswani et al., 2017) and has
also illustrated in Figure 1.
recently pushed forward various tasks such as nat-
Our proposed architecture allows for grouping
ural language processing, computer vision, speech
experts based on the similarity of tasks, i.e. similar
recognition, multimodal and multitask learning
tasks should use a similar group of experts and
due to its advantage in scalability in distributed
otherwise for different tasks, by using shared-task
environments (Fedus et al., 2022). The main advan-
adapters. Our design of putting those adapters
tages of MoE stem from its ensemble design while
on top of MoE layers allows for flexibility in fu-
maintaining the sparsity in computation (Fedus
ture extensions: if we want the model to acquire
et al., 2021). And with proper design such as us-
new tasks while still having similar resources, we
ing GShard (Lepikhin et al., 2020), the possibility
only finetune new adapters, and if we want to
for enterprise-level scalability is almost boundless.
scale the hardware resources, e.g. for more speed,
As a result, this method has been more and more
we simply deal with MoE layers with such new
widely adopted in many applications that require
resources.
distributed and intensive workloads.
Our experiments and analysis show the advan-
However, most of the current methods are task-
tages of using task information in MoE architec-
agnostic, only optimizing for performance based
tures in multiple settings including multitask mul-
∗
Work done while at Microsoft. tilingual machine translations, as well as its gen-
Add + Norm
2020), fairseq (Ott et al., 2019) and
transformer (Vaswani et al., 2017).

2 Related Work
A1 A2 … Ak Task
MoE Basic Transformer-based Mixture-of-
Layer Experts (MoE) architecture essentially sparsifies
transformer architecture by replacing the heavy
TASK Gate
feed-forward network (FFN) with a sparse MoE
layer with top-2 gates (Shazeer et al., 2017). How-
ever, increasing the number of experts does not
Add + Norm Task simply increase the performance (Fedus et al.,
Info
2021; Clark et al., 2022), many approaches have
been proposed together to tackle the large-scale
E1 E2 … En MoE deployment, such as in (Kim et al., 2021).
In large-scale deployment, however, additional
MoE
techniques should also be employed to battle with
Layer memory issues such as “sharding” experts (Lep-
MoE Gate
ikhin et al., 2020) or stabilizing the training (Zoph
et al., 2022), since the models are often deployed
Add + Norm on separate nodes that mainly used GPUs with lim-
ited memory. The architecture in this work inher-
its all of those techniques, and in addition incorpo-
rates task information into MoE routing, which in
Multi-head
Attentions turn directs data into separate task adapters. This
kind of routing is, however, hardware-agnostic,
Figure 1: Extended from the typical MoE approaches unlike some other work such as in (Zheng et al.,
that do not discriminate tokens from different tasks, 2022; Chen et al., 2023; Zeng and Xiong, 2023).
we create shared task-related adapters that are trained MoE Routing Techniques Gating is critical
to route tokens from similar tasks to the same shared to MoE layer, which works as a weighted sum of
adapters, and vice versa. the experts and serves for the ultimate purpose of
load balancing of all available experts during both
eralization in few-shot learning. In summary, our training and inference. Unlike the originally pro-
contributions are as follows. posed top-k experts (Shazeer et al., 2017; Du et al.,
2021), it was studied in SwitchTransformer that
• First, we design novel MoE architectures that a single expert can preserve the quality if chosen
dynamically allocate experts based on task properly, while significantly reducing the commu-
information in the context of multilingual nication and computation cost (Fedus et al., 2021).
multitask machine translation, with many In more detail, SwitchTransformer first divides
variations. evenly amongst all experts with an optional buffer
for imbalanced cases and then applies an auxiliary
• Second, we thoroughly study the pros and loss to enforce load balancing. Another alternative
cons of our approaches in training from approach, which is more computationally efficient
scratch, finetuning as well as transfer learn- is to get rid of such extra-heavy complicated loss
ing. and instead use a hash function to route every to-
ken to its matched expert, which tends to balance
• Third, we implement our models the output (Roller et al., 2021). Another interesting
on top of well-proven infrastructres approach is to permit each token to appear in the
for practicality and scalability in- top-k list of multiple experts (Zhou et al., 2022),
cluding deepspeed (Rasley et al., which has been proven to help, although not appli-

2
cable for auto-regressive applications. Yet because 3.0.1 Sparse Mixture-of-Experts (MoE)
of the inherent problem of load imbalance, another MoE, which was first introduced before the deep
approach is to replace the gating mechanism with learning era (Jacobs et al., 1991), was recently bor-
a stochastic selection method, which randomly ac- rowed to address those drawbacks in transformer
tivates an input during processing. The intuition architecture (Shazeer et al., 2017). In a nutshell,
is somewhat similar to the hash approach since MoE creates an ensemble of experts in multi-layer
it relies on the “fair” randomness to solve the bal- transformer blocks in place of a single expert, typi-
ance problem while keeping the blueprint more cally in the form of a feed-forward neural network
lightweight than enforcing an auxiliary loss. Un- (FFN) that is dense with many parameters.
like all of those routing techniques which are ap- In terms of formality, given an original FFN
plication agnostic, our proposed model connects layer called Ẽ, we clone it into another layer con-
the application level (i.e. task information) with taining a set of N experts with exactly the same
the lower-level MoE layers for better dealing with architecture {Ei }Ni=1 . Likewise, the number of pa-
interference of different tasks in the context of rameters for this particular layer is increased by a
multilingual multitask applications. factor of N .
Task-level Routing Recently task information The typical granular level of applying those
has been used for improving MoE, e.g. in (Liu et al., experts in the context of natural language process-
2023). Our model is, however, much simpler and ing is the token level. Given a token x, its learned
can be trained end-to-end unlike their approach, representation before MoE layer is a vector x, its
which requires clustering to be made for off-the- post-MoE output y is the weighted average of
shelf shared representation learning. Probably the those experts’ output
most related work to ours is Mod-Squad (Chen
oi = Ei (x) (1)
et al., 2022) which shares the motivation with us
N
while having several differences. First, their ap- X
y= Wi oi , (2)
proach has multiple aids to make the task-based
i=1
MoE work with an additional loss for regulariza-
tion, while we instead rely mainly on the simple where Wi is the weight (importance) of the corre-
motivation of incorporating task information into sponding expert Ei .
MoE. Second, we still stick to a single gate for The key to MoE power and its well-proven suc-
routing, while they allocate multiple gates, each cesses in tandem with transformer architecture
per task. Third, they additionally have MoE atten- is its sparsity design: only one or few experts
tion blocks, which make their architecture more are activated (i.e. having non-zero weight) at any
complicated. Finally, our focused application is point in time in spite of many more parameters
text-based machine translation, unlike computer just introduced due to the ensemble. Typically the
vision settings in both works mentioned. component responsible for this sparsity is a gate
that was co-trained with experts to route tokens
to their target expert(s), and eventually assigns
3 Models only a single or few non-zero weights across all ex-
perts per token to its output G(x) typically using
softmax and top-k method
Transformer architecture (Vaswani et al., 2017) has
been proven to be the core backbone of the per- gout = softmax (Wg x) (3)
vasive successes in natural language processing, G(x) = Top_K (gout ) (4)
computer vision, and other artificial intelligence
fields. The main bottleneck to this architecture is, With G(x) being a set of K chosen experts, equa-
however, its heavy blueprint leading to intensive tion 2 becomes
resources in training and inference, and is diffi- X
cult to scale up. MoE is one powerful method to y= Wi oi (5)
alleviate those problems in transformers. i∈G(x)

3
The main architectural problem with this de- we choose the number of adapters to be log2 (n)
sign is its scalability: the memory will be quickly with n being the number of tasks.
used up as we increase experts, given the limita-
tion of current compute resources allocated to a 3.2 Task-based Adapters with MoE
single compute node in any distributed environ- In this section, we formulate the task-based
ment. GShard (Lepikhin et al., 2020) was born to adapters mentioned in Section 3.1 in combination
fix this issue by trading the memory for commu- with MoE, both of which are our core architecture
nication: allocating each expert to a single node components.
and only aggregating them when needed, e.g. gra- Given M tasks, we allocate them into L shared-
dient averaging in training or weight averaging task adapters (L < M ). For every single token
when saving a model. This elegant design has un- x, we have the associated task information t that
locked MoE’s unlimited scalability and practicality makes up an input tuple (x, t) per token. As be-
in enterprise-level deployments, especially with fore, x is the representation vector from input,
the following-up work in optimizing for better and t is the task representation vector learned by
architecture in computation and communication, task embedding.
as mentioned in Section 2. Similarly to MoE, we use a learnable task gate
Gt that is responsible for this routing with input
3.1 Task-based Adapters being the concatenation of the input components
Yet another problem on which we are focusing
is not at the system level but more at the higher Gt (x, t) = Top_K(x ⊕ t) (6)
application level. As mentioned, in the multitask
X
y= Wi oi (7)
setting, the interference amongst tasks that are i∈Gt (x,t)
inherently different from each other could lead
to the ineffectiveness of training. As a result, we And since the number of adapters L < M , the
employ task-based adapters to separate those dif- number of tasks, we call this setting dynamic,
ferent tasks into different adapters. Likewise, data as demonstrated in Figure 2b, as opposed to
(or tokens) from similar tasks should be routed to static (Figure 2a), where each task will go to
a similar group of adapters. There are three modes each individual adapter.
for those adapters. Our main model uses the shared task embed-
First and the simplest is to allocate each adapter ding representation for the task gate as well as
for each individual task. Although this setting is MoE gate, which we call shared-dynamic,
straightforward and requires no additional com- as shown in Figure 2c.
putation for data routing, it has the drawback of
acquiring new unseen tasks. The reason is the 4 Experiment Setup
model has to allocate a new adapter for each new
task and fine-tune it with some amount of new 4.1 Data
data. Another potential problem is memory lim- We tackle the problem of multitask multilingual
itation if we want to extend to many new tasks machine translation using the data consisting of 10
in the future. This mode is called static, as different languages ranging from high-resource to
shown in Figure 2a. low-resource ones including English (En), French
To enforce efficient learning of representation (Fr), German (De), Czech (Cs), Finnish (Fi), Lat-
of similar task data, as well as alleviating memory vian (Lv), Estonian (Et), Romanian (Ro), Hindi (Hi),
problems, we have dynamic (Figure 2b) where Turkish (Tr), and Gujarati (Gu). In more detail, the
the number of adapters is less than the number data for training, validation, and testing are listed
of tasks. As a result, we intentionally guide the in Table 1 where we can see besides the high-
model to learn better cross-task representation resource ones, we have low-resource languages
in terms of similarity and dissimilarity. In other such as Estonian, Hindi, or Gujarati.
words, data from similar tasks should be routed Those data are in the form of Bitext in which
to the same adapters and vice versa. In practice, there is always English. As a result, we denote EX

4
Task Layer Task Layer Task Layer MoE Layer

A1 A2 A3 A1 A2 A1 A2 …
TASK GATE MoE GATE
TASK GATE TASK GATE
TASK EMBED
Shared
Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 Embeddings

(a) Static (b) Dynamic (c) Shared-Dynamic

Figure 2: Our MoE models with variants. (a) Static means for each task, there is a separate adapter associated
with it. (b) In the Dynamic mode, there is less number of adapters than the number of tasks, in order to learn
the shared representation of similar tasks. (c) The last variant is Shared-Dynamic where the gates for task
adapters and MoE share the same embedding for their routing decisions. 1

Task Data
Split Unit de-en fr-en cs-en et-en fi-en gu-en hi-en lv-en ro-en
Training M 4.6 10 10.3 0.7 4.8 0.9 0.3 1.4 0.5
Validation K 3.0 3.0 3.0 2.0 1.4 2.0 0.5 2.0 2.0
Testing K 3.0 3.0 3.0 2.0 1.4 2.0 0.5 2.0 2.0

Table 1: Training, Validation, and Testing sizes for all XE tasks (the data for EX are exactly the same). Note that
the unit for training is million (M) while that for both validation and testing are thousand (K), and the sizes are
the same for validation and testing.

as the translation from English (E) to another lan- ularity and credibility in evaluating machine trans-
guage (X), and similarly for the other way around, lation tasks. This evaluation is implemented by
XE. Those data are populated from the popular SacreBLEU2 . We note that, unlike all available pub-
WMT corpus 1 . For the given 1 English and 9 other lic implementations that we found, our implemen-
languages, there are consequently 9 EX and 9XE tation evaluates all BLEU scores on the fly along
tasks. More information about data can be found with the training, so there is no disruption for of-
in Table 4 in Appendix A. fline evaluation. That also helps in early stopping
based on the BLEU scores on the validation sets.
4.2 Task and Model Training Pre-Processing and Post-Processing In
In this section, we describe the task information, terms of preprocessing, we first encode the data us-
evaluation metrics, and how we deal with data ing the Byte-Pair encoding (BPE) method and gen-
and models for training. erate shared dictionaries where all the language
Task Our task is multi-task multilingual ma- pairs use the same vocabulary of size 64K, before
chine translation (MMMT) which uses the EX and feeding to the model. To get accurate scores, for
XE pairs. Our single model is trained with two post-processing, we again use BPE decoding for
main capacities. First, this single model can trans- reconstructing the whole translated sentences be-
late all the training pairs with high accuracy. Sec- fore comparing them with the original sentences
ond, the model is able to quickly acquire new trans- before BPE pre-processing. Likewise, we treat all
lation pairs with only zero or a few shots. the processing and model manipulation as a black
Evaluation While there are many evaluation box for calculating the scores.
metrics, we mainly use BLEU score due to its pop- Model Configuration and Implementation
1 2
https://www.statmt.org/wmt20/index. https://github.com/mjpost/
html sacrebleu

5
XE Tasks
Model de-en fr-en cs-en et-en fi-en gu-en hi-en lv-en ro-en Average
1. Dense 29.9 31.2 28 22.4 21.4 22.3 21.4 24.5 36.1 26.4
2. MoE Token 27.9 29.5 26.3 19.9 19.6 18.9 17.7 22.3 33.8 24.0
3. MoE Sentence 27.9 29.9 26.2 21.4 19.9 17.9 15.9 23.2 34.4 24.1
4. MoE Task-Static 32.1 33.3 30.7 24.3 23.4 20.6 22.5 27.2 38.8 28.1
5.MoE Task-Dynamic 31.4 32.0 29.1 23.4 22.1 18.9 20.5 25.5 37.2 26.7
EX Tasks
en-de en-fr en-cs en-et en-fi en-gu en-hi en-lv en-ro
1. Dense 25.4 28.3 22.4 23.3 20.9 28.4 29 26.5 31.5 26.2
2. MoE Token 22.9 25.1 19.5 20.1 17.9 26.2 26.3 24.0 29.0 23.4
3. MoE Sentence 23.2 25.7 20.4 22.4 18.7 26.4 27.1 24.2 29.7 24.2
4. MoE Task-Static 29.5 32.5 27.9 27.4 25.8 28.8 30.8 32.2 34.6 29.9
5.MoE Task-Dynamic 27.3 29.6 25.0 24.7 22.7 27.7 29.3 28.4 32.7 27.5

Table 2: Comparison of task-based MoE models (models 4 & 5) to task-agnostic MoE models (models 2 & 3) and
the non-MOE (Dense) model (model 1) in BLEU scores. With the help of task information, task-based MoE models
show their outperforming BLEU scores over all other types across most of the tasks including both high-resource
and low-resource ones.

We use transformer architecture (Vaswani et al., Dense This is the traditional transformer
2017) with 12 layers for both encoder and decoder model without any MoE layer, i.e., no change to
phases, each of which uses a word embedding the fully connected (FFN) layer in each layer of
layer of dimension 1024 and a non-linear layer of encoders or decoders.
dimension 4096. There are 16 attention heads and MoE - Token This is the MoE model that is
a dropout rate of 30%. For MoE, all jobs are trained usually considered the default option where each
on Azure cloud machines with 8 GPUs, each of FFN layer is replaced by an MoE layer. In our
which takes around 2 weeks for a model covering experiments, each MoE layer comprises 8 experts
18 aforementioned tasks to reach decent scores. (each has the same size as the original FFN being
We apply early stopping based on the validation replaced) and a gate for routing purposes.
BLEU scores, in which a non-increasing score after MoE - Sentence This is yet another MoE archi-
2 epochs is the condition. For task-based informa- tecture with exactly the same architecture configu-
tion, we have a task embedding dimension of 64 ration as the MoE - Token baseline. The difference
and a task adapter hidden dimension of 256 for is in the routing layer, which functions at a differ-
every single task adapter. Our implementation ent granularity: sentences instead of tokens. In
inherits the lower-level infrastructure code from more detail, while the gate decides which expert
Microsoft Deepspeed and Fairseq. 3 for each token separately in MoE - Token model,
As for the implementation, an important prac- it instead routes all tokens belonging to a single
tical issue with MoE is load balancing among ex- sentence to the same chosen expert.
perts for the best utilization of the infrastructure
systems. For enforcing the training to have a bal- 5 Results and Discussions
anced load, as a result, we employ the auxiliary 5.1 Multitask Multilingual Machine
loss from Lepikhin et al. (2020). Translation
4.2.1 Baselines We first present the main results for models capa-
ble of translating 18 tasks (see Section 4.2) concur-
In order to show the performance of the task-based
rently. As shown in Table 2, our models that in-
MoE models, the following baselines are selected:
corporate MoE layers and are enhanced with task
3
https://github.com/ information show great advantages over all the
facebookresearch/fairseq baseline models on most tasks, in both directions

6
Design Routing Tasks
Model Average
MoE | Task MoE Task de-en fr-en et-en fi-en
MoE Y N Token - 32.4 33.7 24.2 23.6 28.5
Dense + Task Static Static 32.2 33.7 21 22.8 27.4
N Y
Dense + Task Dynamic Dynamic 31.9 33 22 22.5 27.4
Task
MoE + Task Static Static 30.7 32 19.9 20.8 25.9
MoE + Task Dynamic Y Y Dynamic 32.6 33.9 24 23.9 28.6
MoE + Task Shared-Dynamic Shared-Dynamic 32.2 33.3 24.3 24.5 28.6

Table 3: Performance of different models with changes on whether MoE layers exist, whether Task Adapters
exist, and how routing for those components is undertaken. The scores better than the baseline are highlighted.
Task-based MoE shows advantages, especially with shared-dynamic adapters between MoE and Task Adapters
on the low-resource language pair.

EX and XE, in accordance with our hypothesis ticularly when the dynamic adapters are used to
that using task adapters in conjunction with MoE enforce similar tasks to share the same represen-
is helpful in multilingual multitask translation. tations.
An outstanding drawback with which the task- However, when task adapters are not used in
based MoE models are facing, however, is for the conjunction with MoE, the performance is worse
low-resource translation pairs, e.g. Gu-En, Hi-En, than MoE alone. This also means MoE should
or En-Gu. We hypothesize the problem is due be the foundational infrastructure, and on top of
to the undersampling of the training data. Our that, task adapters should be used. It concurs with
training routine concatenates all the tasks’ data the motivation that the interference of different
in a single big dataset before drawing batches. tasks or languages makes the training of experts
However, without adjusting the sampling process, difficult. In other words, there is not so much help
high-resource language pairs are being trained when there is only one expert for all the tasks (i.e.
significantly more given their much larger data in Dense models).
place. In particular, for the case of Gujarati where
the Task-Dynamic MoE model underperforms in 5.2.2 Flexibility of Task-based MoE in
comparison to the baselines, our hypothesis is that Merging Two Trained Models
linguistically, this language is the most different One of the important capabilities in multi-task
from all other languages, which makes the models learning and in general learning problems is how
very hard to learn effective shared representation to quickly acquire new capabilities given cur-
with any other pairs. rent models with minimal resources and effort.
Aligned with this goal, this ablation explores how
5.2 Ablation Study
quickly our task-based MoE models can be merged
5.2.1 Implications of Different Task Layers with each other from 2 different models to newly
and MoE Layers establish only 1 model that has the combination
In this study, we limit the number of tasks to four of their capabilities.
(De-En, Fr-En, Et-En, and Fi-En), which can be In merging those two models, we restore two
divided into 2 groups of similar tasks: (De-En, respective checkpoints and merge layer-by-layer
Fr-En) is the first group and (Et-En, Fi-En) is the as follows. First, task-based adapters are kept
second one, to study the performance implications and combined with each other: each model has 2
of different model variants when there is a task adapters (for 4 tasks in the model) and the com-
layer and/or MoE layer. bined model has 4 adapters (for 8 tasks in com-
As illustrated in Table 3, we again see that combination). Second, the task routers will also be
bining MoE and Task Adapters yields the best merged and changed so that the routing of each
models, the same trend as shown in Table 2, par- data will now have 4 selections instead of 2 out-

7
(a) model 1 (b) model 2 (c) merged model

Figure 3: Ablation study about merging 2 checkpointed models of different capabilities. Model 1 is trained with
4 tasks: de-en, fr-en, et-en and fi-en. Model 2 is trained with the other 4 tasks: cs-en, gu-en, en-et, and en-fi.
Although those 2 models are under-trained with only a few thousand steps, in the merged model that has the
capabilities of those two combined, many pairs have quickly picked up to a similar levels as in the previous single
models.

puts as in the previous models. Finally, the rest In addition, it also offers the flexibility of changing
of the transformer and MoE layers will have their adapters based on new tasks or changing the MoE
weights averaged. infrastructure without affecting the application
The tasks in the original two models are hand- level. In the future, enforcing the shared repre-
picked as in Section 5.2.1 to have 2 different sentation learning explicitly using such additional
groups, each of which has 2 similar tasks. Model techniques as contrastive learning or mutual in-
1 has de-en, fr-en, et-en, and fi-en, while Model 2 formation is also worth exploring.
has cs-en, gu-en, en-et and en-fi.
As shown in Figure 3, while two original mod- 7 Acknowledgements
els have been trained with just a few thousand The authors would like to thank the great feed-
steps (a couple of hours), the combined model back and help from Yiren Wang, Muhammad
shows that it can quickly pick up their original ElNokrashy, Alex Muzio, Akiko Eriguchi and
capabilities with just a few hundred steps after other members of Microsoft’s Machine Transla-
merging. Although there are a few uncommon tion Group.
pairs that seem to fail, such as gu-en or en-et, the
chart shows the optimistic result of combining
trained models with our flexible task-based MoE
architectures.

6 Conclusion
In the era of big data, large-scale models are
more and more essential to big enterprises and
institutions, where MoE in combination with
transformer-based models has proven its great
advantages very recently. It is, however, compli-
cated to enable that implementation in practice
due to the difficulties of training a single model
serving diverse tasks. The proposed task-based
MoE, which employs both task adapters in tandem
with MoE has shown its promising advantages in
the application of multitask multilingual machine
translations. This novel design enforces shared
representation of similar tasks and separates dif-
ferent task data to counter the interference effects.

8
References Zhili Liu, Kai Chen, Jianhua Han, Lanqing Hong, Hang
Xu, Zhenguo Li, and James Tin-Yau Kwok. 2023.
Chang-Qin Chen, Min Li, Zhihua Wu, Dianhai Yu, Task-customized masked autoencoder via mixture
and Chao Yang. 2023. Ta-moe: Topology-aware of cluster-conditional experts. In International Con-
large scale mixture-of-expert training. ArXiv, ference on Learning Representations.
abs/2302.09915.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Z. Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Sam Gross, Nathan Ng, David Grangier, and Michael
Hengshuang Zhao, Erik G. Learned-Miller, and Auli. 2019. fairseq: A fast, extensible toolkit for
Chuang Gan. 2022. Mod-squad: Designing mixture sequence modeling. In North American Chapter of
of experts as modular multi-task learners. ArXiv, the Association for Computational Linguistics.
abs/2212.08066.
Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li,
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur James Cross, Sebastian Riedel, and Mikel Artetxe.
Mensch, Michela Paganini, Jordan Hoffmann, Bog- 2022. Lifting the curse of multilinguality by pre-
dan Damoc, Blake A. Hechtman, Trevor Cai, Se- training modular transformers. In North American
bastian Borgeaud, George van den Driessche, Eliza Chapter of the Association for Computational Linguis-
Rutherford, T. W. Hennigan, Matthew G. Johnson, tics.
Katie Millican, Albin Cassirer, Chris Jones, Elena
Buchatskaya, David Budden, L. Sifre, Simon Osin- Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and
dero, Oriol Vinyals, Jack W. Rae, Erich Elsen, Koray Yuxiong He. 2020. Deepspeed: System optimiza-
Kavukcuoglu, and Karen Simonyan. 2022. Unified tions enable training deep learning models with over
scaling laws for routed language models. In Inter- 100 billion parameters. Proceedings of the 26th ACM
national Conference on Machine Learning. SIGKDD International Conference on Knowledge Dis-
covery & Data Mining.
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong,
Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,
Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam,
Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret
and Jason Weston. 2021. Hash layers for large sparse
Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou,
models. In Neural Information Processing Systems.
Tao Wang, Yu Emma Wang, Kellie Webster, Marie
Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern,
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Toju Duke, Lucas Dixon, Kun Zhang, Quoc V. Le,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Yonghui Wu, Z. Chen, and Claire Cui. 2021. Glam:
Dean. 2017. Outrageously large neural networks:
Efficient scaling of language models with mixture-
The sparsely-gated mixture-of-experts layer. arXiv
of-experts. ArXiv, abs/2112.06905.
preprint arXiv:1701.06538.
William Fedus, Jeff Dean, and Barret Zoph. 2022. A
review of sparse expert models in deep learning. Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
ArXiv, abs/2209.01667. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
William Fedus, Barret Zoph, and Noam M. Shazeer. you need. In NIPS.
2021. Switch transformers: Scaling to trillion pa-
rameter models with simple and efficient sparsity. Sen Wu, Hongyang Zhang, and Christopher Ré. 2020.
J. Mach. Learn. Res., 23:120:1–120:39. Understanding and improving information transfer
in multi-task learning. ArXiv, abs/2005.00944.
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan,
and Geoffrey E. Hinton. 1991. Adaptive mixtures of Seniha Esen Yüksel, Joseph N. Wilson, and Paul D.
local experts. Neural Computation, 3:79–87. Gader. 2012. Twenty years of mixture of experts.
IEEE Transactions on Neural Networks and Learning
Young Jin Kim, Ammar Ahmad Awan, Alexandre Systems, 23:1177–1193.
Muzio, Andrés Felipe Cruz-Salinas, Liyang Lu, Amr
Hendy, Samyam Rajbhandari, Yuxiong He, and Zhiyuan Zeng and Deyi Xiong. 2023. Scomoe: Efficient
Hany Hassan Awadalla. 2021. Scalable and effi- mixtures of experts with structured communication.
cient moe training for multitask multilingual models. In International Conference on Learning Representa-
ArXiv, abs/2109.10465. tions.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao
Dehao Chen, Orhan Firat, Yanping Huang, Maxim Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang,
Krikun, Noam M. Shazeer, and Z. Chen. 2020. Yuanzhong Xu, Danyang Zhuo, Joseph Gonzalez,
Gshard: Scaling giant models with conditional and Ion Stoica. 2022. Alpa: Automating inter- and
computation and automatic sharding. ArXiv, intra-operator parallelism for distributed deep learn-
abs/2006.16668. ing. ArXiv, abs/2201.12023.

9
Yan-Quan Zhou, Tao Lei, Han-Chu Liu, Nan Du, Yan-
ping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng
Chen, Quoc V. Le, and James Laudon. 2022. Mixture-
of-experts with expert choice routing. ArXiv,
abs/2202.09368.
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du,
Yanping Huang, Jeff Dean, Noam M. Shazeer, and
William Fedus. 2022. St-moe: Designing stable and
transferable sparse expert models.

10
A WMT Data Information

Code Language Test Split

de German wmt2013
fr French wmt2013
cs Czech wmt2013
et Estonian wmt2018dev
fi Finish wmt2015
gu Gujarati wmt2019dev
hi Hindi wmt2014dev
lv Latvian wmt2017dev
ro Romanian wmt2016dev

Table 4: More details about our datasets for comparison

and reproducibility.

Copie de Restaurants Email List
No ratings yet
Copie de Restaurants Email List
249 pages
Network Capture 1746181790657
No ratings yet
Network Capture 1746181790657
646 pages
Rockwell Automation Application Content: Machine Builder Libraries
No ratings yet
Rockwell Automation Application Content: Machine Builder Libraries
42 pages
Arcania Fall of Setarrif - Manual
No ratings yet
Arcania Fall of Setarrif - Manual
17 pages
Car Mechanic Simulator 2021 Car Modding Guide
100% (3)
Car Mechanic Simulator 2021 Car Modding Guide
50 pages
Z690 Pro RS
No ratings yet
Z690 Pro RS
111 pages
Mixture of Experts Explained
No ratings yet
Mixture of Experts Explained
24 pages
Keyence - XG Vision Editor PDF
No ratings yet
Keyence - XG Vision Editor PDF
20 pages
Switch Transformers Scaling To Trillion Parameter Models With Simple and Efficient Sparsity by Fedus Et Al
No ratings yet
Switch Transformers Scaling To Trillion Parameter Models With Simple and Efficient Sparsity by Fedus Et Al
39 pages
Shi Et Al - 2024 - Unchosen Experts Can Contribute Too
No ratings yet
Shi Et Al - 2024 - Unchosen Experts Can Contribute Too
19 pages
Mix Lora
No ratings yet
Mix Lora
18 pages
TubePro V4 Single Chuck
No ratings yet
TubePro V4 Single Chuck
90 pages
Gshard Scaling Giant Models Wi
No ratings yet
Gshard Scaling Giant Models Wi
23 pages
519 - 1 - Fun Skills 2. Student's Book - 2020, 87p Pages 1-50 - Flip PDF Download - FlipHTML5
No ratings yet
519 - 1 - Fun Skills 2. Student's Book - 2020, 87p Pages 1-50 - Flip PDF Download - FlipHTML5
87 pages
Mean Field Type - Transformers
No ratings yet
Mean Field Type - Transformers
51 pages
SPDX1
No ratings yet
SPDX1
21 pages
Open Mixture-of-Experts Language Models
No ratings yet
Open Mixture-of-Experts Language Models
61 pages
Test React Components With Jest and React Testing Library
No ratings yet
Test React Components With Jest and React Testing Library
191 pages
2025 Lecture 4 - MoEs
No ratings yet
2025 Lecture 4 - MoEs
47 pages
B C: A Benchmark For Bioinformatics Code Generation With Contextual Pragmatic Knowledge
No ratings yet
B C: A Benchmark For Bioinformatics Code Generation With Contextual Pragmatic Knowledge
72 pages
P N M T: PNMT (Java Version) Operation Manual
No ratings yet
P N M T: PNMT (Java Version) Operation Manual
117 pages
Wincatalog Manual PDF
No ratings yet
Wincatalog Manual PDF
121 pages
A Survey On Mixture of Experts in Large Language Models
No ratings yet
A Survey On Mixture of Experts in Large Language Models
29 pages
Preprints202408 0583 v1
No ratings yet
Preprints202408 0583 v1
33 pages
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
No ratings yet
A Survey On Mixture of Experts: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang
41 pages
Switch Transformers - Scaling To Trillion Parameter Models With Simple and Efficient Sparsity
No ratings yet
Switch Transformers - Scaling To Trillion Parameter Models With Simple and Efficient Sparsity
40 pages
VSICM7 M08 Resource Manage Monitor
No ratings yet
VSICM7 M08 Resource Manage Monitor
78 pages
Hợp Nhất Chuyên Gia
No ratings yet
Hợp Nhất Chuyên Gia
26 pages
Preprints202408 0583 v2
No ratings yet
Preprints202408 0583 v2
32 pages
Vinija's Notes - Primers - Mixture of Experts
No ratings yet
Vinija's Notes - Primers - Mixture of Experts
39 pages
Files
No ratings yet
Files
33 pages
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training To Power Next-Generation AI Scale
No ratings yet
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training To Power Next-Generation AI Scale
31 pages
Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
No ratings yet
Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
33 pages
Unchosen Experts Can Contribute Too Unleashing MoE Models' Power by Self-Contrast
No ratings yet
Unchosen Experts Can Contribute Too Unleashing MoE Models' Power by Self-Contrast
25 pages
Dynamic Mixture of Experts - An Auto-Tuning Approach For Efficient Transformer Models
No ratings yet
Dynamic Mixture of Experts - An Auto-Tuning Approach For Efficient Transformer Models
27 pages
Software Release 9.10 - Product Update (October 2008)
No ratings yet
Software Release 9.10 - Product Update (October 2008)
25 pages
Mega Large Scale Communication Research Paper
No ratings yet
Mega Large Scale Communication Research Paper
18 pages
Dynamic Mixture of Experts
No ratings yet
Dynamic Mixture of Experts
22 pages
Moa: Heterogeneous Mixture of Adapters For Parameter-Efficient Fine-Tuning of Large Language Models
No ratings yet
Moa: Heterogeneous Mixture of Adapters For Parameter-Efficient Fine-Tuning of Large Language Models
16 pages
Tronik Data Bank
No ratings yet
Tronik Data Bank
35 pages
MoE Instruction Tuning
No ratings yet
MoE Instruction Tuning
24 pages
M ET: Optimized Mixture of Expert Serving With Balanced Expert Placement and Token Routing
No ratings yet
M ET: Optimized Mixture of Expert Serving With Balanced Expert Placement and Token Routing
15 pages
Beyond Memory Limits Scaling Mixture of Experts Models
No ratings yet
Beyond Memory Limits Scaling Mixture of Experts Models
15 pages
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
No ratings yet
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
27 pages
Touch ASP
No ratings yet
Touch ASP
14 pages
A Closer Look Into Mixture-of-Experts in Large Language Models
No ratings yet
A Closer Look Into Mixture-of-Experts in Large Language Models
19 pages
2024-Revisiting SMoE Language Models by Evaluating Inefficiencies With Task Specific Expert Pruning
No ratings yet
2024-Revisiting SMoE Language Models by Evaluating Inefficiencies With Task Specific Expert Pruning
18 pages
2024-SimSMoE Solving Representational Collapse Via Similarity Measure
No ratings yet
2024-SimSMoE Solving Representational Collapse Via Similarity Measure
17 pages
Detailed Plan For Exploring Adapter-Based Architec
No ratings yet
Detailed Plan For Exploring Adapter-Based Architec
11 pages
Moe Pruner
No ratings yet
Moe Pruner
18 pages
Loadbalancing 1 2411.10003v2
No ratings yet
Loadbalancing 1 2411.10003v2
14 pages
MoE 1
No ratings yet
MoE 1
15 pages
Mixture of Experts (Moe)
No ratings yet
Mixture of Experts (Moe)
15 pages
Atc23 Li Jiamin
No ratings yet
Atc23 Li Jiamin
16 pages
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
No ratings yet
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
24 pages
Autonomy-of-Experts Models: Ang LV Ruobing Xie Yining Qian Songhao Wu Xingwu Sun Zhanhui Kang Di Wang Rui Yan
No ratings yet
Autonomy-of-Experts Models: Ang LV Ruobing Xie Yining Qian Songhao Wu Xingwu Sun Zhanhui Kang Di Wang Rui Yan
14 pages
Choice Routing
No ratings yet
Choice Routing
14 pages
Cluster-Driven Expert Pruning For Mixture-Of-Experts
No ratings yet
Cluster-Driven Expert Pruning For Mixture-Of-Experts
12 pages
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
No ratings yet
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
14 pages
MoE-Infinity - Offloading-Efficient MoE Model Serving
No ratings yet
MoE-Infinity - Offloading-Efficient MoE Model Serving
14 pages
Multimae: Multi-Modal Multi-Task Masked Autoencoders
No ratings yet
Multimae: Multi-Modal Multi-Task Masked Autoencoders
21 pages
Nandani Product List
No ratings yet
Nandani Product List
8 pages
Siqueira et al. - 2019 - 済無No Title No Title PDF
No ratings yet
Siqueira et al. - 2019 - 済無No Title No Title PDF
36 pages
Chen AdaMV-MoE Adaptive Multi-Task Vision Mixture-of-Experts ICCV 2023 Paper
No ratings yet
Chen AdaMV-MoE Adaptive Multi-Task Vision Mixture-of-Experts ICCV 2023 Paper
12 pages
2024 - Skywork-MoE - Wei Et Al
No ratings yet
2024 - Skywork-MoE - Wei Et Al
14 pages
2024-Prediction Is All MoE Needs Expert Load Distribution Goes From Fluctuating To Stabilizing
No ratings yet
2024-Prediction Is All MoE Needs Expert Load Distribution Goes From Fluctuating To Stabilizing
10 pages
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
No ratings yet
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
17 pages
Raposo Et Al. - 2024 - Mixture-of-Depths Dynamically Allocating Compute
No ratings yet
Raposo Et Al. - 2024 - Mixture-of-Depths Dynamically Allocating Compute
14 pages
Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
No ratings yet
Moe-I: Compressing Mixture of Experts Models Through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
11 pages
Home: Hierarchy of Multi-Gate Experts For Multi-Task Learning at Kuaishou
No ratings yet
Home: Hierarchy of Multi-Gate Experts For Multi-Task Learning at Kuaishou
10 pages
Generalised Winograd Schema and Its Contextuality: Kin Ian Lo Mehrnoosh Sadrzadeh Shane Mansfield
No ratings yet
Generalised Winograd Schema and Its Contextuality: Kin Ian Lo Mehrnoosh Sadrzadeh Shane Mansfield
16 pages
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
No ratings yet
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
16 pages
Mixture of A Million Experts: Google Deepmind
No ratings yet
Mixture of A Million Experts: Google Deepmind
12 pages
2024 - How Lightweight Can A Vision Transformer Be - Tan - Arxiv
No ratings yet
2024 - How Lightweight Can A Vision Transformer Be - Tan - Arxiv
8 pages
Brochure - Oct - Pepsi
No ratings yet
Brochure - Oct - Pepsi
15 pages
Mixtral of Experts
No ratings yet
Mixtral of Experts
13 pages
Cocoa Basic
No ratings yet
Cocoa Basic
8 pages
CAD
No ratings yet
CAD
15 pages
Transformer Vs MOE
No ratings yet
Transformer Vs MOE
7 pages
The Nordic Pile: A 1.2TB Nordic Dataset For Language Modeling
No ratings yet
The Nordic Pile: A 1.2TB Nordic Dataset For Language Modeling
14 pages
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
No ratings yet
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
14 pages
2020 Golden Rose Price List-1
No ratings yet
2020 Golden Rose Price List-1
6 pages
Mixture of Experts Explained Simply
No ratings yet
Mixture of Experts Explained Simply
8 pages
University of Computer Studies, Mandalay (UCSM)
No ratings yet
University of Computer Studies, Mandalay (UCSM)
23 pages
Design 101: Unit 1: Learning To See Design
No ratings yet
Design 101: Unit 1: Learning To See Design
12 pages
Brainformers - Trading Simplicity For Efficiency
No ratings yet
Brainformers - Trading Simplicity For Efficiency
12 pages
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
No ratings yet
Mixture of Experts With Mixture of Precisions For Tuning Quality of Service
7 pages
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
No ratings yet
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
11 pages
Er Diagram Chen
No ratings yet
Er Diagram Chen
5 pages
Toddlerberta: Exploiting Babyberta For Grammar Learning and Language Understanding
No ratings yet
Toddlerberta: Exploiting Babyberta For Grammar Learning and Language Understanding
10 pages
Business Process Text Sketch Automation Generation Using Large Language Model
No ratings yet
Business Process Text Sketch Automation Generation Using Large Language Model
10 pages
Gios Exam Review
No ratings yet
Gios Exam Review
46 pages
Model-Agnostic Meta-Learning Techniques A State-Of-The-Art Short Review
No ratings yet
Model-Agnostic Meta-Learning Techniques A State-Of-The-Art Short Review
4 pages
Quantifying and Analyzing Entity-Level Memorization in Large Language Models
No ratings yet
Quantifying and Analyzing Entity-Level Memorization in Large Language Models
9 pages
Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
No ratings yet
Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
29 pages
LL SM: L L S M: A Arge Anguage and Peech Odel
No ratings yet
LL SM: L L S M: A Arge Anguage and Peech Odel
8 pages
Milmo:Minority Multilingual Pre-Trained Language Model
No ratings yet
Milmo:Minority Multilingual Pre-Trained Language Model
7 pages
Mera: Merging Pretrained Adapters For Few-Shot Learning
No ratings yet
Mera: Merging Pretrained Adapters For Few-Shot Learning
6 pages
2308 16474
No ratings yet
2308 16474
6 pages
Deep Learning Lab
No ratings yet
Deep Learning Lab
20 pages
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
No ratings yet
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
5 pages
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
No ratings yet
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
11 pages
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
No ratings yet
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
10 pages
Knowledge Representation & Reasoning by Faisal Rehman
No ratings yet
Knowledge Representation & Reasoning by Faisal Rehman
3 pages
Link Prediction For Wikipedia Articles As A Natural Language Inference Task
No ratings yet
Link Prediction For Wikipedia Articles As A Natural Language Inference Task
4 pages
Knowledge-Grounded Natural Language Recommendation Explanation
No ratings yet
Knowledge-Grounded Natural Language Recommendation Explanation
15 pages
Pyrevit
No ratings yet
Pyrevit
185 pages
Tarjeta Bitcoin
No ratings yet
Tarjeta Bitcoin
2 pages
Hugva Usd Price List
No ratings yet
Hugva Usd Price List
1 page
Building Iot Applications: Using Open Source: Timing For Training Sessions: 5:00 - 7:00 PM Every Day
No ratings yet
Building Iot Applications: Using Open Source: Timing For Training Sessions: 5:00 - 7:00 PM Every Day
4 pages
Twosisters OO
No ratings yet
Twosisters OO
1 page
Technosoft Technology Solutions: Santosh Kumar Nayak S100040300077
No ratings yet
Technosoft Technology Solutions: Santosh Kumar Nayak S100040300077
16 pages
Java Swing Intro
No ratings yet
Java Swing Intro
76 pages
One Model To Learn Them All: Work Performed While at Google Brain
No ratings yet
One Model To Learn Them All: Work Performed While at Google Brain
10 pages
GEN10909-L Handout 10909 Ellis 20 - 20A 20practical 20guide 20to 20parametric 20drawing
No ratings yet
GEN10909-L Handout 10909 Ellis 20 - 20A 20practical 20guide 20to 20parametric 20drawing
19 pages
CS3551 DC - Int - I - Answer Key 7.9.23
No ratings yet
CS3551 DC - Int - I - Answer Key 7.9.23
68 pages
CSS Week 5-8
No ratings yet
CSS Week 5-8
32 pages
Computer System Servicing: Quarter 4 - Week 5-8
No ratings yet
Computer System Servicing: Quarter 4 - Week 5-8
25 pages
RSL-D-RS-7.0-SEG-EN-1.0-2017-12-08 RayStation 7 System Environment Guidelines PDF
No ratings yet
RSL-D-RS-7.0-SEG-EN-1.0-2017-12-08 RayStation 7 System Environment Guidelines PDF
42 pages
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Task-Based Moe For Multitask Multilingual Machine Translation

Uploaded by

Task-Based Moe For Multitask Multilingual Machine Translation

Uploaded by

Task-Based MoE for Multitask Multilingual Machine Translation

Hai Pham Young Jin Kim Subhabrata Mukherjee∗

David P. Woodruff Barnabás Póczos Hany Hassan Awadalla

{htpham, bapoczos, dwoodruf}@cs.cmu.edu {youki, hanyh}@microsoft.com subhabrata.mukherjee.ju@gmail.com

Abstract on lower levels in the architecture such as at sys-

tasks in training deep models in many appli-

(a) Static (b) Dynamic (c) Shared-Dynamic

Code Language Test Split

Table 4: More details about our datasets for comparison

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.