0% found this document useful (0 votes)
94 views11 pages

Task-Based Moe For Multitask Multilingual Machine Translation

1. The document proposes a novel task-based mixture-of-experts (MoE) architecture for multitask multilingual machine translation that incorporates task information at different levels. 2. In current MoE implementations, all data from different tasks are treated the same, but the proposed approach assigns experts based on individual task information to make training more task-aware. 3. The approach groups similar tasks to use similar experts through shared dynamic task-based adapters, allowing the model to generalize to new tasks efficiently.

Uploaded by

billeton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views11 pages

Task-Based Moe For Multitask Multilingual Machine Translation

1. The document proposes a novel task-based mixture-of-experts (MoE) architecture for multitask multilingual machine translation that incorporates task information at different levels. 2. In current MoE implementations, all data from different tasks are treated the same, but the proposed approach assigns experts based on individual task information to make training more task-aware. 3. The approach groups similar tasks to use similar experts through shared dynamic task-based adapters, allowing the model to generalize to new tasks efficiently.

Uploaded by

billeton
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Task-Based MoE for Multitask Multilingual Machine Translation

Hai Pham Young Jin Kim Subhabrata Mukherjee∗


Carnegie Mellon University Microsoft Hippocratic AI

David P. Woodruff Barnabás Póczos Hany Hassan Awadalla


Carnegie Mellon University Carnegie Mellon University Microsoft

{htpham, bapoczos, dwoodruf}@cs.cmu.edu {youki, hanyh}@microsoft.com subhabrata.mukherjee.ju@gmail.com

Abstract on lower levels in the architecture such as at sys-


tem or communication levels. In the case of multi-
Mixture-of-experts (MoE) architecture has task learning where a single model is required to
been proven a powerful method for diverse learn from heterogeneous tasks, however, the task-
arXiv:2308.15772v1 [cs.CL] 30 Aug 2023

tasks in training deep models in many appli-


specific data could be inherently diverse and vary
cations. However, current MoE implementa-
tions are task agnostic, treating all tokens from largely from one to another (Wu et al., 2020). As
different tasks in the same manner. In this a result, treating data from such different sources
work, we instead design a novel method that the same makes the learning ineffective, as also
incorporates task information into MoE mod- evidenced recently by the interference between
els at different granular levels with shared dy- different task data (Pfeiffer et al., 2022).
namic task-based adapters. Our experiments As a result, in this work, we design a novel MoE
and analysis show the advantages of our ap-
approach where task information is used during
proaches over the dense and canonical MoE
models on multi-task multilingual machine training and inference for assigning experts based
translations. With task-specific adapters, our on individual task information. The intuition is
models can additionally generalize to new to make the training more task-aware so those
tasks efficiently. similar tasks would be routed to the same group
of experts and vice versa. From the architectural
1 Introduction perspective, we incorporate high-level application-
specific information with the system-level infor-
Mixture-of-Experts (MoE), while not being a novel
mation to make the model become task-aware and
machine learning algorithm (Yüksel et al., 2012),
hence have a better strategy in allocating experts
has revived to combine with deep learning, partic-
based on the characteristics of distinct tasks, as
ularly transformer (Vaswani et al., 2017) and has
also illustrated in Figure 1.
recently pushed forward various tasks such as nat-
Our proposed architecture allows for grouping
ural language processing, computer vision, speech
experts based on the similarity of tasks, i.e. similar
recognition, multimodal and multitask learning
tasks should use a similar group of experts and
due to its advantage in scalability in distributed
otherwise for different tasks, by using shared-task
environments (Fedus et al., 2022). The main advan-
adapters. Our design of putting those adapters
tages of MoE stem from its ensemble design while
on top of MoE layers allows for flexibility in fu-
maintaining the sparsity in computation (Fedus
ture extensions: if we want the model to acquire
et al., 2021). And with proper design such as us-
new tasks while still having similar resources, we
ing GShard (Lepikhin et al., 2020), the possibility
only finetune new adapters, and if we want to
for enterprise-level scalability is almost boundless.
scale the hardware resources, e.g. for more speed,
As a result, this method has been more and more
we simply deal with MoE layers with such new
widely adopted in many applications that require
resources.
distributed and intensive workloads.
Our experiments and analysis show the advan-
However, most of the current methods are task-
tages of using task information in MoE architec-
agnostic, only optimizing for performance based
tures in multiple settings including multitask mul-

Work done while at Microsoft. tilingual machine translations, as well as its gen-
Add + Norm
2020), fairseq (Ott et al., 2019) and
transformer (Vaswani et al., 2017).

2 Related Work
A1 A2 … Ak Task
MoE Basic Transformer-based Mixture-of-
Layer Experts (MoE) architecture essentially sparsifies
transformer architecture by replacing the heavy
TASK Gate
feed-forward network (FFN) with a sparse MoE
layer with top-2 gates (Shazeer et al., 2017). How-
ever, increasing the number of experts does not
Add + Norm Task simply increase the performance (Fedus et al.,
Info
2021; Clark et al., 2022), many approaches have
been proposed together to tackle the large-scale
E1 E2 … En MoE deployment, such as in (Kim et al., 2021).
In large-scale deployment, however, additional
MoE
techniques should also be employed to battle with
Layer memory issues such as “sharding” experts (Lep-
MoE Gate
ikhin et al., 2020) or stabilizing the training (Zoph
et al., 2022), since the models are often deployed
Add + Norm on separate nodes that mainly used GPUs with lim-
ited memory. The architecture in this work inher-
its all of those techniques, and in addition incorpo-
rates task information into MoE routing, which in
Multi-head
Attentions turn directs data into separate task adapters. This
kind of routing is, however, hardware-agnostic,
Figure 1: Extended from the typical MoE approaches unlike some other work such as in (Zheng et al.,
that do not discriminate tokens from different tasks, 2022; Chen et al., 2023; Zeng and Xiong, 2023).
we create shared task-related adapters that are trained MoE Routing Techniques Gating is critical
to route tokens from similar tasks to the same shared to MoE layer, which works as a weighted sum of
adapters, and vice versa. the experts and serves for the ultimate purpose of
load balancing of all available experts during both
eralization in few-shot learning. In summary, our training and inference. Unlike the originally pro-
contributions are as follows. posed top-k experts (Shazeer et al., 2017; Du et al.,
2021), it was studied in SwitchTransformer that
• First, we design novel MoE architectures that a single expert can preserve the quality if chosen
dynamically allocate experts based on task properly, while significantly reducing the commu-
information in the context of multilingual nication and computation cost (Fedus et al., 2021).
multitask machine translation, with many In more detail, SwitchTransformer first divides
variations. evenly amongst all experts with an optional buffer
for imbalanced cases and then applies an auxiliary
• Second, we thoroughly study the pros and loss to enforce load balancing. Another alternative
cons of our approaches in training from approach, which is more computationally efficient
scratch, finetuning as well as transfer learn- is to get rid of such extra-heavy complicated loss
ing. and instead use a hash function to route every to-
ken to its matched expert, which tends to balance
• Third, we implement our models the output (Roller et al., 2021). Another interesting
on top of well-proven infrastructres approach is to permit each token to appear in the
for practicality and scalability in- top-k list of multiple experts (Zhou et al., 2022),
cluding deepspeed (Rasley et al., which has been proven to help, although not appli-

2
cable for auto-regressive applications. Yet because 3.0.1 Sparse Mixture-of-Experts (MoE)
of the inherent problem of load imbalance, another MoE, which was first introduced before the deep
approach is to replace the gating mechanism with learning era (Jacobs et al., 1991), was recently bor-
a stochastic selection method, which randomly ac- rowed to address those drawbacks in transformer
tivates an input during processing. The intuition architecture (Shazeer et al., 2017). In a nutshell,
is somewhat similar to the hash approach since MoE creates an ensemble of experts in multi-layer
it relies on the “fair” randomness to solve the bal- transformer blocks in place of a single expert, typi-
ance problem while keeping the blueprint more cally in the form of a feed-forward neural network
lightweight than enforcing an auxiliary loss. Un- (FFN) that is dense with many parameters.
like all of those routing techniques which are ap- In terms of formality, given an original FFN
plication agnostic, our proposed model connects layer called Ẽ, we clone it into another layer con-
the application level (i.e. task information) with taining a set of N experts with exactly the same
the lower-level MoE layers for better dealing with architecture {Ei }Ni=1 . Likewise, the number of pa-
interference of different tasks in the context of rameters for this particular layer is increased by a
multilingual multitask applications. factor of N .
Task-level Routing Recently task information The typical granular level of applying those
has been used for improving MoE, e.g. in (Liu et al., experts in the context of natural language process-
2023). Our model is, however, much simpler and ing is the token level. Given a token x, its learned
can be trained end-to-end unlike their approach, representation before MoE layer is a vector x, its
which requires clustering to be made for off-the- post-MoE output y is the weighted average of
shelf shared representation learning. Probably the those experts’ output
most related work to ours is Mod-Squad (Chen
oi = Ei (x) (1)
et al., 2022) which shares the motivation with us
N
while having several differences. First, their ap- X
y= Wi oi , (2)
proach has multiple aids to make the task-based
i=1
MoE work with an additional loss for regulariza-
tion, while we instead rely mainly on the simple where Wi is the weight (importance) of the corre-
motivation of incorporating task information into sponding expert Ei .
MoE. Second, we still stick to a single gate for The key to MoE power and its well-proven suc-
routing, while they allocate multiple gates, each cesses in tandem with transformer architecture
per task. Third, they additionally have MoE atten- is its sparsity design: only one or few experts
tion blocks, which make their architecture more are activated (i.e. having non-zero weight) at any
complicated. Finally, our focused application is point in time in spite of many more parameters
text-based machine translation, unlike computer just introduced due to the ensemble. Typically the
vision settings in both works mentioned. component responsible for this sparsity is a gate
that was co-trained with experts to route tokens
to their target expert(s), and eventually assigns
3 Models only a single or few non-zero weights across all ex-
perts per token to its output G(x) typically using
softmax and top-k method
Transformer architecture (Vaswani et al., 2017) has
been proven to be the core backbone of the per- gout = softmax (Wg x) (3)
vasive successes in natural language processing, G(x) = Top_K (gout ) (4)
computer vision, and other artificial intelligence
fields. The main bottleneck to this architecture is, With G(x) being a set of K chosen experts, equa-
however, its heavy blueprint leading to intensive tion 2 becomes
resources in training and inference, and is diffi- X
cult to scale up. MoE is one powerful method to y= Wi oi (5)
alleviate those problems in transformers. i∈G(x)

3
The main architectural problem with this de- we choose the number of adapters to be log2 (n)
sign is its scalability: the memory will be quickly with n being the number of tasks.
used up as we increase experts, given the limita-
tion of current compute resources allocated to a 3.2 Task-based Adapters with MoE
single compute node in any distributed environ- In this section, we formulate the task-based
ment. GShard (Lepikhin et al., 2020) was born to adapters mentioned in Section 3.1 in combination
fix this issue by trading the memory for commu- with MoE, both of which are our core architecture
nication: allocating each expert to a single node components.
and only aggregating them when needed, e.g. gra- Given M tasks, we allocate them into L shared-
dient averaging in training or weight averaging task adapters (L < M ). For every single token
when saving a model. This elegant design has un- x, we have the associated task information t that
locked MoE’s unlimited scalability and practicality makes up an input tuple (x, t) per token. As be-
in enterprise-level deployments, especially with fore, x is the representation vector from input,
the following-up work in optimizing for better and t is the task representation vector learned by
architecture in computation and communication, task embedding.
as mentioned in Section 2. Similarly to MoE, we use a learnable task gate
Gt that is responsible for this routing with input
3.1 Task-based Adapters being the concatenation of the input components
Yet another problem on which we are focusing
is not at the system level but more at the higher Gt (x, t) = Top_K(x ⊕ t) (6)
application level. As mentioned, in the multitask
X
y= Wi oi (7)
setting, the interference amongst tasks that are i∈Gt (x,t)
inherently different from each other could lead
to the ineffectiveness of training. As a result, we And since the number of adapters L < M , the
employ task-based adapters to separate those dif- number of tasks, we call this setting dynamic,
ferent tasks into different adapters. Likewise, data as demonstrated in Figure 2b, as opposed to
(or tokens) from similar tasks should be routed to static (Figure 2a), where each task will go to
a similar group of adapters. There are three modes each individual adapter.
for those adapters. Our main model uses the shared task embed-
First and the simplest is to allocate each adapter ding representation for the task gate as well as
for each individual task. Although this setting is MoE gate, which we call shared-dynamic,
straightforward and requires no additional com- as shown in Figure 2c.
putation for data routing, it has the drawback of
acquiring new unseen tasks. The reason is the 4 Experiment Setup
model has to allocate a new adapter for each new
task and fine-tune it with some amount of new 4.1 Data
data. Another potential problem is memory lim- We tackle the problem of multitask multilingual
itation if we want to extend to many new tasks machine translation using the data consisting of 10
in the future. This mode is called static, as different languages ranging from high-resource to
shown in Figure 2a. low-resource ones including English (En), French
To enforce efficient learning of representation (Fr), German (De), Czech (Cs), Finnish (Fi), Lat-
of similar task data, as well as alleviating memory vian (Lv), Estonian (Et), Romanian (Ro), Hindi (Hi),
problems, we have dynamic (Figure 2b) where Turkish (Tr), and Gujarati (Gu). In more detail, the
the number of adapters is less than the number data for training, validation, and testing are listed
of tasks. As a result, we intentionally guide the in Table 1 where we can see besides the high-
model to learn better cross-task representation resource ones, we have low-resource languages
in terms of similarity and dissimilarity. In other such as Estonian, Hindi, or Gujarati.
words, data from similar tasks should be routed Those data are in the form of Bitext in which
to the same adapters and vice versa. In practice, there is always English. As a result, we denote EX

4
Task Layer Task Layer Task Layer MoE Layer

A1 A2 A3 A1 A2 A1 A2 …
TASK GATE MoE GATE
TASK GATE TASK GATE
TASK EMBED
Shared
Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 Embeddings

(a) Static (b) Dynamic (c) Shared-Dynamic

Figure 2: Our MoE models with variants. (a) Static means for each task, there is a separate adapter associated
with it. (b) In the Dynamic mode, there is less number of adapters than the number of tasks, in order to learn
the shared representation of similar tasks. (c) The last variant is Shared-Dynamic where the gates for task
adapters and MoE share the same embedding for their routing decisions. 1

Task Data
Split Unit de-en fr-en cs-en et-en fi-en gu-en hi-en lv-en ro-en
Training M 4.6 10 10.3 0.7 4.8 0.9 0.3 1.4 0.5
Validation K 3.0 3.0 3.0 2.0 1.4 2.0 0.5 2.0 2.0
Testing K 3.0 3.0 3.0 2.0 1.4 2.0 0.5 2.0 2.0

Table 1: Training, Validation, and Testing sizes for all XE tasks (the data for EX are exactly the same). Note that
the unit for training is million (M) while that for both validation and testing are thousand (K), and the sizes are
the same for validation and testing.

as the translation from English (E) to another lan- ularity and credibility in evaluating machine trans-
guage (X), and similarly for the other way around, lation tasks. This evaluation is implemented by
XE. Those data are populated from the popular SacreBLEU2 . We note that, unlike all available pub-
WMT corpus 1 . For the given 1 English and 9 other lic implementations that we found, our implemen-
languages, there are consequently 9 EX and 9XE tation evaluates all BLEU scores on the fly along
tasks. More information about data can be found with the training, so there is no disruption for of-
in Table 4 in Appendix A. fline evaluation. That also helps in early stopping
based on the BLEU scores on the validation sets.
4.2 Task and Model Training Pre-Processing and Post-Processing In
In this section, we describe the task information, terms of preprocessing, we first encode the data us-
evaluation metrics, and how we deal with data ing the Byte-Pair encoding (BPE) method and gen-
and models for training. erate shared dictionaries where all the language
Task Our task is multi-task multilingual ma- pairs use the same vocabulary of size 64K, before
chine translation (MMMT) which uses the EX and feeding to the model. To get accurate scores, for
XE pairs. Our single model is trained with two post-processing, we again use BPE decoding for
main capacities. First, this single model can trans- reconstructing the whole translated sentences be-
late all the training pairs with high accuracy. Sec- fore comparing them with the original sentences
ond, the model is able to quickly acquire new trans- before BPE pre-processing. Likewise, we treat all
lation pairs with only zero or a few shots. the processing and model manipulation as a black
Evaluation While there are many evaluation box for calculating the scores.
metrics, we mainly use BLEU score due to its pop- Model Configuration and Implementation
1 2
https://www.statmt.org/wmt20/index. https://github.com/mjpost/
html sacrebleu

5
XE Tasks
Model de-en fr-en cs-en et-en fi-en gu-en hi-en lv-en ro-en Average
1. Dense 29.9 31.2 28 22.4 21.4 22.3 21.4 24.5 36.1 26.4
2. MoE Token 27.9 29.5 26.3 19.9 19.6 18.9 17.7 22.3 33.8 24.0
3. MoE Sentence 27.9 29.9 26.2 21.4 19.9 17.9 15.9 23.2 34.4 24.1
4. MoE Task-Static 32.1 33.3 30.7 24.3 23.4 20.6 22.5 27.2 38.8 28.1
5.MoE Task-Dynamic 31.4 32.0 29.1 23.4 22.1 18.9 20.5 25.5 37.2 26.7
EX Tasks
en-de en-fr en-cs en-et en-fi en-gu en-hi en-lv en-ro
1. Dense 25.4 28.3 22.4 23.3 20.9 28.4 29 26.5 31.5 26.2
2. MoE Token 22.9 25.1 19.5 20.1 17.9 26.2 26.3 24.0 29.0 23.4
3. MoE Sentence 23.2 25.7 20.4 22.4 18.7 26.4 27.1 24.2 29.7 24.2
4. MoE Task-Static 29.5 32.5 27.9 27.4 25.8 28.8 30.8 32.2 34.6 29.9
5.MoE Task-Dynamic 27.3 29.6 25.0 24.7 22.7 27.7 29.3 28.4 32.7 27.5

Table 2: Comparison of task-based MoE models (models 4 & 5) to task-agnostic MoE models (models 2 & 3) and
the non-MOE (Dense) model (model 1) in BLEU scores. With the help of task information, task-based MoE models
show their outperforming BLEU scores over all other types across most of the tasks including both high-resource
and low-resource ones.

We use transformer architecture (Vaswani et al., Dense This is the traditional transformer
2017) with 12 layers for both encoder and decoder model without any MoE layer, i.e., no change to
phases, each of which uses a word embedding the fully connected (FFN) layer in each layer of
layer of dimension 1024 and a non-linear layer of encoders or decoders.
dimension 4096. There are 16 attention heads and MoE - Token This is the MoE model that is
a dropout rate of 30%. For MoE, all jobs are trained usually considered the default option where each
on Azure cloud machines with 8 GPUs, each of FFN layer is replaced by an MoE layer. In our
which takes around 2 weeks for a model covering experiments, each MoE layer comprises 8 experts
18 aforementioned tasks to reach decent scores. (each has the same size as the original FFN being
We apply early stopping based on the validation replaced) and a gate for routing purposes.
BLEU scores, in which a non-increasing score after MoE - Sentence This is yet another MoE archi-
2 epochs is the condition. For task-based informa- tecture with exactly the same architecture configu-
tion, we have a task embedding dimension of 64 ration as the MoE - Token baseline. The difference
and a task adapter hidden dimension of 256 for is in the routing layer, which functions at a differ-
every single task adapter. Our implementation ent granularity: sentences instead of tokens. In
inherits the lower-level infrastructure code from more detail, while the gate decides which expert
Microsoft Deepspeed and Fairseq. 3 for each token separately in MoE - Token model,
As for the implementation, an important prac- it instead routes all tokens belonging to a single
tical issue with MoE is load balancing among ex- sentence to the same chosen expert.
perts for the best utilization of the infrastructure
systems. For enforcing the training to have a bal- 5 Results and Discussions
anced load, as a result, we employ the auxiliary 5.1 Multitask Multilingual Machine
loss from Lepikhin et al. (2020). Translation
4.2.1 Baselines We first present the main results for models capa-
ble of translating 18 tasks (see Section 4.2) concur-
In order to show the performance of the task-based
rently. As shown in Table 2, our models that in-
MoE models, the following baselines are selected:
corporate MoE layers and are enhanced with task
3
https://github.com/ information show great advantages over all the
facebookresearch/fairseq baseline models on most tasks, in both directions

6
Design Routing Tasks
Model Average
MoE | Task MoE Task de-en fr-en et-en fi-en
MoE Y N Token - 32.4 33.7 24.2 23.6 28.5
Dense + Task Static Static 32.2 33.7 21 22.8 27.4
N Y
Dense + Task Dynamic Dynamic 31.9 33 22 22.5 27.4
Task
MoE + Task Static Static 30.7 32 19.9 20.8 25.9
MoE + Task Dynamic Y Y Dynamic 32.6 33.9 24 23.9 28.6
MoE + Task Shared-Dynamic Shared-Dynamic 32.2 33.3 24.3 24.5 28.6

Table 3: Performance of different models with changes on whether MoE layers exist, whether Task Adapters
exist, and how routing for those components is undertaken. The scores better than the baseline are highlighted.
Task-based MoE shows advantages, especially with shared-dynamic adapters between MoE and Task Adapters
on the low-resource language pair.

EX and XE, in accordance with our hypothesis ticularly when the dynamic adapters are used to
that using task adapters in conjunction with MoE enforce similar tasks to share the same represen-
is helpful in multilingual multitask translation. tations.
An outstanding drawback with which the task- However, when task adapters are not used in
based MoE models are facing, however, is for the conjunction with MoE, the performance is worse
low-resource translation pairs, e.g. Gu-En, Hi-En, than MoE alone. This also means MoE should
or En-Gu. We hypothesize the problem is due be the foundational infrastructure, and on top of
to the undersampling of the training data. Our that, task adapters should be used. It concurs with
training routine concatenates all the tasks’ data the motivation that the interference of different
in a single big dataset before drawing batches. tasks or languages makes the training of experts
However, without adjusting the sampling process, difficult. In other words, there is not so much help
high-resource language pairs are being trained when there is only one expert for all the tasks (i.e.
significantly more given their much larger data in Dense models).
place. In particular, for the case of Gujarati where
the Task-Dynamic MoE model underperforms in 5.2.2 Flexibility of Task-based MoE in
comparison to the baselines, our hypothesis is that Merging Two Trained Models
linguistically, this language is the most different One of the important capabilities in multi-task
from all other languages, which makes the models learning and in general learning problems is how
very hard to learn effective shared representation to quickly acquire new capabilities given cur-
with any other pairs. rent models with minimal resources and effort.
Aligned with this goal, this ablation explores how
5.2 Ablation Study
quickly our task-based MoE models can be merged
5.2.1 Implications of Different Task Layers with each other from 2 different models to newly
and MoE Layers establish only 1 model that has the combination
In this study, we limit the number of tasks to four of their capabilities.
(De-En, Fr-En, Et-En, and Fi-En), which can be In merging those two models, we restore two
divided into 2 groups of similar tasks: (De-En, respective checkpoints and merge layer-by-layer
Fr-En) is the first group and (Et-En, Fi-En) is the as follows. First, task-based adapters are kept
second one, to study the performance implications and combined with each other: each model has 2
of different model variants when there is a task adapters (for 4 tasks in the model) and the com-
layer and/or MoE layer. bined model has 4 adapters (for 8 tasks in com-
As illustrated in Table 3, we again see that com- bination). Second, the task routers will also be
bining MoE and Task Adapters yields the best merged and changed so that the routing of each
models, the same trend as shown in Table 2, par- data will now have 4 selections instead of 2 out-

7
(a) model 1 (b) model 2 (c) merged model

Figure 3: Ablation study about merging 2 checkpointed models of different capabilities. Model 1 is trained with
4 tasks: de-en, fr-en, et-en and fi-en. Model 2 is trained with the other 4 tasks: cs-en, gu-en, en-et, and en-fi.
Although those 2 models are under-trained with only a few thousand steps, in the merged model that has the
capabilities of those two combined, many pairs have quickly picked up to a similar levels as in the previous single
models.

puts as in the previous models. Finally, the rest In addition, it also offers the flexibility of changing
of the transformer and MoE layers will have their adapters based on new tasks or changing the MoE
weights averaged. infrastructure without affecting the application
The tasks in the original two models are hand- level. In the future, enforcing the shared repre-
picked as in Section 5.2.1 to have 2 different sentation learning explicitly using such additional
groups, each of which has 2 similar tasks. Model techniques as contrastive learning or mutual in-
1 has de-en, fr-en, et-en, and fi-en, while Model 2 formation is also worth exploring.
has cs-en, gu-en, en-et and en-fi.
As shown in Figure 3, while two original mod- 7 Acknowledgements
els have been trained with just a few thousand The authors would like to thank the great feed-
steps (a couple of hours), the combined model back and help from Yiren Wang, Muhammad
shows that it can quickly pick up their original ElNokrashy, Alex Muzio, Akiko Eriguchi and
capabilities with just a few hundred steps after other members of Microsoft’s Machine Transla-
merging. Although there are a few uncommon tion Group.
pairs that seem to fail, such as gu-en or en-et, the
chart shows the optimistic result of combining
trained models with our flexible task-based MoE
architectures.

6 Conclusion
In the era of big data, large-scale models are
more and more essential to big enterprises and
institutions, where MoE in combination with
transformer-based models has proven its great
advantages very recently. It is, however, compli-
cated to enable that implementation in practice
due to the difficulties of training a single model
serving diverse tasks. The proposed task-based
MoE, which employs both task adapters in tandem
with MoE has shown its promising advantages in
the application of multitask multilingual machine
translations. This novel design enforces shared
representation of similar tasks and separates dif-
ferent task data to counter the interference effects.

8
References Zhili Liu, Kai Chen, Jianhua Han, Lanqing Hong, Hang
Xu, Zhenguo Li, and James Tin-Yau Kwok. 2023.
Chang-Qin Chen, Min Li, Zhihua Wu, Dianhai Yu, Task-customized masked autoencoder via mixture
and Chao Yang. 2023. Ta-moe: Topology-aware of cluster-conditional experts. In International Con-
large scale mixture-of-expert training. ArXiv, ference on Learning Representations.
abs/2302.09915.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
Z. Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Sam Gross, Nathan Ng, David Grangier, and Michael
Hengshuang Zhao, Erik G. Learned-Miller, and Auli. 2019. fairseq: A fast, extensible toolkit for
Chuang Gan. 2022. Mod-squad: Designing mixture sequence modeling. In North American Chapter of
of experts as modular multi-task learners. ArXiv, the Association for Computational Linguistics.
abs/2212.08066.
Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li,
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur James Cross, Sebastian Riedel, and Mikel Artetxe.
Mensch, Michela Paganini, Jordan Hoffmann, Bog- 2022. Lifting the curse of multilinguality by pre-
dan Damoc, Blake A. Hechtman, Trevor Cai, Se- training modular transformers. In North American
bastian Borgeaud, George van den Driessche, Eliza Chapter of the Association for Computational Linguis-
Rutherford, T. W. Hennigan, Matthew G. Johnson, tics.
Katie Millican, Albin Cassirer, Chris Jones, Elena
Buchatskaya, David Budden, L. Sifre, Simon Osin- Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and
dero, Oriol Vinyals, Jack W. Rae, Erich Elsen, Koray Yuxiong He. 2020. Deepspeed: System optimiza-
Kavukcuoglu, and Karen Simonyan. 2022. Unified tions enable training deep learning models with over
scaling laws for routed language models. In Inter- 100 billion parameters. Proceedings of the 26th ACM
national Conference on Machine Learning. SIGKDD International Conference on Knowledge Dis-
covery & Data Mining.
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong,
Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,
Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam,
Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret
and Jason Weston. 2021. Hash layers for large sparse
Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou,
models. In Neural Information Processing Systems.
Tao Wang, Yu Emma Wang, Kellie Webster, Marie
Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern,
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Toju Duke, Lucas Dixon, Kun Zhang, Quoc V. Le,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Yonghui Wu, Z. Chen, and Claire Cui. 2021. Glam:
Dean. 2017. Outrageously large neural networks:
Efficient scaling of language models with mixture-
The sparsely-gated mixture-of-experts layer. arXiv
of-experts. ArXiv, abs/2112.06905.
preprint arXiv:1701.06538.
William Fedus, Jeff Dean, and Barret Zoph. 2022. A
review of sparse expert models in deep learning. Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
ArXiv, abs/2209.01667. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
William Fedus, Barret Zoph, and Noam M. Shazeer. you need. In NIPS.
2021. Switch transformers: Scaling to trillion pa-
rameter models with simple and efficient sparsity. Sen Wu, Hongyang Zhang, and Christopher Ré. 2020.
J. Mach. Learn. Res., 23:120:1–120:39. Understanding and improving information transfer
in multi-task learning. ArXiv, abs/2005.00944.
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan,
and Geoffrey E. Hinton. 1991. Adaptive mixtures of Seniha Esen Yüksel, Joseph N. Wilson, and Paul D.
local experts. Neural Computation, 3:79–87. Gader. 2012. Twenty years of mixture of experts.
IEEE Transactions on Neural Networks and Learning
Young Jin Kim, Ammar Ahmad Awan, Alexandre Systems, 23:1177–1193.
Muzio, Andrés Felipe Cruz-Salinas, Liyang Lu, Amr
Hendy, Samyam Rajbhandari, Yuxiong He, and Zhiyuan Zeng and Deyi Xiong. 2023. Scomoe: Efficient
Hany Hassan Awadalla. 2021. Scalable and effi- mixtures of experts with structured communication.
cient moe training for multitask multilingual models. In International Conference on Learning Representa-
ArXiv, abs/2109.10465. tions.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao
Dehao Chen, Orhan Firat, Yanping Huang, Maxim Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang,
Krikun, Noam M. Shazeer, and Z. Chen. 2020. Yuanzhong Xu, Danyang Zhuo, Joseph Gonzalez,
Gshard: Scaling giant models with conditional and Ion Stoica. 2022. Alpa: Automating inter- and
computation and automatic sharding. ArXiv, intra-operator parallelism for distributed deep learn-
abs/2006.16668. ing. ArXiv, abs/2201.12023.

9
Yan-Quan Zhou, Tao Lei, Han-Chu Liu, Nan Du, Yan-
ping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng
Chen, Quoc V. Le, and James Laudon. 2022. Mixture-
of-experts with expert choice routing. ArXiv,
abs/2202.09368.
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du,
Yanping Huang, Jeff Dean, Noam M. Shazeer, and
William Fedus. 2022. St-moe: Designing stable and
transferable sparse expert models.

10
A WMT Data Information

Code Language Test Split


de German wmt2013
fr French wmt2013
cs Czech wmt2013
et Estonian wmt2018dev
fi Finish wmt2015
gu Gujarati wmt2019dev
hi Hindi wmt2014dev
lv Latvian wmt2017dev
ro Romanian wmt2016dev

Table 4: More details about our datasets for comparison


and reproducibility.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy