Trans-Tokenization
Trans-Tokenization
Abstract
∗ Shared first-authorship
1
Published as a conference paper at COLM 2024
1 Introduction
Multilingual tokenization is unfair, with all existing approaches inadvertently favoring
some languages over others (Petrov et al., 2023; Rust et al., 2021). This bias is particularly
pronounced in multilingual subword tokenization techniques, which face the impossible
task of distributing their token capacity equitably among all supported languages. Western
European languages often benefit from this, thanks to their shared alphabet and linguistic
heritage (Limisiewicz et al., 2023). Although character or byte-level encoders appear to
handle diverse scripts more fairly, they frequently struggle to capture meaningful word-level
information, especially in non-ideographic languages with limited alphabets (Libovický
et al., 2022; Edman et al., 2022). Furthermore, byte-level tokenizers also display bias due to
the substantial disparities in unicode encoding efficiency across languages.
In light of these challenges, we stress the need for a more personalized approach, where
each language is equipped with its own tokenizer, specifically tailored to its unique needs.
Unfortunately, the challenge of developing monolingual language models for all the world’s
languages has never been more present due to the vast amounts of data required to train
large language models (LLMs), as evidenced by the technical reports of Mistral (2023),
OLMo (2024) and Gemma (2024). The trillion tokens required for training LLMs simply does
not exist in most languages (Joshi et al., 2020), turning transfer learning into a requirement.
Moreover, serving a wide array of monolingual LLMs at scale remains impractical. Efficient
computation necessitates the batch-processing of requests (Pope et al., 2022), but many
languages also suffer from intermittent workloads. This also makes it unsustainable to
dedicate extensive GPU resources to continuously host often-idling LLMs, while the time
required to load them back into memory impedes many commercial applications that
require low latency (Alizadeh et al., 2024).
In this paper, we introduce several key innovations designed to democratize the training
and deployment of high-quality monolingual models across a diverse set of languages.
More specifically, we demonstrate how model conversion enables researchers to adapt
LLMs to new languages using a very limited amount of resources, with a performance
competitive with continual pre-training. Our approach preserves most layers of the original
model, thereby facilitating the batch-processing of queries written in different languages, a
critical factor in making the deployment of language-specific models economically viable.
2
Published as a conference paper at COLM 2024
…
:
||| |||
Predefined mappings
e.g. en _1 : nl _1
(a) Token alignment is performed first based on a tokenized parallel corpus using a SMT-
based alignment tool, to establish a probabilistic token mapping. We provide snippets of
each stage of the full pipeline in Appendix E.
en _few
C →
∑ nl _sommige
en _some
C →
→
nl _jij
en _you
(b) Embedding mapping is then performed, as the embedding table for the target language
(e.g. Dutch, indicated by nl ) is initialized from the embeddings of mapped tokens in the
source language (e.g. English, indicated by en ), while preserving hidden layers.
3 Trans-Tokenization
Tokenizers limit the range of languages that a model can effectively support. Even when
performance for an unseen language is acceptable, tokens that are used to encode words in
this language often map to smaller subwords, reducing the effective context length, and are
rarely trained properly. As a consequence, the embeddings for these tokens are also less
meaningful, even between languages with a shared ancestral language or with significant
language contact. For instance, the English word ‘music‘ was borrowed from the French
‘musique’, which is encoded by two tokens (‘mus’ + ‘ique’) in most English BPE-based
tokenizers. However, the ‘mus‘ token is unlikely to have been pretrained well, since ‘music‘
exists.
To address this problem, we intuitively want to create a mapping between tokens based
on a translation scheme, instead of relying on orthographic or morphological similarity.
However, subword tokenization prevents the direct use of word translation dictionaries.
To achieve this mapping, our Trans-Tokenization method therefore relies on two steps,
3
Published as a conference paper at COLM 2024
as shown in Figure 1: (i) a token alignment generated using a parallel corpus and (ii) an
embedding mapping. Depending on the model, there is also a third step, where the untied
language modeling head undergoes the same mapping as the embedding table.
Token alignment: We start by tokenizing both sides of a parallel corpus using either
the source or target tokenizer, but re-encode words as single units1 for non-ideographic
languages. Next, we pass this tokenized parallel corpus through a Statistical Machine
Translation (SMT) model, FastAlign by Dyer et al. (2013). FastAlign provides a probabilistic
token mapping based on the real-world evidence extracted from the parallel corpus (e.g.
revealing that the Dutch token vijftien is matched with fifteen about 52% of the times,
with 15 about 46% of the times, and with Fif ... teen the remaining 2% of the times).
1We determine word boundaries using the definition of ‘letter’ in unicode (\p{L}) and tokenization:
mergeable tokens lack an initial whitespace ( token) or start with a word continuation sign (##izer),
depending on the tokenizer. We perform SMT alignment at a word-level instead of at a token level,
since tokens often occur in multiple words. Preliminary experiments showed that the mappings
obtained from using tokens without re-merging were of lower quality, with more noise. If needed, we
split up word mappings back to individual tokens in the next stage, where we map the embeddings.
2When the lengths do not match, tokens are matched proportionally to their relative position (e.g.
for 2-vs-3, the first target token would be matched partially to the first and second source tokens, with
token match counts of respectively 2⁄3 and 1⁄3 of the initial word match count, thus preserving the total).
4
Published as a conference paper at COLM 2024
After adapting an English language model to a new language using the method described
above, we can also leverage our mapped embedding space to create models which accept
tokens from both tokenizers. We refer to these models as ‘Hydra’ LLMs, in reference to
their ability to stand on multiple legs (embedding tables) and grow multiple (language
modelling) heads. These Hydra LLMs can be utilized for tasks such as the translation of
texts or instructions from the source language to the target language, by encoding the source
language using the initial tokenizer and producing new tokens in the target language using
the newly-trained tokenizer. This approach is analogous to code-switching.
We envision several configurations of Hydra LLMs in this article, but focus our experiments
to the zero-shot cross-lingual translation from the source language to the target language,
as we believe this task to be the most promising and the most reliably measurable using
well-established metrics. To test our hypothesis, we extend the popular Transformers library
from HuggingFace (Wolf et al., 2020) by introducing a new LlamaHydraForCausalLM class.
The most important difference of Hydra models lies in the usage of distinct input and output
vocabularies; while the input vocabulary includes the output vocabulary, it also contains
one or several other embedding tables used to support tokens from other languages. To use
the embeddings located beyond the main tokenizer, an offset can be added to the token ids
produced by the additional tokenizers. To perform back-propagation, the labels of tokens
located beyond the output vocabulary should be set to a masked value (e.g. -100).
We hypothesize (but did not verify) that the two bottom layers of the source model should
probably be used to encode tokens from the original vocabulary instead of the layers
finetuned for the target language. However, in our experiments, only the weights of the
trans-tokenized model are used for inference, as this did not seem to cause any issue.
5 Experimental setup
In the next sections, we discuss the performance of our method for several languages, with
a focus on low-resource (§ 6.1, § 6.2, § 6.3, § 6.4) and mid-resource languages (§ 6.5, § 6.6).
To test the capabilities of our transfer learning method in a worst-case scenario, we decided
to evaluate it on Tatar, an endangered low-resource language which has few similarities
with English. Indeed, 75% of the 8070 languages encoded in URIEL (Littell et al., 2017) are
more similar to English than Tatar. This figure remains identical if we only consider the 184
languages featuring a two-letter code, as a proxy for language prominence.
Additionally, none of the 10 languages supported by our translation model at initialization
feature an URIEL similarity of more than 35% with Tatar (as a comparison point, English
has a 40% similarity with Korean and 28% with Chinese). Finally, there is only a limited
amount of training data for the language. For example, Tatar Wikipedia contains only 1.43%
as many articles as English Wikipedia (68th out of 278 languages).
We also evaluate trans-tokenization performance on Armenian, an Indo-European language
with distinctive characteristics, such as an entirely unique writing script and the absence of
any closely related languages within its sub-group. Being closer to English, we expect to see
better results in Armenian than in Tatar for language modeling tasks (lower perplexity).
Finally, we wrap up our evaluations with Dutch, a Western Germanic language very close to
English, and for which more resources are available, enabling to test more conclusively the
capabilities of our models in factuality and reasoning. We also finetune our Dutch model
using a Chat dataset, to compare its capabilities with other existing models.
Our evaluations cover a wide range of tasks, ranging from classical language modeling to
language understanding and text summarization techniques for low-resources languages,
and extending to more advanced SQuAD-type evaluations for our mid-resources languages.
We also evaluate our Hydra LLMs using zero-shot translation from English to Tatar, a
challenging language pair for which no high-quality dataset exists.
5
Published as a conference paper at COLM 2024
For our low-resource and mid-resource experiments, we train several models and baselines.
We start from Mistral-7B (Team MistralAI et al., 2023), as this is a high-quality model for
English. We also perform some ablation studies in the low-resource setting to understand
which initialization and finetuning approach works best. To keep the result tables compact,
the strategies used for training these models are detailed below:
Mistral We use the target language in the prompt with Mistral (2023), without finetuning.
This strategy relies on the original model’s pre-existing understanding of the lan-
guage from its training corpus. While effective for well-resourced languages, it
is unlikely to yield good results for low-resource languages due to limited data
exposure during pre-training. Nevertheless, for languages with more resources like
Dutch, the source model provides a solid baseline.
Mistral+FT We perform continual pre-training using the original tokenizer of the language
model. Although BPE tokenizers are universal encoders (Sennrich et al., 2016), most
merged tokens cater to prominent languages, resulting in inefficient encoding for
low-resource ones.
MistralRAND We reinitialize the embedding table and language modeling head, retraining
them using the in-domain corpus. While effective for high-resource languages, this
strategy leads to substantially degraded performance for low-resource languages.
MistralAVG As an improvement over the preceding strategy, we restore the embedding of
tokens shared between the source and target tokenizers. For Tatar, this concerns
only around 12% of the tokens. The embeddings of all remaining tokens are then
initialized with the average of the previously-mapped embeddings (to keep them
in distribution).
WECHSEL We apply WECHSE (Minixhofer et al., 2022) to the languages in our setup, we
initialize the embeddings of tokens using a bilingual dictionary derived from our
SMT-aligned corpus. For Dutch, we test WECHSEL with (i) the original bidirec-
tional dictionary and (ii) an equal-sized dictionary derived from our SMT-aligned
corpus. For Tatar, we only follow the latter strategy. We do this since the original
dictionaries were of extremely low quality: the Dutch one contained approximately
50% inaccurate or completely wrong translations3 and the Tatar dictionary contains
mostly text in the wrong language and script, making any comparison unfair.
Tweety Finally, we apply our trans-tokenization to initialize the embedding tables, as intro-
duced in Section 3, to improve transfer learning by providing initialization for most
tokens based on a cross-lingual token alignment. This strategy yields good results
across the board and we refer to the resulting models as Tweety (Appendix M).
6
Published as a conference paper at COLM 2024
6 Evaluations
The first way in which we evaluate the model adaptation strategies is by reporting the
validation perplexity of the trained models. To ensure a fair comparison between models
having different tokenizers, we report the “per native token” perplexity (that is, we nor-
malize the perplexity reported by our library relative to the number of tokens required to
represent a Tatar text using the tokenizer of the model; as detailed by Mielke (2019)).
Table 1: Perplexity for the Tatar language, compared with the normalized perplexities of
our baselines and ablation studies. Similar results for Armenian can be found in Appendix K
along with a qualitative analysis of its SMT mapping.
7
Published as a conference paper at COLM 2024
We also evaluate our adaptation strategies with the 1-shot performance of generative models
on the SART Word Analogies dataset (Khusainova et al., 2023), and comparing them with
the existing word embedding baselines. Despite looking trivial, this task remains quite
challenging in a 1-shot setting due to the lack of instruction.
Table 2: Accuracy of models on Tatar the semantic word analogies from the SART dataset.
Refer to Appendix F for a detailed scoring per analogy type, and analysis thereof.
The 1-shot text summarization task is the third way we use to evaluate our Tatar models.
We compute ChrF (Popović, 2015) to compare the generated summaries and the reference.
We report our results in Table 3 and a description of the eval corpus in Appendix G.
Table 3: Textual similarity of generated summaries with a reference. The Mistral + Google
Translate results score the similarity of the Tatar translation of the Mistral summary of an
English translation of the Tatar input.
To evaluate our Hydra models, we focus on three English-to-Tatar machine translation tasks:
two experiments relying on our text summarization dataset, as well as one smaller-scale
evaluation on short social media messages scraped from Mastodon. For the latter, we paid a
professional translator to provide high-quality references. Refer to Appendix H for a more
detailed description of the datasets.
For the long text translation task, we showcase the advantage of using LLMs in translation
systems, providing the gold standard of the short text as a 1-shot example in the prompt, to
perform neural fuzzy repair (Bulte & Tezcan, 2019, +NFR in Table 5).
The translations of the 125 social media messages were also ranked pairwise by one of the
authors, a native Tatar speaker. When no translation was good enough, neither received a
preference vote. The professional translation won 51 pairwise votes, Google Translate 29,
HydraTowerFT 24, and Microsoft Translator 10.
This confirms that HydraLLMs are competent machine translation systems.
8
Published as a conference paper at COLM 2024
Table 4: Machine translation scores (ChrF) between texts and their reference translations.
Social medial references were produced by a professional translator in Tatarstan. The Google
Translate results on this set are striked-through because of a possible data contamination, see
Appendix I. RandomInDistrib refers to the average score obtained by comparing random
pairs of texts from the reference sets, and serves as an absolute baseline. ParFT refers to
finetuning the model on the parallel data used to initialize the Hydra embeddings. BackFT
refers to finetuning the model on a small but high-quality set of Tatar text back-translated to
English using Google Translate.
For evaluating our method on a mid-resource language, we train a Dutch model for 40
GPU hours and 417M tokens (see Appendix L for all details), we first compute a validation
perplexity on the ‘tiny’ subset of the Dutch section of C4 (Raffel et al., 2019).
We trans-tokenize Mistral-7B (Team MistralAI et al., 2023) to use the vocabulary of GPT NEO
1.3b Dutch (Havinga, 2024) to make an easy comparison between both models, especially
since we also train on the same dataset. Despite a significantly lower number of training
tokens (417M versus 33B), our model obtains a perplexity of 11.1, compared to GPT NEO
with 21.2. Mistral-7B has a lower perplexity, but there are fewer tokens and the tokenizer is
not adapted to Dutch, meaning that more words are needed and the per-token perplexity is
lower (Mielke, 2019). Based on the evaluation tokens counts, 33.1% more tokens are needed.
We also compare to related works in the mid-resource setting, more specifically WECH-
SEL (Minixhofer et al., 2022), FOCUS (Dobler & de Melo, 2023) and MaLA-500, an adaptation
of Llama 2 for 534 languages (Lin et al., 2024). For WECHSEL, we test the two variations
of the (i) original bidirectional dictionary and (ii) an improved one based on our token
mapping, as explained in § 5.1.
9
Published as a conference paper at COLM 2024
7 Discussion
Advantages over other approaches. As our results demonstrate, our language adaptation
method is capable of producing high-quality LLMs for low-resource languages, at a fraction
of the cost of training similar-sized LLMs from scratch, and improved performance over
continual pretraining. Unlike massively multilingual models, which inherently create
tokenization unfairness, our trans-tokenized monilingual models offer each language an
equal share of the embedding budget. This could in turn enable layer-sharing between
languages. Finally, our work demonstrates that evidence-based SMT mappings perform
better than traditional character-based embedding reinitialization techniques.
HydraLLMs. Hydra LLMs extend the trans-tokenization concept to enable zero-shot cross-
lingual tasks. In English-to-Tatar translation (Table 4), the HydraTower model performs
competitively with commercial systems, and can further benefit from high-quality finetun-
ing. This demonstrates the potential of Hydra LLMs for low-resource machine translation
without extensive high-quality parallel data, leveraging the strengths of large language
models in cross-lingual scenarios. However, our setup only enables translation in the High-
to-Low resource direction, as our finetuning causes the LLM to lose its fluency in the source
language. We leave the investigation of the reverse direction as future work.
8 Conclusion
In this work, we have introduced a novel approach to the adaptation of LLMs for low-
resource languages through cross-lingual vocabulary transfers. Our experiments with
the Tweeties series of trans-tokenized LLMs and Hydra LLMs have demonstrated the
effectiveness of our approach across a range of downstream tasks and languages.
Notably, the development of a state-of-the-art machine translation model for Tatar, achieved
in a zero-shot manner with Hydra LLMs, underscores the potential of our strategy to
make significant strides in language technology for languages that have historically been
underrepresented in NLP research.
We hope that our contributions will inspire further exploration and innovation in the field,
and that the limitations we mentioned in Appendix A will be addressed in future works,
some of which we already suggest in Appendix B. We are eager to read your works!
10
Published as a conference paper at COLM 2024
Author Contributions
All authors participated in the paper writing and the experimental design. In addition,
François Remy and Alfiya Khabibulina worked on the Tatar experiments. Pieter Delobelle
worked on the Dutch experiments. Hayastan Avetisyan worked on the Armenian experi-
ments. Finally, Miryam de Lhoneux and Thomas Demeester participated in the ideation
process and provided guidance and feedback.
Acknowledgments
We thank Matthieu Meeus and Anthony Rathé for kickstarting this line of research with
their personal project based on BLOOM and LLama2.
François Remy received the financial support of the Vlaams Agentschap Innoveren & On-
dernemen (VLAIO) through its ADAM project. This research also received funding from
the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI)
Vlaanderen” programme, and from the Research Foundation – Flanders (FWO-Vlaanderen)
under the project G0C2723N. Pieter Delobelle was also supported by the Research Foun-
dation - Flanders (FWO) under EOS No. 30992574 (VeriLearn) and received a grant from
“Interne Fondsen KU Leuven/Internal Funds KU Leuven”.
The resources and services used in this work were in part provided by the VSC (Flemish
Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the
Flemish Government, and in part by the GPULab, the machine learning infrastructure for
AI computing built in collaboration between UGent, UAntwerpen and the imec research
and development center.
11
Published as a conference paper at COLM 2024
References
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, and Benoı̂t Sagot. Towards a cleaner
document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language
Resources and Evaluation Conference, pp. 4344–4355, Marseille, France, June 2022. European
Language Resources Association. URL https://aclanthology.org/2022.lrec-1.463.
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo
C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. Llm in a flash: Efficient
large language model inference with limited memory, 2024.
Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin
Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo,
José G. C. de Souza, and André F. T. Martins. Tower: An open multilingual large language
model for translation-related tasks, 2024.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Translation artifacts in cross-lingual trans-
fer learning. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
7674–7684, Online, November 2020. Association for Computational Linguistics. doi:
10.18653/v1/2020.emnlp-main.618. URL https://aclanthology.org/2020.emnlp-main.
618.
Bram Bulte and Arda Tezcan. Neural fuzzy repair: Integrating fuzzy matches into neu-
ral machine translation. In Anna Korhonen, David Traum, and Lluı́s Màrquez (eds.),
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.
1800–1809, Florence, Italy, July 2019. Association for Computational Linguistics. doi:
10.18653/v1/P19-1175. URL https://aclanthology.org/P19-1175.
Wietse de Vries, Martijn Bartelds, Malvina Nissim, and Martijn Wieling. Adapting mono-
lingual models: Data can be scarce when language similarity is high. In Chengqing
Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Com-
putational Linguistics: ACL-IJCNLP 2021, pp. 4901–4907, Online, August 2021. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.433. URL
https://aclanthology.org/2021.findings-acl.433.
Konstantin Dobler and Gerard de Melo. FOCUS: Effective embedding initialization for
monolingual specialization of multilingual models. In Houda Bouamor, Juan Pino, and
Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, pp. 13440–13454, Singapore, December 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.emnlp-main.829. URL https://aclanthology.org/
2023.emnlp-main.829.
Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast, and effective reparameter-
ization of IBM model 2. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff (eds.),
Proceedings of the 2013 Conference of the North American Chapter of the Association for Compu-
tational Linguistics: Human Language Technologies, pp. 644–648, Atlanta, Georgia, June 2013.
Association for Computational Linguistics. URL https://aclanthology.org/N13-1073.
Lukas Edman, Antonio Toral, and Gertjan van Noord. Subword-delimited downsampling
for better character-level translation. In Yoav Goldberg, Zornitsa Kozareva, and Yue
Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 981–
992, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational
Linguistics. doi: 10.18653/v1/2022.findings-emnlp.69. URL https://aclanthology.org/
2022.findings-emnlp.69.
Xavier Garcia, Noah Constant, Ankur Parikh, and Orhan Firat. Towards continual
learning for multilingual machine translation via vocabulary substitution. In Kristina
Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven
Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of
the 2021 Conference of the North American Chapter of the Association for Computational
12
Published as a conference paper at COLM 2024
Linguistics: Human Language Technologies, pp. 1184–1192, Online, June 2021. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.93. URL
https://aclanthology.org/2021.naacl-main.93.
Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, and Paolo Torroni. Fast vocabu-
lary transfer for language model compression. In Yunyao Li and Angeliki Lazari-
dou (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing: Industry Track, pp. 409–416, Abu Dhabi, UAE, December 2022. Associa-
tion for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-industry.41. URL
https://aclanthology.org/2022.emnlp-industry.41.
Team Gemma and Google DeepMind. Gemma: Open models based on gemini research
and technologies. Google, 2024. URL https://storage.googleapis.com/deepmind-media/
gemma/gemma-report.pdf.
Evangelia Gogoulou, Ariel Ekgren, Tim Isbister, and Magnus Sahlgren. Cross-lingual
transfer of monolingual models. In Nicoletta Calzolari, Frédéric Béchet, Philippe
Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isa-
hara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis
(eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 948–
955, Marseille, France, June 2022. European Language Resources Association. URL
https://aclanthology.org/2022.lrec-1.100.
Google. Five new languages now available in google translate, 2020. URL https://blog.
google/products/translate/five-new-languages/.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In
International Conference on Learning Representations, 2022. URL https://openreview.net/
forum?id=nZeVKeeFYf9.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The
state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky,
Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pp. 6282–6293, Online, July 2020.
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.560. URL
https://aclanthology.org/2020.acl-main.560.
Alexander Kalinowski and Yuan An. A survey of embedding space alignment methods for
language and knowledge graphs, 2020.
Albina Khusainova, Adil Khan, and Adı́n Ramı́rez Rivera. Sart - similarity, analogies, and
relatedness for tatar language: New benchmark datasets for word embeddings evaluation.
In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, pp.
380–390, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-24337-0.
Jindřich Libovický, Helmut Schmid, and Alexander Fraser. Why don’t people use character-
level machine translation? In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio
(eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 2470–2485,
Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/
2022.findings-acl.194. URL https://aclanthology.org/2022.findings-acl.194.
Tomasz Limisiewicz, Jiřı́ Balhar, and David Mareček. Tokenization impacts multilingual
language modeling: Assessing vocabulary allocation and overlap across languages. In
Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association
for Computational Linguistics: ACL 2023, pp. 5661–5681, Toronto, Canada, July 2023. As-
sociation for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.350. URL
https://aclanthology.org/2023.findings-acl.350.
13
Published as a conference paper at COLM 2024
Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André FT Martins, and Hinrich Schütze. Mala-500:
Massive language adaptation of large language models. arXiv preprint arXiv:2401.13303,
2024.
Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. OpenSubtitles2018: Statistical rescoring
of sentence alignments in large, noisy parallel corpora. In Nicoletta Calzolari, Khalid
Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara,
Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios
Piperidis, and Takenobu Tokunaga (eds.), Proceedings of the Eleventh International Conference
on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European
Language Resources Association (ELRA). URL https://aclanthology.org/L18-1275.
Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori
Levin. URIEL and lang2vec: Representing languages as typological, geographical, and
phylogenetic vectors. In Mirella Lapata, Phil Blunsom, and Alexander Koller (eds.),
Proceedings of the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, pp. 8–14, Valencia, Spain, April 2017. Association for
Computational Linguistics. URL https://aclanthology.org/E17-2002.
Microsoft. Overcoming language barriers with microsoft azure transla-
tor, Oct 2021. URL https://news.microsoft.com/en-cee/2021/10/11/
overcoming-language-barriers-with-microsoft-azure-translator/.
Sabrina J. Mielke. Can you compare perplexity across different segmentations?, Apr 2019.
URL https://sjmielke.com/comparing-perplexities.htm.
Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL: Effective initializa-
tion of subword embeddings for cross-lingual transfer of monolingual language models.
In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.),
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pp. 3992–4006, Seattle, United States, July
2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.293.
URL https://aclanthology.org/2022.naacl-main.293.
Vladislav Mosin, Igor Samenko, Borislav Kozlovskii, Alexey Tikhonov, and Ivan P.
Yamshchikov. Fine-tuning transformers: Vocabulary transfer. Artificial Intelligence, 317:
103860, 2023. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2023.103860. URL
https://www.sciencedirect.com/science/article/pii/S0004370223000061.
Dan Saattrup Nielsen. ScandEval: A Benchmark for Scandinavian Natural Language Pro-
cessing. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa),
pp. 185–201, May 2023.
Team NLLB, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth
Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,
Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
rault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett,
Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews,
Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj
Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers,
Safiyyah Saleem, Holger Schwenk, and Jeff Wang. No language left behind: Scaling
human-centered machine translation, 2022.
Team OLMo, Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney,
Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang,
Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan,
Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill,
Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters,
Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith,
Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert,
Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith,
and Hannaneh Hajishirzi. Olmo: Accelerating the science of language models, 2024.
14
Published as a conference paper at COLM 2024
Pedro Javier Ortiz Suárez, Benoit Sagot, and Laurent Romary. Asynchronous pipelines for
processing huge corpora on medium to low resource infrastructures. Proceedings of the
Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff,
22nd July 2019, pp. 9 – 16, Mannheim, 2019. Leibniz-Institut f”ur Deutsche Sprache. doi:
10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language
model tokenizers introduce unfairness between languages. In A. Oh, T. Neu-
mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in
Neural Information Processing Systems, volume 36, pp. 36963–36990. Curran Asso-
ciates, Inc., 2023. URL https://proceedings.neurips.cc/paper files/paper/2023/
file/74bb24dca8334adce292883b4b651eda-Paper-Conference.pdf.
Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, and
Peyman Passban. Investigating backtranslation in neural machine translation. In Juan An-
tonio Pérez-Ortiz, Felipe Sánchez-Martı́nez, Miquel Esplà-Gomis, Maja Popović, Celia
Rico, André Martins, Joachim Van den Bogaert, and Mikel L. Forcada (eds.), Proceedings of
the 21st Annual Conference of the European Association for Machine Translation, pp. 269–278,
Alicante, Spain, May 2018. URL https://aclanthology.org/2018.eamt-main.25.
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury,
Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently
scaling transformer inference, 2022.
Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Ondřej
Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias
Huck, Varvara Logacheva, and Pavel Pecina (eds.), Proceedings of the Tenth Workshop on
Statistical Machine Translation, pp. 392–395, Lisbon, Portugal, September 2015. Association
for Computational Linguistics. doi: 10.18653/v1/W15-3049. URL https://aclanthology.
org/W15-3049.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a
unified text-to-text transformer. arXiv e-prints, 2019.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable
questions for SQuAD. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.
784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:
10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124.
François Remy, Pieter Delobelle, Bettina Berendt, Kris Demuynck, and Thomas Demeester.
Tik-to-tok: Translating language models one token at a time: An embedding initialization
strategy for efficient language adaptation, 2023. Full paper in the supplements.
Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. How good is
your tokenizer? on the monolingual performance of multilingual language models. In
Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3118–3135, Online,
August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.
243. URL https://aclanthology.org/2021.acl-long.243.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
words with subword units. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi:
10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
Team MistralAI, Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,
Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock,
15
Published as a conference paper at COLM 2024
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed.
Mistral 7b, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2:
Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in
english? on the latent language of multilingual transformers, 2024.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers:
State-of-the-art natural language processing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online,
October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/
anthology/2020.emnlp-demos.6.
Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do
large language models handle multilingualism?, 2024.
16
Published as a conference paper at COLM 2024
A Limitations
Our proposed trans-tokenization strategy, while effective, is not without its limitations.
Firstly, initializing the target language’s token embeddings with those from a high-resource
language can inadvertently transfer certain cultural and idiomatic patterns from that lan-
guage to the new one. This may not be desirable, especially when the source and target
languages have significant cultural or linguistic differences. However, as the availability of
target language training data increases, this issue should tend to diminish.
Secondly, our intra-word many-to-many token mapping approach relies on a left-to-right
alignment assumption, which may not always be optimal. In theory, it is possible to extend
our method to accommodate different alignment strategies, but this has not been explored
in the current study.
Lastly, our fine-tuning process fully utilizes the bottom two and top two layers without
employing the Layer-wise Relevance Analysis (LoRA) technique proposed by Hu et al.
(2022). This results in a significant VRAM weight for every language supported on a GPU.
The use of more efficient adapters, such as LoRA, could have significantly reduced this
weight, making our approach more resource-efficient. It might also be possible to train
a small projection on top of the existing embedding matrix, to avoid having an entire
embedding table per supported language.
B Future Work
C Release statement
Together with this publication, we release the code and documentation of our trans-
tokenizers library, which facilitates the conversion of models from one tokenizer to another.
All our trained models will be released on the HuggingFace hub, with the tag ”tweety”,
to enable the community to replicate our work. Finally, we also open source our Tatar
summarization dataset on Huggingface.
17
Published as a conference paper at COLM 2024
Figure 2: Illustration of the arbitrary nature of token alignment which can be captured by
evidence-based SMT mappings (trans-tokenization) but not by character-based mappings.
While the semantic information pertaining to tokens can be recovered well using character-
based overlap metrics (because semantically-connected words usually share similar char-
acter combinations), the required token-sequence modeling is inherently specific to the
exact choices of tokenization in both the source and target languages, and it cannot be
recovered without consideration of the precise mappings taking place. This is where our
SMT approach shines, because it can align the first token of a word in the source language
precisely to the first token of that word translated in the target language, based on token
translations. This reduces the number of tokens from which a token is sourced from, and
enhances the ability of converted LLMs to generate correct sequences of tokens, even for
less frequent combinations.
Many other token mapping techniques, such as FOCUS (Dobler & de Melo, 2023), also make
strong assumptions about the accidental or explicit exposure of language models to the
target language. For example, FOCUS assumes that mapping a new token to its former
constituents will make sense for the language model, but this is not true if the language was
never seen at all, or seen rarely. These approaches might however still shine for converting
LLMs between high-resources or mid-resources languages. When LLMs are already fluent in
a language, adjusting their vocabulary in this way can bring performance gains at inference
time, without requiring to also transfer the knowledge from English.
In conclusion, our previous Llama2 experiments and our new baselines show that token
mappings based on character-ngrams are not sufficient for GPT-models, where token map-
pings need to transfer precise information about arbitrary token sequences.
4 Both affiliated with Oqton, a software company accelerating intelligent manufacturing with AI.
18
Published as a conference paper at COLM 2024
Trans-tokenizing a model start with a parallel resource between the two languages to map
to each other. This resources can be a noisy parallel corpus such as NLLB, or it can be word
translation dictionary (although a parallel corpus is preferable).
...
I ' m only fifteen ! ||| Ik ben pas vijftien !
We saw 15 of them . ||| Wij zagen er vijftien .
Fifteen maybe ? ||| Mischien vijftien ?
...
Listing 1: Parallel corpus
After using an SMT-based alignment tool, token matching counts are provided:
...
13721 _vijftien _fifteen
12293 _vijftien _15
544 _vijftien _Fif ## teen
...
Listing 3: FastAlign alignment counts
By normalizing the counts token per token, mapping probabilities can be derived:
...
_vijftien := 0.52* _fifteen + 0.46* _15 + 0.01* _Fif + 0.01* teen
...
Listing 5: Final mapping weights
19
Published as a conference paper at COLM 2024
Table 7: Accuracy of models on Tatar the semantic word analogies from the SART dataset,
broken down by sub-task. In the main paper, only the average was reported.
20
Published as a conference paper at COLM 2024
For our Social Media evaluation, we scraped 125 English snippets from the social network
mastodon.social, by sampling from the most popular posts from the network on Saturday
2024-03-16. The extracted snippets were manually checked for their ability to be understood
in context, the appropriateness of their length, and their exclusive usage of the English lan-
guage. We also filtered messages pertaining to sensible topics which could cause discomfort
to our translator agency (e.g. eroticism, pandemics, armed conflicts).
This resulted in a set of 125 snippets of 60 to 180 characters long. A professional translation
agency was then hired to translate these snippets in Tatar, and these translations were used
as a reference for the task. We noted, however, similarities between the provided translations
and those of Google Translate, which might have been the result of a data contamination
(see next appendix).
We suspect that the translations provided by the translation agency for the Social Media
task were partially contaminated by Google Translate, either directly through inspiration or
indirectly through the use of translation memories.
Figure 3: Google Translate results (in orange) are not in line with the otherwise strong
cross-task correlations of the other models. We estimate a real score of about 53 instead.
For this reason, we cross the Google Translate result for that experiment, and refrained from
providing a ”best result” in bold. Based on the correlations found before, we estimate the
true score of Google Translate on the Social Media task to be situated between 49 and 55.
21
Published as a conference paper at COLM 2024
While we did not conduct enough experiments to make strong claims about the matter in
this paper, we investigated whether the source language from which a mapping was made
had a strong influence on the training results. We did this using the TowerInstruct model,
which supports English and Russian, two languages for which enough data exists to create
high-quality token mappings to Tatar. An interesting aspect of the TowerInstruct model is
that each of the 10 languages it supports received the same amount of training data, which
should ensure each language is given the same importance by the model.
Figure 4: We find that neither the English-to-Tatar nor the Russian-to-Tatar mapping perform
better for transfer learning through trans-tokenization. We attribute this to the fact that
TowerInstruct being trained with corpus of equal size for English and Russian, and that
neither language is particularly close to Tatar. However, combining both initializations
provides some benefit.
We also tried merging the models after finetuning the embeddings separately for Russian
and English mappings; while this worked, this did not bring additional benefits over
merging early, while costing twice the training time.
22
Published as a conference paper at COLM 2024
We hypothesize that the reason for this is that only a small subspace of the TowerInstruct
embedding matrix is perceived by the transformer as a result of its projections, and many
embeddings would produce the same output through the transformer, while looking sub-
stantially different in the full embedding space.
To test this hypothesis, we trained a projection layer using a constrastive strategy, such that
the English-initialized and the Russian-initialized embeddings of a token project to the same
value, while embeddings of different Tatar tokens remain as different as possible. We were
easily able to find such as projection, which potentially confirms our intuition.
Interestingly, this projection can then be applied to the embeddings of tokens from the
original vocabulary of TowerInstruct. In our preliminary analysis, the projection appeared
effective to bring closer the embeddings of semantically similar tokens across languages
(not limited to English and Russian). We however leave the exhaustive analysis of these
patterns to a future work, for lack of time and space.
23
Published as a conference paper at COLM 2024
The Tweety-Armenian model performed very similarly to the Tatar during and after training,
despite the two languages being completely unrelated (see Table below).
Table 8: Train perplexity per native token of our TweetyMistral model for Armenian
Due to a lack of readily usable downstream tasks on which to benchmark our Armenian
model, a qualitative analysis of the mapping was performed instead, to document the areas
that can still be improved.
As expected, the analysis of the English-Armenian word-level mapping revealed substantial
differences in the number of unique words between the two languages, with Armenian
having more than double the unique words compared to English. This discrepancy is
indicative of Armenian’s rich morphological structure. Indeed, Armenian is an agglutinative
language, meaning that words are often formed by stringing together morphemes to create
complex words. This results in a large number of word forms for a single lemma, making it
difficult for alignment models to achieve high coverage accurately.
This analysis also revealed the low quality of the parallel corpus used for alignment (in line
with the inadequacy of this data for finetuning HydraLLMs highlighted in Table 4, where
the parallel data proved detrimental to translation performance, unlike back-translated
data). However, most highly-occuring alignments proved semantically correct. Words with
the high number of translations (such as “setting”, “thread”, “push”), and the most translation
entropy (such as “break”, “up”, and “pick”) indeed correspond to popular English words
exhibiting extreme translation diversity, likely due to their polysemous nature and varied
contextual usage. For instance, “setting” can translate to multiple Armenian verbs, nouns,
and idiomatic expressions, reflecting its contextual flexibility. The token “break”, for instance,
has the highest entropy value (8.3). It has multiple potential translations such as “kotrel”
(to break), “yndmijum” (intermission), and “cheghkel” (to crack) [ARM]. These translations
correspond to different senses of “break”, illustrating its semantic range.
Overall, the mapping appeared very usable as a statistical tool, but is not that suitable as a
translation dictionary. The mapping is particularly deficient concerning words invovled in
idiomatic expressions and phrasal verbs, as the FastAlign model struggles with these due
to their contextual dependencies. It sounds likely that better results could be achieved by
cleaning the parallel data before computing the word-level mappings and by using a better
alignment tool.
24
Published as a conference paper at COLM 2024
L Experimental Details
We compute the token alignment using the NLLB (NLLB et al., 2022) parallel corpus.
We finetune the embeddings on the first 41M tokens of OSCAR (Ortiz Suárez et al., 2019).
We then unfreeze 2x2 layers on the next 66M tokens of OSCAR (Ortiz Suárez et al., 2019).
25
Published as a conference paper at COLM 2024
We compute the token alignment using the NLLB (NLLB et al., 2022) parallel corpus.
We finetune the embeddings on the first 41M tokens of OSCAR (Ortiz Suárez et al., 2019).
We then unfreeze 2x2 layers on the next 82M tokens of OSCAR (Ortiz Suárez et al., 2019).
26
Published as a conference paper at COLM 2024
We compute the token alignment using the concatenation of two parallel corpora: Open
Subtitles (Lison et al., 2018) and NLLB (NLLB et al., 2022).
We train and evaluate our model on a cleaned version of the Dutch fraction of C45 . For the
finetuning, we use 2 A100 GPUs for a total of 40 GPU-hours, with an effective batch size of
256 and a maximal context length of 8,192.
5 https://huggingface.co/datasets/yhavinga/mc4 nl cleaned
27
Published as a conference paper at COLM 2024
The name Tweety comes from the abbreviation of our proposed method, trans-tokenization
(TT for short). The name sounds pleasing to hear, and is semantically associated with a bird.
This association made the choice of a mascot easy.
To enable each model to have its own personality and brand, we developed a template
providing space for a flag and a background photograph, which helps locate the language
and the region of the world it is usually spoken in.
This strong association between a language, a country, and a location is of course very
incomplete, as many languages are spoken in several regions worldwide. However, we
envision that the low computation cost of training new trans-tokenized models would
enable dialect-specific LLMs in the future, helping solve issues in cases where a language is
spoken in more than one country or region.
28