8
8
Figure 1: Generation performance of the various models averaged over both generation tasks. For each model size
the results are presented as relative compared to the 1 M - DOC model.
character-level tokenizer, equivalent to zero sup- number of generation steps involved. We there-
port. We then pre-trained from scratch copies of fore conclude that tokenization’s effect is better
a decoder-only transformer-based model (Vaswani assessed through generation tasks, rather than clas-
et al., 2017), with the different tokenizers, and fine- sification tasks.
tuned them on several downstream tasks. In this Our results also show that smaller models are
work we hypothesize that the downstream success especially vulnerable to poor tokenizations, with
should be correlated with the compression ability the smallest 10m parameter model suffering from
of the underlying tokenizers. We experimented more significant drops in performance compared
with three model sizes, tokenizers of six different to our largest 1B model. Finally, experimentation
volumes of supporting data, and two languages, with Turkish revealed the same trends, ruling out
English and Turkish. the option of an English-specific phenomenon.
Our results show that in terms of intrinsic per- In the remainder of the paper we will describe
formance, the tokenizers’ compression ability is the common practices in assessing tokenizers (sec-
highly influenced by the amount of supporting data, tion 2) and argue for the theoretical sensibility of
with tokenizers trained on a minimal amount of compression as an intrinsic tokenization evaluation
data having tokenized texts more than 60% longer (section 3). We will then describe our experiments
compared to the best compressing tokenizer. How- (section 4) and their results (section 5).
ever, the discrepancy in compression is signifi-
cantly more marked for less frequent words. 2 Measuring Tokenization Quality
Extrinsically, we also found that downstream From the very early days of NLP, models have al-
success monotonically increases with the increase ways assumed text discretized to tokens as input
in the tokenizer’s support. The correlation between (Winograd, 1971). For the most part, these tokens
the intrinsic and extrinsic measures of tokenization were whitespace separated words, but in the recent
quality points to the conclusion that better com- decade non-trivial tokenization algorithms have sur-
pressing tokenizers is a desired goal on the road to faced (Mikolov et al., 2012; Sennrich et al., 2016;
better language models. A conclusion that may be Ataman and Federico, 2018), primarily to deal with
true even for models dealing with other modalities unseen tokens without smoothing techniques or
(Ryoo et al., 2021; Ronen et al., 2023). other convoluted methods (Chen and Goodman,
While we evaluated the downstream perfor- 1999). The underlying reasoning behind all modern
mance on both classification and generative tasks, tokenization methods is that some subwords, e.g.,
we observed that the correlation to compression is morphemes, may carry independent information
stronger for the latter type of tasks. This discrep- that is of value to the model even if the word as a
ancy could be attributed to the fact that generative whole is rare or unseen. Therefore, better tokeniza-
tasks require the use of the tokenizer more exten- tion is assumed to improve models’ performance
sively than in classification tasks, aligning with the over rare words, while also carrying computational
T OKENIZER S ENTENCE
CHAR _T h i s _i s _a b o u t _c o m p r e s s i n g _t o k e n i z e r s
1- DOC _Th is _is _a b ou t _comp re ss ing _to k en i z ers
10- DOC _This _is _about _comp res sing _to k en iz ers
100- DOC _This _is _about _comp ress ing _tok en izers
1 K - DOC _This _is _about _comp ressing _to ken izers
1 M - DOC _This _is _about _compress ing _token izers
Figure 2: Six tokenizers differing in the amount of supporting documents tokenizing the same sentence. Note that
better compression is achieved with more support.
benefits, like smaller models and the elimination of trinsic downstream success. We conclude that com-
unknown tokens. pression is a desideratum for tokenization not only
It is not surprising then, that whenever tokenizers due to its theoretical virtue, expanded on in the next
are presented or tested, they are usually accompa- section, but first and foremost because it correlates
nied with an array of evaluations that assess the to- with downstream performance.
kenization’s influence on the model’s downstream
success, mostly on translation tasks (Kudo, 2018; 3 The Role of Compression in
Provilkov et al., 2020; Vilar and Federico, 2021; Tokenization
Saleva and Lignos, 2023), although monolingual In the realm of intrinsic measures for evaluating to-
tasks are also used (Yehezkel and Pinter, 2023). kenization quality, compression particularly stands
Other works circle back and assess tokenization out in prominence. It has garnered considerable
with respect to the desiderata it is supposed to attention, notably due to its pivotal role as the cor-
serve as a stand-alone algorithm, independently nerstone of the byte pair encoding tokenization
from the model trained on top of it. This is done algorithm (BPE; Schuster and Nakajima, 2012;
usually in addition to evaluation over downstream Sennrich et al., 2016), an algorithm that initially
performance. However, most works disagree on conceived for general data compression purposes
the desiderata themselves. Many emphasize align- (Gage, 1994).
ment to linguistically meaningful units (Klein and Given data composed of sequences of atomic
Tsarfaty, 2020; Hofmann et al., 2021, 2022; Gow- symbols, the algorithm minimizes the overall data
Smith et al., 2022) or to human cognitive prefer- length, for a given dictionary budget, by iteratively
ences in general (Beinborn and Pinter, 2023).1 Oth- substituting a new symbol in place of the symbol
ers include analyses of token length and frequency pair most frequently occurring in a large corpus of
(Bostrom and Durrett, 2020; Yehezkel and Pinter, supporting data. In the domain of language mod-
2023), mostly in addition to the above, assuming eling, the symbols are usually characters and the
that ideal tokenizers use longer and more frequent supporting corpus is a subset of the text designated
tokens. to be used as a training set for the language model
The two types of tokenization evaluations, ex- which the tokenizer is meant to serve.
trinsic over downstream success and intrinsic over But in a sense, compression-driven tokenization
a plethora of metrics, are usually not compared may be viewed as language modeling in and of
directly. They are only used to demonstrate the itself. Consider that language models are aimed at
superiority of a specific tokenizer, and the relations assessing and possibly maximizing the likelihood
between the evaluation approaches is glossed over. of produced texts expressed as a product of token
In this work we explicitly focus on compression as probabilities,
a potential intrinsic indicator of tokenization qual- Y
ity, as has been suggested in past works in other set- P (x) = P (xk |x1:k−1 ),
k
tings (Gallé, 2019; Gutierrez-Vasques et al., 2023),
and check to what extent it is correlated with ex- where xk is the kth token in a sentence x. Com-
1
pression limits the lower bound on this product of
This line of works may view tokenization as a continu-
ation of unsupervised morphemic segmentation (Creutz and fractions by minimizing the number of operands,
Lagus, 2002; Virpioja et al., 2013). i.e., minimizing the length of the sequence.
In terms of n-gram language modeling, where is in contrast with other works that compared com-
the probability of each token is approximated as pression across different tokenization algorithms
dependent only on a context of length n − 1, (Gallé, 2019; Schmidt et al., 2024). To create BPE
Y tokenizers with varied compression rate we recall
P (x) ≈ P (xk |xk−(n−1):k−1 ), that BPE’s maximal compression is guaranteed
k
only over its supporting corpus. Normally, for large
and the probability of xk given its context is further enough corpora, a minimal discrepancy is assumed
approximated by the number of appearances of the between the character distribution in the corpus
relevant n-gram in a training corpus, and the “true” distribution in the language. In this
work however, we explicitly emphasize and expand
P (xk |xk−(n−1):k−1 ) ∝ N (xk−(n−1):k ), this discrepancy by limiting the size of the support
to a great extent. We will show that this interven-
A compressor may be considered a 0-gram lan-
tion severely hinders the compression capacity of
guage model, where the relevant n-gram is of
the tokenizer and that it also leads to deteriorating
length 0 and the probability of each token is not
downstream performance.
even a function of its own frequency in the training
data, setting uniformly,
4 Experimental Setup
−1
P (xi ) = |V | , 4.1 English Experiments
where |V | is the vocabulary size. Tokenizers We trained six different tokenizers
Although simplistic when thinking about lan- with dictionary size of up to 32k tokens.2 Each
guage modeling with predefined whitespace- tokenizer was supported by a different amount of
separated words, this type of objective is sensible documents from the model’s train set: a million
when considering that it is used to determine the (1 M - DOC), a thousand (1 K - DOC), a hundred (100-
symbols themselves. DOC ), ten (10- DOC ), one document (1- DOC ) and
From this point of view, prioritizing compres- no documents at all (CHAR). The tokenizers are
sion as an indicator for tokenization quality is initialized with all the relevant symbols - the char-
very reasonable. Since BPE optimizes an approx- acters of the alphabet, punctuation marks, and all
imation, albeit crude, of the downstream objec- foreign characters that appear on the respective
tive, doing better under this approximated objec- documents.
tive should translate into better downstream perfor-
mance which will justify the focus on compression Models For every tokenizer, we trained decoder-
as a metric. only transformer-based language models in three
Moreover, from an information theoretic per- sizes, in terms of number of parameters: 1B, 128m
spective, Shannon’s source coding theorem (Shan- and 10m. The model sizes exclude the parame-
non, 1948) links the limit on the compression to ters dedicated to the embedding layer, as its size
the entropy of the source of the data to be com- may differ across tokenizers. See Appendix A for
pressed. As language models aim to increase the further details.
log-likelihood of texts, hence decrease the entropy
Data Pretraining of the English models was ex-
of the distribution, they inadvertently also increase
ecuted monolingually using the train split of C4
the possible compression of the texts. Our claim
(Raffel et al., 2020).
is that this relationship is symmetric, and BPE tok-
enizers, as they compress texts, may also inadver- Tasks To evaluate the tokenizers’ contribution to
tently increase their log-likelihood. downstream success we finetuned the models over
We set to empirically examine our hypothesis by four tasks. Two classification tasks:
assessing the correlation between the tokenizer’s • QQP (Quora Question Pairs3 ) where the task
compression ability and the performance of lan- is to classify 2 questions as duplicates or not.
guage models of various sizes over a set of down-
2
stream tasks. For some tokenizers with little supporting data, there
To explicitly control the compression ability, were less than 32k strings and sub-strings so the vocabulary
in practice was smaller. See Appendix A for details.
while fixing any other intervening factors as much 3
https://quoradata.quora.com/
as possible, we deal only with BPE tokenizers. This First-Quora-Dataset-Release-Question-Pairs
Tokenizer Token Length Relative Length the accumulative length in tokens of the develop-
ment sets of all English downstream tasks.
1 M - DOC 9,336,052 —
The results, depicted in Table 1, show that pro-
1 K - DOC 9,541,368 +2%
viding less support severely impedes the tokenizer
100- DOC 10,489,029 +12%
ability to compress unseen texts. Note that the in-
10- DOC 15,126,769 +62%
flation in texts’ length is not linear. Reducing the
1- DOC 20,647,861 +121%
supporting data amount by three orders of mag-
CHAR 39,480,577 +323%
nitude, from 1 M - DOC to 1 K - DOC, results in only
Table 1: Compression ability of the different tokenizer,
slightly longer texts, while a reduction in another
accumulative over the development sets of all down- three orders of magnitude to the 1- DOC tokenizer
stream tasks. Relative length is in comparison to the carries an effect much more significant.
1 M - DOC tokenizer.
5.2 Extrinsic Evaluation
• MultiNLI (Williams et al., 2018) where the Table 2 summarizes the downstream evaluation re-
model is tested on natural language inference sults for all models and all tokenizers over all four
(NLI) examples from a domain which differs English tasks. Unsurprisingly, it shows that larger
from the ones appearing at the training set. models, in terms of parameters, fare better on all
tasks. Additionally, it shows that all models per-
And two generation tasks:
form better on the classification tasks compared to
• X-Sum (Narayan et al., 2018) where news arti-
the generation tasks. Nevertheless, over most tasks
cles should be summarized to one single sen-
and model sizes, there is a clear improvement in
tence.
performance when the models are equipped with
• QG-QA (Question Generation over SQuAD;
better supported tokenizers.
Rajpurkar et al., 2016) where the task is to
Similarly to the intrinsic metric above, the down-
generate questions based on a context para-
stream improvement is not linear as well. The
graph and an answer.
improvements achieved by updating the tokenizer
4.2 Turkish Experiments from the 1- DOC to the 10- DOC are more substan-
tial than those from 1 K - DOC to 1 M - DOC, despite
To make sure that our results are not due to English- the introduction of significantly fewer documents.
specific phenomena, we repeat a representative The findings for the Turkish models in Table 5
subest of the experiments with Turkish, an aggluti- demonstrate analogous patterns, indicating that the
native language with higher morphemes-per-word results decline as the tokenizer’s support dimin-
ratio. for the purpose of intrinsic evaluation, we ishes. This is again particularly noticeable in the
trained six Turkish tokenizers, as we did for En- case of generation tasks.
glish. However for extrinsic evaluation, due to the
expensive pretraining and finetuning, we trained 5.3 Intrinsic-Extrinsic Correlation
models only for three tokenizers: 1 M - DOC, 10-
To assess the correlation between the tokenizer’s
DOC , and CHAR . The models were pretrained on
support and the model’s task performance we com-
the train split of the Turkish part of mC4 (Xue et al.,
puted the Spearman’s ρ coefficient, separately for
2021), and finetuned over three tasks: one classifi-
each task and each model size. This correlation
cation task, XNLI (Conneau et al., 2018), and two
coefficient was chosen since it refers to the relative
generation tasks, XL-Sum (Hasan et al., 2021) and
rank of each data point, thus it does not ascribe
Question Generation over the TQuAD dataset.4
linear importance to the differences in the absolute
number of supporting documents.
5 Results
The results are shown in Table 3. Note that due
5.1 Intrinsic Evaluation to the small sample size the correlation is statisti-
cally significant (α = 0.05) only for coefficients
To illustrate the effect of limiting the tokenization
larger than 0.829. The results show that, for the
support on the compression ability, we measured
most part, the tokenizer’s support is well correlated
4
https://tquad.github.io/ with the model’s overall success, with the clear
turkish-nlp-qa-dataset/ exception of classification tasks on the 1B model.
TASK
T OKENIZER
QQP (F1) M ULTI NLI (Acc.) XS UM (RougeL) QG-QA (RougeL)
1B params
1 M - DOC 88.02±0.18 88.24±0.10 47.71±0.02 33.09±0.41
1 K - DOC 87.38±0.05 88.32±0.07 47.53±0.04 32.95±0.42
100- DOC 88.30±0.09 88.75±0.11 47.69±0.05 32.99±0.46
10- DOC 87.44±0.07 88.27±0.09 47.07±0.06 30.51±0.61
1- DOC 86.07±0.20 86.67±0.25 46.33±0.11 28.42±0.99
CHAR 83.13±0.23 84.59±0.65 44.69±0.08 24.91±0.55
128m params
1 M - DOC 82.13±0.14 85.33±0.16 45.68±0.04 27.66±0.59
1 K - DOC 82.29±0.06 85.45±0.18 45.83±0.08 27.37±0.59
100- DOC 81.75±0.28 85.07±0.13 45.53±0.02 26.97±0.64
10- DOC 80.14±0.19 83.76±0.06 45.08±0.04 24.99±0.62
1- DOC 78.71±0.19 82.30±0.31 44.43±0.06 23.9±0.60
CHAR 76.27±0.23 82.10±0.26 43.19±0.06 21.81±0.43
10m params
1 M - DOC 71.65±0.81 78.62±0.22 40.92±0.06 22.43±0.44
1 K - DOC 69.97±0.11 79.94±0.15 40.69±0.08 22.13±0.53
100- DOC 71.26±0.28 78.57±0.17 41.05±0.02 21.59±0.52
10- DOC 66.51±0.22 75.95±0.24 39.00±0.04 19.73±0.43
1- DOC 66.25±0.11 78.11±0.12 37.01±0.08 18.61±0.40
CHAR 64.01±0.14 75.81±0.77 27.86±0.04 16.92±0.38
Table 2: Results over all downstream tasks, each in terms of its respective metric. Results are averaged over 5
finetues.
TASK
TASK M ODEL S IZE
M ODEL S IZE QQP M ULTI NLI XS UM QG-QA
QQP M ULTI NLI XS UM QG-QA
1B -0.980** -0.974** -0.994** -0.976**
1B 0.714 0.600 0.943** 0.943**
128m -0.971** -0.863* -0.988** -0.949**
128m 0.943** 0.943** 0.943** 1.000**
10m -0.870* -0.710 -0.996** -0.933**
10m 0.943** 0.886* 0.829* 1.000**
Even starker correlation appears when measur- above 0.729 in absolute value. No numerical cor-
ing the Pearson’s correlation coefficient between relation could be computed for the Turkish results
downstream performance and the compression it- due to the small sample size.
self, i.e., to the overall length of the development These results point to generation tasks as bet-
sets in tokens, from Table 1. As can be seen in ter downstream evaluators of tokenizers, as tok-
Table 4, the inverse correlation between length in enization is less crucial to the models’ success over
tokens and performance is high, with the excep- classification tasks.
tion of classification tasks on the 10m model. Note
that here as well, the small sample size causes the In addition to assessing the correlation’s signifi-
correlation to be statistically significant only when cance, Figure 1 visualizes the effect size for both
languages. We averaged the performance over the
TASK char
T OKENIZER 10 1-doc
10-doc
4
generation tasks, as the correlation was less signif- 2
icant for classification. The graph depicts, sepa-
107 108 106 107 105 106 104 1054 103 1044 102 1035 101 1026 0 101 7
rately for each model size, the performance with n=6 n = 65 n = 1118 n = 1 10 n = 7 10 n = 3 10 n = 1 10 n = 2 10
Word Frequency in mC4
the various tokenizers compared to the best sup-
ported tokenizer. Notably, while compression con- Figure 4: Number of subwords per Turkish word as a
sistently correlates with generation performance function of its abundance in 3 million unseen documents.
across all model sizes, the impact is particularly Averaged over orders of magnitude. The number words
pronounced for smaller models. included in each bin is indicated under the x axis.
The parallels drawn between tokenization and
language modeling in section 3 may provide some
Turkish and it is shown in Figure 4.
explanation to the smaller effect of poorer tokeniza-
tion on larger models. As we claim that compres- The figures show that the token-to-word ratio is
sion is simple language modeling on its own, it extremely similar across tokenizers for words that
is possible that LLMs that are more powerful lan- are the most frequent. On the other hand, the differ-
guage models in general are able to allocate re- ent tokenizers diverge in token-to-word ratio when
sources to compensate for less compressing tok- presented with rarer words, with less supported to-
enizers. kenizers being more sensitive to word frequency,
compared to better supported tokenizers. It is worth
6 Analysis noting that the same trend applies to the CHAR tok-
enizer, for which the number of tokens per word is
Tokenization of Frequent Words To better un- simply its length in characters. This should not be
derstand the source for discrepancy in compression surprising due to the tendency of frequently used
between tokenizers, we plot in Figure 3 the num- words to be shorter in accordance to Zipf’s law of
ber of tokens needed per word with respect to the abbreviation (Zipf, 1949).
word frequency (measured in number of appear- In addition, as predicted by Zipf’s law (Estoup,
ances in a sample of 3 million unseen documents 1912; Zipf, 1949), the number of frequent words
from mC4). We averaged the token-per-word ratio over which the tokenizers agree is quite small, In
over all words whose occurrences are of the same terms of types, but they cover a large portion of the
order of magnitude and provided the number of 3 million document sample over which the statistics
words in each bin. A similar analysis was done for were calculated. The English words that appear
0 0 0
2 2 2
Relative Rouge-L
Relative Rouge-L
Relative Rouge-L
4 4 4
6 1m 6 1m 6 1m
1k 1k 1k
8 100 8 100 8 100
10 10 10
10 1 10 1 10 1
char char char
12 12 12
0.000 0.002 0.004 0.006 0.008 0.000 0.002 0.004 0.006 0.008 0.000 0.002 0.004 0.006 0.008
Average Word Frequency Average Word Frequency Average Word Frequency
(a) 10m params model, XSum (b) 128m params model, XSum (c) 1B params model, XSum
0 0 0
Relative Rouge-L
Relative Rouge-L
Relative Rouge-L
5 5 5
10 1m 10 1m 10 1m
1k 1k 1k
15 100 15 100 15 100
10 10 10
1 1 1
20 char 20 char 20 char
0.000 0.002 0.004 0.006 0.008 0.010 0.000 0.002 0.004 0.006 0.008 0.010 0.000 0.002 0.004 0.006 0.008 0.010
Average Word Frequency Average Word Frequency Average Word Frequency
(d) 10m params model, QG-QA (e) 128m params model, QG-QA (f) 1B params model, QG-QA
Figure 5: Downstream success in Rouge-L relative to the 1 M - DOC model plotted against the average frequency in
each example. Trend lines were plotted based on the entire data, but for visibility reasons the scatter is based on
averages over bins containing each 2% of data.
at least 106 times in the sampled corpus, 162 in ers compress differently. It is thus highly probable
number, cover 47% of the words in the corpus. that word frequency is a major confounding factor
On the other hand, in Turkish, due to the thicker the connects compression with downstream perfor-
tail of the Zipf’s distribution of morphologically mance.
rich languages, only 71 words answer this criterion, In addition, this analysis may point to the benefit
covering 26% of the corpus. of challenge sets, comprised of examples with rarer
We conclude that the discrepancy in compres- words, in the evaluation of tokenization.
sion ability, as evident in Table 1 stems mostly from
Similarity between Tokenizers The results so
the difference in the compression in less common
far compared the output of each model to the target
words. This tail of less frequent words is consisted
outputs, where we showed that models are perform-
of the semantically interesting words, so it is likely
ing better when equipped with better compressing
that this gap in compression causes the gaps in
tokenizers. In order to show that the models are
model performance.
also converging towards similar generations, we
plotted, in Figure 6, the pair-wise overlap between
Performance over Frequent Words To comple-
the outputs of all models for the English generation
ment the analysis above we also broke down the
tasks, measured in Rouge-L .
results of the generation tasks by average frequency
of the words in the targeted output of each exam- The analysis shows that, for all tasks and model
ple. The results, plotted in Figure 5, include the sizes, models with similarly supported tokenizers
difference in Rouge-L per example from the best tend to output similar predictions, regardless of
1 M - DOC model per task and model size. it shows whether the predictions are similar to the gold tar-
that the differences between differently tokenizing gets. It is also noticeable that, in accordance with
models are more pronounced over examples with our main results, the differences in the high-support
rarer words. region are less pronounced than those between the
less supported tokenizers.
Together, this and the previous analysis shed
some light on the reasons for the correlation found 7 Conclusions
in our main result. We demonstrate that the differ-
ences in performance between the various models In this paper we demonstrated the importance of
are indeed more pronounced in the presence of rarer compression to tokenization as an intrinsic eval-
words, which are exactly the words that the tokeniz- uation of tokenization quality that indicates the
(a) 10m params model, XSum (b) 128m params model, XSum (c) 1B params model, XSum
(d) 10m params model, QG-QA (e) 128m params model, QG-QA (f) 1B params model, QG-QA
Figure 6: Pair-wise rouge scores between outputs of the different models. Darker is higher Rouge-L scores and
higher similarity between outputs. Models with similar number of supporting documents tend to output similar
predictions.
performance on extrinsic downstream tasks. We terms of compute. We therefore call for research to
argued in favor of compression-driven tokenization better understand the factors controlling the qual-
from a theoretical perspective, since it may make ity of the tokenization and its relation to overall
the tokenizer act as a simple standalone language success of the LLMs.
model, and we showed its correlation with down-
stream model success. 8 Limitations
Our experiments point to generation tasks as The main limitation of this paper has to do with
better downstream evaluators of tokenization since the amount of resources allocated to this research.
their results are both more sensitive to the tokenizer Pretraining our LLMs, especially the 1B parameter
and better correlate with tokenization quality as models, requires a lot of compute and repeating
expressed in compression ability. these experiments, in a slightly different setting or
In terms of linguistic diversity, the similarity in just in order to replicate their results, is an expen-
the results and analyses across two very different sive process.
languages, English and Turkish, points to our con- Another limitation has to do with the limited
clusions being independent of specific typological experiments on non-English languages. Although
characteristics. Yet, ample room is left for studying we executed several experiments on Turkish, the
the effects of tokenization on more languages that cost of pretraining models of up to 1B parameters
are even more typologically diverse. Moreover, prevented us from equating the treatment given to
other intrinsic evaluators are still to be assessed the two languages as well as adding experiments
even for the languages we did work with. in other non-English languages. It is possible, even
We conclude that tokenization matters, as poorly if somewhat unlikely, that running the full suite
compressing tokenizers hinder the results of lan- of experiments on Turkish would have resulted in
guage models, and that investment in better com- different conclusions. A more reasonable possi-
pressing tokenizer has a potential of improving bility is that running experiments on more typo-
model performance while being relatively cheap in logically diverse languages would yield different
conclusions for these languages. We mitigated this Matthias Gallé. 2019. Investigating the effectiveness of
risk by choosing a language that is extremely dif- BPE: The power of shorter sequences. In Proceed-
ings of the 2019 Conference on Empirical Methods
ferent from English.
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
Acknowledgements cessing (EMNLP-IJCNLP), pages 1375–1381, Hong
Kong, China. Association for Computational Linguis-
We thank Alon Jacovi and Uri Shaham for the help- tics.
ful discussions and feedback.
Team Gemini. 2023. Gemini: A family of highly capa-
ble multimodal models.
References Edward Gow-Smith, Harish Tayyar Madabushi, Car-
olina Scarton, and Aline Villavicencio. 2022. Improv-
Duygu Ataman and Marcello Federico. 2018. An evalu-
ing tokenisation by alternative treatment of spaces.
ation of two vocabulary reduction methods for neural
In Proceedings of the 2022 Conference on Empiri-
machine translation. In Proceedings of the 13th Con-
cal Methods in Natural Language Processing, pages
ference of the Association for Machine Translation
11430–11443, Abu Dhabi, United Arab Emirates. As-
in the Americas (Volume 1: Research Track), pages
sociation for Computational Linguistics.
97–110.
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bha-
Lisa Beinborn and Yuval Pinter. 2023. Analyzing cogni- gia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh
tive plausibility of subword tokenization. In Proceed- Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang,
ings of the 2023 Conference on Empirical Methods Shane Arora, David Atkinson, Russell Authur, Khy-
in Natural Language Processing, pages 4478–4486, athi Raghavi Chandu, Arman Cohan, Jennifer Du-
Singapore. Association for Computational Linguis- mas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar
tics. Khot, William Merrill, Jacob Morrison, Niklas Muen-
nighoff, Aakanksha Naik, Crystal Nam, Matthew E.
Kaj Bostrom and Greg Durrett. 2020. Byte pair encod- Peters, Valentina Pyatkin, Abhilasha Ravichander,
ing is suboptimal for language model pretraining. In Dustin Schwenk, Saurabh Shah, Will Smith, Emma
Findings of the Association for Computational Lin- Strubell, Nishant Subramani, Mitchell Wortsman,
guistics: EMNLP 2020, pages 4617–4624, Online. Pradeep Dasigi, Nathan Lambert, Kyle Richardson,
Association for Computational Linguistics. Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Sol-
Stanley F Chen and Joshua Goodman. 1999. An em- daini, Noah A. Smith, and Hannaneh Hajishirzi. 2024.
pirical study of smoothing techniques for language Olmo: Accelerating the science of language models.
modeling. Computer Speech & Language, 13(4):359– Eylon Gueta, Avi Shmidman, Shaltiel Shmidman,
394. Cheyn Shmuel Shmidman, Joshua Guedalia, Moshe
Koppel, Dan Bareket, Amit Seker, and Reut Tsar-
Jonathan H. Clark, Dan Garrette, Iulia Turc, and John faty. 2023. Large pre-trained models with extra-large
Wieting. 2022. Canine: Pre-training an efficient vocabularies: A contrastive analysis of hebrew bert
tokenization-free encoder for language representa- models and a new one to outperform them all.
tion. Transactions of the Association for Computa-
tional Linguistics, 10:73–91. Ximena Gutierrez-Vasques, Christian Bentz, and Tanja
Samardžić. 2023. Languages through the looking
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina glass of bpe compression. Computational Linguis-
Williams, Samuel Bowman, Holger Schwenk, and tics, 49(4):943–1001.
Veselin Stoyanov. 2018. XNLI: Evaluating cross-
lingual sentence representations. In Proceedings of Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Is-
the 2018 Conference on Empirical Methods in Nat- lam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang,
ural Language Processing, pages 2475–2485, Brus- M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-
sels, Belgium. Association for Computational Lin- sum: Large-scale multilingual abstractive summariza-
guistics. tion for 44 languages. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
Mathias Creutz and Krista Lagus. 2002. Unsupervised pages 4693–4703, Online. Association for Computa-
discovery of morphemes. In Proceedings of the ACL- tional Linguistics.
02 Workshop on Morphological and Phonological
Learning, pages 21–30. Association for Computa- Valentin Hofmann, Janet Pierrehumbert, and Hinrich
tional Linguistics. Schütze. 2021. Superbizarre is not superb: Deriva-
tional morphology improves BERT’s interpretation
J-B Estoup. 1912. Gammes sténographiques. recueil of complex words. In Proceedings of the 59th Annual
de textes choisis pour l’acquisition méthodique de la Meeting of the Association for Computational Lin-
vitesse, précédé d’une introduction par j.-b. estoup. guistics and the 11th International Joint Conference
on Natural Language Processing (Volume 1: Long
Philip Gage. 1994. A new algorithm for data compres- Papers), pages 3594–3608, Online. Association for
sion. C Users Journal, 12(2):23–38. Computational Linguistics.
Valentin Hofmann, Hinrich Schuetze, and Janet Pierre- Alex Salcianu, Marc van Zee, Jacob Austin, Se-
humbert. 2022. An embarrassingly simple method bastian Goodman, Livio Baldini Soares, Haitang
to mitigate undesirable properties of pretrained lan- Hu, Sasha Tsvyashchenko, Aakanksha Chowdh-
guage model tokenizers. In Proceedings of the 60th ery, Jasmijn Bastings, Jannis Bulian, Xavier Gar-
Annual Meeting of the Association for Computational cia, Jianmo Ni, Andrew Chen, Kathleen Kenealy,
Linguistics (Volume 2: Short Papers), pages 385–393, Jonathan H. Clark, Stephan Lee, Dan Garrette, James
Dublin, Ireland. Association for Computational Lin- Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin
guistics. Ritter, Maarten Bosma, Alexandre Passos, Jeremy
Maitin-Shepard, Noah Fiedel, Mark Omernick, Bren-
Omri Keren, Tal Avinari, Reut Tsarfaty, and Omer nan Saeta, Ryan Sepassi, Alexander Spiridonov,
Levy. 2022. Breaking character: Are subwords good Joshua Newlan, and Andrea Gesmundo. 2022. Scal-
enough for mrls after all? ing up models and data with t5x and seqio. arXiv
preprint arXiv:2203.17189.
Stav Klein and Reut Tsarfaty. 2020. Getting the ##life
out of living: How adequate are word-pieces for mod- Tomer Ronen, Omer Levy, and Avram Golbert. 2023.
elling complex morphology? In Proceedings of the Vision transformers with mixed-resolution tokeniza-
17th SIGMORPHON Workshop on Computational tion. In Proceedings of the IEEE/CVF Conference
Research in Phonetics, Phonology, and Morphology, on Computer Vision and Pattern Recognition, pages
pages 204–209, Online. Association for Computa- 4612–4621.
tional Linguistics.
Taku Kudo. 2018. Subword regularization: Improv- Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder,
ing neural network translation models with multiple and Iryna Gurevych. 2021. How good is your tok-
subword candidates. In Proceedings of the 56th An- enizer? on the monolingual performance of multilin-
nual Meeting of the Association for Computational gual language models. In Proceedings of the 59th
Linguistics (Volume 1: Long Papers), pages 66–75, Annual Meeting of the Association for Computational
Melbourne, Australia. Association for Computational Linguistics and the 11th International Joint Confer-
Linguistics. ence on Natural Language Processing (Volume 1:
Long Papers), pages 3118–3135, Online. Association
Tomáš Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son for Computational Linguistics.
Le, and Stefan Kombrink. 2012. Subword language
modeling with neural networks. Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa
Dehghani, and Anelia Angelova. 2021. Tokenlearner:
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Adaptive space-time tokenization for videos. In Ad-
2018. Don’t give me the details, just the summary! vances in Neural Information Processing Systems,
topic-aware convolutional neural networks for ex- volume 34, pages 12786–12797. Curran Associates,
treme summarization. In Proceedings of the 2018 Inc.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1797–1807, Brussels, Bel- Jonne Saleva and Constantine Lignos. 2023. What
gium. Association for Computational Linguistics. changes when you randomly choose BPE merge op-
erations? not much. In The Fourth Workshop on
Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. Insights from Negative Results in NLP, pages 59–66,
2020. BPE-dropout: Simple and effective subword Dubrovnik, Croatia. Association for Computational
regularization. In Proceedings of the 58th Annual Linguistics.
Meeting of the Association for Computational Lin-
guistics, pages 1882–1892, Online. Association for Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec
Computational Linguistics. Alameddine, Omri Uzan, Yuval Pinter, and Chris Tan-
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ner. 2024. Tokenization is more than compression.
ine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Mike Schuster and Kaisuke Nakajima. 2012. Japanese
limits of transfer learning with a unified text-to-text and korean voice search. In 2012 IEEE international
transformer. Journal of Machine Learning Research, conference on acoustics, speech and signal process-
21(140):1–67. ing (ICASSP), pages 5149–5152. IEEE.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Rico Sennrich, Barry Haddow, and Alexandra Birch.
Percy Liang. 2016. SQuAD: 100,000+ questions for 2016. Neural machine translation of rare words with
machine comprehension of text. In Proceedings of subword units. In Proceedings of the 54th Annual
the 2016 Conference on Empirical Methods in Natu- Meeting of the Association for Computational Lin-
ral Language Processing, pages 2383–2392, Austin, guistics (Volume 1: Long Papers), pages 1715–1725,
Texas. Association for Computational Linguistics. Berlin, Germany. Association for Computational Lin-
guistics.
Adam Roberts, Hyung Won Chung, Anselm Levskaya,
Gaurav Mishra, James Bradbury, Daniel Andor, Sha- Claude Elwood Shannon. 1948. A mathematical theory
ran Narang, Brian Lester, Colin Gaffney, Afroz of communication. The Bell system technical journal,
Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, 27(3):379–423.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- Colin Raffel. 2021. mT5: A massively multilingual
bert, Amjad Almahairi, Yasmine Babaei, Nikolay pre-trained text-to-text transformer. In Proceedings
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti of the 2021 Conference of the North American Chap-
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton ter of the Association for Computational Linguistics:
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Human Language Technologies, pages 483–498, On-
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, line. Association for Computational Linguistics.
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Shaked Yehezkel and Yuval Pinter. 2023. Incorporating
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, context into subword vocabularies. In Proceedings
Isabel Kloumann, Artem Korenev, Punit Singh Koura, of the 17th Conference of the European Chapter of
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- the Association for Computational Linguistics, pages
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- 623–635, Dubrovnik, Croatia. Association for Com-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- putational Linguistics.
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- George Kingsley Zipf. 1949. Human behavior and the
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, principle of least effort: An introduction to human
Ruan Silva, Eric Michael Smith, Ranjan Subrama- ecology. Ravenio Books.
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, A Training Details
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Melanie Kambadur, Sharan Narang, Aurelien Ro- Tokenizers were trained on the first documents
driguez, Robert Stojnic, Sergey Edunov, and Thomas in C4, for English, and mC4, for Turkish, as they
Scialom. 2023. Llama 2: Open foundation and fine-
tuned chat models. are ordered in the Tensorflow Datasets repository.
Therefore, the data for the less-supported tokeniz-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ers is contained in the data for the better supported
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all ones.
you need. Advances in neural information processing In all cases we limited the vocabulary size to
systems, 30. 32k, but in practice, for tokenizers supported by
David Vilar and Marcello Federico. 2021. A statistical little data, the vocabulary size was lower than 32k,
extension of byte-pair encoding. In Proceedings of since the data did not contain a sufficient number
the 18th International Conference on Spoken Lan- of words and subwords. Specifically, for English,
guage Translation (IWSLT 2021), pages 263–275, the vocabulary size of 100- DOC is 23k, of 10- DOC
Bangkok, Thailand (online). Association for Compu-
tational Linguistics. – 3.7k, and of 1- DOC – 1k. For Turkish the size of
10- DOC is 9.5k, and of 1- DOC – 3.9k.
Sami Virpioja, Peter Smit, Stig-Arne Grönroos, Mikko In the case of CHAR, supported by no documents
Kurimo, et al. 2013. Morfessor 2.0: Python imple-
mentation and extensions for morfessor baseline.
at all, the tokenizer simply breaks all the words into
characters and replaces any foreign character with
Adina Williams, Nikita Nangia, and Samuel Bowman. the unknown sign.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed- Models were trained using the T5X framework
ings of the 2018 Conference of the North American (Roberts et al., 2022) on the span corruption task
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume (Raffel et al., 2020) for 200k training steps, with
1 (Long Papers), pages 1112–1122, New Orleans, a batch size of 512. Every training example was
Louisiana. Association for Computational Linguis- truncated to the maximal length of 1024 tokens.5
tics. The models were finetuned for 4k steps and 20k
Terry Winograd. 1971. Procedures as a represen- steps on the classification and generation tasks, re-
tation for data in a computer program for under- spectively, with a batch size of 128. The decoder-
standing natural language. Technical report, MAS- only models were tasks with the generation of a
SACHUSETTS INST OF TECH CAMBRIDGE
PROJECT MAC. gold output when used for classification tasks as
well. For example, in the QQP task the outputs
Linting Xue, Aditya Barua, Noah Constant, Rami Al- were assumed to be either duplicated or no dupli-
Rfou, Sharan Narang, Mihir Kale, Adam Roberts,
and Colin Raffel. 2022. ByT5: Towards a token-free
cated, where any other output considered wrong.
future with pre-trained byte-to-byte models. Transac- 5
The fixed example length in terms of tokens leads of
tions of the Association for Computational Linguis- course to differences in the amount of data seen by the models
tics, 10:291–306. during training based on their tokenizers. We consider this
as another boon of well-compressing tokenizers, since the
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, computation budget is usually preset, as it is the case in our
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and setting.
A manual inspection showed that models learned
perfectly to output one of the desired targets.