0% found this document useful (0 votes)
43 views

Comparison of Tokenizer Method

Uploaded by

sihyeok kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Comparison of Tokenizer Method

Uploaded by

sihyeok kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Impact of Tokenization on Language Models: An Analysis for Turkish

CAGRI TORAMAN, EYUP HALIT YILMAZ, FURKAN ŞAHİNUÇ, and OGUZHAN OZCELIK,
Aselsan Research Center, Turkey

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are
de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for
morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We
compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form
of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using
arXiv:2204.08832v1 [cs.CL] 19 Apr 2022

RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks.
Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto
tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level
tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model
parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off
between model size and performance.

CCS Concepts: • Computing methodologies → Natural language processing; Language resources; Phonology / morphology;
Modeling methodologies.

Additional Key Words and Phrases: language model, morphological analysis, tokenization, vocabulary size

ACM Reference Format:


Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahinuç, and Oguzhan Ozcelik. 2022. Impact of Tokenization on Language Models: An
Analysis for Turkish. 1, 1 (April 2022), 17 pages. https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION
Deep language models gained popularity with the introduction of masked language modeling based on the Transformer
architecture [Vaswani et al. 2017] to pretrain a general purpose language understanding with BERT [Devlin et al. 2019]
and its variants. The language models are then able to transfer the pretrained knowledge to downstream tasks, such
as Sentiment Analysis and Named Entity Recognition. Indeed, such large models provide impressing results on the
performance of many downstream tasks, not only in natural language processing [Devlin et al. 2019], but also many
other research areas such as search [Yates et al. 2021] and recommendation [Sun et al. 2019].
Tokenization is an important text preprocessing step for deep language models. Conventional word embeddings,
such as word2vec [Mikolov et al. 2013], generally use vocabularies consisting of the surface forms of words. On the
other hand, deep language models employ more efficient tokenization algorithms where input text is split into smaller

Author’s address: Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahinuç, and Oguzhan Ozcelik,
Aselsan Research Center, Ankara, Turkey, emails:{ctoraman,ehyilmaz,fsahinuc,ogozcelik}@aselsan.com.tr.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2022 Association for Computing Machinery.
Manuscript submitted to ACM

Manuscript submitted to ACM 1


2 Toraman et al.

pieces so that out-of-vocabulary words can still be processed. Language models can also benefit from the tokens that
represent basic semantic units to better comprehend text semantics.
Transformer-based language models generally employ two de facto tokenization algorithms, namely WordPiece
[Schuster and Nakajima 2012] and Byte Pair Encoding (BPE) [Sennrich et al. 2016]. For instance, BERT [Devlin et al. 2019]
uses WordPiece, whereas GPT-2 [Radford et al. 2019] employs BPE. However, large language models are first pretrained
for English; successor pretrained models in low-resource languages thereby employ the same tokenizers [Schweter 2020].
The impact of tokenization algorithms can be different for low-resource languages, such as agglutinative Turkic and
Uralic languages, where words can have prefixes and suffixes. For instance, in Turkish, parsing the word "veremedim"
(translated as "I could not give") results in "ver-e-me-di-m" including four suffixes in a single word. A Morphological-level
tokenizer can output five tokens in this case, providing the model with a better understanding of word semantics.
An example benefit is that the language model would relate the suffix "-me" to negation, similar to the word "not" in
English. Moreover, the impact of different tokenization methods, including token representation in different levels from
Character-level to Word-level, is not examined in details for low-resource languages, specifically for Turkish.
The number of unique tokens used in training deep language models is referred to as the vocabulary size. Any
evaluation data would be split into tokens by using the selected tokenization algorithm according to the trained
vocabulary. There is a likelihood of observing out-of-vocabulary or unknown tokens, i.e. some tokens in the evaluation
data can be missing in the trained vocabulary. The problem with unknown tokens is that they are mapped to the same
embedding without semantic context, resulting in possible performance loss. As the vocabulary size increases, the
likelihood of getting such unknown tokens is decreased, since the vocabulary would capture more instances of tokens.
On the other hand, the model becomes less efficient in terms of its size, i.e, the memory requirement would increase
and the model would become more costly to train. This results in a trade-off between model size and performance in
terms of vocabulary size.
In this study, we thereby examine the following research questions.
• RQ-1. What is the impact of different tokenization methods on the performance of Turkish language modeling
and varying downstream tasks? e.g. Does a Morphological-level tokenization method have benefits on Turkish
language modeling?
• RQ-2. How does the model performance change in different tokenization methods when the vocabulary size is
tuned for the trade-off between model size and performance?
In order to answer our research questions, we compare the performance of different tokenization methods for Turkish.
We select five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces (characters) to the
surface form (words), which are Character-level, BPE, WordPiece, Morphological-level, and Word-level tokenization,
respectively. In order to evaluate their performances, we train a tokenizer for each method, and pretrain medium
language models using RoBERTa [Liu et al. 2019] pretraining procedure on the Turkish split of the OSCAR [Ortiz Suárez
et al. 2019] corpus, called RoBERTa-TR-medium1 . We then evaluate the performance of our models by fine-tuning
them on six downstream tasks; namely News Classification, Hate Speech Detection, Sentiment Analysis, Named Entity
Recognition, Semantic Text Similarity, and Natural Language Inference.
The main contributions and practical implications of this study can be summarized as follows.
• We analyze the impact of tokenizers, at different granularity levels from character to word-level, on varying
downstream tasks for Turkish language models. We find that Morphological-level tokenizer is competitive with
1 We publish our pretrained models with different tokenizers and vocabulary sizes at https://huggingface.co/ctoraman
Manuscript submitted to ACM
Impact of Tokenization on Language Models: An Analysis for Turkish 3

de facto tokenizers, i.e. BPE and WordPiece. Our experimental results, supported by statistical tests, can shed
light on the role of tokenization in language modeling, specifically for morphologically rich languages.
• We show that increasing the vocabulary size improves the performances of Morphological and Word-level
tokenizers more than that of de facto tokenizers, BPE and WordPiece. The ratio of the number of vocabulary
parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers
and 40% for others. This choice can result in a more efficient use of computational resources, where the majority
of the parameters are allocated to the Transformer blocks instead of the vocabulary embeddings. Moreover,
we believe that our experimental results on the vocabulary size can provide an empirical guidance for other
researchers who work on pretraining deep language models.
• We compare our medium-based models with a state-of-the-art Turkish language model [Schweter 2020] that
has the same model architecture with BERT-base [Devlin et al. 2019], and show that our approximately 3-times
smaller model can recover 97% of the performance of the larger one. Our language models are publicly available,
so that other researchers and practitioners can benefit from our models in terms of a more efficient model size
and varying vocabulary sizes. This would reduce memory requirements, and also provide responsible energy
usage and smaller carbon footprint in return, which we discuss in Section 5 in details, along with other ethical
concerns including transparency and fairness.

The rest of the paper is organized as follows. In the next section, we provide a brief literature review of tokenization
algorithms in general and also tokenization in low-resource languages. We explain the details of tokenization methods,
our pretrained model, and downstream tasks in Section 3. Our comparative experiments on different tokenizers and
analysis on vocabulary size are given in Section 4. We then provide a short discussion on the ethical concerns and
broader impact of our study in Section 5. We conclude the study in the last section.

2 RELATED WORK
2.1 Tokenization Algorithms
Since tokenization is one of the first steps in any Information Retrieval or Natural Language Processing system, the
importance of using a tokenization algorithm is highlighted in early studies [Metke Jimenez et al. 2011]. The prevalent
tokenization algorithms in the literature, Byte Pair Encoding (BPE) [Sennrich et al. 2016] and WordPiece [Schuster
and Nakajima 2012], are of recent interest in language model pretraining research. Many noteworthy studies in the
literature focus on enhancing these subword tokenization methods. For example, Ding et al. [2019b] explore the impact
of the number of BPE merges on the machine translation performance. Provilkov et al. [2020] propose a drop-out
method for each merge step of BPE in order to break the deterministic nature of BPE, which provides a performance
improvement in machine translation.
BPE is found to be suboptimal for language pretraining [Bostrom and Durrett 2020] as it does not effectively utilize
the vocabulary space. Nayak et al. [2020] compare the activations of attention layers of BERT with WordPiece and
Word-level tokenization to assess the effect of including subword tokens. They find out that the vocabulary with
frequency-based character combinations hinders the ability of modeling semantically meaningful relations between
words. Additionally, tokenization based on word occurrence statistics results in representations that are dependent on
frequency information rather than semantics [Gong et al. 2018]. On the other hand, it is proposed to apply subword
regularization by utilizing multiple subword segmentations to enhance the robustness of the neural machine translation
models [Kudo 2018]. Based on this algorithm, which implements BPE and Unigram language models, SentencePiece is
Manuscript submitted to ACM
4 Toraman et al.

proposed as another tokenization method [Kudo and Richardson 2018]. Recently, Xu et al. [2021] approach the problem
of finding the best token vocabulary with a proper size in the scope of the trade-off between vocabulary entropy
and vocabulary size. The produced vocabularies in diverse scenarios achieve both reduced sizes and performance
improvements. In addition, learning optimal vocabulary takes significantly less time than regular BPE-search approach.
Alternative tokenization algorithms using morphological analysis can be promising candidates for subword tok-
enization that increase training efficiency and downstream performance [Park et al. 2020]. Rule-based tokenization
algorithms utilizing lexicons and semantic parsing can extend existing methods to cross-lingual settings [Vasiu and
Potolea 2020]. Joint and hybrid tokenization approaches combine coarse and fine-grained representations to incorporate
Word-level and subword representations [Hiraoka et al. 2021]. Multi-grained tokenization methods are incorporated
into the model architecture to capture multi-word representations, such as ice cream, at the expense of increased
computational complexity [Zhang et al. 2021a]. Enabling a gradient-based learnable representation in the tokenization
step of the pipeline is an emerging line of research [Tay et al. 2021]. In our study, we provide a comprehensive analysis
on the impact of tokenization algorithms at different granularity levels from character to word-level, and evaluate the
performances on a diverse range of downstream tasks.

2.2 Tokenization in Low-resource Languages


Tokenization-based methods aiming to enhance the downstream task’s performance in low-resource languages have
been studied before introducing the de facto tokenization methods [Kulick 2011]. With the emergence of the de facto
tokenizers, the effects of SentencePiece, Word-level, and Syllable-level tokenization strategies are investigated for
low-resource languages, such as Thai [Lowphansirikul et al. 2021]. In addition, Li et al. [2021] show that character-based
subword tokenization methods give better results than syllable-based ones in Tibetan-to-Chinese machine translation.
Part-of-Speech Tagging is one of the downstream tasks where different tokenization-based methods are employed
in low resource languages [Ding et al. 2019a, 2018; Kaing et al. 2021]. Morphological analysis is used to propose a
tokenization system for Kurdish [Ahmadi 2020]. Exploiting pretrained models with parameter freezing and additional
intermediate layers is beneficial for Uyghur-Chinese machine translation [Zhang et al. 2021b]. Dossou and Emezue
[2021] propose a phrase-based tokenization method for neural machine translation task between Fon language and
French. Since Fon language is quite specific and low-resource, bilingual people are involved in data cleaning and
preprocessing phases to extract best phrases based on the linguistic components of the Fon language. Park et al. [2021]
studies morphological features of the Korean language while training a tokenizer in the scope of machine translation.
During training tokenizers, target sentences that are not processed by morphological analysis, are also utilized.
Although there are some efforts for pretraining Turkish language models [Loodos 2020; Schweter 2020], the effect of
tokenization algorithms including a Morphological-level one is yet to be studied. To the best of our knowledge, this is
the first study that investigates the impact of tokenization and vocabulary size on Turkish.

3 IMPACT OF TOKENIZATION
In order to understand the impact of different tokenization methods on language modeling, we first explain the
tokenization approaches that are examined in this study. We then introduce our pipeline that describes the details of
various steps to obtain language models with different tokenizers.

Manuscript submitted to ACM


Impact of Tokenization on Language Models: An Analysis for Turkish 5

3.1 Tokenization Methods


In our study, we consider five tokenization algorithms making use of different linguistic features including characters,
frequency, and grammatical rules, explained with respect to the granularity levels as follows.

• Character-level: Unlike the tokenization methods performing on word or subword units, Character-level tokenizers
split words into the smallest parts. Since Character-level tokenizer requires no training to learn a vocabulary, we
employ the ByT5 tokenization [Xue et al. 2021]. The advantage of this type of tokenization is that they can be utilized
in any language to represent any character sequence in byte-level and enable a diverse modeling. Character-level
tokenization also reduces the memory requirement in terms of model size, since it has a very limited number of
tokens in the vocabulary. A disadvantage of this approach is that the model has to spend more capacity to reach
at a higher-level representation compared to other tokenization methods. For instance, the language model has to
learn during training that "t" and "h" co-occur frequently in English, whereas another tokenizer can provide this
information to the models with a "th" token. Furthermore, the output for a given sequence would contain a large
number of tokens when compared to other tokenizers. This results in potential information loss, since deep language
models have an input parameter of text sequence length.
• BPE: Byte Pair Encoding (BPE) is a frequently used tokenizer for pretrained language models [Sennrich et al. 2016].
The granularity of BPE can be considered as mid-level between character and word-level, such that tokens are mostly
subwords depending on vocabulary size. In this method, all unique words are first extracted. A base vocabulary is then
constructed from all symbols occurring in the unique words. The final vocabulary is built by merging the symbols
according to the frequencies of consecutive symbols or subwords. Since BPE operates with byte representations, the
vocabulary can encompass tokens from multiple languages and informal character sequences such as emojis.
• WordPiece: Similar to BPE, WordPiece is also based on merging characters in the documents [Schuster and Nakajima
2012]. Its main difference from BPE is that, WordPiece merges symbols towards maximizing a likelihood score of
language modeling, i.e., when the probability of the merged symbol divided by individual probabilities of the symbols
is greater than any other symbol pair. WordPiece and BPE are frequency-based algorithms that aim to increase the
modeling power of individual tokens while being able to tokenize words that are not encountered during training of
the tokenizer.
• Morphological-level: Morphological analysis can provide suffixes and word stems that are semantically more
meaningful and valuable than the tokens obtained with overlapping frequency or likelihood. We therefore examine
using the parsing output (without tags) of morphological analysis as input tokens. We use Zemberek morphological
analysis tool for Turkish [Akın and Akın 2007] before training the tokenizer. The advantage of Morphological-level
tokenization is to capture grammatically interpretable character sequences in modeling and learn the semantics
based on the suffixes of words. A disadvantage of this approach is that word stems are not split further and constitute
a large set that has to be included in the vocabulary.
• Word-level: The granularity of Word-level tokenizer is surface forms of words, i.e. splits text according to the spaces
between words. Word-level tokenization requires no vocabulary training, since one can apply it by just splitting
text with white space characters. One explicit disadvantage is that this tokenizer requires more vocabulary size
to properly tokenize the same amount of text compared to other methods. Since vocabulary has a limited size in
language modeling, out-of-vocabulary or unknown tokens are likely to be observed in this approach.

The sample outputs provided by different tokenization methods are given for a sample sentence "Toplumsal barış
sağlanır" (translated as "Social peace would be achieved") in Table 1. All tokenizers have a vocabulary size of 16.6k
Manuscript submitted to ACM
6 Toraman et al.

Method Tokenized text


Character-level "t", "o", "p", "l", "u", "m", "s", "a", "l", " ", "b", "a", "r", "ı", "ş", " ", "s", "a", "ğ", "l", "a", "n", "ı", "r"
BPE "[CLS]", "toplumsal", "barış", "sağ", "##lanır", "[SEP]"
WordPiece "[CLS]", "toplumsal", "barış", "sağlan", "##ır", "[SEP]"
Morphological-level "[CLS]", "toplum", "##sal", "barış", "sağ", "##lanır", "[SEP]"
Word-level "[CLS]", "[UNK]", "barış", "[UNK]", "[SEP]"
Table 1. The outputs of different tokenization methods for a sample input, "Toplumsal barış sağlanır" (translated as "Social peace
would be achieved").

tokens in this example, except that Character-level tokenizer has a vocabulary size of 384 characters. We note that BPE
and WordPiece tokenizers can assign the surface forms of words to tokens, while Word-level tokenizer fails to capture
some words and produces unknown tokens. The reason could be that the vocabulary capacity is utilized more efficiently
by BPE and WordPiece tokenizers, whereas Word-level tokenizer fills up the vocabulary with more frequent words
and cannot tokenize less frequent words. Morphological-level tokenizer overcomes this by assigning individual tokens
to suffixes and isolating the word stems. We note that the output sequence length of Character-level tokenization is
considerably higher than other tokenizers, which is not practical when language model requires a limited length of
input text sequence (e.g. if this length parameter is set to 10 tokens, then all methods can properly represent the input,
except that Character-level tokenizer can only represent its first 10 tokens or characters).

3.2 Our Pretrained Model: RoBERTa-Turkish-medium


We develop a pipeline, illustrated in Figure 1, which consists of collecting and cleaning the training corpus, training a
tokenizer with a fixed-length vocabulary, and lastly pretraining a deep language model by using the selected tokenizer
and its vocabulary. We are then able to fine-tune the model on different downstream tasks to evaluate the performance
of the tokenizer.
We use the OSCAR deduplicated corpus for pretraining our language model [Huggingface 2021; Ortiz Suárez et al.
2019]. OSCAR is a multilingual corpus that is obtained by filtering of the Common Crawl corpus that maintain an open
repository of publicly available web pages. We use the split of this corpus prepared for Turkish. However, we observe
that this split includes many documents in languages other than Turkish. We thereby filter out 95,152 documents that
are not in Turkish by using an automated language detector [Shuyo 2010]. The filtering process results in 11,501,370
documents for pretraining.
The tokenization process, depicted inside a dashed rectangle in the figure, is conducted in three steps: (i) Applying
normalization to clear the invalid characters from the text, (ii) training the tokenizer according to a predetermined
vocabulary size (except Character-level), and (iii) processing the corpus with the trained tokenizer to obtain a tokenized
pretraining data. We apply lowercase conversion and NFC normalization2 .
We pretrain language models using Turkish (TR) text, with RoBERTa pretraining procedure and configuration
[Liu et al. 2019], but smaller in terms of the number of layers, attention heads, and hidden size (we follow the same
architecture as BERT-medium [Devlin et al. 2019]). We thereby call the model as RoBERTa-TR-medium. We determine
the vocabulary size based on the number of parameters of the models. Similar to BERT [Devlin et al. 2019], the number
of parameters associated with the vocabulary constitutes 20% of the whole model. The vocabulary size for tokenizers

2 Unicode normalization is important for Turkish, since there are special characters (ç, ğ, ı, ö, ş, ü) in the Turkish alphabet that are not observed in English.
We note that NFC Unicode normalization provides all letters in Turkish.
Manuscript submitted to ACM
Impact of Tokenization on Language Models: An Analysis for Turkish 7

Obtain the corpus Preprocess the corpus Train the tokenizer Vocabulary Pretrain the model Fine-tune the model

kon
OSCAR
Filtering ##uk
CORPUS Tokenizer NLI
(Turkish) evi ...
tesis
... NER
SA
et ... Softmax

##mek
... ...
konuk evi tesisaraba
etmek,jalem araba jalem araba jalem ,
yardıma muhtaç kitap
hastalara
masa un konuk evi tesiskitap masa un
etmek,
kon ##uk evi tesis
kitap et
masa un ... ...
##mek , yardıma yardıma
tedavi ve ilaç süt
masrafları
yumurta yardıma muhtaç süt hastalara
yumurta süt muhtaç
yumurta ...
yardımı yapmaktır. hastalara tedavi ve ilaç muhtaç
helva petrol tedavi ve ilaç helva
masraflarıpetrol masraf ##ları helva
yardımı petrol
Mais uma vez,benzin
a Loja Maré yardımı yapmaktır. ...
yakıt benzin yakıt yapmak ##tır .benzin yakıt
Mansa..
fiyatlarındaki fiyatlarındaki fiyatlarındaki
部活動応援キャンペーン

Fig. 1. An illustration of our pretraining pipeline. There are five steps of pretraining process. We first obtain a text corpus (OSCAR
Turkish deduplicated), and preprocess the corpus by filtering non-Turkish texts. We then choose a tokenization algorithm and
implement it on the filtered corpus. We obtain a vocabulary from the trained tokenizer. We are then able to pretrain a deep language
model (RoBERTa-TR-medium) using the pretrained tokenizer and obtained vocabulary. We lastly fine-tune our model on several
downstream tasks including Sentiment Analysis (SA) and Named Entity Recognition (NER).

are therefore 16.6k tokens, except for Character-level. The calculation of vocabulary size is given as |𝑉 | = (𝑀 × 𝑅)/𝐻,
where |𝑉 | is the number of tokens in vocabulary (i.e. the vocabulary size), 𝑀 is the number of total parameters in the
language model, 𝑅 is the ratio of the vocabulary size to the whole model, and 𝐻 is the hidden dimension size (in our
case, 𝑀 is approximately 42.7, 𝐻 is 512).
The pretraining details of our medium model is given in Table 2. Since we examine the effect of different tokenization
strategies in Turkish, we keep the pretraining procedure computationally simpler because extensive pretraining might
overshadow possible advantages of tokenization algorithms. When a model is extensively pretrained, the performance
can converge to high scores, even with Character-level encoding [Xue et al. 2021]. Nevertheless, we compare the results
of our model with the current state-of-the-art performance for sanity check, i.e. the rationality of our results. To do
so, we employ the BERTurk model [Schweter 2020], which is a Turkish pretrained version of BERT-base [Devlin et al.
2019]. We therefore provide the configuration of BERTurk along with our model’s configuration in the table (we did not
pretrain BERTurk, but fine-tune it on our downstream tasks).
We use AdamW [Loshchilov and Hutter 2019] optimizer (𝛽 1 is 0.90, 𝛽 2 is 0.98, and 𝜖 is 1e-6), linear scheduling with a
warmup ratio of 1e-2 and peak learning rate of 5e-5, and gradient accumulation with 22 steps. Other hyperparameters
are set to the RoBERTa configuration [Liu et al. 2019].

3.3 Fine-tuning Tasks


The performances of the pretrained models with different tokenizers are evaluated by fine-tuning the models on six
downstream tasks. The tasks and datasets used for fine-tuning are explained as follows:

• News Classification: Given a set of news articles, this task aims to classify each document into a predetermined
set of classes, i.e. the task is text sequence classification. We use Turkish news classification datasets provided by
Toraman et al. [2011]. Merging two datasets from two different news resources results in approximately 7.5k news
instances. The news articles are given under eight news topics or categories; namely sports, economy, national, world,
politics, columnists, health, and culture-art.
Manuscript submitted to ACM
8 Toraman et al.

BERTurk-base RoBERTa-TR-medium
Parameters 110.62 M 42.69 M
Train data 35 GB 27 GB
Layers 12 8
Heads 12 8
Hidden size 768 512
Batch size n/a 264
Max length 512 tokens 514 tokens
Train time 9.63 days 2 days*
Hardware TPU v3-8 2x Nvidia RTX2080 Ti
Table 2. Details of pretraining configurations for BERTurk and our model, RoBERTa-TR-medium. (*Train time and hardware are given
for a vocabulary size of 16.6k tokens. Train time can differ for other vocabulary sizes as we report detailed information in Section 5).

• Hate Speech Detection: The aim of this task is to determine whether a given text sequence includes hate speech
towards other individuals or communities with different backgrounds. Hate Speech Detection is a challenging problem
with a limited number of resources in the literature, since there is no decisive consensus on the definitions of the hate
or offensive speech, and hate language can have various forms in natural language. In this study, we use a recent hate
speech dataset in Turkish, curated by Toraman et al. [2022]. Data instances are tweets from different hate speech
domains including gender, religion, and politics. There are 100k tweets distributed equally among five hate speech
domains annotated as hate, offensive, and normal.
• Sentiment Analysis: Sentiment Analysis is a task of text sequence classification to find the author’s sentimental
state. We use a Turkish dataset including movie reviews prepared by Demirtas and Pechenizkiy [2013]. The reviews
are labeled as having positive and negative sentiments. The dataset is balanced and contains approximately 5.3k
instances for each sentiment class.
• Named Entity Recognition: Named Entity Recognition is a token classification task to predict pre-determined
named entities in a text sequence, such as person and location. We use the benchmark dataset [Tür et al. 2003]
including Turkish news articles. The dataset contains approximately 32.5k sentences and three named entity classes
given as person, location, and organization. The named entities in the dataset are annotated with the IOB2 [Ramshaw
and Marcus 1995] tags, such that each entity chunk starts with B-<class>, and continues with I-<class>, e.g. New
York has the tags of B-<LOCATION> and I-<LOCATION>.
• Semantic Text Similarity: Semantic similarity between two text sequences is measured in this task. The sentence
pairs are annotated in a scale between 0 (i.e. no semantic similarity) and 5 (i.e. semantically equivalent) according to
their similarity degree. Different from the classification tasks, STS is handled as a regression problem. To evaluate the
performance of the model, the correlation between the ground truth and model predictions is taken into consideration.
We use a Turkish STS dataset that is the translation of the STSb dataset [Beken Fikri et al. 2021]. The domain of the
sentence pairs vary from news articles to online forum messages. The dataset includes approximately 8.6k sentence
pairs in total.
• Natural Language Inference: Given two sentences, the aim of Natural Language Inference is to predict whether
the latter is inferred by the former. The dataset includes three types of semantic relations: The first sentence can
entail the second one (entailment), the sentences can be irrelevant to each other (neutral), or the first sentence can
contradict the second one (contradiction). We use a Turkish NLI dataset which is the translated version of the SNLI
dataset [Budur et al. 2020]. The dataset includes approximately 570k sentence pairs.
Manuscript submitted to ACM
Impact of Tokenization on Language Models: An Analysis for Turkish 9

News Hate Speech Sentiment Named Entity Semantic Text Natural Language
Classification Detection Analysis Recognition Similarity Inference
Epochs 10 5 10 10 10 3
BERT R-TR-m

Max. length 514 256 514 514 514 514


Batch size 32 32 32 16 16 16
Epochs 3 3 3 3 25 3
Max. length 256 256 256 256 256 256
Batch size 32 32 32 16 32 16
Learning rate 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5
Train Size 6,786 90,000 9,595 29,332 7,765 512,130
Test Size 754 10,000 1,067 3,259 863 56,903
Table 3. Details of fine-tuning configurations for different downstream tasks. R-TR-m refers to our model, RoBERTa-Turkish-medium,
and BERT refers to BERTurk. The task configurations can change due to space complexity and data size. We apply constant learning
rate for all tasks, except linear decay learning rate in NLI. When Character-level tokenizer is used, we set the number of epochs to 10,
max. sequence length to 1024, and batch size to 12, for all tasks.

4 EXPERIMENTS
We conduct two experiments in this study. First, we compare the performances of tokenization methods for Turkish
downstream tasks. Second, we analyze the effect of vocabulary size on the downstream task performance. For each
experiment, we describe our experimental design, and then report the results.

4.1 Comparison of Tokenization Methods


4.1.1 Experimental Design. We compare the performance of tokenizers using RoBERTa-TR-medium on Turkish down-
stream tasks (RQ-1). We use a fixed size of vocabulary, 16.6k tokens, in this experiment to understand how different
tokenizers would perform under the same configuration. Furthermore, we analyze the performance of our medium-size
model in comparison with a larger state-of-the-art model. To do so, we report the performance of BERTurk [Schweter
2020], a Turkish model with a size similar to BERT-base [Devlin et al. 2019].
We fine-tune the models, which are pretrained using different tokenizers, on the downstream tasks given in Section
3.3. For fine-tuning our models, the configurations and hyperparameters along with dataset sizes are given in Table 3.
We measure weighted precision, recall, and F1 score for all tasks, except STS where Pearson correlation is reported
with p-value. We apply 10-fold cross-validation and report the average scores. We determine statistically significant
differences between the means, which follow non-normal distributions, by using the two-sided Mann-Whitney U
(MWU) test at %95 interval with Bonferroni correction.

4.1.2 Experimental Results. We report the fine-tuning results in Table 4. Our main observations and answers to RQ-1
can be summarized as follows.
WordPiece and BPE are highest performing tokenizers in Turkish language modeling. WordPiece and BPE
are de facto standard tokenizers that are practically employed in language modeling. Indeed, we find that WordPiece
statistically significantly outperforms other tokenizers in most of the tasks in our experiments for Turkish. The only
exceptions are that BPE has higher scores in News Classification, but the difference between the performances of BPE
and WordPiece is not statistically significant in that case. Similarly, WordPiece has higher score than BPE in Sentiment
Analysis and Named Entity Recognition, but the differences are not statistically significant as well. We thereby argue
Manuscript submitted to ACM
10 Toraman et al.

News Hate Speech Sentiment Named Entity Semantic Text Natural Language
Classification Detection Analysis Recognition Similarity Inference
P R F1 P R F1 P R F1 P R F1 corr p-value P R F1
BERT 0.918 0.917 0.917 0.781 0.781 0.781 0.927 0.927 0.927 0.935 0.955 0.945 0.862 <1e-178 0.852 0.852 0.852
Char 0.715 0.723 0.713 0.606 0.609 0.607 0.812 0.812 0.812 0.730 0.788 0.757 0.256 <1e-4 0.620 0.619 0.619
R-TR-medium

BPE 0.886 0.885 0.885 • 0.742 0.737 0.738 0.882 0.881 0.881 ◦ 0.851 0.883 0.866 ◦ 0.487 <2e-32 0.772 0.772 0.772
WP 0.882 0.881 0.881 ◦ 0.745 0.745 0.745 • 0.884 0.884 0.884 • 0.858 0.893 0.875 • 0.718 <3e-92 • 0.778 0.778 0.778 •
Morph 0.869 0.868 0.867 0.726 0.727 0.726 0.824 0.823 0.823 0.839 0.872 0.855 0.655 <5e-63 ◦ 0.768 0.768 0.768
Word 0.857 0.857 0.856 0.647 0.649 0.648 0.805 0.805 0.805 0.791 0.740 0.764 0.492 <2e-16 0.603 0.598 0.595
Table 4. Fine-tuning results of different tokenizers (rows) on six downstream tasks (columns) using Turkish datasets. The average
of 10-fold cross-validation is reported in terms of weighted precision (P), recall (R), and F1 score. BERT refers to BERTurk, which
is structurally similar to BERT-base, but pretrained for Turkish text. For STS, Pearson correlation (corr) is reported with p-value.
R-TR-medium refers to our pretrained model for Turkish text, RoBERTa-Turkish-medium, along with each tokenization method. Char
refers to Character-level tokenizer, BPE refers to Byte Pair Encoding, WP refers to WordPiece, Morph refers to Morphological-level
tokenizer, and Word refers to World-level tokenizer. The highest score among tokenizers for each task is given in bold. The symbol “•"
indicates statistical significant difference at a 95% interval (with Bonferroni correction 𝑝 < 0.0125) in pairwise comparisons between
the highest performing method and others (except the ones with “◦").

that WordPiece and BPE are highest performing tokenizers in Turkish, yet WordPiece has better performance than BPE
in the majority of tasks.
Word-level tokenizer performs poorly due to many unknown tokens. Word-level tokenizer performs poorly
compared to BPE, WordPiece, and Morphological-level tokenizers, possibly due to poor utilization of the available
vocabulary capacity. We therefore examine the ratio of the unknown tokens to all tokens in the fine-tuning datasets,
and report them in Table 5. It is apparent that almost half the tokens are unknown to the model when Word-level
tokenizer is used in all tasks. The reason why Word-level still achieves comparable results with other tokenizers might
be that the model is trained with the Masked Language Modeling task and gains an ability to infer meaning even with
many unknown tokens.
Morphological-level has competitive results with state-of-the-art tokenizers. The differences between the
weighted F1 scores of Morphological-level tokenizer and the best performing tokenizer is statistically significant
but very small (between 0.01 and 0.02) in all tasks, except that it is approximately 0.06 in Sentiment Analysis and
Semantic Text Similarity (STS). However, this difference is not statistically significant in STS. We thereby argue that
Morphological-level tokenizer achieves competitive results with de facto tokenizers. Moreover, the performance of
Morphological-level tokenizer is always better than those of Character-level and Word-level tokenizers. We argue
that suffixes can provide useful information for language modeling in Turkish. The poor performance compared to
de facto tokenizers can be attributed to two observations. First, the method has dependency on the performance of
the morphological analyzer, Zemberek [Akın and Akın 2007], which we employ for obtaining prefixes and suffixes.
We observe possible errors such as wrong morphemes in the output of morphological analyzer as reported Table 6.
For instance, the word "İstanbullular" (translated to "People of Istanbul") is not tokenized correctly; however, the word
contains inherit information, such as hometown (##lu) and plural (##lar). Second, Morphological-level tokenizer is
limited in terms of word stems since roots are not split into smaller pieces in this approach, increasing the likelihood of
observing unknown tokens, as observed in Table 5.
Character-level tokenizer has no significant benefit. Character-level tokenizer achieves the worst performance
for Turkish in most tasks. The reason could be that our medium models might be inadequate to comprehend the
relations among characters, which could be better modeled by larger language models [Xue et al. 2021]. However,
Manuscript submitted to ACM
Impact of Tokenization on Language Models: An Analysis for Turkish 11

News Hate Speech Sentiment Named Entity Semantic Text Natural Language
Classification Detection Analysis Recognition Similarity Inference
BPE 0.000 0.000 0.000 0.000 0.000 0.000
WordPiece 1.447e-6 4.292e-6 4.824e-6 0.000 0.000 0.000
Morph-level 0.021 0.171 0.183 0.018 0.024 0.008
Word-level 0.587 0.515 0.502 0.457 0.522 0.457
Table 5. Ratios of unknown tokens to all tokens in the fine-tuning datasets that are used in Table 4.

Output
Sample Sentence İstanbullular güneşin tadını çıkarabildiler
True Parsing İstanbul ##lu ##lar güneş ##in tat ##ı ##nı çık ##ar ##abil ##di ##ler
Zemberek İstanbullular güneş ##in tad ##ın ##ı çıkar ##abil ##di ##ler
Table 6. Tokenization output of the true parsing, and a morphological analysis tool, Zemberek, considering Turkish syntax rules.
Sample sentence is "İstanbullular güneşin tadını çıkarabildiler" (translated as "People of Istanbul were able to enjoy the sun").

we argue that the size and architecture of larger models could be inefficient to outperform de facto tokenizers, as we
address in Section 3.1.
Medium models can be competitive to larger ones. We expect that the performance of our medium models is
lower than larger models, i.e. BERTurk, due to the computational advantages of larger models. However, we find that the
performance gap is narrow for particular tasks. Our 3-times smaller model recovers 97% of BERTurk’s performance in
News Classification, 95% in Hate Speech, 95% in Sentiment Analysis, 93% in Named Entity Recognition, 83% in Semantic
Textual Similarity, and 91% in Natural Language Inference. One possible reason of the relatively lower recovery score
of STS task is that it is a regression task. In classification, output logits are mapped to classes. Since there is no such
quantization in regression, there can be more deviations in the correlation between ground truths and predictions.

4.2 Analysis of Vocabulary Size


4.2.1 Experimental Design. In the previous experiments, we fix the vocabulary size for all tokenizers except Character-
level tokenizer. However, vocabulary embeddings contribute to the total number of model parameters and the effect of
the vocabulary size can vary among different tokenizers (RQ-2). We thereby design an experiment that measures the
performances of tokenizers with changing vocabulary sizes. We note that in the ultimate case when the vocabulary size
tends to infinite, every possible character combination in corpus is assigned a representation in the vocabulary. In such
a case, the modeling becomes similar to conventional word embeddings, such as word2vec [Mikolov et al. 2013]. On the
other extreme, the need for contextual representation of a given token increases as vocabulary size gets smaller. In
other words, a single token is expected to reflect a wide variety of contextual meanings due to limited vocabulary size.
In this experiment, we fix the hyperparameters of the Transformer blocks in the architecture, e.g. the number of
layers and hidden size, and adjust the vocabulary size such that the number of parameters attributed to the vocabulary
constitutes 10, 20, 30, 40, and 50 percent of the entire model. Since Character-level tokenizer has a fixed vocabulary size,
we exclude it from this experiment.
This analysis requires to train a separate language model for five vocabulary sizes and four tokenization methods,
resulting in a total number of 20 models. Considering six downstream tasks, we would have 120 experimental runs. We
therefore decide to select two important tasks among six tasks, for the sake of efficiency and carbon footprint. We select
Manuscript submitted to ACM
12 Toraman et al.

Fig. 2. The effect of varying the vocabulary size for different tokenization methods. The results are reported for a text classification
task, Sentiment Analysis, and a token classification task, Named Entity Recognition.

a text classification task, Sentiment Analysis, and a token classification task, Named Entity Recognition. In fine-tuning,
we apply 10-fold cross-validation and report the average of weighted F1 scores.

4.2.2 Experimental Results. The results of varying vocabulary size for different tokenizers are given in Figure 2. We
report the vocabulary sizes by removing fractional part of decimal numbers (e.g. 16.6k is given as 16k). Our main
observations and answers to RQ-2 are listed as follows:
We observe an increasing pattern for the performance of all tokenizers as the vocabulary size increases.
Increasing vocabulary size can result in less number of unknown tokens for Morphological-level and Word-level
tokenizers. BPE and WordPiece can benefit from higher level tokens by merging subword tokens when vocabulary size
increases, e.g. New Y ##or ##k can be represented as New York when vocabulary size is sufficient. In terms of downstream
tasks, Named Entity Recognition can benefit from increasing vocabulary size more than Sentiment Analysis, since
Named Entity Recognition is a token classification task that depends on the performance of predicting individual tokens.
The reason of smaller improvements in Sentiment Analysis could be that it is a sequence classification task that can
tolerate individual unknown tokens while predicting sentiment of the given sequence.
Performance improvement gets faster saturation for de facto tokenizers compared to others. De facto
tokenizers, i.e. BPE and WordPiece, do not dramatically benefit from larger vocabulary sizes, which indicates a saturation
in performance at smaller sizes. A potential cause for this saturation might be that BPE and WordPiece can tokenize
almost any input in the fine-tuning datasets with a vocabulary size greater than 16k (i.e. the ratios of unknown tokens
to all tokens are mostly zeros in Table 5). Furthermore, they outperform Morphological-level and Word-level tokenizers
in both tasks. However, the performance gap diminishes as the vocabulary size increases, probably due to less unknown
tokens. This also reduces the impact of tokenization on the model performance. In terms of downstream tasks, while
Sentiment Analysis has performance saturation in small number of vocabulary sizes (e.g. 7k and 16k), Named Entity
Recognition gets saturation after 16k probably due to the fact that it is a token classification task.
Very large vocabulary sizes have less practical advantage, specifically for de facto tokenizers. When the
vocabulary size exceeds 66k tokens, the performance improvement may become infeasible due to the increasing
Manuscript submitted to ACM
Impact of Tokenization on Language Models: An Analysis for Turkish 13

computational complexity. Instead, the computational resources can be dedicated to increase other resources, such as
the number of parameters in the Transformer blocks. For de facto tokenizers, this situation is more apparent because
their performances are already saturated at even smaller vocabulary sizes. We observe that the ratio of the number of
vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers
and 40% for others, in order to satisfy the trade-off between model size and performance.

5 BROADER IMPACT AND ETHICAL CONCERNS


Since we pretrain large language models and fine-tune them on several downstream tasks for experimental evaluations,
we would like to emphasize broader impact and ethical concerns [Baeza-Yates 2022; Bender et al. 2021; Mitchell et al.
2019]. We thereby provide a discussion on our study’s broader impact, transparency, responsibility, and fairness in this
section.

5.1 Broader Impact


We anticipate that our study would have a broader impact on the research community that focuses on the impact of
tokenization and vocabulary size on language modeling. Although the models are medium-sized, we observe that their
downstream performances are comparable with base models (see Table 4). Our research elaborates on an agglutinative
low-resource language, Turkish, and our findings can provide guidance to other researchers that study similar languages.
We also provide performance measurements for a diverse set of downstream tasks which can be useful for various
real-life applications, such as recommendation systems and finance.

5.2 Transparency
In order to provide a transparent modeling [Mitchell et al. 2019], we explain all details regarding text corpus used in
pretraining our language models in Section 3.2, as well as the details of the algorithms used throughout obtaining the
language models in Section 3.1 and model configurations in Section 3.2. The details of how we conduct the experimental
evaluations are also reported in Section 3.3 and Section 4.

5.3 Responsibility
There is an increasing environmental awareness among the Machine Learning community about responsible training
such as the carbon footprint of extensive model training [Bender et al. 2021; Henderson et al. 2020]. We estimate the
carbon footprint of our study based on the energy usage of GPUs for pretraining and fine-tuning, and report them
in Table 7 and Table 8, respectively. We report execution time in hours and electrical energy consumption in kWh.
It is assumed that the power consumption during training is equal to the maximum power drain of GPUs and they
operate at maximum power utilization (250W). This estimation ignores the carbon footprint of CPU utilization and the
manufacturing costs of the hardware. We note that the emissions for different tokenization methods are close to each
other, and report the cost for a single tokenization method, WordPiece, in this analysis.
Based on the calculations of execution time and energy consumption, we estimate the carbon footprint of our models
in terms of greenhouse gas (GHG) emission in kg CO2𝑒𝑞. We then plot the carbon footprint of pretraining with different
vocabulary sizes and fine-tuning on different downstream tasks in Figure 3. We note that for the GHG emission of
pretraining, the vocabulary sizes given at the x-axis have an exponential scale, starting from 7k to 66k. We observe a
linearly increasing trend in GHG emissions as the vocabulary size increases, which indicates an exponential growth in
environmental damage. We underline that the performance gain in Figure 2 comes at a cost of increasing environmental
Manuscript submitted to ACM
14 Toraman et al.

Vocabulary Size
7k 16k 28k 44k 66k
GPU Hours (h) 2 × 36.3h 2 × 40h 2 × 44h 2 × 52.5h 2 × 57.75h
Energy Consumption (kWh) 18.15 20.00 22.00 26.25 28.875
Table 7. Energy consumption for pretraining with a single tokenization method for different vocabulary sizes.

News Hate Speech Sentiment Named Entity Semantic Text Natural Language
Classification Detection Analysis Recognition Similarity Inference
GPU Hours (h) 2 × 1.77h 2 × 6.08h 2 × 2.50h 1 × 8.33h 2 × 2.05h 2 × 35.00h
Energy Consump. (kWh) 0.89 3.04 1.25 2.08 1.03 17.5
Table 8. Energy consumption for fine-tuning with a single tokenization method for different downstream tasks.

GHG Emissions from Pretraining GHG Emissions from Fine-tuning

Number of Instances in the Training Set


Greenhouse Gas Emission (kg CO2 eq)

Greenhouse Gas Emission (kg CO2 eq)


16 10
500k
15 8 400k
14
6 300k
13
4 200k
12
11 2 100k
10 0
7k 16k 28k 44k 66k ation ction lysis ition arity ence
Vocabulary Size
e w s ClaSspeeicchenDteimteennttitAynaRteiccoTgenxnt gSuimagile Infer
sif
N Hate S E an a
Named SemNatural L

Fig. 3. Carbon footprint (in terms of kg CO2 ) of pretraining with a single tokenization method for different vocabulary sizes (left) and
fine-tuning for different downstream tasks (right). The numbers of training instances used in fine-tuning are represented by a line at
the right subplot with their corresponding values at the second y-axis.

damage, and therefore suggest that a reasonably smaller vocabulary size is a preferable choice for pretraining. We
further observe that the greenhouse gas emissions caused by the fine-tuning experiments are in correlation with the
size of the utilized training set.
For the conversion of electrical energy usage to CO2𝑒𝑞 Greenhouse Gas (GHG) emission, we use a local conversion
factor specified by the Turkish government [Republic of Turkey Ministry of Environment and Change 2020]. The
specified value is an upper bound of the spontaneous value reported by electricitymap [ElectricityMap 2022]. Based on
our estimation for the GHG release, a lower bound for the social carbon cost (SCC) of the experiments in our study can
be approximated as $117.42 with a value of $300 per ton of CO2 [Kikstra et al. 2021].

5.4 Fairness
The pretraining text corpus in our study, i.e. the OSCAR corpus, has millions of web page texts, a part of which may
include biased contents ignoring the fairness principle towards all communities and individuals with a variety of
Manuscript submitted to ACM
Impact of Tokenization on Language Models: An Analysis for Turkish 15

backgrounds and profiles. Moreover, the fairness of the tokenization algorithms and Transformer-based language
models is still in debate [Baeza-Yates 2022]. We are not able to make a judgment on the fairness of the corpus and
model; since, to the best of our knowledge, there is no available automatic tool to assess fairness. Nevertheless, we
acknowledge that the researchers and practitioners should provide fair algorithms, data, and models. The bias can be
removed by filtering the pretraining corpus or analyzing the fairness of algorithms used in this study, however we
leave the study of such filtering and analysis to future work, since it would require a dedicated effort to develop such
algorithms, but the scope of this study is to compare empirical performance of different tokenizers and vocabulary sizes.

6 CONCLUSION
We provide a comprehensive study that examines the impact of tokenization in Turkish, which is a low-resource
language with a limited number of pretrained deep language models. In order to accomplish this task, we pretrain
a medium-sized language model, called RoBERTa-TR-medium, with different tokenization algorithms and varying
vocabulary sizes. Our language models are publicly available, so that other researchers and practitioners can benefit
from our models. This would provide less electrical energy and memory usage with better carbon footprint in return.
Our experimental results, supported by statistical tests, can shed light on the role of tokenization in language
modeling, specifically for morphologically rich languages. We find that Morphological-level tokenizer is competitive
with de facto tokenizers, i.e. BPE and WordPiece. Our models, which are approximately 3-times smaller than state-
of-the-art larger models, can recover 97% of the performance of the larger one. We also show that increasing the
vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto
tokenizers. We suggest that the ratio of the number of vocabulary parameters to the total number of model parameters
can be empirically chosen as 20% for de facto tokenizers and 40% for others, for the trade-off between model size and
performance.
In future work, we plan to extend our experiments to other agglutinative languages, such as Finnish and Hungarian,
and other tokenization algorithms such as SentencePiece [Kudo 2018]. Morphological disambiguation [Hakkani-Tür
et al. 2018] can be used to improve the quality of morphological analysis, yielding to potential improvements in
Morphological-level tokenization. We also plan to focus more on AI ethics for the impact of tokenization for pretraining
language models, including but not limited to filtering bias in pretraining text corpora and analysis of tokenization
algorithms in terms of fairness.

REFERENCES
Sina Ahmadi. 2020. A Tokenization System for the Kurdish Language. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and
Dialects. International Committee on Computational Linguistics (ICCL), Barcelona, Spain (Online), 114–127.
Ahmet Afsin Akın and Mehmet Dündar Akın. 2007. Zemberek, An Open Source NLP Framework for Turkic Languages. Structure 10 (2007), 1–5.
Ricardo Baeza-Yates. 2022. Ethical Challenges in AI. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1–2.
Figen Beken Fikri, Kemal Oflazer, and Berrin Yanikoglu. 2021. Semantic Similarity Based Evaluation for Abstractive News Summarization. In Proceedings
of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021). Association for Computational Linguistics, Online, 24–33.
https://doi.org/10.18653/v1/2021.gem-1.3
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models
Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association
for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
Kaj Bostrom and Greg Durrett. 2020. Byte Pair Encoding is Suboptimal for Language Model Pretraining. In Findings of the Association for Computational
Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4617–4624. https://doi.org/10.18653/v1/2020.findings-emnlp.414
Emrah Budur, Rıza Özçelik, and Tunga Güngör. 2020. Data and Representation for Turkish Natural Language Inference. In Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 8253–8267. https:
//doi.org/10.18653/v1/2020.emnlp-main.662
Manuscript submitted to ACM
16 Toraman et al.

Erkin Demirtas and Mykola Pechenizkiy. 2013. Cross-lingual Polarity Detection with Machine Translation. In Proceedings of the Second International
Workshop on Issues of Sentiment Discovery and Opinion Mining (Chicago, Illinois) (WISDOM ’13). Association for Computing Machinery, New York, NY,
USA. https://doi.org/10.1145/2502069.2502078
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa, Khin Thandar Nwet, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2019a. Towards Burmese
(Myanmar) Morphological Analysis: Syllable-Based Tokenization and Part-of-Speech Tagging. ACM Transactions on Asian and Low-Resource Language
Information Processing 19, 1 (2019). https://doi.org/10.1145/3325885
Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech
Tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 2 (2018). https://doi.org/10.1145/3276773
Shuoyang Ding, Adithya Renduchintala, and Kevin Duh. 2019b. A Call for Prudent Choice of Subword Merge Operations in Neural Machine Translation.
In Proceedings of Machine Translation Summit XVII: Research Track. European Association for Machine Translation, Dublin, Ireland, 204–213.
Bonaventure FP Dossou and Chris C Emezue. 2021. Crowdsourced Phrase-based Tokenization for Low-resourced Neural Machine Translation: The case
of Fon Language. arXiv preprint arXiv:2103.08052 (2021).
ElectricityMap. 2022. Climate Impact by Area. https://app.electricitymap.org/map Accessed: 2022-03-16.
Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Frage: Frequency-agnostic Word Representation. Advances in Neural
Information Processing Systems 31 (2018).
Dilek Zeynep Hakkani-Tür, Murat Saraçlar, Gökhan Tür, Kemal Oflazer, and Deniz Yuret. 2018. Morphological Disambiguation for Turkish. Springer Int.
Publishing, Cham, 53–67. https://doi.org/10.1007/978-3-319-90165-7_3
Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and Joelle Pineau. 2020. Towards The Systematic Reporting of The Energy and
Carbon Footprints of Machine Learning. Journal of Machine Learning Research 21, 248 (2020), 1–43.
Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, and Naoaki Okazaki. 2021. Joint Optimization of Tokenization and Downstream Model.
In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 244–255. https:
//doi.org/10.18653/v1/2021.findings-acl.21
Huggingface. 2021. Oscar Dataset Huggingface. https://huggingface.co/datasets/oscar Accessed: 2022-03-16.
Hour Kaing, Chenchen Ding, Masao Utiyama, Eiichiro Sumita, Sethserey Sam, Sopheap Seng, Katsuhito Sudoh, and Satoshi Nakamura. 2021. Towards
Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion. ACM Transactions on Asian and Low-Resource Language Information
Processing 20, 6 (2021). https://doi.org/10.1145/3464378
JS Kikstra, P Waidelich, J Rising, D Yumashev, C Hope, and CM Brierley. 2021. The Social Cost of Carbon Dioxide Under Climate-economy Feedbacks and
Temperature Variability. Environmental Research Letters 16, 9 (2021), 094037.
Taku Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the
56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne,
Australia, 66–75. https://doi.org/10.18653/v1/P18-1007
Taku Kudo and John Richardson. 2018. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text
Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for
Computational Linguistics, Brussels, Belgium, 66–71. https://doi.org/10.18653/v1/D18-2012
Seth Kulick. 2011. Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging. ACM Transactions on Asian
Language Information Processing 10, 1 (2011). https://doi.org/10.1145/1929908.1929912
Yachao Li, Jing Jiang, Jia Yangji, and Ning Ma. 2021. Finding Better Subwords for Tibetan Neural Machine Translation. ACM Transactions on Asian and
Low-Resource Language Information Processing 20, 2 (2021). https://doi.org/10.1145/3448216
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).
Loodos. 2020. Turkish Language Models. https://github.com/Loodos/turkish-language-models Accessed: 2022-03-16.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. LA, USA.
Lalita Lowphansirikul, Charin Polpanumas, Nawat Jantrakulchai, and Sarana Nutanong. 2021. WangchanBERTa: Pretraining Transformer-based Thai
Language Models. arXiv preprint arXiv:2101.09635 (2021).
Alejandro Metke Jimenez, Kerry Raymond, and Ian MacColl. 2011. Information Extraction from Web Services: A Comparison of Tokenisation Algorithms.
In Proceedings of the 2nd International Workshop on Software Knowledge 2011, in conjunction with 3rd International Joint Conference on Knowledge
Discovery, Knowledge Engineering and Knowledge Management. Scitepress, 12–23.
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International
Conference on Learning Representations, ICLR, Workshop Track Proceedings. Scottsdale, Arizona, USA.
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, et al. 2019. Model Cards for Model Reporting. In
Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* ’19). Association for Computing Machinery, New
York, NY, USA, 220–229. https://doi.org/10.1145/3287560.3287596

Manuscript submitted to ACM


Impact of Tokenization on Language Models: An Analysis for Turkish 17

Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, and Vijendran Gopalan Venkoparao. 2020. Domain Adaptation Challenges of BERT in
Tokenization and Sub-word Representations of Out-of-Vocabulary words. In Proceedings of the First Workshop on Insights from Negative Results in NLP.
Association for Computational Linguistics, Online, 1–5. https://doi.org/10.18653/v1/2020.insights-1.1
Pedro Javier Ortiz Suárez, Benoit Sagot, and Laurent Romary. 2019. Asynchronous Pipelines for Processing Huge Corpora on Medium to Low Resource
Infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, Cardiff,
United Kingdom, 9–16. https://doi.org/10.14618/ids-pub-9021
Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and Heuiseok Lim. 2021. Should We Find Another Model?: Improving Neural Machine Translation
Performance with ONE-Piece Tokenization Method without Model Modification. In Proceedings of the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. Association for Computational Linguistics, Online,
97–104. https://doi.org/10.18653/v1/2021.naacl-industry.13
Kyubyong Park, Joohong Lee, Seongbo Jang, and Dawoon Jung. 2020. An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks. In
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference
on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 133–142.
Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2020. BPE-Dropout: Simple and Effective Subword Regularization. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1882–1892. https://doi.org/10.18653/v1/
2020.acl-main.170
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language Models Are Unsupervised Multitask Learners.
OpenAI blog 1, 8 (2019), 9.
Lance Ramshaw and Mitch Marcus. 1995. Text Chunking using Transformation-Based Learning. In Third Workshop on Very Large Corpora. https:
//aclanthology.org/W95-0107
Urbanization Republic of Turkey Ministry of Environment and Climate Change. 2020. Primary Energy and GHG Emissions Coefficients of Electric-
ity. https://meslekihizmetler.csb.gov.tr/elektrik-enerjisinin-birincil-enerji-ve-sera-gazi-salimi-katsayilari-2021-yilindan-itibaren-kullanilmak-uzere-
guncellenmistir-duyuru-411795 Accessed: 2022-03-16.
Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean Voice Search. In 2012 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). 5149–5152. https://doi.org/10.1109/ICASSP.2012.6289079
Stefan Schweter. 2020. BERTurk - BERT Models for Turkish. https://doi.org/10.5281/zenodo.3770924
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany,
1715–1725. https://doi.org/10.18653/v1/P16-1162
Nakatani Shuyo. 2010. Language Detection Library for Java. http://code.google.com/p/language-detection/ Accessed: 2022-03-16.
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder
Representations from Transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing,
China) (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 1441–1450. https://doi.org/10.1145/3357384.3357895
Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. 2021.
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. arXiv preprint arXiv:2106.12672 (2021).
Cagri Toraman, Fazli Can, and Seyit Koçberber. 2011. Developing A Text Categorization Template for Turkish News Portals. In 2011 International
Symposium on Innovations in Intelligent Systems and Applications. 379–383. https://doi.org/10.1109/INISTA.2011.5946096
Cagri Toraman, Furkan Ş̧ahinuç̧, and Eyup Halit Yilmaz. 2022. Large-Scale Hate Speech Detection with Cross-Domain Transfer. arXiv preprint
arXiv:2203.01111 (2022).
Gökhan Tür, Dilek Hakkani-Tür, and Kemal Oflazer. 2003. A Statistical Information Extraction System for Turkish. Natural Lang. Engineering 9, 2 (2003),
181–210. https://doi.org/10.1017/S135132490200284X
Mihaela Alexandra Vasiu and Rodica Potolea. 2020. Enhancing Tokenization by Embedding Romanian Language Specific Morphology. In 2020 IEEE 16th
International Conference on Intelligent Computer Communication and Processing (ICCP). 243–250. https://doi.org/10.1109/ICCP51029.2020.9266140
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you
need. In Advances in Neural Information Processing Systems. 5998–6008.
Jingjing Xu, Hao Zhou, Chun Gan, Zaixiang Zheng, and Lei Li. 2021. Vocabulary Learning via Optimal Transport for Neural Machine Translation. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 7361–7373. https://doi.org/10.18653/v1/2021.acl-long.571
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, et al. 2021. ByT5: Towards A Token-free Future with Pre-trained
Byte-to-byte Models. arXiv preprint arXiv:2105.13626 (2021).
Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. In Proceedings of the 14th ACM
International Conference on Web Search and Data Mining. 1154–1156.
Wenbo Zhang, Xiao Li, Yating Yang, and Rui Dong. 2021b. Pre-Training on Mixed Data for Low-Resource Neural Machine Translation. Information 12, 3
(2021), 133.
Xinsong Zhang, Pengshuai Li, and Hang Li. 2021a. AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 421–435. https://doi.org/10.18653/v1/2021.findings-acl.37
Manuscript submitted to ACM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy