Pretraining Data and Tokenizer for Indic LLM
Pretraining Data and Tokenizer for Indic LLM
guage model. Our meticulous data acquisition models. The chat-models of these pretrained LLMs
spans open-source and proprietary sources, in-
showcase comparable results on few benchmarks
cluding Common Crawl, Indic books, news
articles, and Wikipedia, ensuring a diverse and with proprietary LLM chat-models including Chat-
rich linguistic representation. For each Indic GPT, BARD and Claude. LLMs such as Llama
language, we design a custom preprocessing and Falcon have outlined their data refinement and
pipeline to effectively eliminate redundant and pre-training steps in their respective technical re-
low-quality text content. Additionally, we per- ports. We draw inspiration from the meticulous
form deduplication on Common Crawl data analysis of RefinedWeb Dataset by Falcon in our
to address the redundancy present in 70% of
data filtering and deduplication approaches.
the crawled web pages. This study focuses on
developing high-quality data, optimizing tok- Indic culture boasts linguistic diversity, encom-
enization for our multilingual dataset for Indic passing Indo-Aryan, Dravidian and Munda lan-
large language models with 3B and 7B param- guages. Indo-Aryan and Dravidian languages con-
eters, engineered for superior performance in stitute 96% of India’s spoken languages. Despite
Indic languages. We introduce a novel multilin- this richness, most open language models lack In-
gual tokenizer training strategy, demonstrating dic language support, hindering innovation due to
our custom-trained Indic tokenizer outperforms limited high-quality data and complex tokeniza-
the state-of-the-art OpenAI Tiktoken tokenizer,
tion challenges (AI, 2023). Notable corpora like
achieving a superior token-to-word ratio for
Indic languages. EMILLE/CIIL, Wikipedia for Indian languages,
Samantar Corpus, and AI4Bharat-IndicNLP pro-
1 Introduction vide valuable resources (Kunchukuttan et al., 2020).
EMILLE/CIIL spans 14 languages with 92 mil-
In the ever-evolving realm of Natural Language lion words, Wikipedia for Indian languages is lim-
Processing (NLP), the development of Large Lan- ited, Samantar Corpus offers 49.7 million paral-
guage Models (LLMs) have seen a meteoric rise lel sentences across 11 languages, and AI4Bharat-
since the inception of transformers. The intro- IndicNLP Corpus contains 2.7 billion words for 10
duction of LLMs has ushered in a new era in languages (Kunchukuttan et al., 2020). IndicCorp
NLP, where the boundaries of what machines with 8.8 billion tokens across 11 languages, supple-
could achieve with human language are constantly ments linguistic resources (Kakwani et al., 2020).
pushed to astonishing limits. OpenAI’s ChatGPT Despite these efforts, it is noteworthy that Hindi be-
and Google’s BARD has been a pioneering force ing the third most spoken language, does not rank
for redefining the landscape of language modeling among the top 20 languages in processed Com-
performance and has also illuminated the vast soci- monCrawl documents, highlighting the scarcity of
etal implications inherent in these technological ad- India-specific data for open large language model
vancements. Alongside ChatGPT and BARD, vari- training (Penedo et al., 2023).
ous open source and proprietary LLMs have show- In this study, we provide a technical report on
cased remarkable natural language understanding clean indic dataset preparation and tokenizer train-
and generation. There have been public releases ing for Indic LLM : India’s own foundational model
1
with indic-rich context. The study highlights a com- lected. Researchers have been advancing in the
prehensive analysis of available open source and development of LLMs tailored to specific regional
proprietary datasets and data refinement steps. We languages (Cui et al., 2023) (Balachandran, 2023)
also devise a state-of-art indic tokenizer through (Kunchukuttan et al., 2020). For development of a
rigorous experimentations and validated the perfor- robust multilingual LLM, two pivotal components
mance through a pretraining model. are: presence of abundant multilingual data and a
diverse vocabulary (Yuan et al., 2023). The cur-
2 Related Works rent state-of-art LLMs provide a rigid multilingual
support, this is due to less multilingual data in the
Large language models (LLMs) owe their remark- pre-training corpus. For example, Llama models
able learning capabilities to massive model sizes (Touvron et al., 2023a) (Touvron et al., 2023b) have
and extensive training datasets. At present, there leveraged a vast pre-training corpus with over 1.6
are numerous foundational models spanning from trillion tokens, but less than 4.5% is multilingual
open source and proprietary LLMs. Some note- data, over 20 different languages. This number
worthy open source foundational LLMs include is further enhanced in Llama 2 models where the
Llama 2 (Touvron et al., 2023a), Mistral 7B (Jiang proportion of multilingual data is increased to ap-
et al., 2023), Falcon (Penedo et al., 2023), MPT proximately 11% and the number of languages to
(Team, 2023), Bloom (Workshop et al., 2022), etc. around 26. CulturaX (Nguyen et al., 2023) is an-
whereas proprietary foundational LLMs comprise other multilingual dataset with 6.3 trillion tokens
of GPT4 (Achiam et al., 2023), LaMDA (Thoppilan across 167 languages. The dataset is created af-
et al., 2022), etc. LLaMA 2 developed by Meta AI ter meticulously cleaning and deduplication steps
and Microsoft focuses on multilingual capabilities to facilitate advancements in multilingual LLMs.
and is optimized for swift training and inference. However, out of 167 languages only 14 languages
MPT-7B by MosaicML and Mistral-7B by Mistral amount to 90.38%, thus creating a very low pre-
AI are 7-billion-parameter models that have demon- training corpus for developing an efficient and ver-
strated efficient open-source training code, promot- satile LLM having contextual understanding of in-
ing transparency and ease of use. These models dic languages.
have showcased superiority over other open-source To cater this issue of procuring massive datasets
models in the 7B-20B range. Falcon-40B devel- for LLM development, researchers have been
oped by Technology Innovation Institute (TII) has relying on open source datasets such as web-
40-billion-parameters and is a causal decoder-only crawled Common Crawl, Wikipedia, Public do-
model trained on a causal language modeling task. main books spanning from cultural and histori-
It is trained on a large dataset and has demonstrated cal facets (Gao et al., 2020), Stack Exchange and
superior performance to GPT-3. BLOOM is the Github archives, Journal articles and educational
world’s largest open multilingual language model, resources (Lewkowycz et al., 2022), News archives,
with 176-billion-parameters. Generally, proprietary Government and institutional legal repositories,
systems are more expensive and offer product solu- Multimedia transcripts. In this study, we aim to
tions that can be tailored to fit very specific business exploit the aforementioned data sources for devel-
needs. Open source models usually offer more af- oping extensive indic data corpus.
fordable and customizable options but may lack the Moreover, these massive corpora need meticu-
performance level and specialization of proprietary lous rigorous data refinement and deduplication
LLMs. to ensure data quality while maintaining integrity.
Despite the widespread availability of LLMs for Recent LLMs have demonstrated a robust work
public exploration, the lack of transparency regard- on data preparation and filtering, The RefinedWeb
ing training datasets, especially for state-of-the-art (Penedo et al., 2023) has been pivotal in providing
models, hinders research on addressing relevant an insightful technical background in this context,
biases. Furthermore, LLMs are known to gener- significantly enhancing our understanding of the
ate text lacking sufficient grounding to knowledge nuances involved. RefinedWeb has executed fil-
sources, thus posing risks of misinformation and tering and deduplication techniques on Common
hallucination (Li et al., 2023). This challenge is Crawl corpus, these filters include incorporation
exacerbated largely in multilingual learning sce- of threshold-based language filtering, url-filtering,
narios where datasets are often inadequately col- line-wise correction filters, document deduplica-
2
tion, etc. MassiveText (Rae et al., 2021) has de- a comprehensive selection of 12 indic languages
fined rules for reducing nuances from the text doc- including English. These languages span from
uments by implementing extensive quality filtering Assamese, Bengali, English, Kannada, Gujarati,
techniques. Upon rigorous filtering and deduplica- Hindi, Marathi, Malayali, Punjabi, Odia, Tamil,
tion steps, a large amount of noisy data is removed Telugu. Our contributions in paper are highlighted
from the original corpora. It has been observed that as :
the open source language datasets do hold a high 1. Development of High-Quality Multilingual Indic
number of boilerplate texts and similar-context text Data
documents (Penedo et al., 2023) (Lee et al., 2021). 2. Novel Multilingual Tokenizer Training Strategy
Consequently, deduplication process is a cru-
cial step for producing high quality pre-training 3 Pre-training Data
corpus. Various deduplication algorithms have 3.1 Common Crawl
been established in literature spanning from ex-
Common Crawl is an open repository that houses
act matching with suffix arrays (Manber and My-
extensive web crawl data. Since its inception in
ers, 1993), largest substring matching , minhash
2008, the archive has amassed petabytes of data
(Broder, 1997), simhash (Charikar, 2002), and
and continues to perform crawls almost monthly.
other fuzzy techniques in order to minimize mem-
The data, accessible via Common Crawl’s public
ory usage while maximizing efficiency. In this
S3 bucket, is provided in three distinct archive for-
work, we have methodically incorporated the data
mats: WAT, WET, and WARC files. The WAT
filtering and deduplication techniques, drawing in-
(Web Archive Transformation) files encompass the
spiration from the research findings presented in
metadata of the crawl, including HTTP headers,
RefinedWeb and MassiveText and also proposed
elements from the HTML <head> (such as title,
our own innovative filters for data preprocessing.
meta tags, and scripts), and links from the websites.
Conventional tokenization approaches often in-
WET (Web Extracted Text) files contain the text ex-
volve a complex preprocessing pipeline and are
tracted directly from the HTML of the crawl. Mean-
language-specific. A simplistic and multilingual
while, WARC (Web ARChive) files comprise the
tokenizer is required for diverse natural language
complete crawl data, encompassing both the meta-
processing (NLP) tasks. These tokenization tech-
data and the full HTML response. While WET files
niques generally include BPE (Sennrich et al.,
could have served as a straightforward source for
2015), Wordpiece (Wu et al., 2016) , Sentencepiece
text data, our experimentation with various HTML
(Kudo and Richardson, 2018), IndicBERT (Kak-
scraping tools revealed that the text extracted in
wani et al., 2020), Spacy (AI, Accessed: 2024).
WET files often lacks cleanliness. Therefore, we
(Kunchukuttan et al., 2020) proposed an Indic NLP
opted to process the WARC files, which allowed
tokenizer (Kunchukuttan, 2020) which emerges as
us to scrape cleaner text from their comprehensive
an effective tokenization tool for indic-languages
HTML archives.
(Assamese, Bengali, Gujarati, Hindi, Marathi,
To date, we have processed a total of 93 Com-
Odia, Punjabi, Kannada, Malayalam, Tamil, Tel-
mon Crawl snapshots, with CC-MAIN-2023-50
ugu) including English. In addition, Stanford NLP
being the latest snapshot at the time of writing this
(Al-Rfou and Skiena, 2013) has proposed Indic
paper. Our processing of the Common Crawl data
tokenizer (Manning et al., 2014) which supports
is divided into three major steps:
English, Indo-aryan and Dravidian languages along
with preprocessing functionalities. SentencePiece 1. Preprocessing : It involves raw text extraction.
(Kudo and Richardson, 2018) tokenizer enables a
fully end-to-end language-independent system by 2. Postprocessing : Entails Language Detection
directly training its subword models from raw in- and the application of Heuristic Filters.
put sentences. This approach eliminates the need
3. Deduplication : Aimed at removing duplicate
for pre-tokenized word sequences. In this study,
components.
we experiment with the SentencePiece (Kudo and
Richardson, 2018) tokenizer and fine-tune it based Common Crawl Preprocessing Among the var-
on our indic-rich corpora. ious data sources utilized during the pretraining
In this study, we meticulously compile data cor- phase, the Common Crawl dataset presented the
pus from open source and proprietary datasets for most significant challenge, primarily due to its
3
on multiple AWS EC2 instances.
Figure 1: Common Crawl Dataset Processing Pipeline Common Crawl Postprocessing Following the
preprocessing step for the WARC files, we obtain a
CSV file corresponding to each WARC file. These
vast scale, which encompasses approximately 7 files comprise columns detailing the scraping date,
petabytes of data. We employed the warcio library clean text, markdown text, and URL. As our focus
in Python, which is adept at efficiently streaming is on developing Large Language Models (LLMs)
WARC files instead of loading them entirely into for Indic languages, an initial step involves apply-
memory. This approach is crucial given the substan- ing a language detection filter to segregate non-
tial size of WARC files, often reaching gigabytes, Indic languages, with the exception of English. In
making complete loading into memory impractical. our exploration of various open-source language
This streaming process necessitated a separate post- detection models, including AI4Bharat and Fast-
processing pipeline for language identification and Text, we found that FastText provided the most
heuristic feature computation, which proved more accurate results. Currently, the text extracted from
efficient when conducted in batches rather than on the Common Crawl dataset is being classified into
individual records. English and 11 specific Indic languages.
In our exploration of various open-source text Post this language filtration, approximately 50%
extraction libraries, including unstructured.io and of the initial data remains. Subsequently, we com-
trafilatura, we found that trafilatura delivered the pute heuristic features such as token count, mean
most effective results. While streaming through sentence length, mean word length, symbol-to-
the warcio iterator, we extracted clean text, mark- word ratio, perplexity (ppl) score, fraction of dupli-
down text, and URLs from each web page using cate lines, fraction of characters in duplicate lines,
trafilatura. fraction of characters in the most common 2-11-
Our analysis encompassed a total of 5,455,398 grams, fraction of lines ending with an ellipsis,
WARC files across 93 snapshots. We initially and fraction of lines starting with a bullet point.
conducted a Proof of Concept (POC) for WARC Through exploratory data analysis conducted for
preprocessing using a PySpark pipeline. In this each language, we determined that applying these
pipeline, we read multiple WARC files as RDDs filters within the range of 0-90th percentiles effec-
(Resilient Distributed Datasets) and processed them tively eliminates unclean and gibberish documents.
concurrently. Each RDD in this pipeline was a tu- Additionally, we examined the impact of applying
ple containing the file path and the content of the filters solely on mean word length, mean sentence
WARC file as a byte array. We employed the same length, language threshold, and symbol-to-word ra-
warcio library in the map partition User-Defined tio. The outcomes were found to be similar to those
Function (UDF) for processing these files. obtained when applying the full range of heuristic
Additionally, we established a vanilla Python features.Table 1 presents the number of documents
processing pipeline utilizing multiprocessing, and token count subsequent to the application of
which enabled parallel processing of multiple these filters.
WARC files, fully utilizing the cores of a machine.
The extraction of text using trafilatura from each Common Crawl Deduplication As noted on the
WARC file took approximately 40 to 60 minutes, Common Crawl website, each snapshot of their
varying with the number of web pages archived in web crawl typically encompasses approximately 3
each file. Both the PySpark and multiprocessing billion web pages. Remarkably, 2 billion of these
vanilla Python pipelines demonstrated similar pages have been previously crawled in earlier snap-
processing times for the WARC files. However, we shots. This results in an average of about 66%
opted to process the files using our multiprocessing duplicate content per snapshot. When considering
vanilla Python pipeline. This decision was driven all snapshots cumulatively, the proportion of dupli-
by the significant reduction in compute costs cate content is substantially higher. Consequently,
it offered. Furthermore, this approach allowed the implementation of a deduplication process is
us to design our pipeline more fault-tolerantly, crucial. It not only ensures the high quality of data
leveraging a custom orchestration pipeline that ran but also significantly reduces the computational re-
4
Language Lang(0.1) + Lang(0.6) + Dedup + Dedup +
Basic filter Basic filter Lang(0.6) + Lang(0.6) +
Basic filter All filter
Hi 108 85 54 51.2
Bn 49.5 34 18.6 16.3
Ta 33.6 22.3 6.2 5.1
Ml 14.3 9.5 3.2 2.7
Mr 11.2 8.1 2.7 2.4
Te 10.5 4 1.8 1.6
Kn 6.2 4.2 1.4 1.1
Gu 6.1 4.5 1.5 1.3
Pa 5.1 2.7 0.88 0.69
Or 1.2 0.9 0.4 0.3
As 0.8 0.6 0.01 0.01
5
frameworks. Our goal in this extraction process is area of the union of two bounding boxes. Average
to identify bounding boxes of layouts in an article Precision is the area under the PR curve. AP sum-
and then detect relevant text-blocks for respective marizes the PR Curve to one scalar value. Average
languages. Initially, we experiment with an open precision is high when both precision and recall are
source python package, layoutparser, prolific in high, and low when either of them is low across a
document-image analysis tasks. LayoutParser pro- range of confidence threshold values. The range
vides a rich repository of deep learning models for for AP is between 0 to 1. AP value can be calcu-
layout detection as well as a set of unified APIs lated for each class. The mean average precision is
for using them.LayoutParser comes with a set of calculated by taking the average of AP across all
layout data structures with carefully designed APIs the classes under consideration.
that are optimized for document image analysis
tasks. In addtion, we train models on indic-data Model Inference The model inference pipeline
comprising of various handwritten documents and takes up one newspaper image and then bounding
newspapers. In this step, we experiment with deep box prediction is done on the entire newspaper
learning-based models by detectron2 from layout- image along with the class. Masking of the heading,
parser library are leveraged to detect text-snippets non-text and images class bounding boxes is done
from the layouts of the article. Furthermore, this al- on the image. For each text class bounding box, a
gorithm also detects tables which are redundant for tesseract based OCR is used for extraction of text.
the text corpus. After rigorous experimentation, we
resolved to fine-tune a MaskRCNN based model 3.3 Books
for our extraction process. The web crawled data from Common Crawl has
limited distribution of Indic language specific con-
Annotation All the segmented blocks of the news- tent. The distribution of deduplicated content is
paper images are labelled into 4 classes namely indeed further lesser. This is a huge challenge espe-
headings, text, images and non-text by leveraging cially when we want to perform LLM pre-training
label studio. The text class focuses mainly on the with Indic languages. In order to increase the num-
article block and this is what we intend to extract. ber of tokens as well as the breadth of Indic spe-
The headings class contains all the main and sub cific content, we leverage the open source pdfs that
headings in the newspaper image. The non-text mostly contain books, periodicals, magazines, and
class taks care of the tables, summaries and quotes. financial reports. The pdfs are downloaded using a
pipeline across the indic languages.
Model fine-tuning MaskRCNN based model is a The pdfs have broadly two types of format present
three stage model that predicts a class label, bound- in it, image or text embedded on pdf. A pipeline is
ing box offset and object mask for each Region built that detects whether an image is embedded on
of Interest (ROI). It can produce more accurate the pdf and then based on that, appropriate extrac-
and fine-grained masks for each object, which can tion pipeline runs. Tesseract OCR is leveraged to
capture the shape and contour details better than extract text out of these pdf documents with images
bounding boxes. It can also handle overlapping embedded on it. There is a challenge of detecting a
and occluded objects better. script so that tesseract OCR engine with appropri-
ate language is used for extraction. We overcome
this with our pipeline that detects the script and
Model evaluation The performance of the object extracts text out of it. It is then scaled up with
detection and localization algorithm is evaluated by multiprocessing and across multiple nodes.
a metric called Average Precision (AP) and mean
average precision (MAP). AP is calculated with the 4 Tokenizer
help of several other metrics such as Intersection
over Union (IoU), confusion matrix (TP, FP, FN), Tokenization is a preprocessing step in natural lan-
precision and recall as shown in the figure below. guage processing (NLP), facilitating machine com-
IoU quantifies the closeness of the two bounding prehension and interpretation of human language
boxes (ground truth and prediction). It’s a value by breaking down text into fundamental units such
between 0 and 1. The IoU is calculated by taking as words, characters and subwords. Among the
the ratio between the area of intersection and the various types of tokenization tools, one widely em-
6
ployed approach is SentencePiece, which sets itself an increased corpus size. Thus, beyond a certain
apart by directly training subword models from raw point, expanding the corpus size yields diminishing
sentence data. This method offers a fully end-to- returns in terms of tokenizer improvement.
end and language-agnostic system, eliminating the
need for pre-tokenized word sequences. Vocabulary Size Additionally, we experimented
In our study, we adopted Byte Pair Encoding with various corpus sizes. The results, as presented
(BPE) as the tokenization strategy. We assessed the in the corpus table, demonstrate that a tokenizer
quality of tokenization using two key metrics: the trained with a 225 million sampled corpus size
token-to-word ratio and the exact score, which mea- achieves a token-to-word ratio similar to that of a
sures the proportion of correctly tokenized units tokenizer trained with a 12 billion sampled corpus
out of the total tokens. size. Increasing the vocabulary size allows for the
formation of more eligible words or sub-words as
Language 70k 85k 100k tokens, thereby improving the tokenizer’s quality,
en 1.52 1.49 1.46 as indicated by the improved token-to-word ratio.
hi 1.38 1.36 1.35 However, this improvement comes with a trade-off:
mr 1.83 1.83 1.79 a tokenizer with a larger vocabulary size will have
te 2.58 2.45 2.39 reduced inference speed compared to one with a
ta 2.51 2.41 2.30 smaller vocabulary size.
as 2.06 2.03 2.00
gu 2.05 1.92 1.90 Character Coverage We also experimented with
kn 2.51 2.32 2.29 different character coverage levels during tokenizer
ml 3.01 2.98 2.86 training. Setting character coverage to 1 resulted
od 1.94 1.89 1.85 in the inclusion of numerous gibberish characters,
pn 1.58 1.66 1.63 such as emojis and special characters, in the tok-
enizer vocabulary. These characters, despite having
Table 3: Token-to-Word Ratio for Different Vocabulary
low frequencies, were included due to the high char-
Sizes
acter coverage. To address this, we tested various
character coverage settings. As presented in the
character coverage table, we found that a char-
4.1 Tokenizer Experiments
acter coverage of 0.997 is optimal. This setting
Corpus Size For training the tokenizer, we utilized ensures that the characters included in the Hindi
sample data from various sources such as Common tokenizer vocabulary closely match the typical set
Crawl, Wikipedia, and Books. This sampled data of characters in the Hindi language, which consists
constitutes the corpus size employed during tok- of approximately 50 characters (36 consonants and
enizer training. The data in the corpus is sampled in 12 vowels).
proportion to the availability of data from these dif-
ferent sources. The vocabulary size in the context Language GPT-4o Indic
of a tokenizer refers to the number of unique tokens en 1.33 1.48
or subword units that the tokenizer can produce or hi 1.62 1.39
handle. We trained the tokenizer with different cor- mr 2.53 1.80
pus sizes and vocabulary sizes. The results in cor- te 3.15 2.29
pus and vocab table indicate that increasing the ta 3.16 2.30
corpus size does not significantly improve the per- as 2.70 1.99
formance of the tokenizer. However, it is evident gu 2.28 1.91
that increasing the vocabulary size significantly kn 3.03 2.31
improves the token-to-word ratio. This finding is ml 3.37 2.71
logical because increasing the corpus size does not od 6.39 1.80
necessarily enhance tokenizer performance once pn 2.68 1.56
a critical mass of major words has been covered.
If the vocabulary size remains constant, the same Table 4: Token-to-Word Ratios for GPT-4o and Indic
or similar tokens will be generated, and the token- 100k tokenizer on ai4bharat sangraha Corpus
to-word ratio will remain unchanged, even with
7
4.2 Tokenizer Training References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Initially, we trained a tokenizer with the optimal
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
vocabulary size, corpus size, and character cov- Diogo Almeida, Janko Altenschmidt, Sam Altman,
erage determined from our previous experiments. Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
Despite sampling a dataset focused on Indic con- arXiv preprint arXiv:2303.08774.
tent, we observed a significant number of false
Explosion AI. Accessed: 2024. spaCy tokenizer api.
positives, which introduced characters from other https://spacy.io/api/tokenizer.
languages, such as Chinese and French, into our
Indic LLMs. To address this issue and create a SARVAM AI. 2023. Openhathi series: An approach to
clean Indic tokenizer, we implemented a series of build bilingual llms frugally.
steps specifically targeting Indic words. Rami Al-Rfou and Steven Skiena. 2013. Speedread: A
First, we trained a dummy tokenizer using the fast named entity recognition pipeline. arXiv preprint
sampled data from various sources. With the arXiv:1301.2857.
assistance of linguistic experts, we manually re-
Abhinand Balachandran. 2023. Tamil-llama: A new
moved gibberish characters and characters from tamil language model based on llama 2. arXiv
other languages from the dummy tokenizer vocab- preprint arXiv:2311.05845.
ulary. Next, we encoded and decoded the sampled
dataset using this dummy tokenizer, with the byte Andrei Z Broder. 1997. On the resemblance and con-
tainment of documents. In Proceedings. Compres-
fallback flag set to false. This process converted sion and Complexity of SEQUENCES 1997 (Cat. No.
all gibberish words and non-Indic language words 97TB100171), pages 21–29. IEEE.
identified by linguists to UNK tokens.
Moses Charikar. 2002. Similarity estimation techniques
With this cleaned dataset, we then trained our fi- from rounding algorithms. pages 380–388.
nal tokenizer. This refined tokenizer demonstrated
optimal vocabulary size, corpus size, and character Yiming Cui, Ziqing Yang, and Xin Yao. 2023. Efficient
coverage, with minimal inclusion of gibberish and and effective text encoding for chinese llama and
alpaca. arXiv preprint arXiv:2304.08177.
foreign language words. We compared the perfor-
mance of our tokenizer with the OpenAI Tiktoken Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
tokenizer across 11 Indic languages. The results, as ing, Travis Hoppe, Charles Foster, Jason Phang, Ho-
shown in the tokenizer table, clearly indicate that race He, Anish Thite, Noa Nabeshima, et al. 2020.
The pile: An 800gb dataset of diverse text for lan-
our tokenizer’s metrics significantly outperform
guage modeling. arXiv preprint arXiv:2101.00027.
those of the Tiktoken tokenizer.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
5 Conclusion de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, et al. 2023. Mistral
7b. arXiv preprint arXiv:2310.06825.
In conclusion, our meticulous data acquisition and
custom preprocessing pipeline have effectively cu- Divyanshu Kakwani, Anoop Kunchukuttan, Satish
rated a high-quality multilingual dataset, signif- Golla, NC Gokul, Avik Bhattacharyya, Mitesh M
icantly enhancing the performance of our Indic Khapra, and Pratyush Kumar. 2020. Indicnlpsuite:
large language models. Through rigorous filtering Monolingual corpora, evaluation benchmarks and
pre-trained multilingual language models for indian
and deduplication processes, we ensured the elimi- languages. In Findings of the Association for Com-
nation of redundant and low-quality content, opti- putational Linguistics: EMNLP 2020, pages 4948–
mizing the data for training. Our novel multilingual 4961.
tokenizer training strategy demonstrated superior
Taku Kudo and John Richardson. 2018. Sentencepiece:
token-to-word ratios for Indic languages compared A simple and language independent subword tok-
to the state-of-the-art OpenAI Tiktoken tokenizer. enizer and detokenizer for neural text processing.
The experiments underscore the importance of tai- arXiv preprint arXiv:1808.06226.
lored preprocessing and tokenizer design, paving
Anoop Kunchukuttan. 2020. The IndicNLP Library.
the way for more accurate and efficient language https://github.com/anoopkunchukuttan/
models in Indic-rich multilingual contexts. indic_nlp_library/blob/master/docs/
indicnlp.pdf.
8
Anoop Kunchukuttan, Divyanshu Kakwani, Satish Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
Golla, Avik Bhattacharyya, Mitesh M Khapra, Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Pratyush Kumar, et al. 2020. Ai4bharat-indicnlp cor- Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
pus: Monolingual corpora and word embeddings for 2022. Lamda: Language models for dialog applica-
indic languages. arXiv preprint arXiv:2005.00085. tions. arXiv preprint arXiv:2201.08239.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
and Nicholas Carlini. 2021. Deduplicating training Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
data makes language models better. arXiv preprint Azhar, et al. 2023a. Llama: Open and effi-
arXiv:2107.06499. cient foundation language models. arXiv preprint
arXiv:2302.13971.
Aitor Lewkowycz, Anders Andreassen, David Dohan,
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Ambrose Slone, Cem Anil, Imanol Schlag, Theo bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Gutman-Solo, et al. 2022. Solving quantitative rea- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
soning problems with language models. Advances Bhosale, et al. 2023b. Llama 2: Open founda-
in Neural Information Processing Systems, 35:3843– tion and fine-tuned chat models. arXiv preprint
3857. arXiv:2307.09288.
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun BigScience Workshop, Teven Le Scao, Angela Fan,
Nie, and Ji-Rong Wen. 2023. Halueval: A large- Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel
scale hallucination evaluation benchmark for large Hesslow, Roman Castagné, Alexandra Sasha Luc-
language models. In Proceedings of the 2023 Con- cioni, François Yvon, et al. 2022. Bloom: A 176b-
ference on Empirical Methods in Natural Language parameter open-access multilingual language model.
Processing, pages 6449–6464. arXiv preprint arXiv:2211.05100.
Udi Manber and Gene Myers. 1993. Suffix arrays: a Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,
new method for on-line string searches. siam Journal Mohammad Norouzi, Wolfgang Macherey, Maxim
on Computing, 22(5):935–948. Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
Christopher D Manning, Mihai Surdeanu, John Bauer, 2016. Google’s neural machine translation system:
Jenny Rose Finkel, Steven Bethard, and David Mc- Bridging the gap between human and machine trans-
Closky. 2014. The stanford corenlp natural language lation. arXiv preprint arXiv:1609.08144.
processing toolkit. In Proceedings of 52nd annual
meeting of the association for computational linguis- Fei Yuan, Shuai Yuan, Zhiyong Wu, and Lei Li. 2023.
tics: system demonstrations, pages 55–60. How multilingual is multilingual llm? arXiv preprint
arXiv:2311.09071.
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu
Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A
Rossi, and Thien Huu Nguyen. 2023. Culturax: A
cleaned, enormous, and multilingual dataset for large
language models in 167 languages. arXiv preprint
arXiv:2309.09400.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
Ruxandra Cojocaru, Alessandro Cappelli, Hamza
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,
and Julien Launay. 2023. The refinedweb dataset
for falcon llm: Outperforming curated corpora with
web data, and web data only. Computing Research
Repository, arXiv:2306.01116.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susan-
nah Young, et al. 2021. Scaling language models:
Methods, analysis & insights from training gopher.
arXiv preprint arXiv:2112.11446.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
MosaicML NLP Team. 2023. Introducing mpt-7b: A
new standard for open-source, commercially usable
llms. Accessed: 2023-05-05.