0% found this document useful (0 votes)

3 views

Pretraining Data and Tokenizer for Indic LLM

This document presents a comprehensive approach to data preparation and tokenizer development for multilingual Indic large language models (LLMs). It details the meticulous data acquisition process from various sources, including Common Crawl and Indic literature, and introduces a novel multilingual tokenizer that outperforms existing solutions. The study emphasizes the importance of high-quality data and effective deduplication techniques to enhance the performance of LLMs tailored for Indic languages.

Uploaded by

Suri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Pretraining Data and Tokenizer for Indic LLM

Uploaded by

Suri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Pretraining Data and Tokenizer for Indic LLM

Rahul Kumar, Shubham Kakde, Divyansh Rajput,

Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumar

Abstract by some key players in the realm of LLM includ-

ing Llama 2 (Touvron et al., 2023a), Mistral (Jiang
We present a novel approach to data prepara- et al., 2023), Falcon, MPT, etc creating more af-
tion for developing multilingual Indic large lan- fordable, efficient, and high-performing language
arXiv:2407.12481v1 [cs.CL] 17 Jul 2024

guage model. Our meticulous data acquisition models. The chat-models of these pretrained LLMs
spans open-source and proprietary sources, in-
showcase comparable results on few benchmarks
cluding Common Crawl, Indic books, news
articles, and Wikipedia, ensuring a diverse and with proprietary LLM chat-models including Chat-
rich linguistic representation. For each Indic GPT, BARD and Claude. LLMs such as Llama
language, we design a custom preprocessing and Falcon have outlined their data refinement and
pipeline to effectively eliminate redundant and pre-training steps in their respective technical re-
low-quality text content. Additionally, we per- ports. We draw inspiration from the meticulous
form deduplication on Common Crawl data analysis of RefinedWeb Dataset by Falcon in our
to address the redundancy present in 70% of
data filtering and deduplication approaches.
the crawled web pages. This study focuses on
developing high-quality data, optimizing tok- Indic culture boasts linguistic diversity, encom-
enization for our multilingual dataset for Indic passing Indo-Aryan, Dravidian and Munda lan-
large language models with 3B and 7B param- guages. Indo-Aryan and Dravidian languages con-
eters, engineered for superior performance in stitute 96% of India’s spoken languages. Despite
Indic languages. We introduce a novel multilin- this richness, most open language models lack In-
gual tokenizer training strategy, demonstrating dic language support, hindering innovation due to
our custom-trained Indic tokenizer outperforms limited high-quality data and complex tokeniza-
the state-of-the-art OpenAI Tiktoken tokenizer,
tion challenges (AI, 2023). Notable corpora like
achieving a superior token-to-word ratio for
Indic languages. EMILLE/CIIL, Wikipedia for Indian languages,
Samantar Corpus, and AI4Bharat-IndicNLP pro-
1 Introduction vide valuable resources (Kunchukuttan et al., 2020).
EMILLE/CIIL spans 14 languages with 92 mil-
In the ever-evolving realm of Natural Language lion words, Wikipedia for Indian languages is lim-
Processing (NLP), the development of Large Lan- ited, Samantar Corpus offers 49.7 million paral-
guage Models (LLMs) have seen a meteoric rise lel sentences across 11 languages, and AI4Bharat-
since the inception of transformers. The intro- IndicNLP Corpus contains 2.7 billion words for 10
duction of LLMs has ushered in a new era in languages (Kunchukuttan et al., 2020). IndicCorp
NLP, where the boundaries of what machines with 8.8 billion tokens across 11 languages, supple-
could achieve with human language are constantly ments linguistic resources (Kakwani et al., 2020).
pushed to astonishing limits. OpenAI’s ChatGPT Despite these efforts, it is noteworthy that Hindi be-
and Google’s BARD has been a pioneering force ing the third most spoken language, does not rank
for redefining the landscape of language modeling among the top 20 languages in processed Com-
performance and has also illuminated the vast soci- monCrawl documents, highlighting the scarcity of
etal implications inherent in these technological ad- India-specific data for open large language model
vancements. Alongside ChatGPT and BARD, vari- training (Penedo et al., 2023).
ous open source and proprietary LLMs have show- In this study, we provide a technical report on
cased remarkable natural language understanding clean indic dataset preparation and tokenizer train-
and generation. There have been public releases ing for Indic LLM : India’s own foundational model

1
with indic-rich context. The study highlights a com- lected. Researchers have been advancing in the
prehensive analysis of available open source and development of LLMs tailored to specific regional
proprietary datasets and data refinement steps. We languages (Cui et al., 2023) (Balachandran, 2023)
also devise a state-of-art indic tokenizer through (Kunchukuttan et al., 2020). For development of a
rigorous experimentations and validated the perfor- robust multilingual LLM, two pivotal components
mance through a pretraining model. are: presence of abundant multilingual data and a
diverse vocabulary (Yuan et al., 2023). The cur-
2 Related Works rent state-of-art LLMs provide a rigid multilingual
support, this is due to less multilingual data in the
Large language models (LLMs) owe their remark- pre-training corpus. For example, Llama models
able learning capabilities to massive model sizes (Touvron et al., 2023a) (Touvron et al., 2023b) have
and extensive training datasets. At present, there leveraged a vast pre-training corpus with over 1.6
are numerous foundational models spanning from trillion tokens, but less than 4.5% is multilingual
open source and proprietary LLMs. Some note- data, over 20 different languages. This number
worthy open source foundational LLMs include is further enhanced in Llama 2 models where the
Llama 2 (Touvron et al., 2023a), Mistral 7B (Jiang proportion of multilingual data is increased to ap-
et al., 2023), Falcon (Penedo et al., 2023), MPT proximately 11% and the number of languages to
(Team, 2023), Bloom (Workshop et al., 2022), etc. around 26. CulturaX (Nguyen et al., 2023) is an-
whereas proprietary foundational LLMs comprise other multilingual dataset with 6.3 trillion tokens
of GPT4 (Achiam et al., 2023), LaMDA (Thoppilan across 167 languages. The dataset is created af-
et al., 2022), etc. LLaMA 2 developed by Meta AI ter meticulously cleaning and deduplication steps
and Microsoft focuses on multilingual capabilities to facilitate advancements in multilingual LLMs.
and is optimized for swift training and inference. However, out of 167 languages only 14 languages
MPT-7B by MosaicML and Mistral-7B by Mistral amount to 90.38%, thus creating a very low pre-
AI are 7-billion-parameter models that have demon- training corpus for developing an efficient and ver-
strated efficient open-source training code, promot- satile LLM having contextual understanding of in-
ing transparency and ease of use. These models dic languages.
have showcased superiority over other open-source To cater this issue of procuring massive datasets
models in the 7B-20B range. Falcon-40B devel- for LLM development, researchers have been
oped by Technology Innovation Institute (TII) has relying on open source datasets such as web-
40-billion-parameters and is a causal decoder-only crawled Common Crawl, Wikipedia, Public do-
model trained on a causal language modeling task. main books spanning from cultural and histori-
It is trained on a large dataset and has demonstrated cal facets (Gao et al., 2020), Stack Exchange and
superior performance to GPT-3. BLOOM is the Github archives, Journal articles and educational
world’s largest open multilingual language model, resources (Lewkowycz et al., 2022), News archives,
with 176-billion-parameters. Generally, proprietary Government and institutional legal repositories,
systems are more expensive and offer product solu- Multimedia transcripts. In this study, we aim to
tions that can be tailored to fit very specific business exploit the aforementioned data sources for devel-
needs. Open source models usually offer more af- oping extensive indic data corpus.
fordable and customizable options but may lack the Moreover, these massive corpora need meticu-
performance level and specialization of proprietary lous rigorous data refinement and deduplication
LLMs. to ensure data quality while maintaining integrity.
Despite the widespread availability of LLMs for Recent LLMs have demonstrated a robust work
public exploration, the lack of transparency regard- on data preparation and filtering, The RefinedWeb
ing training datasets, especially for state-of-the-art (Penedo et al., 2023) has been pivotal in providing
models, hinders research on addressing relevant an insightful technical background in this context,
biases. Furthermore, LLMs are known to gener- significantly enhancing our understanding of the
ate text lacking sufficient grounding to knowledge nuances involved. RefinedWeb has executed fil-
sources, thus posing risks of misinformation and tering and deduplication techniques on Common
hallucination (Li et al., 2023). This challenge is Crawl corpus, these filters include incorporation
exacerbated largely in multilingual learning sce- of threshold-based language filtering, url-filtering,
narios where datasets are often inadequately col- line-wise correction filters, document deduplica-

2
tion, etc. MassiveText (Rae et al., 2021) has de- a comprehensive selection of 12 indic languages
fined rules for reducing nuances from the text doc- including English. These languages span from
uments by implementing extensive quality filtering Assamese, Bengali, English, Kannada, Gujarati,
techniques. Upon rigorous filtering and deduplica- Hindi, Marathi, Malayali, Punjabi, Odia, Tamil,
tion steps, a large amount of noisy data is removed Telugu. Our contributions in paper are highlighted
from the original corpora. It has been observed that as :
the open source language datasets do hold a high 1. Development of High-Quality Multilingual Indic
number of boilerplate texts and similar-context text Data
documents (Penedo et al., 2023) (Lee et al., 2021). 2. Novel Multilingual Tokenizer Training Strategy
Consequently, deduplication process is a cru-
cial step for producing high quality pre-training 3 Pre-training Data
corpus. Various deduplication algorithms have 3.1 Common Crawl
been established in literature spanning from ex-
Common Crawl is an open repository that houses
act matching with suffix arrays (Manber and My-
extensive web crawl data. Since its inception in
ers, 1993), largest substring matching , minhash
2008, the archive has amassed petabytes of data
(Broder, 1997), simhash (Charikar, 2002), and
and continues to perform crawls almost monthly.
other fuzzy techniques in order to minimize mem-
The data, accessible via Common Crawl’s public
ory usage while maximizing efficiency. In this
S3 bucket, is provided in three distinct archive for-
work, we have methodically incorporated the data
mats: WAT, WET, and WARC files. The WAT
filtering and deduplication techniques, drawing in-
(Web Archive Transformation) files encompass the
spiration from the research findings presented in
metadata of the crawl, including HTTP headers,
RefinedWeb and MassiveText and also proposed
elements from the HTML <head> (such as title,
our own innovative filters for data preprocessing.
meta tags, and scripts), and links from the websites.
Conventional tokenization approaches often in-
WET (Web Extracted Text) files contain the text ex-
volve a complex preprocessing pipeline and are
tracted directly from the HTML of the crawl. Mean-
language-specific. A simplistic and multilingual
while, WARC (Web ARChive) files comprise the
tokenizer is required for diverse natural language
complete crawl data, encompassing both the meta-
processing (NLP) tasks. These tokenization tech-
data and the full HTML response. While WET files
niques generally include BPE (Sennrich et al.,
could have served as a straightforward source for
2015), Wordpiece (Wu et al., 2016) , Sentencepiece
text data, our experimentation with various HTML
(Kudo and Richardson, 2018), IndicBERT (Kak-
scraping tools revealed that the text extracted in
wani et al., 2020), Spacy (AI, Accessed: 2024).
WET files often lacks cleanliness. Therefore, we
(Kunchukuttan et al., 2020) proposed an Indic NLP
opted to process the WARC files, which allowed
tokenizer (Kunchukuttan, 2020) which emerges as
us to scrape cleaner text from their comprehensive
an effective tokenization tool for indic-languages
HTML archives.
(Assamese, Bengali, Gujarati, Hindi, Marathi,
To date, we have processed a total of 93 Com-
Odia, Punjabi, Kannada, Malayalam, Tamil, Tel-
mon Crawl snapshots, with CC-MAIN-2023-50
ugu) including English. In addition, Stanford NLP
being the latest snapshot at the time of writing this
(Al-Rfou and Skiena, 2013) has proposed Indic
paper. Our processing of the Common Crawl data
tokenizer (Manning et al., 2014) which supports
is divided into three major steps:
English, Indo-aryan and Dravidian languages along
with preprocessing functionalities. SentencePiece 1. Preprocessing : It involves raw text extraction.
(Kudo and Richardson, 2018) tokenizer enables a
fully end-to-end language-independent system by 2. Postprocessing : Entails Language Detection
directly training its subword models from raw in- and the application of Heuristic Filters.
put sentences. This approach eliminates the need
3. Deduplication : Aimed at removing duplicate
for pre-tokenized word sequences. In this study,
components.
we experiment with the SentencePiece (Kudo and
Richardson, 2018) tokenizer and fine-tune it based Common Crawl Preprocessing Among the var-
on our indic-rich corpora. ious data sources utilized during the pretraining
In this study, we meticulously compile data cor- phase, the Common Crawl dataset presented the
pus from open source and proprietary datasets for most significant challenge, primarily due to its

3
on multiple AWS EC2 instances.

Figure 1: Common Crawl Dataset Processing Pipeline Common Crawl Postprocessing Following the
preprocessing step for the WARC files, we obtain a
CSV file corresponding to each WARC file. These
vast scale, which encompasses approximately 7 files comprise columns detailing the scraping date,
petabytes of data. We employed the warcio library clean text, markdown text, and URL. As our focus
in Python, which is adept at efficiently streaming is on developing Large Language Models (LLMs)
WARC files instead of loading them entirely into for Indic languages, an initial step involves apply-
memory. This approach is crucial given the substan- ing a language detection filter to segregate non-
tial size of WARC files, often reaching gigabytes, Indic languages, with the exception of English. In
making complete loading into memory impractical. our exploration of various open-source language
This streaming process necessitated a separate post- detection models, including AI4Bharat and Fast-
processing pipeline for language identification and Text, we found that FastText provided the most
heuristic feature computation, which proved more accurate results. Currently, the text extracted from
efficient when conducted in batches rather than on the Common Crawl dataset is being classified into
individual records. English and 11 specific Indic languages.
In our exploration of various open-source text Post this language filtration, approximately 50%
extraction libraries, including unstructured.io and of the initial data remains. Subsequently, we com-
trafilatura, we found that trafilatura delivered the pute heuristic features such as token count, mean
most effective results. While streaming through sentence length, mean word length, symbol-to-
the warcio iterator, we extracted clean text, mark- word ratio, perplexity (ppl) score, fraction of dupli-
down text, and URLs from each web page using cate lines, fraction of characters in duplicate lines,
trafilatura. fraction of characters in the most common 2-11-
Our analysis encompassed a total of 5,455,398 grams, fraction of lines ending with an ellipsis,
WARC files across 93 snapshots. We initially and fraction of lines starting with a bullet point.
conducted a Proof of Concept (POC) for WARC Through exploratory data analysis conducted for
preprocessing using a PySpark pipeline. In this each language, we determined that applying these
pipeline, we read multiple WARC files as RDDs filters within the range of 0-90th percentiles effec-
(Resilient Distributed Datasets) and processed them tively eliminates unclean and gibberish documents.
concurrently. Each RDD in this pipeline was a tu- Additionally, we examined the impact of applying
ple containing the file path and the content of the filters solely on mean word length, mean sentence
WARC file as a byte array. We employed the same length, language threshold, and symbol-to-word ra-
warcio library in the map partition User-Defined tio. The outcomes were found to be similar to those
Function (UDF) for processing these files. obtained when applying the full range of heuristic
Additionally, we established a vanilla Python features.Table 1 presents the number of documents
processing pipeline utilizing multiprocessing, and token count subsequent to the application of
which enabled parallel processing of multiple these filters.
WARC files, fully utilizing the cores of a machine.
The extraction of text using trafilatura from each Common Crawl Deduplication As noted on the
WARC file took approximately 40 to 60 minutes, Common Crawl website, each snapshot of their
varying with the number of web pages archived in web crawl typically encompasses approximately 3
each file. Both the PySpark and multiprocessing billion web pages. Remarkably, 2 billion of these
vanilla Python pipelines demonstrated similar pages have been previously crawled in earlier snap-
processing times for the WARC files. However, we shots. This results in an average of about 66%
opted to process the files using our multiprocessing duplicate content per snapshot. When considering
vanilla Python pipeline. This decision was driven all snapshots cumulatively, the proportion of dupli-
by the significant reduction in compute costs cate content is substantially higher. Consequently,
it offered. Furthermore, this approach allowed the implementation of a deduplication process is
us to design our pipeline more fault-tolerantly, crucial. It not only ensures the high quality of data
leveraging a custom orchestration pipeline that ran but also significantly reduces the computational re-

4
Language Lang(0.1) + Lang(0.6) + Dedup + Dedup +
Basic filter Basic filter Lang(0.6) + Lang(0.6) +
Basic filter All filter
Hi 108 85 54 51.2
Bn 49.5 34 18.6 16.3
Ta 33.6 22.3 6.2 5.1
Ml 14.3 9.5 3.2 2.7
Mr 11.2 8.1 2.7 2.4
Te 10.5 4 1.8 1.6
Kn 6.2 4.2 1.4 1.1
Gu 6.1 4.5 1.5 1.3
Pa 5.1 2.7 0.88 0.69
Or 1.2 0.9 0.4 0.3
As 0.8 0.6 0.01 0.01

Table 1: Comparative Analysis of Token Counts(in millions) Before and After

Application of Basic, Full Range Filters and Deduplication

Language Token Mean symbol to mean

sources required for training. count word word sentence
In our deduplication pipeline, we have employed length ratio length
Hi [50,10000] [3 10] 0.22 >4.2
the Minhash Locality Sensitive Hashing (LSH) Bn [40,10000] [4,9] 0.24 >4.4
Ta [45,10000] [6,9] 0.32 >5.1
technique. The Minhash LSH process involves Ml [55,10000] [6,10] 0.25 >3.5
three primary steps: first is shingling, which con- Mr [43,10000] [4,6] 0.31 >4.3
Te [52,10000] [5,8] 0.28 >5.4
verts documents into set representations; second is Kn [48,10000] [5,8] 0.32 >3.6
Gu [51,10000] [4,6] 0.23 >3.4
min-hashing, which transforms these large sets into Pa [55,10000] [3,6] 0.24 >3.7
Or [47,10000] [4,7] 0.32 >4.2
shorter signatures while preserving their similarity; As [49,10000] [4,7] 0.25 >3.8
and finally, LSH, which identifies likely candidates
for similarity based on these signatures. We have Table 2: Comparative Analysis of Filter Thresholds
opted for 5-gram shingles for the initial conversion Language-wise
of each document into sets. In the min-hashing
step, we set the signature dimension to 250. By
3.2 Newspaper
establishing a similarity threshold of 0.7, we deter-
mined that the optimal configuration consists of 25 There are plethora of newspapers across various in-
bands and 10 rows per band for this process. dic languages which are regarded as a knowledge-
While duplicates were present across snapshots, we rich source of data for pre-training our indic-rich
chose not to run our deduplication pipeline across context LLM. We accumulated language-specific
all snapshots to avoid excessive data loss. Addition- newspapers and downloaded the digital version of
ally, we noted that if a URL is re-crawled within all the historical editions. These newspapers gen-
approximately a year, its content is likely to remain erally comprise blocks of images, tables and ad-
largely unchanged, although the converse may not vertisements which are noisy data in pre-training
hold true. Therefore, we segmented our deduplica- corpora. Henceforth, a robust and scalable algo-
tion process, considering snapshots spanning 1-2 rithm is necessary to identify individual useful text-
years as a single batch, which roughly equates to blocks from the news article and subsequently ex-
10 million documents. Thus, we divided the 93 tract data out of it. This elementary problem of
snapshots for each language into batches of 10 article extraction is challenging due to composi-
million documents and executed the deduplication tion of a wide range of layouts random arranged
pipeline on each batch. This approach effectively in the target page. Research has been prevailing
removes recent duplicated data while preserving for development of algorithms addressing the text
older, potentially unique data.The number of docu- extraction issues . However, most of these methods
ments before and after deduplication is presented are based on a set of heuristic rules. In this study,
in Tabel 1 we construct an article-extraction pipeline by exper-
imenting with existing open source algorithms and

5
frameworks. Our goal in this extraction process is area of the union of two bounding boxes. Average
to identify bounding boxes of layouts in an article Precision is the area under the PR curve. AP sum-
and then detect relevant text-blocks for respective marizes the PR Curve to one scalar value. Average
languages. Initially, we experiment with an open precision is high when both precision and recall are
source python package, layoutparser, prolific in high, and low when either of them is low across a
document-image analysis tasks. LayoutParser pro- range of confidence threshold values. The range
vides a rich repository of deep learning models for for AP is between 0 to 1. AP value can be calcu-
layout detection as well as a set of unified APIs lated for each class. The mean average precision is
for using them.LayoutParser comes with a set of calculated by taking the average of AP across all
layout data structures with carefully designed APIs the classes under consideration.
that are optimized for document image analysis
tasks. In addtion, we train models on indic-data Model Inference The model inference pipeline
comprising of various handwritten documents and takes up one newspaper image and then bounding
newspapers. In this step, we experiment with deep box prediction is done on the entire newspaper
learning-based models by detectron2 from layout- image along with the class. Masking of the heading,
parser library are leveraged to detect text-snippets non-text and images class bounding boxes is done
from the layouts of the article. Furthermore, this al- on the image. For each text class bounding box, a
gorithm also detects tables which are redundant for tesseract based OCR is used for extraction of text.
the text corpus. After rigorous experimentation, we
resolved to fine-tune a MaskRCNN based model 3.3 Books
for our extraction process. The web crawled data from Common Crawl has
limited distribution of Indic language specific con-
Annotation All the segmented blocks of the news- tent. The distribution of deduplicated content is
paper images are labelled into 4 classes namely indeed further lesser. This is a huge challenge espe-
headings, text, images and non-text by leveraging cially when we want to perform LLM pre-training
label studio. The text class focuses mainly on the with Indic languages. In order to increase the num-
article block and this is what we intend to extract. ber of tokens as well as the breadth of Indic spe-
The headings class contains all the main and sub cific content, we leverage the open source pdfs that
headings in the newspaper image. The non-text mostly contain books, periodicals, magazines, and
class taks care of the tables, summaries and quotes. financial reports. The pdfs are downloaded using a
pipeline across the indic languages.
Model fine-tuning MaskRCNN based model is a The pdfs have broadly two types of format present
three stage model that predicts a class label, bound- in it, image or text embedded on pdf. A pipeline is
ing box offset and object mask for each Region built that detects whether an image is embedded on
of Interest (ROI). It can produce more accurate the pdf and then based on that, appropriate extrac-
and fine-grained masks for each object, which can tion pipeline runs. Tesseract OCR is leveraged to
capture the shape and contour details better than extract text out of these pdf documents with images
bounding boxes. It can also handle overlapping embedded on it. There is a challenge of detecting a
and occluded objects better. script so that tesseract OCR engine with appropri-
ate language is used for extraction. We overcome
this with our pipeline that detects the script and
Model evaluation The performance of the object extracts text out of it. It is then scaled up with
detection and localization algorithm is evaluated by multiprocessing and across multiple nodes.
a metric called Average Precision (AP) and mean
average precision (MAP). AP is calculated with the 4 Tokenizer
help of several other metrics such as Intersection
over Union (IoU), confusion matrix (TP, FP, FN), Tokenization is a preprocessing step in natural lan-
precision and recall as shown in the figure below. guage processing (NLP), facilitating machine com-
IoU quantifies the closeness of the two bounding prehension and interpretation of human language
boxes (ground truth and prediction). It’s a value by breaking down text into fundamental units such
between 0 and 1. The IoU is calculated by taking as words, characters and subwords. Among the
the ratio between the area of intersection and the various types of tokenization tools, one widely em-

6
ployed approach is SentencePiece, which sets itself an increased corpus size. Thus, beyond a certain
apart by directly training subword models from raw point, expanding the corpus size yields diminishing
sentence data. This method offers a fully end-to- returns in terms of tokenizer improvement.
end and language-agnostic system, eliminating the
need for pre-tokenized word sequences. Vocabulary Size Additionally, we experimented
In our study, we adopted Byte Pair Encoding with various corpus sizes. The results, as presented
(BPE) as the tokenization strategy. We assessed the in the corpus table, demonstrate that a tokenizer
quality of tokenization using two key metrics: the trained with a 225 million sampled corpus size
token-to-word ratio and the exact score, which mea- achieves a token-to-word ratio similar to that of a
sures the proportion of correctly tokenized units tokenizer trained with a 12 billion sampled corpus
out of the total tokens. size. Increasing the vocabulary size allows for the
formation of more eligible words or sub-words as
Language 70k 85k 100k tokens, thereby improving the tokenizer’s quality,
en 1.52 1.49 1.46 as indicated by the improved token-to-word ratio.
hi 1.38 1.36 1.35 However, this improvement comes with a trade-off:
mr 1.83 1.83 1.79 a tokenizer with a larger vocabulary size will have
te 2.58 2.45 2.39 reduced inference speed compared to one with a
ta 2.51 2.41 2.30 smaller vocabulary size.
as 2.06 2.03 2.00
gu 2.05 1.92 1.90 Character Coverage We also experimented with
kn 2.51 2.32 2.29 different character coverage levels during tokenizer
ml 3.01 2.98 2.86 training. Setting character coverage to 1 resulted
od 1.94 1.89 1.85 in the inclusion of numerous gibberish characters,
pn 1.58 1.66 1.63 such as emojis and special characters, in the tok-
enizer vocabulary. These characters, despite having
Table 3: Token-to-Word Ratio for Different Vocabulary
low frequencies, were included due to the high char-
Sizes
acter coverage. To address this, we tested various
character coverage settings. As presented in the
character coverage table, we found that a char-
4.1 Tokenizer Experiments
acter coverage of 0.997 is optimal. This setting
Corpus Size For training the tokenizer, we utilized ensures that the characters included in the Hindi
sample data from various sources such as Common tokenizer vocabulary closely match the typical set
Crawl, Wikipedia, and Books. This sampled data of characters in the Hindi language, which consists
constitutes the corpus size employed during tok- of approximately 50 characters (36 consonants and
enizer training. The data in the corpus is sampled in 12 vowels).
proportion to the availability of data from these dif-
ferent sources. The vocabulary size in the context Language GPT-4o Indic
of a tokenizer refers to the number of unique tokens en 1.33 1.48
or subword units that the tokenizer can produce or hi 1.62 1.39
handle. We trained the tokenizer with different cor- mr 2.53 1.80
pus sizes and vocabulary sizes. The results in cor- te 3.15 2.29
pus and vocab table indicate that increasing the ta 3.16 2.30
corpus size does not significantly improve the per- as 2.70 1.99
formance of the tokenizer. However, it is evident gu 2.28 1.91
that increasing the vocabulary size significantly kn 3.03 2.31
improves the token-to-word ratio. This finding is ml 3.37 2.71
logical because increasing the corpus size does not od 6.39 1.80
necessarily enhance tokenizer performance once pn 2.68 1.56
a critical mass of major words has been covered.
If the vocabulary size remains constant, the same Table 4: Token-to-Word Ratios for GPT-4o and Indic
or similar tokens will be generated, and the token- 100k tokenizer on ai4bharat sangraha Corpus
to-word ratio will remain unchanged, even with

7
4.2 Tokenizer Training References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Initially, we trained a tokenizer with the optimal
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
vocabulary size, corpus size, and character cov- Diogo Almeida, Janko Altenschmidt, Sam Altman,
erage determined from our previous experiments. Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
Despite sampling a dataset focused on Indic con- arXiv preprint arXiv:2303.08774.
tent, we observed a significant number of false
Explosion AI. Accessed: 2024. spaCy tokenizer api.
positives, which introduced characters from other https://spacy.io/api/tokenizer.
languages, such as Chinese and French, into our
Indic LLMs. To address this issue and create a SARVAM AI. 2023. Openhathi series: An approach to
clean Indic tokenizer, we implemented a series of build bilingual llms frugally.
steps specifically targeting Indic words. Rami Al-Rfou and Steven Skiena. 2013. Speedread: A
First, we trained a dummy tokenizer using the fast named entity recognition pipeline. arXiv preprint
sampled data from various sources. With the arXiv:1301.2857.
assistance of linguistic experts, we manually re-
Abhinand Balachandran. 2023. Tamil-llama: A new
moved gibberish characters and characters from tamil language model based on llama 2. arXiv
other languages from the dummy tokenizer vocab- preprint arXiv:2311.05845.
ulary. Next, we encoded and decoded the sampled
dataset using this dummy tokenizer, with the byte Andrei Z Broder. 1997. On the resemblance and con-
tainment of documents. In Proceedings. Compres-
fallback flag set to false. This process converted sion and Complexity of SEQUENCES 1997 (Cat. No.
all gibberish words and non-Indic language words 97TB100171), pages 21–29. IEEE.
identified by linguists to UNK tokens.
Moses Charikar. 2002. Similarity estimation techniques
With this cleaned dataset, we then trained our fi- from rounding algorithms. pages 380–388.
nal tokenizer. This refined tokenizer demonstrated
optimal vocabulary size, corpus size, and character Yiming Cui, Ziqing Yang, and Xin Yao. 2023. Efficient
coverage, with minimal inclusion of gibberish and and effective text encoding for chinese llama and
alpaca. arXiv preprint arXiv:2304.08177.
foreign language words. We compared the perfor-
mance of our tokenizer with the OpenAI Tiktoken Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
tokenizer across 11 Indic languages. The results, as ing, Travis Hoppe, Charles Foster, Jason Phang, Ho-
shown in the tokenizer table, clearly indicate that race He, Anish Thite, Noa Nabeshima, et al. 2020.
The pile: An 800gb dataset of diverse text for lan-
our tokenizer’s metrics significantly outperform
guage modeling. arXiv preprint arXiv:2101.00027.
those of the Tiktoken tokenizer.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
5 Conclusion de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, et al. 2023. Mistral
7b. arXiv preprint arXiv:2310.06825.
In conclusion, our meticulous data acquisition and
custom preprocessing pipeline have effectively cu- Divyanshu Kakwani, Anoop Kunchukuttan, Satish
rated a high-quality multilingual dataset, signif- Golla, NC Gokul, Avik Bhattacharyya, Mitesh M
icantly enhancing the performance of our Indic Khapra, and Pratyush Kumar. 2020. Indicnlpsuite:
large language models. Through rigorous filtering Monolingual corpora, evaluation benchmarks and
pre-trained multilingual language models for indian
and deduplication processes, we ensured the elimi- languages. In Findings of the Association for Com-
nation of redundant and low-quality content, opti- putational Linguistics: EMNLP 2020, pages 4948–
mizing the data for training. Our novel multilingual 4961.
tokenizer training strategy demonstrated superior
Taku Kudo and John Richardson. 2018. Sentencepiece:
token-to-word ratios for Indic languages compared A simple and language independent subword tok-
to the state-of-the-art OpenAI Tiktoken tokenizer. enizer and detokenizer for neural text processing.
The experiments underscore the importance of tai- arXiv preprint arXiv:1808.06226.
lored preprocessing and tokenizer design, paving
Anoop Kunchukuttan. 2020. The IndicNLP Library.
the way for more accurate and efficient language https://github.com/anoopkunchukuttan/
models in Indic-rich multilingual contexts. indic_nlp_library/blob/master/docs/
indicnlp.pdf.

8
Anoop Kunchukuttan, Divyanshu Kakwani, Satish Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
Golla, Avik Bhattacharyya, Mitesh M Khapra, Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Pratyush Kumar, et al. 2020. Ai4bharat-indicnlp cor- Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
pus: Monolingual corpora and word embeddings for 2022. Lamda: Language models for dialog applica-
indic languages. arXiv preprint arXiv:2005.00085. tions. arXiv preprint arXiv:2201.08239.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
and Nicholas Carlini. 2021. Deduplicating training Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
data makes language models better. arXiv preprint Azhar, et al. 2023a. Llama: Open and effi-
arXiv:2107.06499. cient foundation language models. arXiv preprint
arXiv:2302.13971.
Aitor Lewkowycz, Anders Andreassen, David Dohan,
Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Ambrose Slone, Cem Anil, Imanol Schlag, Theo bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Gutman-Solo, et al. 2022. Solving quantitative rea- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
soning problems with language models. Advances Bhosale, et al. 2023b. Llama 2: Open founda-
in Neural Information Processing Systems, 35:3843– tion and fine-tuned chat models. arXiv preprint
3857. arXiv:2307.09288.
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun BigScience Workshop, Teven Le Scao, Angela Fan,
Nie, and Ji-Rong Wen. 2023. Halueval: A large- Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel
scale hallucination evaluation benchmark for large Hesslow, Roman Castagné, Alexandra Sasha Luc-
language models. In Proceedings of the 2023 Con- cioni, François Yvon, et al. 2022. Bloom: A 176b-
ference on Empirical Methods in Natural Language parameter open-access multilingual language model.
Processing, pages 6449–6464. arXiv preprint arXiv:2211.05100.
Udi Manber and Gene Myers. 1993. Suffix arrays: a Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,
new method for on-line string searches. siam Journal Mohammad Norouzi, Wolfgang Macherey, Maxim
on Computing, 22(5):935–948. Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
Christopher D Manning, Mihai Surdeanu, John Bauer, 2016. Google’s neural machine translation system:
Jenny Rose Finkel, Steven Bethard, and David Mc- Bridging the gap between human and machine trans-
Closky. 2014. The stanford corenlp natural language lation. arXiv preprint arXiv:1609.08144.
processing toolkit. In Proceedings of 52nd annual
meeting of the association for computational linguis- Fei Yuan, Shuai Yuan, Zhiyong Wu, and Lei Li. 2023.
tics: system demonstrations, pages 55–60. How multilingual is multilingual llm? arXiv preprint
arXiv:2311.09071.
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu
Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A
Rossi, and Thien Huu Nguyen. 2023. Culturax: A
cleaned, enormous, and multilingual dataset for large
language models in 167 languages. arXiv preprint
arXiv:2309.09400.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
Ruxandra Cojocaru, Alessandro Cappelli, Hamza
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,
and Julien Launay. 2023. The refinedweb dataset
for falcon llm: Outperforming curated corpora with
web data, and web data only. Computing Research
Repository, arXiv:2306.01116.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susan-
nah Young, et al. 2021. Scaling language models:
Methods, analysis & insights from training gopher.
arXiv preprint arXiv:2112.11446.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
MosaicML NLP Team. 2023. Introducing mpt-7b: A
new standard for open-source, commercially usable
llms. Accessed: 2023-05-05.

Instant ebooks textbook Build a Large Language Model (From Scratch) (MEAP V01) Sebastian Raschka download all chapters
100% (3)
Instant ebooks textbook Build a Large Language Model (From Scratch) (MEAP V01) Sebastian Raschka download all chapters
34 pages
Dolby IMS3000 Installation Manual 005287 Issue 7
100% (1)
Dolby IMS3000 Installation Manual 005287 Issue 7
144 pages
Fanatec GT3 RS V2
No ratings yet
Fanatec GT3 RS V2
27 pages
2502.11187v1
No ratings yet
2502.11187v1
17 pages
2024.arabicnlp-1.24
No ratings yet
2024.arabicnlp-1.24
15 pages
Datasets For Large Language Models A Comprehensive Survey
No ratings yet
Datasets For Large Language Models A Comprehensive Survey
181 pages
Bueno Teoria 2307.06435
No ratings yet
Bueno Teoria 2307.06435
37 pages
2411.02538
No ratings yet
2411.02538
65 pages
2401.06466v1
No ratings yet
2401.06466v1
13 pages
LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
AComprehensive Overviewof Large Language Models
No ratings yet
AComprehensive Overviewof Large Language Models
36 pages
Evaluating The Elementary Multilingual Capabilities of Large Language Models With M Q
No ratings yet
Evaluating The Elementary Multilingual Capabilities of Large Language Models With M Q
19 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
2304.02020v1
No ratings yet
2304.02020v1
36 pages
Compact Guide To Large Language Models
No ratings yet
Compact Guide To Large Language Models
9 pages
paper-1
No ratings yet
paper-1
44 pages
Scalexm - Ai: A Compact Guide To Large Language Models
No ratings yet
Scalexm - Ai: A Compact Guide To Large Language Models
9 pages
Baichuan 2: Open Large-Scale Language Models: Authors Are Listed Alphabetically, Correspondent
No ratings yet
Baichuan 2: Open Large-Scale Language Models: Authors Are Listed Alphabetically, Correspondent
28 pages
Pariksha_Tech_Report_v1-663980ea39a84
No ratings yet
Pariksha_Tech_Report_v1-663980ea39a84
26 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
1 2 3 4 5 6 7 8 9 Merged
No ratings yet
1 2 3 4 5 6 7 8 9 Merged
221 pages
Ltrc25 Nlpscale Indic Dec24
No ratings yet
Ltrc25 Nlpscale Indic Dec24
18 pages
Unifiedcrawl: Aggregated Common Crawl For Affordable Adaptation of Llms On Low-Resource Languages
No ratings yet
Unifiedcrawl: Aggregated Common Crawl For Affordable Adaptation of Llms On Low-Resource Languages
19 pages
A Survey On Open-Source Large Language Models (LLMS) - Architectures, Capabilities, and Limitations
No ratings yet
A Survey On Open-Source Large Language Models (LLMS) - Architectures, Capabilities, and Limitations
5 pages
76 Main Long
No ratings yet
76 Main Long
13 pages
2503.01159v1
No ratings yet
2503.01159v1
55 pages
Top 10 Open-Source LLM Models - Large Language Models - GeeksforGeeks
No ratings yet
Top 10 Open-Source LLM Models - Large Language Models - GeeksforGeeks
17 pages
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
No ratings yet
A Comprehensive Overview of Large Language Models: A A, B, C, D, E, F E, F G, I H I
46 pages
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
No ratings yet
A Comprehensive Overview of Large Language Models: A B, C, D, E, F, G F, G H, J I J
47 pages
1 s2.0 S2666651024000111 Main
No ratings yet
1 s2.0 S2666651024000111 Main
26 pages
LLM_Project_Guide
No ratings yet
LLM_Project_Guide
4 pages
adaptMLLM Fine-Tuning Multilingual Language Models
No ratings yet
adaptMLLM Fine-Tuning Multilingual Language Models
24 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Indo Language
No ratings yet
Indo Language
16 pages
A Comprehensive Overview of Large Language Models: Preprint 1
No ratings yet
A Comprehensive Overview of Large Language Models: Preprint 1
46 pages
Large Language Models (LLM)
No ratings yet
Large Language Models (LLM)
139 pages
Large Language Models
No ratings yet
Large Language Models
40 pages
ACompactGuidetoLearnLargeLanguageModels
No ratings yet
ACompactGuidetoLearnLargeLanguageModels
6 pages
The_Best_LLMs_Cheatsheet_1727364716
No ratings yet
The_Best_LLMs_Cheatsheet_1727364716
15 pages
Extend en Llms Aug2024
No ratings yet
Extend en Llms Aug2024
65 pages
RESEARCH_PAPER(piyush) (2)
No ratings yet
RESEARCH_PAPER(piyush) (2)
8 pages
LLM - Seminar Report
No ratings yet
LLM - Seminar Report
13 pages
Understanding Large Language Models (LLMs)_ A Mode
No ratings yet
Understanding Large Language Models (LLMs)_ A Mode
3 pages
2502.10361v1
No ratings yet
2502.10361v1
24 pages
Early Release Quick Start Guide To Large Language Models Strategies And Best Practices For Using Chatgpt And Other Llms Sinan Ozdemir instant download
100% (1)
Early Release Quick Start Guide To Large Language Models Strategies And Best Practices For Using Chatgpt And Other Llms Sinan Ozdemir instant download
44 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
YAYI 2: Multilingual Open-Source Large Language Models: Bai Et Al. 2022
No ratings yet
YAYI 2: Multilingual Open-Source Large Language Models: Bai Et Al. 2022
16 pages
LMM Model
No ratings yet
LMM Model
41 pages
LLM compact guide
No ratings yet
LLM compact guide
9 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
Multilingual Machine Translation With Large Language Models: Empirical Results and Analysis
No ratings yet
Multilingual Machine Translation With Large Language Models: Empirical Results and Analysis
14 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
25636-1454-21112-2-10-20230927
No ratings yet
25636-1454-21112-2-10-20230927
4 pages
2_notes (3)
No ratings yet
2_notes (3)
3 pages
NepaliGPT 2.0: Nepali Text Understanding and Generation
No ratings yet
NepaliGPT 2.0: Nepali Text Understanding and Generation
9 pages
TinyLlama: An Open-Source Small Language Model
No ratings yet
TinyLlama: An Open-Source Small Language Model
8 pages
LLM NTOES
No ratings yet
LLM NTOES
1,139 pages
intro class
No ratings yet
intro class
81 pages
Radiology LLM
No ratings yet
Radiology LLM
34 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python Top 100
No ratings yet
Python Top 100
9 pages
A Review of the Marathi Natural Language Processing
No ratings yet
A Review of the Marathi Natural Language Processing
13 pages
Computing Neural Network Gradients-merged
No ratings yet
Computing Neural Network Gradients-merged
67 pages
PPR Sample Form With Common Errors
No ratings yet
PPR Sample Form With Common Errors
3 pages
DHBVN PDF
No ratings yet
DHBVN PDF
1 page
Programming With Karel
No ratings yet
Programming With Karel
7 pages
The New Dot Com Bubble Is Here - It's Called Online Advertising - The Correspondent
No ratings yet
The New Dot Com Bubble Is Here - It's Called Online Advertising - The Correspondent
27 pages
Finalbeauty Techlonglist1527854448862 180606060808 PDF
No ratings yet
Finalbeauty Techlonglist1527854448862 180606060808 PDF
188 pages
Vision - Core Values - Philosophy of PMS - Objectives - Applicability - PMS Cycle - PMS Process
No ratings yet
Vision - Core Values - Philosophy of PMS - Objectives - Applicability - PMS Cycle - PMS Process
30 pages
SGS Indicators Project: Cynthia Cappello, Board of Trustees Mo Copeland, Head of School
No ratings yet
SGS Indicators Project: Cynthia Cappello, Board of Trustees Mo Copeland, Head of School
34 pages
Caparo India
No ratings yet
Caparo India
21 pages
T2 UDFS Book 4 v12.01
No ratings yet
T2 UDFS Book 4 v12.01
1,740 pages
BITWeek8 - L12 - ITE2422 V1
No ratings yet
BITWeek8 - L12 - ITE2422 V1
11 pages
Q1) Issues in Designing Distributed Systems:: 1. Heterogeneity
No ratings yet
Q1) Issues in Designing Distributed Systems:: 1. Heterogeneity
3 pages
Number Base Conversion
No ratings yet
Number Base Conversion
7 pages
Competitive Profile Matrix: Critical Success Factors Weight Rating Score Rating Score Rating
No ratings yet
Competitive Profile Matrix: Critical Success Factors Weight Rating Score Rating Score Rating
6 pages
ARVR Student Handbook Level1
No ratings yet
ARVR Student Handbook Level1
49 pages
4 - Computer Assisted Auditing Techniques (CAAT)
No ratings yet
4 - Computer Assisted Auditing Techniques (CAAT)
3 pages
MicroGen 说明书
No ratings yet
MicroGen 说明书
24 pages
1 week-1-lecture-1-10092024-010756pm-06022025-014628pm
No ratings yet
1 week-1-lecture-1-10092024-010756pm-06022025-014628pm
8 pages
Group 3 Project: Android Auto in Toyota and Lexus
No ratings yet
Group 3 Project: Android Auto in Toyota and Lexus
29 pages
Website: Vce To PDF Converter: Facebook: Twitter:: Aca-Cloud1.Vceplus - Premium.Exam.50Q
100% (1)
Website: Vce To PDF Converter: Facebook: Twitter:: Aca-Cloud1.Vceplus - Premium.Exam.50Q
15 pages
Rafutho
No ratings yet
Rafutho
36 pages
Cloud Security Chapter
No ratings yet
Cloud Security Chapter
95 pages
Manuale Trimble XCN2050
No ratings yet
Manuale Trimble XCN2050
311 pages
Assignment 5
No ratings yet
Assignment 5
7 pages
Minibjörn's Monitor Hunter's Fact Sheet
No ratings yet
Minibjörn's Monitor Hunter's Fact Sheet
24 pages
Infrastructure as Code
No ratings yet
Infrastructure as Code
218 pages
possible questions - P3
No ratings yet
possible questions - P3
5 pages
Motherboard Suntech H61V10C
No ratings yet
Motherboard Suntech H61V10C
6 pages
Lecture 8. Phylogenetic Tree Reconstruction: The Chinese University of Hong Kong CSCI3220 Algorithms For Bioinformatics
No ratings yet
Lecture 8. Phylogenetic Tree Reconstruction: The Chinese University of Hong Kong CSCI3220 Algorithms For Bioinformatics
71 pages
Dräger Caleo Incubator - User Manual (En) PDF
No ratings yet
Dräger Caleo Incubator - User Manual (En) PDF
132 pages
s Data Analyst
No ratings yet
s Data Analyst
1 page
Asynchronous Counters
No ratings yet
Asynchronous Counters
5 pages
RUTX16 Datasheet
No ratings yet
RUTX16 Datasheet
14 pages
TELKOMNIKA Journal Paper
No ratings yet
TELKOMNIKA Journal Paper
9 pages
Elavon Terminal Pre-Requisite Checklist
No ratings yet
Elavon Terminal Pre-Requisite Checklist
2 pages
3com Transceivers: Standards-Based Flexible Ethernet Connections
No ratings yet
3com Transceivers: Standards-Based Flexible Ethernet Connections
4 pages
EC-405 CHAPTER-3 (1)
No ratings yet
EC-405 CHAPTER-3 (1)
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Pretraining Data and Tokenizer for Indic LLM

Uploaded by

Pretraining Data and Tokenizer for Indic LLM

Uploaded by

Pretraining Data and Tokenizer for Indic LLM

Rahul Kumar, Shubham Kakde, Divyansh Rajput,

Abstract by some key players in the realm of LLM includ-

Table 1: Comparative Analysis of Token Counts(in millions) Before and After

Language Token Mean symbol to mean

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.