Tokenization
Tokenization
Ack:Ujjwal MK,
Teaching Assistant
UE22AM343BB5: Large Language Models and Their Applications
Tokenization - Overview
In NLP, most of the data that we handle consists of raw text. These raw
texts are not understood directly by machine learning models. Hence we
use tokenizers
Tokenization - Types
In this tokenization technique, the idea is to split the raw text into
words. Based on some rules like empty spaces, punctuations, etc
1. Lack context understanding: For example, the word “dog” and “dogs” are
represented with different ID even though those words are very similar and the
meaning is close.
2. The vocabulary becomes too huge: When trying to understanding the corpus,
due to this type of tokenization, the number of words unique might end up being
very high thereby leading to higher vocabulary size. Leading to huge model
sizes.
(Note: we can ignore words and set the limit to number of words it learns, For
examples: learning only 10,000 words. Every other words after 10,000 words will
be termed as unknown words, thereby leading to loss of information)
UE22AM343BB5: Large Language Models and Their Applications
We now split the text into individual characters rather than the words.
Disadvantages:
1. This process might not enable each token to hold more context/value. For
example: ‘L’ contains lower context compared to ‘Let’s’.
2. The token sequences are translated to very large numbers instead of
representing it with fewer tokens as each character is being tokenized.
UE22AM343BB5: Large Language Models and Their Applications
The idea is to find a middle ground, wherein there is fewer ‘out of vocab’
words while ensuring meaning between the tokens with smaller sequence
length.
Image Credits: https://www.youtube.com/watch?v=zHvTiHr506c&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=16
UE22AM343BB5: Large Language Models and Their Applications
Training Process
1. Initial Vocabulary:
a. Extract unique characters from words in the corpus.
b. Example: For“dog”, “log”, “fog”, “dogs”, “logs”, the initial vocabulary is
[“d”, “f”, “g”, “l”, “o”, “s”].
UE22AM343BB5: Large Language Models and Their Applications
Tokenization Process
To tokenize new inputs:
1. Normalize and pre-tokenize the input.
2. Split words into characters.
3. Apply learned merge rules sequentially.
Example:
● Merge Rules:
○ (“o”, “g”) → “og”, (“d”, “og”) → “dog”, (“l”, “og”) → “log”
○ Tokenizing "fog": Results in ["f", "og"].
○ Tokenizing "dogs": Results in ["dog", "s"].
○ Tokenizing "frog": Results in ["[UNK]", "og"].
UE22AM343BB5: Large Language Models and Their Applications
Credit: https://www.youtube.com/watch?v=HEikzVL-lZU
UE22AM343BB5: Large Language Models and Their Applications
Key Benefits
● Efficient vocabulary management.
● Handles rare and unknown words by breaking them into subwords.
● Widely adopted in Transformer-based NLP models.
UE22AM343BB5: Large Language Models and Their Applications
Training Process
1. Initial Vocabulary:
a. Start with a vocabulary of special tokens (like [UNK], [CLS], [SEP], etc.) and an
alphabet of all characters in the corpus.
b. Each character is considered separately and prefixed with “##” if it is not the first
character in a word.
c. Example: For the word “word”, the initial split would be: w ##o ##r ##d
UE22AM343BB5: Large Language Models and Their Applications
3. Example:
● Initial corpus: (“cat”, 12), (“bat”, 8), (“rat”, 10), (“cats”, 5), (“bats”, 3)
● Initial splits: (“c” “##a” “##t”, 12), (“b” “##a” “##t”, 8), (“r” “##a” “##t”, 10), (“c” “##a”
“##t” “##s”, 5), (“b” “##a” “##t” “##s”, 3)
● Initial vocabulary: [“b”, “c”, “r”, “##a”, “##t”, “##s”]
Merge Selection:
● The most frequent pair is (”##a”, “##t”) (appears 30 times).
● Its score is 30/864 = 0.0347 , due to the high frequency of “##a”.
● The pair (”##t”, “##s”) has the next highest score (8/300):
● Vocabulary becomes: [“b”, “c”, “r”, “##s”, “##at”]
● Corpus becomes: (“c” “##at”, 12), (“b” “##at”, 8), (“r” “##at”, 10), (“c” “##at” “##s”,
5), (“b” “##at” “##s”, 3)
UE22AM343BB5: Large Language Models and Their Applications
4. Iterative Merging:
The process continues until the vocabulary reaches the desired size.
Tokenization Process
1. Key Differences from BPE:
a. WordPiece does not save merge rules; it only saves the final vocabulary.
b. It uses the longest matching subword strategy to tokenize a word.
c. In cases where no valid subword exists, the word is tokenized as [UNK].
UE22AM343BB5: Large Language Models and Their Applications
3. Unknown Tokens:
● If no valid subword exists for any part of the word, the entire word is
tokenized as [UNK].
● Example:
○ For "mug", since "##m" is not in the vocabulary, the output is ["[UNK]"].
Summary
WordPiece focuses on efficient vocabulary learning by scoring pairs and
prioritising less frequent individual subwords. Its longest matching subword
tokenization strategy ensures robust handling of words, but it is stricter than BPE in
dealing with out-of-vocabulary segments, resulting in more [UNK] tokens for
unseen words.
UE22AM343BB5: Large Language Models and Their Applications
Credits: https://www.youtube.com/watch?v=qpv6ms_t_1A
UE22AM343BB5: Large Language Models and Their Applications
objective is to find the subword sequence for the input sequence X that
maximizes the (log) likelihood of the sequence.
Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications
Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications
Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications
Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications
Text preprocessing is the first and most crucial step in Natural Language
Processing (NLP). It involves cleaning and preparing raw text for analysis or
machine learning models. Ensures uniformity and improves model
performance.
Key Steps Covered:
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Stemming and Lemmatization
5. Part-of-Speech (POS) Tagging
6. Named Entity Recognition (NER)
UE22AM343BB5: Large Language Models and Their Applications
3. Stopword Removal: Removes common words (e.g., “is”, “the”) that add
little meaning.
● Libraries: nltk.corpus.stopwords, spaCy.
UE22AM343BB5: Large Language Models and Their Applications
4. Lemmatization
•Reduces words to their base or root form.
•Tools: WordNetLemmatizer (nltk), spaCy.
•Example: "running" → "run"
5. Stemming
•Cuts off word suffixes to find the root form.
•Example: "jumps", "jumping" → "jump"
6. POS Tagging
•Assigns grammatical roles to words (noun, verb, etc.).
•Tools: nltk.pos_tag, spaCy.
•Example: "NLP is fun!" → [('NLP', 'NN'), ('is', 'VB'), ('fun', 'JJ')]
UE22AM343BB5: Large Language Models and Their Applications
References
1. https://www.youtube.com/watch?v=tOMjTCO0htA
2. https://www.youtube.com/watch?v=HEikzVL-lZU
3. https://huggingface.co/learn/nlp-course/en/chapter6/5
4. https://www.youtube.com/watch?v=qpv6ms_t_1A
5. https://huggingface.co/learn/nlp-course/en/chapter6/6
6. https://www.cse.iitm.ac.in/~miteshk/llm-course.html
UE22AM343BB5
Large Language Models and Their Applications
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu
Ack:Ujjwal MK,
Teaching Assistant