0% found this document useful (0 votes)
4 views34 pages

Tokenization

The document discusses various tokenization techniques used in Natural Language Processing (NLP), including word-based, character-based, subword-based (Byte-Pair Encoding, WordPiece, and SentencePiece) methods. Each technique has its advantages and disadvantages, impacting vocabulary size and context understanding. The document also emphasizes the importance of text preprocessing in preparing raw text for analysis or machine learning models.

Uploaded by

yashaswinivmipuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

Tokenization

The document discusses various tokenization techniques used in Natural Language Processing (NLP), including word-based, character-based, subword-based (Byte-Pair Encoding, WordPiece, and SentencePiece) methods. Each technique has its advantages and disadvantages, impacting vocabulary size and context understanding. The document also emphasizes the importance of text preprocessing in preparing raw text for analysis or machine learning models.

Uploaded by

yashaswinivmipuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UE22AM343BB5

Large Language Models and Their Applications


Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack:Ujjwal MK,
Teaching Assistant
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Overview

In NLP, most of the data that we handle consists of raw text. These raw
texts are not understood directly by machine learning models. Hence we
use tokenizers

Image Credit: https://www.youtube.com/watch?v=VFp38yj8h3A&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=13


UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Types

● There are various types of tokenization(Embedding) techniques:


○ Word-based Tokenization
○ Character-based Tokenization
○ SubWord-based Tokenization
■ Byte-Pair Encoding
■ WordPiece Encoding
■ SentencePiece Encoding
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Word-based Tokenization

In this tokenization technique, the idea is to split the raw text into
words. Based on some rules like empty spaces, punctuations, etc

Image Credit https://www.youtube.com/watch?v=nhJxYji1aho&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=14


UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Word-based Tokenization

There are limitations:

1. Lack context understanding: For example, the word “dog” and “dogs” are
represented with different ID even though those words are very similar and the
meaning is close.
2. The vocabulary becomes too huge: When trying to understanding the corpus,
due to this type of tokenization, the number of words unique might end up being
very high thereby leading to higher vocabulary size. Leading to huge model
sizes.

(Note: we can ignore words and set the limit to number of words it learns, For
examples: learning only 10,000 words. Every other words after 10,000 words will
be termed as unknown words, thereby leading to loss of information)
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Character-based Tokenization

We now split the text into individual characters rather than the words.

The advantage being, the vocabulary size is concise and


remains adequate for the model.

For example: In english, the average vocabulary size is


around 170,000 words. We can get by that huge number
by considering the number of characters which is 256
characters.

Image Credits: https://www.youtube.com/watch?v=ssLq_EK2jLE&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=15


UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Character-based Tokenization

Advantages of using character-based tokenization over word-based


tokenization:
1. The vocabulary size is limited and does not increase the model size.
2. Fewer ‘out of vocab’ characters, as the vocabulary contains all the
characters present in corpus, there tokenization process in
character-based tends to fewer ‘out of vocabulary’ tokens.

Disadvantages:
1. This process might not enable each token to hold more context/value. For
example: ‘L’ contains lower context compared to ‘Let’s’.
2. The token sequences are translated to very large numbers instead of
representing it with fewer tokens as each character is being tokenized.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SubWord-based Tokenization

This method, lies in-between Word-based and character-based tokenization


techniques.

The idea is to find a middle ground, wherein there is fewer ‘out of vocab’
words while ensuring meaning between the tokens with smaller sequence
length.
Image Credits: https://www.youtube.com/watch?v=zHvTiHr506c&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=16
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SubWord-based Tokenization

This algorithm relies on the following rules:

Image Credit: https://www.youtube.com/watch?v=zHvTiHr506c&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=16


UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding

BPE is a text compression algorithm adapted for tokenization in NLP, used


in models like GPT, RoBERTa, and BART. It generates subword tokens by
iteratively merging character pairs based on their frequency in a corpus,
making it robust for handling unknown words and reducing vocabulary size.

Training Process
1. Initial Vocabulary:
a. Extract unique characters from words in the corpus.
b. Example: For“dog”, “log”, “fog”, “dogs”, “logs”, the initial vocabulary is
[“d”, “f”, “g”, “l”, “o”, “s”].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding


2. Iterative Merging:
● Compute pair frequencies across all words.
● Merge the most frequent pair into a new token and update the vocabulary.
● Repeat until the desired vocabulary size is reached.
Example:
● Corpus Frequencies: (“dog”, 10), (“log”, 5), (“fog”, 12), (“dogs”, 4), (“logs”, 5)
○ Merge (“o”, “g”) → “og”: New vocabulary: [“d”, “f”, “g”, “l”, “o”, “s”, “og”]
○ Merge ("d", "og") → "dog": Vocabulary: [“d”, “f”, “g”, “l”, “o”, “s”, “og”, “dog”]
○ Merge ("l", "og") → "hug": Vocabulary: [“d”, “f”, “g”, “l”, “o”, “s”, “og”, “dog”,
“log”]
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding

Tokenization Process
To tokenize new inputs:
1. Normalize and pre-tokenize the input.
2. Split words into characters.
3. Apply learned merge rules sequentially.
Example:
● Merge Rules:
○ (“o”, “g”) → “og”, (“d”, “og”) → “dog”, (“l”, “og”) → “log”
○ Tokenizing "fog": Results in ["f", "og"].
○ Tokenizing "dogs": Results in ["dog", "s"].
○ Tokenizing "frog": Results in ["[UNK]", "og"].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding - Animation

Credit: https://www.youtube.com/watch?v=HEikzVL-lZU
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding

Key Benefits
● Efficient vocabulary management.
● Handles rare and unknown words by breaking them into subwords.
● Widely adopted in Transformer-based NLP models.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

WordPiece is a tokenization algorithm initially developed by Google for training


BERT and has since been adopted in several Transformer models like DistilBERT,
MobileBERT, Funnel Transformers, and MPNET. It is similar to Byte Pair Encoding
(BPE) but differs in its merge rule selection and tokenization method.

Training Process
1. Initial Vocabulary:
a. Start with a vocabulary of special tokens (like [UNK], [CLS], [SEP], etc.) and an
alphabet of all characters in the corpus.
b. Each character is considered separately and prefixed with “##” if it is not the first
character in a word.
c. Example: For the word “word”, the initial split would be: w ##o ##r ##d
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

2. Learning Merge Rules:


• Like BPE, WordPiece uses merge rules to combine subwords iteratively.
• However, instead of selecting the most frequent pair of subwords to merge,
WordPiece uses a scoring formula:
score = (freq of pair) / (freq of first element * freq of second element)
• Purpose: This scoring prioritises merging pairs that are less frequent
individually in the vocabulary.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

3. Example:
● Initial corpus: (“cat”, 12), (“bat”, 8), (“rat”, 10), (“cats”, 5), (“bats”, 3)
● Initial splits: (“c” “##a” “##t”, 12), (“b” “##a” “##t”, 8), (“r” “##a” “##t”, 10), (“c” “##a”
“##t” “##s”, 5), (“b” “##a” “##t” “##s”, 3)
● Initial vocabulary: [“b”, “c”, “r”, “##a”, “##t”, “##s”]
Merge Selection:
● The most frequent pair is (”##a”, “##t”) (appears 30 times).
● Its score is 30/864 = 0.0347 , due to the high frequency of “##a”.
● The pair (”##t”, “##s”) has the next highest score (8/300):
● Vocabulary becomes: [“b”, “c”, “r”, “##s”, “##at”]
● Corpus becomes: (“c” “##at”, 12), (“b” “##at”, 8), (“r” “##at”, 10), (“c” “##at” “##s”,
5), (“b” “##at” “##s”, 3)
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

4. Iterative Merging:
The process continues until the vocabulary reaches the desired size.

Tokenization Process
1. Key Differences from BPE:
a. WordPiece does not save merge rules; it only saves the final vocabulary.
b. It uses the longest matching subword strategy to tokenize a word.
c. In cases where no valid subword exists, the word is tokenized as [UNK].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

2. Longest Subword Matching:


● Starting from the beginning of the word, find the longest subword present
in the vocabulary.
● Split on the identified subword and continue with the remaining characters.
● Examples:
○ Tokenizing "hugs":
■ Longest subword: "hug" → split as ["hug", "##s"].
○ Tokenizing "bugs":
■ Longest subword: "b" → split as ["b", "##u", "##gs"].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

3. Unknown Tokens:
● If no valid subword exists for any part of the word, the entire word is
tokenized as [UNK].
● Example:
○ For "mug", since "##m" is not in the vocabulary, the output is ["[UNK]"].
Summary
WordPiece focuses on efficient vocabulary learning by scoring pairs and
prioritising less frequent individual subwords. Its longest matching subword
tokenization strategy ensures robust handling of words, but it is stricter than BPE in
dealing with out-of-vocabulary segments, resulting in more [UNK] tokens for
unseen words.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding - Animation

Credits: https://www.youtube.com/watch?v=qpv6ms_t_1A
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Comparison b/w BPE and WordPiece

Aspect BPE WordPiece


Merge Rule Selection Most frequent pair Scored pair
(frequency-based)
Vocabulary Output Merge rules + final Only final vocabulary
vocabulary
Tokenization Method Applies learned merges Uses longest matching
sequentially subword strategy
Handling Unknowns Only unknown characters Entire word is [UNK] if a
are [UNK] part cannot match
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding

Traditional tokenization algorithms assume that input text uses spaces to


separate words, which isn’t the case for all languages. Languages like Chinese,
Japanese, and Thai, for instance, do not rely on spaces, requiring
language-specific pre-tokenizers (e.g., XLM’s specialized pre-tokenizers). To
address this limitation more broadly, SentencePiece was introduced as a simple,
language-independent subword tokenizer and detokenizer for neural text
processing (Kudo et al., 2018). Unlike other methods, SentencePiece processes
raw input streams, treating spaces as characters and using algorithms like BPE
or unigram to construct the vocabulary.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding

let x denote a subword sequence of length n:


x = (x1,x2,x3,....)
then the probability of the subword sequence (with unigram LM) is simply:

objective is to find the subword sequence for the input sequence X that
maximizes the (log) likelihood of the sequence.

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding

Therefore, given the vocabulary V Expectation-Maximization (EM) algorithm


could be used to maximize the likelihood

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding - Algorithm

1. Construct a reasonably large seed vocabulary using BPE or Extended Suffix


Array algorithm.
2. E-Step: Estimate the probability for every token in the given vocabulary using
frequency counts in the training corpus
3. M-Step: Use Viterbi algorithm to segment the corpus and return optimal
segments that maximizes the (log) likelihood.
4. Compute the likelihood for each new subword from optimal segments
5. Shrink the vocabulary size by removing top x% of subwords that have the
smallest likelihood.
6. Repeat step 2 to 5 until desired vocabulary size is reached
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding - Example

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding - Example

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Overview

Text preprocessing is the first and most crucial step in Natural Language
Processing (NLP). It involves cleaning and preparing raw text for analysis or
machine learning models. Ensures uniformity and improves model
performance.
Key Steps Covered:
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Stemming and Lemmatization
5. Part-of-Speech (POS) Tagging
6. Named Entity Recognition (NER)
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Methods and Steps

1. Tokenization: Splits text into smaller units: sentences or words.


● Tools: nltk.tokenize, spaCy.
● Example:
○ Input: "Natural Language Processing is fun!"
○ Tokens: ['Natural', 'Language', 'Processing', 'is', 'fun', '!']

2. Lowercasing: Converts text to lowercase for uniformity.


● Example: "NLP is FUN!" → "nlp is fun!"

3. Stopword Removal: Removes common words (e.g., “is”, “the”) that add
little meaning.
● Libraries: nltk.corpus.stopwords, spaCy.
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Methods and Steps

4. Lemmatization
•Reduces words to their base or root form.
•Tools: WordNetLemmatizer (nltk), spaCy.
•Example: "running" → "run"
5. Stemming
•Cuts off word suffixes to find the root form.
•Example: "jumps", "jumping" → "jump"
6. POS Tagging
•Assigns grammatical roles to words (noun, verb, etc.).
•Tools: nltk.pos_tag, spaCy.
•Example: "NLP is fun!" → [('NLP', 'NN'), ('is', 'VB'), ('fun', 'JJ')]
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Methods and Steps

7. Named Entity Recognition (NER)


•Identifies entities (names, dates, locations).
•Example: "John lives in New York." → [('John', 'PERSON'), ('New York', 'GPE')]

Text preprocessing is a key step in natural language processing (NLP) that


prepares raw text data for analysis or machine learning. It involves cleaning and
standardising text by removing noise like punctuation, stopwords, and special
characters, while applying techniques like case conversion and lemmatisation.
This process ensures consistency, reduces data complexity, and enhances model
performance, making it essential for effective NLP workflows.
UE22AM343BB5: Large Language Models and Their Applications

References

1. https://www.youtube.com/watch?v=tOMjTCO0htA
2. https://www.youtube.com/watch?v=HEikzVL-lZU
3. https://huggingface.co/learn/nlp-course/en/chapter6/5
4. https://www.youtube.com/watch?v=qpv6ms_t_1A
5. https://huggingface.co/learn/nlp-course/en/chapter6/6
6. https://www.cse.iitm.ac.in/~miteshk/llm-course.html
UE22AM343BB5
Large Language Models and Their Applications
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack:Ujjwal MK,
Teaching Assistant

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy