0% found this document useful (0 votes)

4 views34 pages

Tokenization

The document discusses various tokenization techniques used in Natural Language Processing (NLP), including word-based, character-based, subword-based (Byte-Pair Encoding, WordPiece, and SentencePiece) methods. Each technique has its advantages and disadvantages, impacting vocabulary size and context understanding. The document also emphasizes the importance of text preprocessing in preparing raw text for analysis or machine learning models.

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views34 pages

Tokenization

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

UE22AM343BB5

Large Language Models and Their Applications

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack:Ujjwal MK,
Teaching Assistant
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Overview

In NLP, most of the data that we handle consists of raw text. These raw
texts are not understood directly by machine learning models. Hence we
use tokenizers

Image Credit: https://www.youtube.com/watch?v=VFp38yj8h3A&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=13

UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Types

● There are various types of tokenization(Embedding) techniques:

○ Word-based Tokenization
○ Character-based Tokenization
○ SubWord-based Tokenization
■ Byte-Pair Encoding
■ WordPiece Encoding
■ SentencePiece Encoding
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Word-based Tokenization

In this tokenization technique, the idea is to split the raw text into
words. Based on some rules like empty spaces, punctuations, etc

Image Credit https://www.youtube.com/watch?v=nhJxYji1aho&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=14

UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Word-based Tokenization

There are limitations:

1. Lack context understanding: For example, the word “dog” and “dogs” are
represented with diﬀerent ID even though those words are very similar and the
meaning is close.
2. The vocabulary becomes too huge: When trying to understanding the corpus,
due to this type of tokenization, the number of words unique might end up being
very high thereby leading to higher vocabulary size. Leading to huge model
sizes.

(Note: we can ignore words and set the limit to number of words it learns, For
examples: learning only 10,000 words. Every other words after 10,000 words will
be termed as unknown words, thereby leading to loss of information)
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Character-based Tokenization

We now split the text into individual characters rather than the words.

The advantage being, the vocabulary size is concise and

remains adequate for the model.

For example: In english, the average vocabulary size is

around 170,000 words. We can get by that huge number
by considering the number of characters which is 256
characters.

Image Credits: https://www.youtube.com/watch?v=ssLq_EK2jLE&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=15

UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Character-based Tokenization

Advantages of using character-based tokenization over word-based

tokenization:
1. The vocabulary size is limited and does not increase the model size.
2. Fewer ‘out of vocab’ characters, as the vocabulary contains all the
characters present in corpus, there tokenization process in
character-based tends to fewer ‘out of vocabulary’ tokens.

Disadvantages:
1. This process might not enable each token to hold more context/value. For
example: ‘L’ contains lower context compared to ‘Let’s’.
2. The token sequences are translated to very large numbers instead of
representing it with fewer tokens as each character is being tokenized.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SubWord-based Tokenization

This method, lies in-between Word-based and character-based tokenization

techniques.

The idea is to ﬁnd a middle ground, wherein there is fewer ‘out of vocab’
words while ensuring meaning between the tokens with smaller sequence
length.
Image Credits: https://www.youtube.com/watch?v=zHvTiHr506c&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=16
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SubWord-based Tokenization

This algorithm relies on the following rules:

Image Credit: https://www.youtube.com/watch?v=zHvTiHr506c&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=16

UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding

BPE is a text compression algorithm adapted for tokenization in NLP, used

in models like GPT, RoBERTa, and BART. It generates subword tokens by
iteratively merging character pairs based on their frequency in a corpus,
making it robust for handling unknown words and reducing vocabulary size.

Training Process
1. Initial Vocabulary:
a. Extract unique characters from words in the corpus.
b. Example: For“dog”, “log”, “fog”, “dogs”, “logs”, the initial vocabulary is
[“d”, “f”, “g”, “l”, “o”, “s”].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding

2. Iterative Merging:
● Compute pair frequencies across all words.
● Merge the most frequent pair into a new token and update the vocabulary.
● Repeat until the desired vocabulary size is reached.
Example:
● Corpus Frequencies: (“dog”, 10), (“log”, 5), (“fog”, 12), (“dogs”, 4), (“logs”, 5)
○ Merge (“o”, “g”) → “og”: New vocabulary: [“d”, “f”, “g”, “l”, “o”, “s”, “og”]
○ Merge ("d", "og") → "dog": Vocabulary: [“d”, “f”, “g”, “l”, “o”, “s”, “og”, “dog”]
○ Merge ("l", "og") → "hug": Vocabulary: [“d”, “f”, “g”, “l”, “o”, “s”, “og”, “dog”,
“log”]
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding

Tokenization Process
To tokenize new inputs:
1. Normalize and pre-tokenize the input.
2. Split words into characters.
3. Apply learned merge rules sequentially.
Example:
● Merge Rules:
○ (“o”, “g”) → “og”, (“d”, “og”) → “dog”, (“l”, “og”) → “log”
○ Tokenizing "fog": Results in ["f", "og"].
○ Tokenizing "dogs": Results in ["dog", "s"].
○ Tokenizing "frog": Results in ["[UNK]", "og"].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding - Animation

Credit: https://www.youtube.com/watch?v=HEikzVL-lZU
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Byte-Pair Encoding

Key Beneﬁts
● Eﬃcient vocabulary management.
● Handles rare and unknown words by breaking them into subwords.
● Widely adopted in Transformer-based NLP models.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

WordPiece is a tokenization algorithm initially developed by Google for training

BERT and has since been adopted in several Transformer models like DistilBERT,
MobileBERT, Funnel Transformers, and MPNET. It is similar to Byte Pair Encoding
(BPE) but diﬀers in its merge rule selection and tokenization method.

Training Process
1. Initial Vocabulary:
a. Start with a vocabulary of special tokens (like [UNK], [CLS], [SEP], etc.) and an
alphabet of all characters in the corpus.
b. Each character is considered separately and preﬁxed with “##” if it is not the ﬁrst
character in a word.
c. Example: For the word “word”, the initial split would be: w ##o ##r ##d
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

2. Learning Merge Rules:

• Like BPE, WordPiece uses merge rules to combine subwords iteratively.
• However, instead of selecting the most frequent pair of subwords to merge,
WordPiece uses a scoring formula:
score = (freq of pair) / (freq of ﬁrst element * freq of second element)
• Purpose: This scoring prioritises merging pairs that are less frequent
individually in the vocabulary.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

3. Example:
● Initial corpus: (“cat”, 12), (“bat”, 8), (“rat”, 10), (“cats”, 5), (“bats”, 3)
● Initial splits: (“c” “##a” “##t”, 12), (“b” “##a” “##t”, 8), (“r” “##a” “##t”, 10), (“c” “##a”
“##t” “##s”, 5), (“b” “##a” “##t” “##s”, 3)
● Initial vocabulary: [“b”, “c”, “r”, “##a”, “##t”, “##s”]
Merge Selection:
● The most frequent pair is (”##a”, “##t”) (appears 30 times).
● Its score is 30/864 = 0.0347 , due to the high frequency of “##a”.
● The pair (”##t”, “##s”) has the next highest score (8/300):
● Vocabulary becomes: [“b”, “c”, “r”, “##s”, “##at”]
● Corpus becomes: (“c” “##at”, 12), (“b” “##at”, 8), (“r” “##at”, 10), (“c” “##at” “##s”,
5), (“b” “##at” “##s”, 3)
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

4. Iterative Merging:
The process continues until the vocabulary reaches the desired size.

Tokenization Process
1. Key Diﬀerences from BPE:
a. WordPiece does not save merge rules; it only saves the ﬁnal vocabulary.
b. It uses the longest matching subword strategy to tokenize a word.
c. In cases where no valid subword exists, the word is tokenized as [UNK].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

2. Longest Subword Matching:

● Starting from the beginning of the word, ﬁnd the longest subword present
in the vocabulary.
● Split on the identiﬁed subword and continue with the remaining characters.
● Examples:
○ Tokenizing "hugs":
■ Longest subword: "hug" → split as ["hug", "##s"].
○ Tokenizing "bugs":
■ Longest subword: "b" → split as ["b", "##u", "##gs"].
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding

3. Unknown Tokens:
● If no valid subword exists for any part of the word, the entire word is
tokenized as [UNK].
● Example:
○ For "mug", since "##m" is not in the vocabulary, the output is ["[UNK]"].
Summary
WordPiece focuses on eﬃcient vocabulary learning by scoring pairs and
prioritising less frequent individual subwords. Its longest matching subword
tokenization strategy ensures robust handling of words, but it is stricter than BPE in
dealing with out-of-vocabulary segments, resulting in more [UNK] tokens for
unseen words.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - WordPiece Encoding - Animation

Credits: https://www.youtube.com/watch?v=qpv6ms_t_1A
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - Comparison b/w BPE and WordPiece

Aspect BPE WordPiece

Merge Rule Selection Most frequent pair Scored pair
(frequency-based)
Vocabulary Output Merge rules + ﬁnal Only ﬁnal vocabulary
vocabulary
Tokenization Method Applies learned merges Uses longest matching
sequentially subword strategy
Handling Unknowns Only unknown characters Entire word is [UNK] if a
are [UNK] part cannot match
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding

Traditional tokenization algorithms assume that input text uses spaces to

separate words, which isn’t the case for all languages. Languages like Chinese,
Japanese, and Thai, for instance, do not rely on spaces, requiring
language-speciﬁc pre-tokenizers (e.g., XLM’s specialized pre-tokenizers). To
address this limitation more broadly, SentencePiece was introduced as a simple,
language-independent subword tokenizer and detokenizer for neural text
processing (Kudo et al., 2018). Unlike other methods, SentencePiece processes
raw input streams, treating spaces as characters and using algorithms like BPE
or unigram to construct the vocabulary.
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding

let x denote a subword sequence of length n:

x = (x1,x2,x3,....)
then the probability of the subword sequence (with unigram LM) is simply:

objective is to ﬁnd the subword sequence for the input sequence X that
maximizes the (log) likelihood of the sequence.

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding

Therefore, given the vocabulary V Expectation-Maximization (EM) algorithm

could be used to maximize the likelihood

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding - Algorithm

1. Construct a reasonably large seed vocabulary using BPE or Extended Suﬃx

Array algorithm.
2. E-Step: Estimate the probability for every token in the given vocabulary using
frequency counts in the training corpus
3. M-Step: Use Viterbi algorithm to segment the corpus and return optimal
segments that maximizes the (log) likelihood.
4. Compute the likelihood for each new subword from optimal segments
5. Shrink the vocabulary size by removing top x% of subwords that have the
smallest likelihood.
6. Repeat step 2 to 5 until desired vocabulary size is reached
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding - Example

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Tokenization - SentencePiece Encoding - Example

Credits: https://iitm-pod.slides.com/arunprakash_ai/tokenizers/fullscreen#/0/40/11
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Overview

Text preprocessing is the ﬁrst and most crucial step in Natural Language
Processing (NLP). It involves cleaning and preparing raw text for analysis or
machine learning models. Ensures uniformity and improves model
performance.
Key Steps Covered:
1. Tokenization
2. Lowercasing
3. Stopword Removal
4. Stemming and Lemmatization
5. Part-of-Speech (POS) Tagging
6. Named Entity Recognition (NER)
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Methods and Steps

1. Tokenization: Splits text into smaller units: sentences or words.

● Tools: nltk.tokenize, spaCy.
● Example:
○ Input: "Natural Language Processing is fun!"
○ Tokens: ['Natural', 'Language', 'Processing', 'is', 'fun', '!']

2. Lowercasing: Converts text to lowercase for uniformity.

● Example: "NLP is FUN!" → "nlp is fun!"

3. Stopword Removal: Removes common words (e.g., “is”, “the”) that add
little meaning.
● Libraries: nltk.corpus.stopwords, spaCy.
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Methods and Steps

4. Lemmatization
•Reduces words to their base or root form.
•Tools: WordNetLemmatizer (nltk), spaCy.
•Example: "running" → "run"
5. Stemming
•Cuts off word suffixes to find the root form.
•Example: "jumps", "jumping" → "jump"
6. POS Tagging
•Assigns grammatical roles to words (noun, verb, etc.).
•Tools: nltk.pos_tag, spaCy.
•Example: "NLP is fun!" → [('NLP', 'NN'), ('is', 'VB'), ('fun', 'JJ')]
UE22AM343BB5: Large Language Models and Their Applications

Text Preprocessing - Methods and Steps

7. Named Entity Recognition (NER)

•Identiﬁes entities (names, dates, locations).
•Example: "John lives in New York." → [('John', 'PERSON'), ('New York', 'GPE')]

Text preprocessing is a key step in natural language processing (NLP) that

prepares raw text data for analysis or machine learning. It involves cleaning and
standardising text by removing noise like punctuation, stopwords, and special
characters, while applying techniques like case conversion and lemmatisation.
This process ensures consistency, reduces data complexity, and enhances model
performance, making it essential for eﬀective NLP workﬂows.
UE22AM343BB5: Large Language Models and Their Applications

References

1. https://www.youtube.com/watch?v=tOMjTCO0htA
2. https://www.youtube.com/watch?v=HEikzVL-lZU
3. https://huggingface.co/learn/nlp-course/en/chapter6/5
4. https://www.youtube.com/watch?v=qpv6ms_t_1A
5. https://huggingface.co/learn/nlp-course/en/chapter6/6
6. https://www.cse.iitm.ac.in/~miteshk/llm-course.html
UE22AM343BB5
Large Language Models and Their Applications
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack:Ujjwal MK,
Teaching Assistant

Chat GPT For Dummies. A Quick Introduction To Prompt Engineering 2023
90% (10)
Chat GPT For Dummies. A Quick Introduction To Prompt Engineering 2023
33 pages
L1 Introduction
No ratings yet
L1 Introduction
127 pages
Token Izer
No ratings yet
Token Izer
17 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Code Explanation
No ratings yet
Code Explanation
8 pages
SPR 08 Algorithms
No ratings yet
SPR 08 Algorithms
41 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
Unit - 3 NLP
No ratings yet
Unit - 3 NLP
15 pages
Day 1
No ratings yet
Day 1
32 pages
13 TextGen 2024
No ratings yet
13 TextGen 2024
106 pages
PHP Da
No ratings yet
PHP Da
26 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Using AI in Higher Education
No ratings yet
Using AI in Higher Education
31 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Transformer Basics
No ratings yet
Transformer Basics
17 pages
Lecture2 Transformer
No ratings yet
Lecture2 Transformer
64 pages
The 2021 Chatbot Guide
100% (2)
The 2021 Chatbot Guide
50 pages
(Slide) Sentiment Analysis v3
No ratings yet
(Slide) Sentiment Analysis v3
46 pages
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
No ratings yet
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
25 pages
LLM Intro
No ratings yet
LLM Intro
49 pages
Course2 Tokenization
No ratings yet
Course2 Tokenization
44 pages
Tokenization
No ratings yet
Tokenization
26 pages
Chunking & Tokenization (Updated)
No ratings yet
Chunking & Tokenization (Updated)
25 pages
From Bytes To Ideas: Language Modeling With Autoregressive U-Nets
No ratings yet
From Bytes To Ideas: Language Modeling With Autoregressive U-Nets
18 pages
AIResAnalyser
No ratings yet
AIResAnalyser
55 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
Prompting Techniques Slide Deck
No ratings yet
Prompting Techniques Slide Deck
29 pages
How Does A GPT Tool Process Inputs
No ratings yet
How Does A GPT Tool Process Inputs
19 pages
Assignment No 1 - Genai Fa24-Msds-0007
No ratings yet
Assignment No 1 - Genai Fa24-Msds-0007
10 pages
Byte-Pair Encoding Tokenization - Hugging Face NLP Course
No ratings yet
Byte-Pair Encoding Tokenization - Hugging Face NLP Course
17 pages
LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
Tokenization
No ratings yet
Tokenization
7 pages
WordPiece Tokenization - Hugging Face NLP Course
No ratings yet
WordPiece Tokenization - Hugging Face NLP Course
12 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
Tokenisation and Embedding
No ratings yet
Tokenisation and Embedding
11 pages
Day 5 Tokenisation and Embedding
No ratings yet
Day 5 Tokenisation and Embedding
12 pages
Formalizing BPE Tokenization
No ratings yet
Formalizing BPE Tokenization
12 pages
LLM Embeddings
No ratings yet
LLM Embeddings
11 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Assignment No 1 - Genai
No ratings yet
Assignment No 1 - Genai
10 pages
Normalization and Pre-Tokenization - Hugging Face NLP Course
No ratings yet
Normalization and Pre-Tokenization - Hugging Face NLP Course
11 pages
Handout On A.I
No ratings yet
Handout On A.I
52 pages
AOML
No ratings yet
AOML
14 pages
Vinija's Notes - Natural Language Processing - Tokenizer
No ratings yet
Vinija's Notes - Natural Language Processing - Tokenizer
11 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
Data Science Chatbot
No ratings yet
Data Science Chatbot
9 pages
OSINT NLP Report On The Bible
No ratings yet
OSINT NLP Report On The Bible
8 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Machine Translation
No ratings yet
Machine Translation
10 pages
Aoml Projj
No ratings yet
Aoml Projj
11 pages
A Survey of Deep Learning Approaches For OCR and D
No ratings yet
A Survey of Deep Learning Approaches For OCR and D
14 pages
Mathematics of LLMs Part 1
No ratings yet
Mathematics of LLMs Part 1
8 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
12 pages
2 Generative Models
No ratings yet
2 Generative Models
60 pages
ML Ai PGD
No ratings yet
ML Ai PGD
26 pages
AI-Unit 5
No ratings yet
AI-Unit 5
32 pages
Top NLP BOoks
No ratings yet
Top NLP BOoks
5 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
TF Idf
No ratings yet
TF Idf
6 pages
Humpty Dumpty: Controlling Word Meanings Via Corpus Poisoning
No ratings yet
Humpty Dumpty: Controlling Word Meanings Via Corpus Poisoning
19 pages
Bpemb: Tokenization-Free Pre-Trained Subword Embeddings in 275 Languages
No ratings yet
Bpemb: Tokenization-Free Pre-Trained Subword Embeddings in 275 Languages
5 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
1 s2.0 S2772485923000327 Main
No ratings yet
1 s2.0 S2772485923000327 Main
12 pages
Deep Learning Models On Big Data For Genomic Research
No ratings yet
Deep Learning Models On Big Data For Genomic Research
3 pages
Fai Question Bank
No ratings yet
Fai Question Bank
11 pages
Resume Parser Progress
No ratings yet
Resume Parser Progress
11 pages
Study On Application of Graph Theory in Artificial Intelligence (AI)
No ratings yet
Study On Application of Graph Theory in Artificial Intelligence (AI)
8 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
JACOB Data Science Chatbot A Comprehensive Guide
No ratings yet
JACOB Data Science Chatbot A Comprehensive Guide
10 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Natural Language Processing NLP and EFL Learning A Case Study Based On Deep LearningJournal of Language Teaching and Research
No ratings yet
Natural Language Processing NLP and EFL Learning A Case Study Based On Deep LearningJournal of Language Teaching and Research
8 pages
Literature Survey
No ratings yet
Literature Survey
6 pages
AIML Module 1 Notes (21CS752)
No ratings yet
AIML Module 1 Notes (21CS752)
14 pages
LLM Prompting & In-Context Learning
No ratings yet
LLM Prompting & In-Context Learning
18 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
One Day Workshop: Important Date About VIT
No ratings yet
One Day Workshop: Important Date About VIT
2 pages
Assignment 1 - Muhammad Amirul Roslan
No ratings yet
Assignment 1 - Muhammad Amirul Roslan
5 pages
BSC (Hons) Business Management Bmp4005 Information Systems and Big Data Analysis Assessment Number 2 Written Report and Poster Accompanying Paper
No ratings yet
BSC (Hons) Business Management Bmp4005 Information Systems and Big Data Analysis Assessment Number 2 Written Report and Poster Accompanying Paper
8 pages
Sourabh23 Resume
No ratings yet
Sourabh23 Resume
1 page
Regularization in Linear Regression
No ratings yet
Regularization in Linear Regression
1 page
Introduction To Large Language Models
No ratings yet
Introduction To Large Language Models
3 pages
The Evolution of Artificial Intelligence From Concept To Reality
No ratings yet
The Evolution of Artificial Intelligence From Concept To Reality
2 pages
Yash Bhartia CV
No ratings yet
Yash Bhartia CV
2 pages
CV Rwitban
No ratings yet
CV Rwitban
2 pages
The Way to Go: A Thorough Introduction to the Go Programming Language
From Everand
The Way to Go: A Thorough Introduction to the Go Programming Language
Ivo Balbaert
3/5 (4)
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
From Everand
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
Fergal Dearle
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Tokenization

Uploaded by

Tokenization

Uploaded by

UE22AM343BB5

Large Language Models and Their Applications

Image Credit: https://www.youtube.com/watch?v=VFp38yj8h3A&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=13

● There are various types of tokenization(Embedding) techniques:

Tokenization - Word-based Tokenization

Image Credit https://www.youtube.com/watch?v=nhJxYji1aho&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=14

Tokenization - Word-based Tokenization

There are limitations:

Tokenization - Character-based Tokenization

The advantage being, the vocabulary size is concise and

For example: In english, the average vocabulary size is

Image Credits: https://www.youtube.com/watch?v=ssLq_EK2jLE&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=15

Tokenization - Character-based Tokenization

Advantages of using character-based tokenization over word-based

Tokenization - SubWord-based Tokenization

This method, lies in-between Word-based and character-based tokenization

Tokenization - SubWord-based Tokenization

This algorithm relies on the following rules:

Image Credit: https://www.youtube.com/watch?v=zHvTiHr506c&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=16

Tokenization - Byte-Pair Encoding

BPE is a text compression algorithm adapted for tokenization in NLP, used

Tokenization - Byte-Pair Encoding

Tokenization - Byte-Pair Encoding

Tokenization - Byte-Pair Encoding - Animation

Tokenization - Byte-Pair Encoding

Tokenization - WordPiece Encoding

WordPiece is a tokenization algorithm initially developed by Google for training

Tokenization - WordPiece Encoding

2. Learning Merge Rules:

Tokenization - WordPiece Encoding

Tokenization - WordPiece Encoding

Tokenization - WordPiece Encoding

2. Longest Subword Matching:

Tokenization - WordPiece Encoding

Tokenization - WordPiece Encoding - Animation

Tokenization - Comparison b/w BPE and WordPiece

Aspect BPE WordPiece

Tokenization - SentencePiece Encoding

Traditional tokenization algorithms assume that input text uses spaces to

Tokenization - SentencePiece Encoding

let x denote a subword sequence of length n:

Tokenization - SentencePiece Encoding

Therefore, given the vocabulary V Expectation-Maximization (EM) algorithm

Tokenization - SentencePiece Encoding - Algorithm

1. Construct a reasonably large seed vocabulary using BPE or Extended Suﬃx

Tokenization - SentencePiece Encoding - Example

Tokenization - SentencePiece Encoding - Example

Text Preprocessing - Overview

Text Preprocessing - Methods and Steps

1. Tokenization: Splits text into smaller units: sentences or words.

2. Lowercasing: Converts text to lowercase for uniformity.

Text Preprocessing - Methods and Steps

Text Preprocessing - Methods and Steps

7. Named Entity Recognition (NER)

Text preprocessing is a key step in natural language processing (NLP) that

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.