0% found this document useful (0 votes)

5 views

02. N-Gram Language Models

The document discusses N-Gram language models, which are statistical models used in natural language processing to estimate the likelihood of word sequences based on their frequency. It covers unigrams, bigrams, trigrams, their limitations, and applications such as speech recognition and machine translation. While N-Gram models are simple and efficient, modern models like RNNs and transformers are better suited for tasks requiring deeper language understanding.

Uploaded by

Trinh Tan Quang Bao K17 DN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

02. N-Gram Language Models

Uploaded by

Trinh Tan Quang Bao K17 DN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 15

N-Gram Languages Model

Contents
 Introduction
 Unigrams and Bigrams
 Limitations
 Trigram and Higher N-Grams
 Applications of N-Gram Language Model

2
Introduction
 N-Gram models are statistical language models used
in NLP. They estimate the likelihood of a sequence of
N words based on their frequency in data. "N"
represents the number of words grouped together.

 For instance, in a bigram (2-gram) model, the

probability of a word depends on the previous word.
These models are simple but struggle with
understanding complex language patterns compared
to modern models like RNNs and transformers.

3
Introduction
 Example
Dataset with the following sentences
"I love to eat ice cream."
"I love to play soccer."
"I prefer tea over coffee.“

If we want to create a bigram (2-gram) model, we'll look at pairs of

consecutive words in each sentence. For example, the first sentence
would be split into these bigrams:

"I love"
"love to"
"to eat"
"eat ice"
"ice cream"

4
Unigrams and Bigrams
 Unigrams (1-grams):
Unigrams are the simplest form of N-Grams, where each word
or token in a text is treated as a separate unit, and its
probability is calculated independently of other words. In
other words, unigrams don't take into account any context
or relationship with surrounding words.

 Bigrams (2-grams):
Bigrams are a type of N-Gram where words are grouped into
pairs, and the probability of a word depends on the
previous word. This introduces a basic level of context into
the model. Bigram models consider the likelihood of
observing a word given the word that immediately
precedes it.

5
Unigrams and Bigrams
 Example

"I love to eat ice cream.“

Unigrams: each word's probability is calculated

separately. So, for each word: P("I"), P("love"), P("to"),
P("eat"), P("ice"), P("cream").

Bigrams: the probabilities of words are calculated based

on their relationship with the preceding word. P("I
love"), P("love to"), P("to eat"), P("eat ice"), P("ice
cream").

6
Example in Python
In this example, the code
performs the following steps

Preprocess the text by removing

non-alphanumeric characters and
converting it to lowercase.

Tokenize the text into individual

words.

Calculate the frequency of each

unigram (individual word) using the
Counter class.

Calculate the frequency of each

bigram (pair of consecutive words)
using the ngrams function and the
Counter class.

Display the results for both unigrams

7
Limitations
Using only unigrams and bigrams in language modeling
has several limitations

Lack of Context Beyond Adjacent Words

Ignoring Long-Range Dependencies

Limited Understanding of Nuanced Language Patterns

Difficulty with Ambiguity

Poor Performance on Tasks Requiring Deep Semantics

Inadequate for Creative Text Generation

8
Trigrams, Higher N-Grams
 Trigrams (3-grams) and higher N-Grams are extensions
of the N-Gram language modeling concept.

 While bigrams consider pairs of adjacent words,

trigrams consider sequences of three consecutive
words, and higher N-Grams consider sequences of N
words, where N is greater than 3.

 These models aim to capture more intricate language

patterns by considering a broader context within the
text.

9
Example in Python
In this example, the code
performs the following steps:

Preprocess the text by removing

non-alphanumeric characters and
converting it to lowercase.

Tokenize the text into individual

words.

Calculate the frequency of each

trigram (group of three consecutive
words) using the ngrams function and
the Counter class.

Display the results for trigrams.

10
Applications
 Speech Recognition
 Machine Translation
 Text Generation
 Language Modeling
 Spell Checking and Correction
 Information Retrieval
 Predictive Text Analytics

11
Summary
 While N-Gram models offer simplicity and efficiency,
they are best suited for tasks that require basic
language understanding and context prediction.

 For more advanced applications that demand nuanced

understanding and generation of text, modern models
like RNNs, LSTMs, and transformer-based models have
proven to be more effective due to their ability to
capture long-range dependencies and semantic
relationships in language.

12
Practice on N-Gram
Models
 Instructions
 Create a Python program that builds an N-Gram language model
from a given text and generates text based on the model.
 You'll work with a dataset containing sample text.
 Your task is to build a simple N-Gram language model and use it to
generate text.

 Dataset
 text_data = """Natural language processing (NLP) is a field of
artificial intelligence that focuses on the interaction between
computers and humans through natural language. NLP techniques
are used to analyze, understand, and generate human language in
a valuable way."""

13
Practice on N-Gram
Models
 Tasks
 Task 1: Preprocessing
Write a function preprocess_text(text) that takes a text as input and preprocesses it by
converting to lowercase and removing non-alphanumeric characters.

 Task 2: Build N-Gram Model

Write a function build_ngram_model(text, n) that takes preprocessed text and the
value of n (order of the N-Gram) as input, and builds an N-Gram language model.

 Task 3: Generate Text

Write a function generate_text(model, seed, length) that takes the N-Gram model, a
seed (starting word or phrase), and the desired length of the generated text as input,
and generates text using the N-Gram language model.

 Task 4: Test Your Functions

Apply each function to the given text_data and print the generated text using different
seeds and lengths.

14
Q&A

Lif Vis Pol Int Unit 4c Dys-300dpi
No ratings yet
Lif Vis Pol Int Unit 4c Dys-300dpi
6 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
Lecture 6 to 8 N-gram
No ratings yet
Lecture 6 to 8 N-gram
19 pages
A34 NLP Expt 02
No ratings yet
A34 NLP Expt 02
7 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Module-5:: Network Analysis
No ratings yet
Module-5:: Network Analysis
22 pages
N_gram_Presentation
No ratings yet
N_gram_Presentation
29 pages
Untitled document (1)
No ratings yet
Untitled document (1)
6 pages
StatisticalLanguageModel_307c1057bfc7eca695d81d227e3a7b88
No ratings yet
StatisticalLanguageModel_307c1057bfc7eca695d81d227e3a7b88
9 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
N Gram
No ratings yet
N Gram
6 pages
BCSE306L_AI_MODULE-7_SMSATAPATHY
No ratings yet
BCSE306L_AI_MODULE-7_SMSATAPATHY
51 pages
ai
No ratings yet
ai
13 pages
session10_cs2731 nlp LM
No ratings yet
session10_cs2731 nlp LM
47 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Unit 5 notes final
No ratings yet
Unit 5 notes final
14 pages
Cs224n 2025 Lecture05 Rnnlm
No ratings yet
Cs224n 2025 Lecture05 Rnnlm
54 pages
1_N-grams_and_Language_Models_Detailed
No ratings yet
1_N-grams_and_Language_Models_Detailed
4 pages
Introduction to Language Models
No ratings yet
Introduction to Language Models
24 pages
Implementation of N-Gram Technique
No ratings yet
Implementation of N-Gram Technique
6 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
PT 2
No ratings yet
PT 2
59 pages
Ai Unit 3 Part 2
No ratings yet
Ai Unit 3 Part 2
8 pages
AI Project
No ratings yet
AI Project
19 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
KEN2570 4 LanguageModel
No ratings yet
KEN2570 4 LanguageModel
17 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Unit 5
No ratings yet
Unit 5
26 pages
Deep Learning (MODULE-4)_RNN - NLP
No ratings yet
Deep Learning (MODULE-4)_RNN - NLP
52 pages
n-grams
No ratings yet
n-grams
2 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
UBC Summer School in NLP - VSP 2019 Lecture 8
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 8
27 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
lm24aug
No ratings yet
lm24aug
84 pages
Unit 5-Aiml
No ratings yet
Unit 5-Aiml
25 pages
NLP- AI2214601 unit 1to unit 5 notes
No ratings yet
NLP- AI2214601 unit 1to unit 5 notes
98 pages
Recurrent Neural Networks: Amir H. Payberah
No ratings yet
Recurrent Neural Networks: Amir H. Payberah
142 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
NLP New
No ratings yet
NLP New
3 pages
N-Gram Models For Language Detection
No ratings yet
N-Gram Models For Language Detection
14 pages
Clip Unit 4
No ratings yet
Clip Unit 4
9 pages
module-1 ch-2
No ratings yet
module-1 ch-2
31 pages
NLP_project_report
No ratings yet
NLP_project_report
21 pages
NLP 5th unit
No ratings yet
NLP 5th unit
19 pages
2 N-Gram
No ratings yet
2 N-Gram
70 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
Unit - 4 NLP - R20
No ratings yet
Unit - 4 NLP - R20
12 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
NLP
No ratings yet
NLP
46 pages
13 Ai Cse551 NLP 1 PDF
No ratings yet
13 Ai Cse551 NLP 1 PDF
50 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
Mission JavaScript
From Everand
Mission JavaScript
Sheela Preuitt
No ratings yet
Password Reset C1 - C2 - UT - 4A
No ratings yet
Password Reset C1 - C2 - UT - 4A
5 pages
Lesson 10, Key
No ratings yet
Lesson 10, Key
5 pages
English Test
No ratings yet
English Test
5 pages
Wuthering Heights On The Screen Explorin
No ratings yet
Wuthering Heights On The Screen Explorin
23 pages
دليل المعلم الإنجليزي Mega Goal 2.1 مسارات ثاني ثانوي 1444
No ratings yet
دليل المعلم الإنجليزي Mega Goal 2.1 مسارات ثاني ثانوي 1444
180 pages
English - Syllabus Grade 4
No ratings yet
English - Syllabus Grade 4
35 pages
Grade 6: ELA (Subject-Verb Agreement)
No ratings yet
Grade 6: ELA (Subject-Verb Agreement)
24 pages
CLASS-VII MID TERM PORTION
No ratings yet
CLASS-VII MID TERM PORTION
2 pages
Rizal in Paris
No ratings yet
Rizal in Paris
19 pages
Grammar - Probability - Revisión Del Intento
No ratings yet
Grammar - Probability - Revisión Del Intento
2 pages
Microsoft LCID Reference
No ratings yet
Microsoft LCID Reference
70 pages
Ricento & Cassels - Conceptual - and - Theoretical - Perspectives On Language Policy
100% (1)
Ricento & Cassels - Conceptual - and - Theoretical - Perspectives On Language Policy
16 pages
PASSIVE VOICE. 2Âº BACHILLERATO Definitivo
No ratings yet
PASSIVE VOICE. 2Âº BACHILLERATO Definitivo
7 pages
Mixed Tenses, Conditionals and Unreal Tenses
No ratings yet
Mixed Tenses, Conditionals and Unreal Tenses
7 pages
Music Tastes and Personality
No ratings yet
Music Tastes and Personality
3 pages
Blood: A. Scrub Up
No ratings yet
Blood: A. Scrub Up
4 pages
Literature Notes Class1
No ratings yet
Literature Notes Class1
2 pages
CSC 330
No ratings yet
CSC 330
3 pages
L2_A2PLUS_U2_Test_Standard (6)
No ratings yet
L2_A2PLUS_U2_Test_Standard (6)
5 pages
Chord Letto Ruang Rindu Chordfrenzy
No ratings yet
Chord Letto Ruang Rindu Chordfrenzy
2 pages
DLP8 Afroasianlit
No ratings yet
DLP8 Afroasianlit
8 pages
1ap Tarbia Islamia 529293
No ratings yet
1ap Tarbia Islamia 529293
3 pages
Classroom Rules English
No ratings yet
Classroom Rules English
11 pages
PDF Numerical Methods in Engineering with Python 3 Kiusalaas Jaan download
100% (1)
PDF Numerical Methods in Engineering with Python 3 Kiusalaas Jaan download
24 pages
Intensive Reading Passive Voice, Dates
100% (1)
Intensive Reading Passive Voice, Dates
4 pages
Fluent With Friends Bonus Materials
No ratings yet
Fluent With Friends Bonus Materials
5 pages
Lesson Plan: Anticipated Problems & Solutions
No ratings yet
Lesson Plan: Anticipated Problems & Solutions
4 pages
Level 1 First Term
No ratings yet
Level 1 First Term
5 pages
The Effect of Academic Stress On Grade 11 Students Bahavior
No ratings yet
The Effect of Academic Stress On Grade 11 Students Bahavior
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

02. N-Gram Language Models

Uploaded by

02. N-Gram Language Models

Uploaded by

N-Gram Languages Model

 For instance, in a bigram (2-gram) model, the

If we want to create a bigram (2-gram) model, we'll look at pairs of

"I love to eat ice cream.“

Unigrams: each word's probability is calculated

Bigrams: the probabilities of words are calculated based

Preprocess the text by removing

Tokenize the text into individual

Calculate the frequency of each

Calculate the frequency of each

Display the results for both unigrams

Lack of Context Beyond Adjacent Words

Limited Understanding of Nuanced Language Patterns

Difficulty with Ambiguity

Poor Performance on Tasks Requiring Deep Semantics

Inadequate for Creative Text Generation

 While bigrams consider pairs of adjacent words,

 These models aim to capture more intricate language

Preprocess the text by removing

Tokenize the text into individual

Calculate the frequency of each

Display the results for trigrams.

 For more advanced applications that demand nuanced

 Task 2: Build N-Gram Model

 Task 3: Generate Text

 Task 4: Test Your Functions

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.