0% found this document useful (0 votes)
5 views

02. N-Gram Language Models

The document discusses N-Gram language models, which are statistical models used in natural language processing to estimate the likelihood of word sequences based on their frequency. It covers unigrams, bigrams, trigrams, their limitations, and applications such as speech recognition and machine translation. While N-Gram models are simple and efficient, modern models like RNNs and transformers are better suited for tasks requiring deeper language understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

02. N-Gram Language Models

The document discusses N-Gram language models, which are statistical models used in natural language processing to estimate the likelihood of word sequences based on their frequency. It covers unigrams, bigrams, trigrams, their limitations, and applications such as speech recognition and machine translation. While N-Gram models are simple and efficient, modern models like RNNs and transformers are better suited for tasks requiring deeper language understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 15

N-Gram Languages Model

Contents
 Introduction
 Unigrams and Bigrams
 Limitations
 Trigram and Higher N-Grams
 Applications of N-Gram Language Model

2
Introduction
 N-Gram models are statistical language models used
in NLP. They estimate the likelihood of a sequence of
N words based on their frequency in data. "N"
represents the number of words grouped together.

 For instance, in a bigram (2-gram) model, the


probability of a word depends on the previous word.
These models are simple but struggle with
understanding complex language patterns compared
to modern models like RNNs and transformers.

3
Introduction
 Example
Dataset with the following sentences
"I love to eat ice cream."
"I love to play soccer."
"I prefer tea over coffee.“

If we want to create a bigram (2-gram) model, we'll look at pairs of


consecutive words in each sentence. For example, the first sentence
would be split into these bigrams:

"I love"
"love to"
"to eat"
"eat ice"
"ice cream"

4
Unigrams and Bigrams
 Unigrams (1-grams):
Unigrams are the simplest form of N-Grams, where each word
or token in a text is treated as a separate unit, and its
probability is calculated independently of other words. In
other words, unigrams don't take into account any context
or relationship with surrounding words.

 Bigrams (2-grams):
Bigrams are a type of N-Gram where words are grouped into
pairs, and the probability of a word depends on the
previous word. This introduces a basic level of context into
the model. Bigram models consider the likelihood of
observing a word given the word that immediately
precedes it.

5
Unigrams and Bigrams
 Example

"I love to eat ice cream.“

Unigrams: each word's probability is calculated


separately. So, for each word: P("I"), P("love"), P("to"),
P("eat"), P("ice"), P("cream").

Bigrams: the probabilities of words are calculated based


on their relationship with the preceding word. P("I
love"), P("love to"), P("to eat"), P("eat ice"), P("ice
cream").

6
Example in Python
In this example, the code
performs the following steps

Preprocess the text by removing


non-alphanumeric characters and
converting it to lowercase.

Tokenize the text into individual


words.

Calculate the frequency of each


unigram (individual word) using the
Counter class.

Calculate the frequency of each


bigram (pair of consecutive words)
using the ngrams function and the
Counter class.

Display the results for both unigrams


7
Limitations
Using only unigrams and bigrams in language modeling
has several limitations

Lack of Context Beyond Adjacent Words


Ignoring Long-Range Dependencies

Limited Understanding of Nuanced Language Patterns

Difficulty with Ambiguity

Poor Performance on Tasks Requiring Deep Semantics

Inadequate for Creative Text Generation

8
Trigrams, Higher N-Grams
 Trigrams (3-grams) and higher N-Grams are extensions
of the N-Gram language modeling concept.

 While bigrams consider pairs of adjacent words,


trigrams consider sequences of three consecutive
words, and higher N-Grams consider sequences of N
words, where N is greater than 3.

 These models aim to capture more intricate language


patterns by considering a broader context within the
text.

9
Example in Python
In this example, the code
performs the following steps:

Preprocess the text by removing


non-alphanumeric characters and
converting it to lowercase.

Tokenize the text into individual


words.

Calculate the frequency of each


trigram (group of three consecutive
words) using the ngrams function and
the Counter class.

Display the results for trigrams.

10
Applications
 Speech Recognition
 Machine Translation
 Text Generation
 Language Modeling
 Spell Checking and Correction
 Information Retrieval
 Predictive Text Analytics

11
Summary
 While N-Gram models offer simplicity and efficiency,
they are best suited for tasks that require basic
language understanding and context prediction.

 For more advanced applications that demand nuanced


understanding and generation of text, modern models
like RNNs, LSTMs, and transformer-based models have
proven to be more effective due to their ability to
capture long-range dependencies and semantic
relationships in language.

12
Practice on N-Gram
Models
 Instructions
 Create a Python program that builds an N-Gram language model
from a given text and generates text based on the model.
 You'll work with a dataset containing sample text.
 Your task is to build a simple N-Gram language model and use it to
generate text.

 Dataset
 text_data = """Natural language processing (NLP) is a field of
artificial intelligence that focuses on the interaction between
computers and humans through natural language. NLP techniques
are used to analyze, understand, and generate human language in
a valuable way."""

13
Practice on N-Gram
Models
 Tasks
 Task 1: Preprocessing
Write a function preprocess_text(text) that takes a text as input and preprocesses it by
converting to lowercase and removing non-alphanumeric characters.

 Task 2: Build N-Gram Model


Write a function build_ngram_model(text, n) that takes preprocessed text and the
value of n (order of the N-Gram) as input, and builds an N-Gram language model.

 Task 3: Generate Text


Write a function generate_text(model, seed, length) that takes the N-Gram model, a
seed (starting word or phrase), and the desired length of the generated text as input,
and generates text using the N-Gram language model.

 Task 4: Test Your Functions


Apply each function to the given text_data and print the generated text using different
seeds and lengths.

14
Q&A

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy