02. N-Gram Language Models
02. N-Gram Language Models
Contents
Introduction
Unigrams and Bigrams
Limitations
Trigram and Higher N-Grams
Applications of N-Gram Language Model
2
Introduction
N-Gram models are statistical language models used
in NLP. They estimate the likelihood of a sequence of
N words based on their frequency in data. "N"
represents the number of words grouped together.
3
Introduction
Example
Dataset with the following sentences
"I love to eat ice cream."
"I love to play soccer."
"I prefer tea over coffee.“
"I love"
"love to"
"to eat"
"eat ice"
"ice cream"
4
Unigrams and Bigrams
Unigrams (1-grams):
Unigrams are the simplest form of N-Grams, where each word
or token in a text is treated as a separate unit, and its
probability is calculated independently of other words. In
other words, unigrams don't take into account any context
or relationship with surrounding words.
Bigrams (2-grams):
Bigrams are a type of N-Gram where words are grouped into
pairs, and the probability of a word depends on the
previous word. This introduces a basic level of context into
the model. Bigram models consider the likelihood of
observing a word given the word that immediately
precedes it.
5
Unigrams and Bigrams
Example
6
Example in Python
In this example, the code
performs the following steps
8
Trigrams, Higher N-Grams
Trigrams (3-grams) and higher N-Grams are extensions
of the N-Gram language modeling concept.
9
Example in Python
In this example, the code
performs the following steps:
10
Applications
Speech Recognition
Machine Translation
Text Generation
Language Modeling
Spell Checking and Correction
Information Retrieval
Predictive Text Analytics
11
Summary
While N-Gram models offer simplicity and efficiency,
they are best suited for tasks that require basic
language understanding and context prediction.
12
Practice on N-Gram
Models
Instructions
Create a Python program that builds an N-Gram language model
from a given text and generates text based on the model.
You'll work with a dataset containing sample text.
Your task is to build a simple N-Gram language model and use it to
generate text.
Dataset
text_data = """Natural language processing (NLP) is a field of
artificial intelligence that focuses on the interaction between
computers and humans through natural language. NLP techniques
are used to analyze, understand, and generate human language in
a valuable way."""
13
Practice on N-Gram
Models
Tasks
Task 1: Preprocessing
Write a function preprocess_text(text) that takes a text as input and preprocesses it by
converting to lowercase and removing non-alphanumeric characters.
14
Q&A
15