0% found this document useful (0 votes)
34 views

Big Data Assignment Group 7 Monalisa Kakati (2757) Sejal Gandhi (2403) Indrani Das (3890) Nitesh Deshmukh (0505) Farhan Ali (3232)

N-grams are sequences of consecutive items from a given sample of text or speech. They are used in natural language processing and text mining tasks. The document provides examples of bigrams and trigrams extracted from sentences. N-gram models predict the next item based on the previous n items, and are used for language modeling, feature extraction in machine learning, and modeling sequences in speech recognition and parsing. They are a fundamental technique for representing and analyzing sequential data like text.

Uploaded by

Aviral Lamsal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Big Data Assignment Group 7 Monalisa Kakati (2757) Sejal Gandhi (2403) Indrani Das (3890) Nitesh Deshmukh (0505) Farhan Ali (3232)

N-grams are sequences of consecutive items from a given sample of text or speech. They are used in natural language processing and text mining tasks. The document provides examples of bigrams and trigrams extracted from sentences. N-gram models predict the next item based on the previous n items, and are used for language modeling, feature extraction in machine learning, and modeling sequences in speech recognition and parsing. They are a fundamental technique for representing and analyzing sequential data like text.

Uploaded by

Aviral Lamsal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

BIG DATA

Assignment

Group 7

Monalisa Kakati (2757)


Sejal Gandhi (2403)
Indrani Das (3890)
Nitesh Deshmukh(0505)
Farhan Ali (3232)
N-gram Representation

N-grams of texts are extensively used in text mining and natural language processing tasks.
They are basically a set of co-occuring words within a given window and when computing
the n-grams you typically move one word forward (although you can move X words forward
in more advanced scenarios). For example, for the sentence "The cow jumps over the moon".
If N=2 (known as bigrams), then the ngrams would be:

 the cow
 cow jumps
 jumps over
 over the
 the moon
If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:

What are N-grams used for?


N-grams are used for a variety of different task. For example, when developing a
language model, n-grams are used to develop not just unigram models but also bigram
and trigram models.

Another use of n-grams is for developing features for supervised Machine Learning models
such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams
in the feature space instead of just unigrams.
An n-gram model models sequences, notably natural languages, using the statistical properties
of n-grams.
This idea can be traced to an experiment by Claude Shannon's work in information theory.
Shannon posed the question: given a sequence of letters (for example, the sequence "for ex"),
what is the likelihood of the next letter? From training data, one can derive a probability
distribution for the next letter given a history of size : a = 0.4, b = 0.00001, c = 0, ....; where the
probabilities of all possible "next-letters" sum to 1.0.
More concisely, an n-gram model predicts  based on . In probability terms, this is . When used
for language modeling, independence assumptions are made so that each word depends only on
the last n − 1 words. This Markov model is used as an approximation of the true underlying
language. This assumption is important because it massively simplifies the problem of estimating
the language model from data. In addition, because of the open nature of language, it is common
to group words unknown to the language model together.
Note that in a simple n-gram language model, the probability of a word, conditioned on some
number of previous words (one word in a bigram model, two words in a trigram model, etc.) can
be described as following a categorical distribution (often imprecisely called a "multinomial
distribution").
In practice, the probability distributions are smoothed by assigning non-zero probabilities to
unseen words or n-grams; see smoothing techniques.
n-gram models are widely used in statistical natural language processing. In speech
recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution.
For parsing, words are modeled such that each n-gram is composed of n words. For language
identification, sequences of characters/graphemes (e.g., letters of the alphabet) are modeled for
different languages.[4] For sequences of characters, the 3-grams (sometimes referred to as
"trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor"
and so forth, counting the space character as a gram (sometimes the beginning and end of a text
are modeled explicitly, adding "__g", "_go", "ng_", and "g__"). For sequences of words, the
trigrams (shingles) that can be generated from "the dog smelled like a skunk" are "# the dog",
"the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #".

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy