Big Data Assignment Group 7 Monalisa Kakati (2757) Sejal Gandhi (2403) Indrani Das (3890) Nitesh Deshmukh (0505) Farhan Ali (3232)
Big Data Assignment Group 7 Monalisa Kakati (2757) Sejal Gandhi (2403) Indrani Das (3890) Nitesh Deshmukh (0505) Farhan Ali (3232)
Assignment
Group 7
N-grams of texts are extensively used in text mining and natural language processing tasks.
They are basically a set of co-occuring words within a given window and when computing
the n-grams you typically move one word forward (although you can move X words forward
in more advanced scenarios). For example, for the sentence "The cow jumps over the moon".
If N=2 (known as bigrams), then the ngrams would be:
the cow
cow jumps
jumps over
over the
the moon
If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:
Another use of n-grams is for developing features for supervised Machine Learning models
such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams
in the feature space instead of just unigrams.
An n-gram model models sequences, notably natural languages, using the statistical properties
of n-grams.
This idea can be traced to an experiment by Claude Shannon's work in information theory.
Shannon posed the question: given a sequence of letters (for example, the sequence "for ex"),
what is the likelihood of the next letter? From training data, one can derive a probability
distribution for the next letter given a history of size : a = 0.4, b = 0.00001, c = 0, ....; where the
probabilities of all possible "next-letters" sum to 1.0.
More concisely, an n-gram model predicts based on . In probability terms, this is . When used
for language modeling, independence assumptions are made so that each word depends only on
the last n − 1 words. This Markov model is used as an approximation of the true underlying
language. This assumption is important because it massively simplifies the problem of estimating
the language model from data. In addition, because of the open nature of language, it is common
to group words unknown to the language model together.
Note that in a simple n-gram language model, the probability of a word, conditioned on some
number of previous words (one word in a bigram model, two words in a trigram model, etc.) can
be described as following a categorical distribution (often imprecisely called a "multinomial
distribution").
In practice, the probability distributions are smoothed by assigning non-zero probabilities to
unseen words or n-grams; see smoothing techniques.
n-gram models are widely used in statistical natural language processing. In speech
recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution.
For parsing, words are modeled such that each n-gram is composed of n words. For language
identification, sequences of characters/graphemes (e.g., letters of the alphabet) are modeled for
different languages.[4] For sequences of characters, the 3-grams (sometimes referred to as
"trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor"
and so forth, counting the space character as a gram (sometimes the beginning and end of a text
are modeled explicitly, adding "__g", "_go", "ng_", and "g__"). For sequences of words, the
trigrams (shingles) that can be generated from "the dog smelled like a skunk" are "# the dog",
"the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #".