Example 2.9
Training set:
The Arabian Knights
These are the fairy tales of the east
The stories of the Arabian knights are translated in many languages
Bi-gram model:
P(the/) = 0.67. P(Arabian/the) = 0.4 “P(knights /Arabian) = 1.0
Plare/these) = 1.0 Plthe/are) = 0.5" P(fairy/the) = 0.2
Pltales/fairy) = 1.0) P(of/tales) = 1.0 Pi(the/of) = 1.0
Pleast/the) = 0.2 P(storiés/the) = 0.2 P(of/stories) = 1.0
P(are/knights) = 1.0 P(translated/are) = 0.5 P(in /tyanslated) = 1.0
P(many/in) = 1.0 Hikae
P(languages/many) = 1.0
Test sentence(s):‘The Arabian knights are the fairy tales of the east.
P(The/) x. P(Arabian/the) x P(Knights/Arabian) x P(are/knighis)
x’P(the/are) x P(fairy/the) x Pltales/fairy) x Plof/tales) x P(the/of)
x P(east/the)
= 0.67 x 0.5 10x 1.0 x 0.5 x 0.2 x 1.0 x
= 0.0067 ay
As each probability is necessarily less than 1, multiplying s
Probabilities might cause @ numerical underflow, particularly a ong
Sentences, To avoid this, calculations are made in log space, ae 2
calculation corresponds to adding log of individual probabilities and taking
antilog of the sum.
1,0 x 1.0 x 0.2uage Processing and Information Retrieval
. An mgray
The n-gram model suffers from data sparseness Po esrabability, a
that does not occur in the training data is assigned a bi-gram matrix
that even a large corpus has several zero entries in : f occurrence of,
This is because of the assumption that the probability oe nl words)
word depends only on the preceding word (or precec! Bi endenciel
which is not true in general. There are several long distance os ’
in natural language sentences, which this model fails to ae ‘ an
(2003) pointed out that ‘there is rarely enough data to accurately estimate
the parameters of a language model.’
A number of smoothing techniques have been developed to handle
the data sparseness problem, the simplest of these being add-one
smoothing. In the words of Jurafsky and Martin (2000)
Smoothing in general refers to the task of re-evaluating zero-
probability or low-probability n-grams and assigning them non-zero
values.
‘The word ‘smoothing’ is used
to denote these techniques b
Feast ec
tend to make distributions more uniform by cas th ‘ause they
probabilities towards the average. 8 the extreme
2.3.2 Addnnaceaus -