Science in Context
Science in Context
http://journals.cambridge.org/SIC
Additional services for Science in Context:
Email alerts: Click here
Subscriptions: Click here
Commercial reprints: Click here
Terms of use : Click here
An Example of Statistical Investigation of the Text Eugene
Onegin Concerning the Connection of Samples in Chains
A. A. Markov
Science in Context / Volume 19 / Issue 04 / December 2006, pp 591 600
DOI: 10.1017/S0269889706001074, Published online: 04 January 2007
Link to this article: http://journals.cambridge.org/abstract_S0269889706001074
How to cite this article:
A. A. Markov (2006). An Example of Statistical Investigation of the Text Eugene Onegin
Concerning the Connection of Samples in Chains. Science in Context, 19, pp 591600
doi:10.1017/S0269889706001074
Request Permissions : Click here
Downloaded from http://journals.cambridge.org/SIC, IP address: 131.104.62.10 on 04 Apr 2013
Science in Context 19(4), 591–600 (2006). Copyright
C Cambridge University Press
doi:10.1017/S0269889706001074 Printed in the United Kingdom
A. A. Markov
(Lecture at the physical-mathematical faculty, Royal Academy of Sciences, St. Petersburg, 23 January
1913)1
This study investigates a text excerpt containing 20,000 Russian letters of the alphabet,
excluding and ,2 from Pushkin’s novel Eugene Onegin – the entire first chapter and
sixteen stanzas of the second.
This sequence provides us with 20,000 connected trials, which are either a vowel
or a consonant.
Accordingly, we assume the existence of an unknown constant probability p that the
observed letter is a vowel. We determine the approximate value of p by observation,
by counting all the vowels and consonants. Apart from p, we shall find – also through
observation – the approximate values of two numbers p1 and p0 , and four numbers
p1,1 , p1,0 , p0,1 , and p0,0 . They represent the following probabilities: p1 – a vowel follows
another vowel; p0 – a vowel follows a consonant; p1,1 – a vowel follows two vowels; p1,0 –
a vowel follows a consonant that is preceded by a vowel; p0,1 – a vowel follows a vowel
that is preceded by a consonant; and, finally, p0,0 – a vowel follows two consonants.
The indices follow the same system that I introduced in my paper “On a Case of
Samples Connected in Complex Chain” [Markov 1911b]; with reference to my other
paper, “Investigation of a Remarkable Case of Dependent Samples” [Markov 1907a],
however, p0 = p2 . We denote the opposite probabilities for consonants with q and
indices that follow the same pattern.
If we seek the value of p, we first find 200 approximate values from which we
can determine the arithmetic mean. To be precise, we divide the entire sequence of
20,000 letters into 200 separate sequences of 100 letters, and count how many vowels
there are in each 100: we obtain 200 numbers, which, when divided by 100, yield 200
approximate values of p.
1
Cf. Markov 1913a. Translated into German by Alexander Y. Nitussov, Lioudmila Voropai, and David Link;
translated into English by Gloria Custance and David Link.
2
In Russian, these letters are hard and soft signs, which are not pronounced independently but modify the
pronunciation of the preceding letter.
592 A. A. Markov
1 2 3 4 5 6 7 8 9 10
11, 12, 13, 14, 15, 16, 17, 18, 19, 20
.....................................................
91, 92, 93, 94, 95, 96, 97, 98, 99, 100.
Next, we count how many vowels there are in each column taken separately and
join the numbers in pairs:
the 1st and 6th, 2nd and 7th, 3rd and 8th, 4th and 9th, 5th and 10th.
In this way, we obtain five numbers for each 100 letters, which we denote with the
following symbols
etc.3
We start a new table by counting how often each of the numbers occurs in this
group.
37 38 39 40 41 42 43 44 45 46 47 48 49
3 1 6 18 12 31 43 29 25 17 12 2 1
In the first row are all the numbers that occur in the group, and in the second row
beneath it, how often they appear.
With the aid of this table, the arithmetic mean is easy to find
29 + 25 × 2 + 17 × 3 + 12 × 4 + 2 × 5 + 6 − 31 − 12 × 2 − 18 × 3 − 6 × 4 − 5 − 3 × 6
43 + = 43.19
200
3
This is the beginning of Pushkin’s text.
Statistical Investigation of the Text Eugene Onegin 595
that is, approximately 15 , which is explained well by the connectedness of our samples.
To clarify this connectedness, although not entirely, it will help us to calculate the
above-mentioned probabilities p1 and p0 approximately.
We take the entire text of 20,000 letters, count the number of sequences
vowel, vowel,
and obtain the number 1104; after dividing it by the total number of vowels in the
text, we get the following approximate quantity for p1 :
1104
= 0.128.
8638
In the same manner, we could find an approximate value for q0 by counting the
number of sequences
consonant, consonant
and dividing it by 11,362, then p0 = 1 − q0 . However, we can also substitute the tiring
direct count with the following. If we subtract 1104 from 8638, we obtain the number
of consonants
7534,
which follow a vowel, and as all consonants apart from the first one must follow either
a vowel or a consonant, the number of sequences
consonant, consonant
is determined by the difference
11,361 − 7534 = 3827.
Therefore, we get the following approximate quantity for p0
7534 7534
= = 0.663.
11,361 11,362
As we can see, the probability of a letter being a vowel changes considerably
depending upon which letter – vowel or consonant – precedes it. The difference
p1 – p0 , which we denote with the [Greek] letter δ is
0.128 − 0.663 = −0.535.
Now, if we assume that the sequence of 20,000 letters forms a simple chain, then
for
δ = −0.535,
according to “Investigation of a Remarkable Case of Dependent Samples,” the number
1+δ 465
= = 0.3
1−δ 1535
Statistical Investigation of the Text Eugene Onegin 597
can be regarded as the theoretical dispersion coefficient; naturally, this number does
not agree exactly with the previously found
0.208,
but it is closer to it than to one, which corresponds to the case of independent samples.
If we consider the sequence as a complex chain and apply the findings of the study
“On a Case of Samples Connected in Complex Chain,” we can make the theoretical
dispersion coefficient agree still better with the experimental one.
For this, we count the number of the combinations
vowel, vowel, vowel
and
consonant, consonant, consonant
in our sequence. According to my count, there are 115 cases of the first combination
and of the second – 505. When we divide these numbers by the numbers found earlier
1104 and 3827,
we get the approximate equations
115 505
p 1,1 = = 0.104, q 0,0 = = 0.132.
1104 3827
With the aim of applying the findings of the above-mentioned article to our case
here, we assume that
p = 0.432, q = 0.568, p 1 = 0.128, q 1 = 0.872, p 0 = 0.663,
q 0 = 0.337, p 1,1 = 0.104, q 0,0 = 0.132
and from these numbers we get
−24 205
δ = −0.535, ε= = −0.027, η = − = −0.309.
872 663
Next, we turn to the expression of the coefficient of dispersion
{q (1 − 3ε)(1 − η) + p(1 − 3η)(1 − ε) − 2(1 − ε)(1 − η)} (1 − δ) + 2(1 − εη)
(1 − δ)(1 − ε)(1 − η)
1+δ 1+ε 1+η (q − p)(η − ε)
= + + ,
1 − δ 2(1 − ε) 2(1 − η) (1 − ε)(1 − η)
which corresponds to the conditions of my article and is derived there.
If we insert here the values found
p, q , δ, ε, η
598 A. A. Markov
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
1 0 0 0 1 2 1 3 5 1 2 9 13 12 13 11
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
17 16 15 10 10 16 10 10 5 5 3 3 3 0 1 2
The arithmetic mean of these 200 new numbers is the same as before
43.19.
However, the sum of the squares of their deviations from 43.2 is considerably greater
than before; it is, namely,
5788.8.
Here, it is necessary to pay attention to the assumption of the independence of the
quantities, which is usually connected with the method of least squares (see chapter VII
of my book “Calculus of Probability”); let us recall why this assumption is necessary. It
is necessary to determine the weight of the end result that is expressed by equation (21),
and also to calculate the mathematical expectation W, which gives the approximate
value k (see my book). However, this condition will prove to be superfluous, if first,
we leave aside the question of the weight of equation (21), and second, replace ξ in
expression W with the number a, which we shall assume to be equal to a0 , in that we
disregard the difference a – a0 . Then, the equations
p x + p x + · · · + p (n) x (n)
M. E. =a
p + p + · · · + p (n)
Statistical Investigation of the Text Eugene Onegin 599
and
p (x − a )2 + p (x − a )2 + · · · + p (n) (x (n) − a )2
M. E. =k
n
form the basis of our deductions,4 not requiring any independence of the quantities
x , x , . . . , x (n) .
Based on such equations and the law of large numbers, we suggest that
p a + p a + · · · + p (n) a (n)
a = = a0
p + p + · · · + p (n)
and
p (i ) (a (i ) −a )2 p (i ) (a (i ) −a 0 )2
k = = .
n n
Only the theorem of the weight of the end result, which is expressed by the well-
known equation (22), is disregarded: the weight of the result is the same as the sum of
the weights of all parts.
In the given case, each of our 200 numbers represents the sum of nearly independent
quantities; however, the sums themselves are connected in groups of five so that only
forty of them can be regarded as independent. We have 40 groups of 500 letters each;
in no group of 100 are there letters that are adjacent in the text and this is the reason for
the observed independence of the parts; on the other hand, in each group the letters of
the first hundred are next to those of the second hundred, those of the second hundred
are next to both those of the first and those of the third, etc., and for this reason, as
mentioned above, our numbers are connected in groups of five.
Under these conditions and according to the given explanations, the number
5788.8
= 28.944
200
can be considered as the approximate value of the mathematical expectation of the
square of the deviations of our 200 new numbers
49, 42, 38, 42, 44, ...
from their mathematical expectation, which is approximately
43.2.
If we pass over from the letters (samples) in hundreds to the single letters, we
ascertain that the number
0.28944
4
M. E. = mathematical expectation.
600 A. A. Markov