0% found this document useful (0 votes)
53 views

Science in Context

paper of statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Science in Context

paper of statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Science in Context

http://journals.cambridge.org/SIC

Additional services for Science in Context:

Email alerts: Click here
Subscriptions: Click here
Commercial reprints: Click here
Terms of use : Click here

An Example of Statistical Investigation of the Text Eugene 
Onegin Concerning the Connection of Samples in Chains
A. A. Markov

Science in Context / Volume 19 / Issue 04 / December 2006, pp 591 ­ 600
DOI: 10.1017/S0269889706001074, Published online: 04 January 2007

Link to this article: http://journals.cambridge.org/abstract_S0269889706001074

How to cite this article:
A. A. Markov (2006). An Example of Statistical Investigation of the Text Eugene Onegin 
Concerning the Connection of Samples in Chains. Science in Context, 19, pp 591­600 
doi:10.1017/S0269889706001074

Request Permissions : Click here

Downloaded from http://journals.cambridge.org/SIC, IP address: 131.104.62.10 on 04 Apr 2013
Science in Context 19(4), 591–600 (2006). Copyright 
C Cambridge University Press
doi:10.1017/S0269889706001074 Printed in the United Kingdom

Classical Text in Translation


An Example of Statistical Investigation of the Text Eugene
Onegin Concerning the Connection of Samples in Chains

A. A. Markov
(Lecture at the physical-mathematical faculty, Royal Academy of Sciences, St. Petersburg, 23 January
1913)1

This study investigates a text excerpt containing 20,000 Russian letters of the alphabet,
excluding and ,2 from Pushkin’s novel Eugene Onegin – the entire first chapter and
sixteen stanzas of the second.
This sequence provides us with 20,000 connected trials, which are either a vowel
or a consonant.
Accordingly, we assume the existence of an unknown constant probability p that the
observed letter is a vowel. We determine the approximate value of p by observation,
by counting all the vowels and consonants. Apart from p, we shall find – also through
observation – the approximate values of two numbers p1 and p0 , and four numbers
p1,1 , p1,0 , p0,1 , and p0,0 . They represent the following probabilities: p1 – a vowel follows
another vowel; p0 – a vowel follows a consonant; p1,1 – a vowel follows two vowels; p1,0 –
a vowel follows a consonant that is preceded by a vowel; p0,1 – a vowel follows a vowel
that is preceded by a consonant; and, finally, p0,0 – a vowel follows two consonants.
The indices follow the same system that I introduced in my paper “On a Case of
Samples Connected in Complex Chain” [Markov 1911b]; with reference to my other
paper, “Investigation of a Remarkable Case of Dependent Samples” [Markov 1907a],
however, p0 = p2 . We denote the opposite probabilities for consonants with q and
indices that follow the same pattern.
If we seek the value of p, we first find 200 approximate values from which we
can determine the arithmetic mean. To be precise, we divide the entire sequence of
20,000 letters into 200 separate sequences of 100 letters, and count how many vowels
there are in each 100: we obtain 200 numbers, which, when divided by 100, yield 200
approximate values of p.

1
Cf. Markov 1913a. Translated into German by Alexander Y. Nitussov, Lioudmila Voropai, and David Link;
translated into English by Gloria Custance and David Link.
2
In Russian, these letters are hard and soft signs, which are not pronounced independently but modify the
pronunciation of the preceding letter.
592 A. A. Markov

When we determine the number of vowels, we wish to retain the possibility of


constructing other combinations of 100 letters; we write down each hundred in a
square with ten rows and ten columns maintaining the order of the letters:

1 2 3 4 5 6 7 8 9 10
11, 12, 13, 14, 15, 16, 17, 18, 19, 20
.....................................................
91, 92, 93, 94, 95, 96, 97, 98, 99, 100.

Next, we count how many vowels there are in each column taken separately and
join the numbers in pairs:

the 1st and 6th, 2nd and 7th, 3rd and 8th, 4th and 9th, 5th and 10th.

In this way, we obtain five numbers for each 100 letters, which we denote with the
following symbols

(1,6), (2,7), (3,8), (4,9), (5,10);

and the following sum

(1,6) + (2,7) + (3,8) + (4,9) + (5,10)

represents the total number of vowels in this hundred.


If we combine 500 letters, we can construct five new groups of 100 letters each: the
first – from the first and sixth column, the second – from the second and the seventh,
and so on.
The number of vowels in these new groups of 100, obviously, is made up from the
following sums

(1,6), (2,7), (3,8), (4,9), (5,10),

which consist of the corresponding five summands.


The results of our counts are entered in forty small tables, each containing the
following: in the first row – five numbers (1,6) and their sum, in the second row – five
numbers (2,7) and their sum, etc., in the last row – the number of vowels in the first
hundred, second hundred, etc., and finally the number of vowels in all five hundreds;
to save space, reduced by 200:
Statistical Investigation of the Text Eugene Onegin 593
594 A. A. Markov

First, we shall look at the group of numbers


42, 46, 40, 44, 43, 44, 45, 43, . . .
which are found in the last rows of our 40 small tables and show the number of vowels
in consecutive groups of 100 letters of the text, for example:

etc.3
We start a new table by counting how often each of the numbers occurs in this
group.

37 38 39 40 41 42 43 44 45 46 47 48 49

3 1 6 18 12 31 43 29 25 17 12 2 1

In the first row are all the numbers that occur in the group, and in the second row
beneath it, how often they appear.
With the aid of this table, the arithmetic mean is easy to find
29 + 25 × 2 + 17 × 3 + 12 × 4 + 2 × 5 + 6 − 31 − 12 × 2 − 18 × 3 − 6 × 4 − 5 − 3 × 6
43 + = 43.19
200

and from this it follows that


p = 0.4319 = 0.432.
Next, we calculate the sum of the squares of their deviations from 43.2; it is
1022.8,
divided by 200, we get
5.114,
and this number can be regarded as the approximate quantity of the mathematical
expectation of the square of the deviation of each of our 200 numbers from their
common mathematical expectation, which is 43.2. Finally, the number
5.114
= 0.02557
200

3
This is the beginning of Pushkin’s text.
Statistical Investigation of the Text Eugene Onegin 595

represents the approximate quantity of the mathematical expectation of the square of


the error when determining 100 p with the equation
100 p = 43.2.
Such a deduction is associated with the usual assumption of the method of least
squares, namely, that we are dealing with independent quantities. This assumption is
not less justified in this case than in many others, because the connection between the
numbers is fairly weak due to the way in which they were obtained.
One can also discern a certain correspondence of our results with the well-known
law of error, which is associated with the names of Gauss and Laplace; for example,
the quantity called probable error in our case is

0.67 · 5.11 = 1.5
and accordingly, between
43.2 − 1.5 = 41.7 and 43.2 + 1.5 = 44.7
lie 103 numbers, that is, approximately half [of the total]: 31 times the number 42, 43
times the number 43, and 29 times the number 44.
To the independence of the quantities corresponds the fact that when we combine
them in twos, fours, or fives, and calculate for these 100, 50, and 40 combinations the
sums of the squares of their deviations from
86.4, 172.8, and 216,
we obtain the numbers
827.6, 975.2, 1004,
which do not differ very much from the number found earlier
1022.8.
Now, if we move on from samples in hundreds to single samples, we ascertain that
the number
5.114
= 0.05114
100
differs strongly from
0.432 × 0.568 = 0.245376 :
the coefficient of dispersion (we deviate here slightly from usual terminology, whereby
we should have taken the square root of the number that we call the coefficient of
dispersion) is
5114
= 0.208,
24537.6
596 A. A. Markov

that is, approximately 15 , which is explained well by the connectedness of our samples.
To clarify this connectedness, although not entirely, it will help us to calculate the
above-mentioned probabilities p1 and p0 approximately.
We take the entire text of 20,000 letters, count the number of sequences
vowel, vowel,
and obtain the number 1104; after dividing it by the total number of vowels in the
text, we get the following approximate quantity for p1 :
1104
= 0.128.
8638
In the same manner, we could find an approximate value for q0 by counting the
number of sequences
consonant, consonant
and dividing it by 11,362, then p0 = 1 − q0 . However, we can also substitute the tiring
direct count with the following. If we subtract 1104 from 8638, we obtain the number
of consonants
7534,
which follow a vowel, and as all consonants apart from the first one must follow either
a vowel or a consonant, the number of sequences
consonant, consonant
is determined by the difference
11,361 − 7534 = 3827.
Therefore, we get the following approximate quantity for p0
7534 7534
= = 0.663.
11,361 11,362
As we can see, the probability of a letter being a vowel changes considerably
depending upon which letter – vowel or consonant – precedes it. The difference
p1 – p0 , which we denote with the [Greek] letter δ is
0.128 − 0.663 = −0.535.
Now, if we assume that the sequence of 20,000 letters forms a simple chain, then
for
δ = −0.535,
according to “Investigation of a Remarkable Case of Dependent Samples,” the number
1+δ 465
= = 0.3
1−δ 1535
Statistical Investigation of the Text Eugene Onegin 597

can be regarded as the theoretical dispersion coefficient; naturally, this number does
not agree exactly with the previously found
0.208,
but it is closer to it than to one, which corresponds to the case of independent samples.
If we consider the sequence as a complex chain and apply the findings of the study
“On a Case of Samples Connected in Complex Chain,” we can make the theoretical
dispersion coefficient agree still better with the experimental one.
For this, we count the number of the combinations
vowel, vowel, vowel
and
consonant, consonant, consonant
in our sequence. According to my count, there are 115 cases of the first combination
and of the second – 505. When we divide these numbers by the numbers found earlier
1104 and 3827,
we get the approximate equations
115 505
p 1,1 = = 0.104, q 0,0 = = 0.132.
1104 3827
With the aim of applying the findings of the above-mentioned article to our case
here, we assume that
p = 0.432, q = 0.568, p 1 = 0.128, q 1 = 0.872, p 0 = 0.663,
q 0 = 0.337, p 1,1 = 0.104, q 0,0 = 0.132
and from these numbers we get
−24 205
δ = −0.535, ε= = −0.027, η = − = −0.309.
872 663
Next, we turn to the expression of the coefficient of dispersion
{q (1 − 3ε)(1 − η) + p(1 − 3η)(1 − ε) − 2(1 − ε)(1 − η)} (1 − δ) + 2(1 − εη)
(1 − δ)(1 − ε)(1 − η)
 
1+δ 1+ε 1+η (q − p)(η − ε)
= + + ,
1 − δ 2(1 − ε) 2(1 − η) (1 − ε)(1 − η)
which corresponds to the conditions of my article and is derived there.
If we insert here the values found
p, q , δ, ε, η
598 A. A. Markov

and calculate the result, we obtain


0.195
as coefficient of dispersion, which agrees very well with the number
0.208,
found following general rules and independent of our special assumptions, so that one
can hardly demand any better agreement.
Of course, we cannot claim that our example satisfies fully all theoretical
assumptions; however, on the other hand, we can scarcely believe that the agreement of
the numbers we have discovered is pure coincidence; rather, that it is related to a certain
correspondence of the theoretical assumptions and the conditions of the example.
Now, we shall turn to the other arrangement of the 20,000 letters in hundreds that
we have made. We construct a table with the repetitions of individual numbers, like
the one before.

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

1 0 0 0 1 2 1 3 5 1 2 9 13 12 13 11

42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

17 16 15 10 10 16 10 10 5 5 3 3 3 0 1 2

The arithmetic mean of these 200 new numbers is the same as before
43.19.
However, the sum of the squares of their deviations from 43.2 is considerably greater
than before; it is, namely,
5788.8.
Here, it is necessary to pay attention to the assumption of the independence of the
quantities, which is usually connected with the method of least squares (see chapter VII
of my book “Calculus of Probability”); let us recall why this assumption is necessary. It
is necessary to determine the weight of the end result that is expressed by equation (21),
and also to calculate the mathematical expectation W, which gives the approximate
value k (see my book). However, this condition will prove to be superfluous, if first,
we leave aside the question of the weight of equation (21), and second, replace ξ in
expression W with the number a, which we shall assume to be equal to a0 , in that we
disregard the difference a – a0 . Then, the equations
p  x  + p  x  + · · · + p (n) x (n)
M. E. =a
p  + p  + · · · + p (n)
Statistical Investigation of the Text Eugene Onegin 599

and
p  (x  − a )2 + p  (x  − a )2 + · · · + p (n) (x (n) − a )2
M. E. =k
n
form the basis of our deductions,4 not requiring any independence of the quantities
x  , x  , . . . , x (n) .
Based on such equations and the law of large numbers, we suggest that
p  a  + p  a  + · · · + p (n) a (n)
a = = a0
p  + p  + · · · + p (n)
and
 
p (i ) (a (i ) −a )2 p (i ) (a (i ) −a 0 )2
k = = .
n n
Only the theorem of the weight of the end result, which is expressed by the well-
known equation (22), is disregarded: the weight of the result is the same as the sum of
the weights of all parts.
In the given case, each of our 200 numbers represents the sum of nearly independent
quantities; however, the sums themselves are connected in groups of five so that only
forty of them can be regarded as independent. We have 40 groups of 500 letters each;
in no group of 100 are there letters that are adjacent in the text and this is the reason for
the observed independence of the parts; on the other hand, in each group the letters of
the first hundred are next to those of the second hundred, those of the second hundred
are next to both those of the first and those of the third, etc., and for this reason, as
mentioned above, our numbers are connected in groups of five.
Under these conditions and according to the given explanations, the number
5788.8
= 28.944
200
can be considered as the approximate value of the mathematical expectation of the
square of the deviations of our 200 new numbers
49, 42, 38, 42, 44, ...
from their mathematical expectation, which is approximately
43.2.
If we pass over from the letters (samples) in hundreds to the single letters, we
ascertain that the number
0.28944

4
M. E. = mathematical expectation.
600 A. A. Markov

does not differ significantly from


0.432 × 0.568 = 0.245376 :
the dispersion coefficient is
28944
= 1.18.
24537.6
If we now turn to the end result
43.19,
then because of the connectedness of the numbers
49, 42, 38, 42, 44, ...
the mathematical expectation of its square of error can no longer be expressed by
28.944
= 0.14472;
200
on the contrary, corresponding to the results of the initial arrangement of the letters in
hundreds, it can be expressed (of course, approximately) by the number
5.114
= 0.02557.
200
The connectedness of the numbers as mentioned appears when their sums are
combined in twos, fours, and particularly in fives. If we calculate for these 100, 50,
and 40 combinations the sums of the squares of their deviations from
86.4, 172.8, and 216,
instead of
5788.8
we obtain the numbers
3551.6, 3089.2, 1004,
the last of which is nearly six times smaller than 5788.8.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy