Introstat
Introstat
tex
N. H. BINGHAM
Syllabus.
I. Simulation (Monte Carlo, ...), Fri. 6, 340, 10-12 am,
II. Estimation (Maximum likelihood, etc.), Fri 6, 340, 3-5 pm.
I. SIMULATION
Our raw material here will be a sequence of (independent) random num-
bers xn , each uniformly distributed on the unit interval: xn ∼ U [0, 1].
The point is that we want the numbers to be random. But we cannot
strictly achieve this except by genuine sampling. This is awkward, indeed
impossible to do in practice if we need very many random numbers. Instead,
we use a computer to generate a (perhaps very long) sequence of pseudo-
random numbers – numbers that are not random at all but deterministic,
but which ‘look random’.
That this may be possible is familiar. Think of mathematical tables, such
as of logs or antilogs, trig functions, etc – say, four-figure tables. The first
digits are informative, and systematic. The last digits are not: they are de-
termined by rounding error from the first digit not displayed, and look like
mere ‘noise’.
Note. You should compare the numerical behaviour of first digits with that
of last digits. For last digits, you would expect each of the ten possibilities
0,1,..,9 to occur with equal frequency 1/10 in the long run. They do; you
can check this. (This is an instance of the ‘Law of Averages’, below, or the
Strong Law of Large Numbers (SLLN) [Stochastic Processes II.10, L14].) By
contrast, the first digits show decreasing frequency from 1 to 9! You should
(i) check this for yourselves numerically;
(ii) then check out the theory here – Benford’s Law: Frank BENFORD (1883-
1948) in 1938.
Our main theoretical tool for generating such pseudo-random sequences
are congruential generators:
1
These were introduced by D. H. LEHMER (1905-1991) in 1948. We shall
take it for granted that such congruential generators ‘work’; for background,
see e.g.
D. E. KNUTH, The Art of Computer Programming, Vol. 2: Semi-numerical
algorithms, Addison-Wesley, 1969, Ch. 3.
Note. Donald E. Knuth (1938–) is also the inventor of TeX (pronounced
‘tech’ – ch as in ‘loch’), in 1978. This is now known as plainTeX; more widely
used nowadays, and recommended in this MSc, is LaTeX (Leslie LAMPORT
(1941–) in 1986).
The uniform distribution (or more briefly, uniform law) U [0, 1] models
probability = length:
P (a ≤ X ≤ b) = b − a (0 ≤ a ≤ b ≤ 1).
F (x) := P (X ≤ x)
Proof.
P (X = F −1 (U ) ≤ x) = P (U ≤ F (x)) = F (x). //
So as we can generate a sequence of uniforms (above), we can hence gen-
erate a sequence sampled from any given distribution F .
Examples.
1. Coin-tossing. We can generate coin-tosses by taking tails (T, or 0) if
U < 1/2, heads (H, or 1) if not.
2
2. Rolling dice. Similarly, we can generate die-rolls by taking the outcome
as 1 if 0 ≤ U < 1/6, 1 if 1/6 ≤ U < 1/3, etc.
3. Dealing bridge hands. There are 52 cards in a pack (4 suits, spades ♠,
hearts ♥, diamonds ♦, clubs ♣, 13 cards per suit – 2,..., 10, J, Q, K, A).
By labelling each card with a number from 1 to 52, we can as above ‘deal
each card uniformly’ (make it equally likely that each player N, S, E, W gets
it). [To make each player get 13 cards is more tricky, and involves ‘sampling
without replacement’, or we can proceed as follows. Proceed with prob. 1/4
each until one player has 13 cards, then with prob. 1/3 each till the next
has 13, then with prob. 1/2 till the next, then the remaining player gets the
rest.]
4. Tennis. In a tennis game, suppose the server wins each point with prob-
ability p. What is the probability that the server wins the game? the set?
the match?
Draw a picture of a game of tennis.
In the bottom row is the starting point, 0-0.
In the first row are the two possibilities after 1 point, 15-0 and 0-15.
In the second row are the three possibilities after 2 points, 30-0, 15-15, 0-30.
In the fourth row are the four possibilities after 3 points, 40-0, 30-15, 15-30,
0-40.
In the fifth row are two of the possibilities after 4 points in which the game
continues: 15-40 and 40-15.
In the top [6th] row are the five remaining possibilities after 4 or 5 points:
Win (for server), 40-30, 30-30, 30-40, Lose.
If the game continues, we can combine ‘Advantage in’ with 40-30, ‘Deuce’
with 30-30, ‘Advantage out’ with 30-40 (you should check this!).
Note. (i) 40 is short here for 45. Formerly each player had a clock, and each
point was worth 1/4 of a revolution, i.e. 15 seconds (say). To win, a player
had to complete a revolution but be more than one point ahead.
With a flow-diagram to represent these 17 states, one can ‘play tennis’
by computer, with each point leading to the upper left and right neighbours
with probabilities p, 1 − p.
(ii) The general context of such flow-diagrams is that of finite Markov chains.
See e.g.
John G. KEMENY and J. Laurie SNELL, Finite Markov chains, Van Nos-
trand, 1960 (tennis, 7.2 p. 161-7),
Olle HÄGGSTRÖM, Finite Markov chains and algorithmic applications,
CUP, 2002.
3
5. Binomial tree. In a discrete financial model, suppose that at each stage the
price of a risky asset can go up or down. One can model the price evolution
over a finite time-period (say, from the start of an option to its expiry) by
a binomial tree, with two paths leading from each node, one ‘up right’, one
‘down right’. This gives the Cox-Ross-Rubinstein binomial tree model (1979).
This leads to the discrete Black-Scholes model for option pricing. Lurking
in the background here is the binomial distribution, which in the limit gives
the normal distribution (Central Limit Theorem (CLT), or ‘Law of Errors’
– below). This leads from the discrete Black-Scholes formula (1979) to the
more familiar and famous (continuous) Black-Scholes formula of mathemat-
ical finance (1973).
Densities.
1. Exponential distribution, E(λ) (λ > 0 is the parameter):
So we use instead
F −1 (U ) = −λ−1 log U ∼ E(λ),
since 1 − U ∼ U [0, 1] also. Note that the two last formulae are equivalent
mathematically, but not computationally: it would be avoidably inefficient,
and so count as a programming error, not to use the second.
2. Gamma distribution Γ(n, λ).
This has moment-generating function (MGF) (λ/(λ − t))n , and is the distri-
bution of the sum of n (independent) copies of E(λ).
4
3. Chi-square χ2 (n) = Γ(n/2, 1/2). If the Xi are independent copies of
standard normal (below),
X12 + . . . + Xn2 ∼ χ2 (n).
4. Standard normal distribution, Φ = N (0, 1). Density φ and distribution Φ:
Z x
1 2 1 Z x −u2 /2
φ(x) = √ e−x /2 , Φ(x) = φ(u)du = √ e du.
2π −∞ 2π −∞
There is no closed form for Φ (except that Φ(0) = 1/2 by symmetry), so
none for Φ−1 . So how do we use PIT to simulate from Φ? One method is the
Box-Muller method of 1958. Recall that the element of area dA is dx dy in
plane cartesian coordinates and r drdθ in plane polar coordinates. If X, Y
are independent N (0, 1),
Z Z
1 2 1 2
2 2
P (X + Y ≤ R ) = 2
√ e−x /2 . √ e−y /2 dxdy
x2 +y 2 ≤R2 2π 2π
Z R Z 2π −r2 /2
e
= .rdrdθ
0 0 2π
Z R −r2 /2
e
= .d(r2 /2)
0 2π
2 /2
= 1 − e−R .
This says that r2 := X 2 + Y 2 ∼ E(1/2), the exponential law with parameter
1/2 (mean 2). We know how to simulate from this, by above!
θ ∼ U [0, 2π] by symmetry: simulate by θ = 2πU . So take U1 , U2 (inde-
pendent) ∼ U [0, 1],
r2 := −2 log U1 (∼ E(1/2)),
q
r := −2 log U1 , θ := 2πU2 .
Then X := r cos θ, Y := r sin θ are independent N (0, 1).
Rejection Method.
This is due to John von NEUMANN (1903-1957) in 1951. Suppose we
have a density f . Then the area under the curve is 1. The subgraph of f is
{(x, y) : 0 ≤ y ≤ f (x)}. So the area of the subgraph is 1. By definition of
density,
P (X ∈ [x, x + dx]) = f (x)dx = dA,
5
where A denotes area under the subgraph to the left of x. So (‘probability
= area’) X has density f iff X is the x-coordinate of a point uniformly
distributed over the subgraph of f . So we can go from uniform points (X, Y )
on the subgraph to points X with density f by projecting onto the first
coordinate; conversely, we can go from such an X to such an (X, Y ) by
taking Y ∼ U f (x) given X = x (where as usual U ∼ U (0, 1)).
Suppose we have a density g that we know how to simulate from, and a
density f that we don’t know how to simulate from, but
f (x) ≤ cg(x)
(writing |.| for area). Now for B ⊂ F , the distribution of the non-rejected
points (i.e. of the points conditional on their being in F ) is given by
|B ∩ F | |F |
= / = |B ∩ F |/|F |.
|G| |G|
This says that the non-rejected points are uniform over F , the subgraph of
f , i.e. that they have density f , as required. //
Note. The closer the graph of f is to that of cg, the fewer points are rejected,
and the greater the computational efficiency. For heavy computational use,
it is worth making an effort to achieve such a ‘good fit’, but for details we
6
must refer to a specialist book on simulation.
(a.s. = almost surely, or with probability one – see SP). The idea of the
Monte Carlo method is to simulate the Xk on the left, form the average on
the left numerically, and use it as an approximation to the expectation or
integral on the right. The method is widely used, and very powerful.
The idea can be traced back to Buffon’s needle (G. L. Leclerc, Comte de
BUFFON (1707-1788) in 1777), but is due in its modern form to Stanislaw
ULAM (1909-1984) in 1946. It emerged in work by physicists at the Los
Alamos Laboratory (Manhattan Project, WWII, atom (fission) bomb, pre-
computer, then 1950s, hydrogen (fusion) bomb, with computers).
There is a whole area of Statistics called Markov Chain Monte Carlo or
MCMC, based on this idea (Professor Alastair Young of the Statistics Sec-
tion here is the local expert).
7
of Large Numbers mentioned earlier. This property is called (strong) nor-
mality; Borel’s result says that almost all reals are strongly normal. But
there is nothing special about the base 10 of decimals: we can use the base
2 of binary (as computers do), etc. Borel’s result says more: almost all reals
are strongly normal to all bases simultaneously.
Now that we know that almost all reals behave like this, it would be nice
to have a specific example – and π is the obvious candidate. The decimal
expansion of π has been subjected to every statistical test for randomness
known to Statistics – and passed them all with flying colours (hence the
Chudnovsky quotation above). This strongly suggests that π is indeed nor-
mal – but does not prove it. Indeed, there is no reason to suppose that we
will ever be able to prove this. From the point of view of Number Theory,
there is only one natural way to expand π, and that is as a continued fraction:
1
π/4 = 12
1+ 32
2+ 2
2+ 5
..
.
(William, Lord BROUNCKER (1620-1684), in 1655 – related to Wallis’ prod-
uct for π, for which see e.g. my home-page, M2PM3, L32).
Of course we all know that π begins 3.14159..., so the string “14159” has
us all thinking “pi”. But if we start after, say, a million places, the expansion
would “look random”. One can even personalise this. My date of birth is
19.03.1945; the decimal expansion of π started after 19,031,945 places would
be perfectly deterministic and predictable to me, but would look perfectly
random to anyone who did not know this.
Such examples make one think about what randomness is. Consider, for
example, a thousand independent tosses of a fair coin (heads = 1, tails =
0). There are 21,000 possible outcomes, each with equal probability 2−1,000 by
symmetry. Imagine two such outcomes, (i) one obtained by you, laboriously
tossing a coin a thousand times – or, simulating as above, (ii) all 1s. There is
one sense in which these are on the same footing (symmetry, above). There
is another in which they are obviously not: the first takes a thousand bits
of information to describe, the second takes two. Lurking in the background
here is a whole subject – Algorithmic Information Theory. This was created
by A. N. Kolmogorov (1903-1987) – the same Kolmogorov who gave us the
measure-theoretic probability we use all the time, though these are different
theories!
8
II. ESTIMATION (of parameters); LIKELIHOOD
Here µ is the mean, σ 2 the variance (σ > 0 the standard deviation, or SD);
µ, σ are the parameters (the term is due to R. A. (Sir Ronald) FISHER (1890-
1962) in 1922).
Exponential E(λ) (λ > 0): f (x|λ) := λe−λx , x > 0.
Uniform, U [a, b] (a < b): f (x|a, b) := 1/(b − a) on [a, b], 0 elewhere.
Poisson, P (λ) (λ > 0): f (k|λ) := e−λ λk /k! (k = 0, 1, . . .).
We write θ for a parameter (scalar or vector), and write such examples
as f (x|θ), which we will call the density (w.r.t. Lebesgue measure in the first
three examples, counting measure in the fourth – see SP Ch. I). Here x is
the argument of a function, the density function.
If we have n independent copies sampled from this density, the joint
density is the product of the marginal densities:
DATA.
Now suppose that the numerical values of the random variables in our
data set are x1 , . . . , xn . Fisher’s great idea of 1912 was to put the data xi
where the arguments xi were in (∗). He called this (later, 1921 on) the
likelihood, L – a function of the parameter θ:
The data points will tend to be concentrated where the probability is con-
centrated. Fisher advocated choosing as our estimate of the (unknown, but
non-random) parameter θ, the value(s) θ̂ (or θ̂n ) for which the likelihood
L(λ) is maximised. This gives the maximum likelihood estimator (MLE); the
9
method is the Method of Maximum Likelihood. It is intuitive, simple to use
and very powerful – ‘everyone’s favourite method of estimating parameters’.
It is often more convenient to use the log-likelihood,
` := log L,
and maximise that instead (as log is increasing, maximising L and ` are the
same).
Examples.
1. Normal N (µ, σ).
n
1 1X
L= . exp{− (xi − µ)2 /σ 2 },
σ n (2π)n/2 2 1
n
1X
` = const − n log σ − (xi − µ)2 /σ 2 .
2 1
n n
X 1X
∂`/∂µ = 0 : (xi − µ) = 0, µ= xi .
1 n 1
So the MLE of the (population) mean µ is the sample mean (average of the
data points):
n
1X
µ̂ = x̄, x := xi .
n 1
This makes sense: one would hope and expect that the sample mean is
informative about the population mean. Indeed, by SLLN,
µ̂ = X → EX (n → ∞) a.s.
(we revert to capitals for random variables; we use lower case for data values,
= observed values of random variables).
n n
n 1 X 1X
∂`/∂σ = 0 : − + 3 (xi − µ)2 = 0, σ2 = (xi − µ)2 .
σ σ 1 n 1
The RHS is called the sample variance, S 2 . So the MLEs of the population
mean and variance µ, σ are the sample mean and variance x, S 2 .
10
Note. 1. Many authors use 1/(n − 1) in place of 1/n in the definition of the
sample variance (this is needed to get the estimate unbiased). But for large
n, there is little difference.
2. We can extend the bar notation:
2 2 2 2
σ̂ 2 = (X − X)2 = X 2 − 2XX + X = X 2 − 2X + X = X 2 − X .
Then by SLLN,
Thus the bar notation is ideally suited to use of SLLN. We can show similarly
that the (suitably defined) sample covariance and correlation tend to the
corresponding population covariance and correlation, etc.
3. The above shows clearly that desirable properties of estimators (e.g. being
MLEs and being unbiased) may be incompatible.
2. Exponential. For E(λ), the mean EX = 1/λ.
n
L = λn exp{−λ xi } = λn exp{−nλx̄},
X
` = n log λ − nλx̄.
1
n n
X 1X
∂`/∂λ = 0 : −n + xi /λ = 0, λ̂ = xi = x̄.
1 n 1
Again, this is natural, and what we would expect.
4. Uniform U [a, b]. Here
11
We are to maximise this wrt a, b. Don’t use calculus as above (we can’t
– the RHS is discontinuous, so not differentiable). Instead, we can do the
maximising of L on sight: the MLEs are
â = min, b̂ = max .
Note. We shall see later that this example is less well behaved, and different
from, the ones above.
cross-terms vanish)
1 n 2
L= . exp{− [S + (x̄ − µ)2 ]}.
σ n (2π)n/2 2
This involves the data x1 , . . . , xn only through x̄ and S 2 (equivalently, only
through x̄ and x2 ). Suppose n = 1, 000, and we record these two statistics,
but lose the data. Does it matter? There are two plausible views:
(i) Yes. A thousand numbers are more informative than two.
(ii) No. As above, we expect x̄, S 2 to be informative about µ, σ 2 , and the
fact that the likelihood only involves these confirms this.
In fact the optimistic view (ii) is the correct one: we can reduce the data
set of 1,000 values down to just two, without loss of information. This is
the idea of sufficiency, due to Fisher in 1920. It can be formulated in var-
ious equivalent ways, one of which is that above: we say that a statistic T
is sufficient for a parameter θ if the likelihood factorises into factors, one
which involves the data only through T (and may involve the parameters),
the other free of the parameters.
Sufficiency (or data reduction) is such a good idea that it should always
be used, to reduce the data. We would naturally like to be able to reduce
the data as much as possible (without loss of information). This is the idea
of minimal sufficiency (Lehmann and Scheffé; see Statistical Methods for Fi-
nance – SMF – I.4).
Example: Uniform distribution.
The form of the likelihood above shows that (min, max) is sufficient for
(a, b). [It is also minimal sufficient, but this is clear: we couldn’t expect a fur-
ther reduction, to a one-dimensional statistic, to suffice for a two-dimensional
12
parameter.]
Limit Theorems
13
This qualifying phrase ‘under suitable regularity conditions’ is ubiquitous
in large-sample theory in Statistics. What is needed is that the model be
smooth enough to justify differentiating under the integral sign wrt θ twice
(this is what is needed to get the Cramér-Rao lower bound). Some such
condition is needed, as the following example shows.
Note the contrast√to the above! There, the limit is normal, and the rate
of convergence is n. Here, the limit is exponential E(1), and the rate of
convergence is n, much faster. It looks as if this faster rate of convergence
is an advantage. So it is – but we pay a heavy price for it. We have no
protection against contamination of the data. If one of the data points is
way too small (say), it will permanently affect the sample minimum, and so
the estimate of a. By contrast, in the examples above, the influence of one
bad data point will be damped out as the sample size increases.
The branch of Statistics concerned with such protection against data
contamination is called Robust Statistics. The phenomenon noted above
(‘too rapid convergence’) is called super-efficiency – and indicates extreme
non-robustness. When the ‘suitable regularity conditions’ hold, we have a
regular maximum-likelihood estimation problem. The example above of the
uniform distribution U [a, b] is non-regular. The non-regularity results from
the support of the distribution (the region where it is positive) depending on
the parameters (the support is [a, b]).
Uniform distribution (continued).
Suppose now that the length b−a of the interval in U [a, b] is known. Then
we can w.l.o.g. take it as 1. This is now a one-parameter problem rather
14
than a two-parameter one; the natural choice of parameter is the mid-point,
θ, giving U [θ − 21 , θ + 12 ]. The likelihood now is
1 1 1 1
L(θ) = 1 (θ− ≤ x1 , . . . , xn ≤ θ+ ) : L(θ) = 1 (θ− min, max ≤ θ+ ).
2 2 2 2
The maximum value is 1, but this is now attained on an entire interval,
[min, max]. So we have infinitely many MLEs! Of course, we prefer the
symmetrical choice, of the sample mid-range. This is the average of max and
min, while the sample range is their difference:
1
mid := (max + min), ran := max − min .
2
The argument above shows that the MLE mid is super-efficient for θ, with
convergence rate n as for U [a, b], but now the limit law is the symmetric
exponential, SE(1), with density
1
f (x) = exp{−|x|} :
2
n(mid − θ) → SE(1) (n → ∞) in distribution.
The argument above also shows that {min, max} is (minimal) sufficient for θ,
but mid is not! So now we have (i) a non-unique MLE; (ii) a two-dimensional
minimal sufficient statistic {min, max} – equivalently, {mid, ran} – for a one-
dimensional parameter θ.
Note. This shows that ran is relevant. But this is strange: the parameter θ
is a location parameter, while ran = max − min is translation-invariant, and
so tells us nothing about the location of the interval, i.e. about θ. True –
but what ran does tell us about is the accuracy of mid as an estimator for
θ. If we are lucky and get a ‘good’ sample, max and min will be nearly as
far apart as they can be (1), and their average mid will be close to θ. But if
we are unlucky and have a ‘bad’ sample, they may be close, and then their
average mid may be nearly as far away from θ as it can be (1/2). So ran is
uninformative about θ on its own, but {mid, ran} is more informative than
mid alone. We say that ran is ancillary for θ (Fisher, 1934).
Higher dimensions
The above goes through with minimal changes in higher dimensions.
Normal distribution, N (µ, Σ). In d dimensions, the mean µ is a d-vector and
15
the covariance matrix Σ = (σij ) is a positive definite symmetric d × d matrix.
The sample mean x̄ (a d-vector) and sample covariance matrix S (a d × d
matrix) can be defined. They are the MLEs for µ, Σ, and by the LLN
x̄ → µ, S → Σ.
Also (x̄, S) are (minimal) sufficient for (µ, Σ). See e.g.
[BF] N. H. BINGHAM and John M. FRY, Regression: Linear Models in
Statistics, SUMS (Springer Undergraduate Mathematics Series), 2010.
The information (per reading) I(θ) becomes the information matrix. With
these changes, the above theory goes through much as before.
Method of Moments
This is due to Karl PEARSON (1857-1936) in the 1880s. It is generally
inferior to the Method of Maximum Likelihood, but can be useful. It con-
sists of estimating parameters by matching sample moments to population
moments.
Example: The Binomial distribution, B(k, p). Here k = 1, 2, . . ., p ∈ (0, 1),
!
k i
P (X = k) = p (1 − p)k−i .
i
This counts the number of successes in k Bernoulli trials (or heads in k tosses
of a biased coin).
Now assume that both p and k are unknown. Sample x1 , . . . , xn from
B(k, p). The first two sample moments are x, x2 = S 2 + [x]2 . The mean and
variance of the Bernoulli are µ = kp, σ 2 = kp(1 − p) = kp − kp2 . Matching
moments gives two equations for the two unknown parameters:
x = kp, x2 = kp(1 − p) + k 2 p2 .
So
S 2 = kp − kp2 , x = kp,
or x − S 2 = kp2 , (x)2 = k 2 p2 , giving the estimates
k̃ = (x)2 /(x − S 2 ), p̃ = x/k̃.
Application: under-reported crime (example: rape); k is the number of
crimes, p is the probability that a crime is reported. So: Statistics can
help police with crime that hasn’t even been reported to them! NHB
16