0% found this document useful (0 votes)

110 views

Introstat

This document provides a syllabus and introduction to simulation techniques in statistics. It discusses generating pseudo-random numbers using congruential generators and how to use these numbers to simulate random variables from various distributions like uniform, exponential, normal via methods like the probability integral transformation and Box-Muller transform. Examples of simulations discussed include coin tossing, rolling dice, dealing cards in bridge, modeling the game of tennis as a Markov chain, and modeling stock prices with a binomial tree.

Uploaded by

Antonio Castellanos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views

Introstat

Uploaded by

Antonio Castellanos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

introstat.

tex

INTRODUCTORY LECTURES ON STATISTICS. 6.10.2017

N. H. BINGHAM

Syllabus.
I. Simulation (Monte Carlo, ...), Fri. 6, 340, 10-12 am,
II. Estimation (Maximum likelihood, etc.), Fri 6, 340, 3-5 pm.

I. SIMULATION
Our raw material here will be a sequence of (independent) random num-
bers xn , each uniformly distributed on the unit interval: xn ∼ U [0, 1].
The point is that we want the numbers to be random. But we cannot
strictly achieve this except by genuine sampling. This is awkward, indeed
impossible to do in practice if we need very many random numbers. Instead,
we use a computer to generate a (perhaps very long) sequence of pseudo-
random numbers – numbers that are not random at all but deterministic,
but which ‘look random’.
That this may be possible is familiar. Think of mathematical tables, such
as of logs or antilogs, trig functions, etc – say, four-figure tables. The first
digits are informative, and systematic. The last digits are not: they are de-
termined by rounding error from the first digit not displayed, and look like
mere ‘noise’.
Note. You should compare the numerical behaviour of first digits with that
of last digits. For last digits, you would expect each of the ten possibilities
0,1,..,9 to occur with equal frequency 1/10 in the long run. They do; you
can check this. (This is an instance of the ‘Law of Averages’, below, or the
Strong Law of Large Numbers (SLLN) [Stochastic Processes II.10, L14].) By
contrast, the first digits show decreasing frequency from 1 to 9! You should
(i) check this for yourselves numerically;
(ii) then check out the theory here – Benford’s Law: Frank BENFORD (1883-
1948) in 1938.
Our main theoretical tool for generating such pseudo-random sequences
are congruential generators:

xn+1 := axn + c (mod m).

1
These were introduced by D. H. LEHMER (1905-1991) in 1948. We shall
take it for granted that such congruential generators ‘work’; for background,
see e.g.
D. E. KNUTH, The Art of Computer Programming, Vol. 2: Semi-numerical
algorithms, Addison-Wesley, 1969, Ch. 3.
Note. Donald E. Knuth (1938–) is also the inventor of TeX (pronounced
‘tech’ – ch as in ‘loch’), in 1978. This is now known as plainTeX; more widely
used nowadays, and recommended in this MSc, is LaTeX (Leslie LAMPORT
(1941–) in 1986).
The uniform distribution (or more briefly, uniform law) U [0, 1] models
probability = length:

P (a ≤ X ≤ b) = b − a (0 ≤ a ≤ b ≤ 1).

For X a random variable and F its distribution function

F (x) := P (X ≤ x)

(see e.g. M5F3 Stochastic Processes – SP – Ch. II): F is non-decreasing and

right-continuous (F (x+) = F (x); F (x−) ≤ F (x), with < if x is a jump point
of F ); F increases from 0 at −∞ to 1 at +∞. Its inverse function

F −1 (t) := inf{x : F (x) ≥ t}

is similarly non-decreasing, but is left-continuous. So the infimum is attained,

and so is a minimum:

F −1 (t) := min{x : F (x) ≥ t}.

Proposition (Probability Integral Transformation, PIT). If U ∼ U [0, 1],

then X := F −1 (U ) ∼ F .

Proof.
P (X = F −1 (U ) ≤ x) = P (U ≤ F (x)) = F (x). //
So as we can generate a sequence of uniforms (above), we can hence gen-
erate a sequence sampled from any given distribution F .
Examples.
1. Coin-tossing. We can generate coin-tosses by taking tails (T, or 0) if
U < 1/2, heads (H, or 1) if not.

2
2. Rolling dice. Similarly, we can generate die-rolls by taking the outcome
as 1 if 0 ≤ U < 1/6, 1 if 1/6 ≤ U < 1/3, etc.
3. Dealing bridge hands. There are 52 cards in a pack (4 suits, spades ♠,
hearts ♥, diamonds ♦, clubs ♣, 13 cards per suit – 2,..., 10, J, Q, K, A).
By labelling each card with a number from 1 to 52, we can as above ‘deal
each card uniformly’ (make it equally likely that each player N, S, E, W gets
it). [To make each player get 13 cards is more tricky, and involves ‘sampling
without replacement’, or we can proceed as follows. Proceed with prob. 1/4
each until one player has 13 cards, then with prob. 1/3 each till the next
has 13, then with prob. 1/2 till the next, then the remaining player gets the
rest.]
4. Tennis. In a tennis game, suppose the server wins each point with prob-
ability p. What is the probability that the server wins the game? the set?
the match?
Draw a picture of a game of tennis.
In the bottom row is the starting point, 0-0.
In the first row are the two possibilities after 1 point, 15-0 and 0-15.
In the second row are the three possibilities after 2 points, 30-0, 15-15, 0-30.
In the fourth row are the four possibilities after 3 points, 40-0, 30-15, 15-30,
0-40.
In the fifth row are two of the possibilities after 4 points in which the game
continues: 15-40 and 40-15.
In the top [6th] row are the five remaining possibilities after 4 or 5 points:
Win (for server), 40-30, 30-30, 30-40, Lose.
If the game continues, we can combine ‘Advantage in’ with 40-30, ‘Deuce’
with 30-30, ‘Advantage out’ with 30-40 (you should check this!).
Note. (i) 40 is short here for 45. Formerly each player had a clock, and each
point was worth 1/4 of a revolution, i.e. 15 seconds (say). To win, a player
had to complete a revolution but be more than one point ahead.
With a flow-diagram to represent these 17 states, one can ‘play tennis’
by computer, with each point leading to the upper left and right neighbours
with probabilities p, 1 − p.
(ii) The general context of such flow-diagrams is that of finite Markov chains.
See e.g.
John G. KEMENY and J. Laurie SNELL, Finite Markov chains, Van Nos-
trand, 1960 (tennis, 7.2 p. 161-7),
Olle HÄGGSTRÖM, Finite Markov chains and algorithmic applications,
CUP, 2002.

3
5. Binomial tree. In a discrete financial model, suppose that at each stage the
price of a risky asset can go up or down. One can model the price evolution
over a finite time-period (say, from the start of an option to its expiry) by
a binomial tree, with two paths leading from each node, one ‘up right’, one
‘down right’. This gives the Cox-Ross-Rubinstein binomial tree model (1979).
This leads to the discrete Black-Scholes model for option pricing. Lurking
in the background here is the binomial distribution, which in the limit gives
the normal distribution (Central Limit Theorem (CLT), or ‘Law of Errors’
– below). This leads from the discrete Black-Scholes formula (1979) to the
more familiar and famous (continuous) Black-Scholes formula of mathemat-
ical finance (1973).

Densities.
1. Exponential distribution, E(λ) (λ > 0 is the parameter):

f (x) := λe−λx (x > 0), 0 (x ≤ 0).

The distribution function is
Z x
F (x) = f (u)du = 1 − e−λx (x > 0);
0

F −1 (u) = −λ−1 log(1 − u).

So by PIT, if U ∼ U [0, 1],

F −1 (U ) = −λ−1 log(1 − U ) ∼ E(λ).

So we use instead
F −1 (U ) = −λ−1 log U ∼ E(λ),
since 1 − U ∼ U [0, 1] also. Note that the two last formulae are equivalent
mathematically, but not computationally: it would be avoidably inefficient,
and so count as a programming error, not to use the second.
2. Gamma distribution Γ(n, λ).

f (x) = e−λx λn xn−1 /(n − 1)! (x > 0).

This has moment-generating function (MGF) (λ/(λ − t))n , and is the distri-
bution of the sum of n (independent) copies of E(λ).

4
3. Chi-square χ2 (n) = Γ(n/2, 1/2). If the Xi are independent copies of
standard normal (below),
X12 + . . . + Xn2 ∼ χ2 (n).
4. Standard normal distribution, Φ = N (0, 1). Density φ and distribution Φ:
Z x
1 2 1 Z x −u2 /2
φ(x) = √ e−x /2 , Φ(x) = φ(u)du = √ e du.
2π −∞ 2π −∞
There is no closed form for Φ (except that Φ(0) = 1/2 by symmetry), so
none for Φ−1 . So how do we use PIT to simulate from Φ? One method is the
Box-Muller method of 1958. Recall that the element of area dA is dx dy in
plane cartesian coordinates and r drdθ in plane polar coordinates. If X, Y
are independent N (0, 1),
Z Z
1 2 1 2
2 2
P (X + Y ≤ R ) = 2
√ e−x /2 . √ e−y /2 dxdy
x2 +y 2 ≤R2 2π 2π
Z R Z 2π −r2 /2
e
= .rdrdθ
0 0 2π
Z R −r2 /2
e
= .d(r2 /2)
0 2π
2 /2
= 1 − e−R .
This says that r2 := X 2 + Y 2 ∼ E(1/2), the exponential law with parameter
1/2 (mean 2). We know how to simulate from this, by above!
θ ∼ U [0, 2π] by symmetry: simulate by θ = 2πU . So take U1 , U2 (inde-
pendent) ∼ U [0, 1],
r2 := −2 log U1 (∼ E(1/2)),
q
r := −2 log U1 , θ := 2πU2 .
Then X := r cos θ, Y := r sin θ are independent N (0, 1).

Rejection Method.
This is due to John von NEUMANN (1903-1957) in 1951. Suppose we
have a density f . Then the area under the curve is 1. The subgraph of f is
{(x, y) : 0 ≤ y ≤ f (x)}. So the area of the subgraph is 1. By definition of
density,
P (X ∈ [x, x + dx]) = f (x)dx = dA,

5
where A denotes area under the subgraph to the left of x. So (‘probability
= area’) X has density f iff X is the x-coordinate of a point uniformly
distributed over the subgraph of f . So we can go from uniform points (X, Y )
on the subgraph to points X with density f by projecting onto the first
coordinate; conversely, we can go from such an X to such an (X, Y ) by
taking Y ∼ U f (x) given X = x (where as usual U ∼ U (0, 1)).
Suppose we have a density g that we know how to simulate from, and a
density f that we don’t know how to simulate from, but

f (x) ≤ cg(x)

for all x and some constant c. We can proceed as follows.

1. Simulate from g, i.e. by above
1*. Sample points uniformly from the subgraph of g.
2. Stretch the positive y-axis by a factor c.
The points are still uniformly distributed over the subgraph of cg.
3. Reject all point not in the subgraph of f (contained in the subgraph of cg,
as f ≤ cg). The remaining points are still uniform, but over the subgraph of
f not cg. So:
4. The x-coordinates of the points have density f .
The step that needs checking is 3 – that the non-rejected points are still
uniform, but over the subgraph F of f rather than the subgraph G of cg.
Before the rejection step, X is uniform over G:

X ∼ U (G); P (X ∈ A) = |A|/|G|, A⊂G

(writing |.| for area). Now for B ⊂ F , the distribution of the non-rejected
points (i.e. of the points conditional on their being in F ) is given by

P (X ∈ B|X ∈ F ) = P (X ∈ B & X ∈ F )/P (X ∈ F ) = P (X ∈ B∩F )/P (X ∈ F )

|B ∩ F | |F |
= / = |B ∩ F |/|F |.
|G| |G|
This says that the non-rejected points are uniform over F , the subgraph of
f , i.e. that they have density f , as required. //

Note. The closer the graph of f is to that of cg, the fewer points are rejected,
and the greater the computational efficiency. For heavy computational use,
it is worth making an effort to achieve such a ‘good fit’, but for details we

6
must refer to a specialist book on simulation.

Monte Carlo Method R

If we are to evaluate an integral f (x)dx (typically in several or many
dimensions), we may be able to interpret it as an expectation, E[f (X)] for
some random variable X. Then if X1 , . . . , Xn , . . . are independent random
variables with the same distribution, the SLLN gives
n
1X
f (Xk ) → E[f (X)] (n → ∞) a.s.
n 1

(a.s. = almost surely, or with probability one – see SP). The idea of the
Monte Carlo method is to simulate the Xk on the left, form the average on
the left numerically, and use it as an approximation to the expectation or
integral on the right. The method is widely used, and very powerful.
The idea can be traced back to Buffon’s needle (G. L. Leclerc, Comte de
BUFFON (1707-1788) in 1777), but is due in its modern form to Stanislaw
ULAM (1909-1984) in 1946. It emerged in work by physicists at the Los
Alamos Laboratory (Manhattan Project, WWII, atom (fission) bomb, pre-
computer, then 1950s, hydrogen (fusion) bomb, with computers).
There is a whole area of Statistics called Markov Chain Monte Carlo or
MCMC, based on this idea (Professor Alastair Young of the Statistics Sec-
tion here is the local expert).

Random numbers and π.

Recall that π is defined to be the ratio of the circumference of a circle to
its diameter, and π = 3.1415926535... Mnemonic:
Que j’aime à faire apprendre
Ce nombre utile aux sages.
Nothing could be less random than π! But π is so important, and interesting,
that its decimal expansion has been calculated to billions of places. As the
brothers David and Gregory Chudnovsky (who have been prominent in this)
put it, ‘Pi is a damned good fake of a random number’.
What properties should a random (real) number have? – or what prop-
erties should a ‘typical’ real have? One obvious one is absence of a pref-
erence for one of the digits 0,1,..,9 above another: each digit should occur
with its ‘correct’ asymptotic frequency, 1/10. That this happens is part of
Borel’s Normal Number Theorem of 1909, a consequence of the Strong Law

7
of Large Numbers mentioned earlier. This property is called (strong) nor-
mality; Borel’s result says that almost all reals are strongly normal. But
there is nothing special about the base 10 of decimals: we can use the base
2 of binary (as computers do), etc. Borel’s result says more: almost all reals
are strongly normal to all bases simultaneously.
Now that we know that almost all reals behave like this, it would be nice
to have a specific example – and π is the obvious candidate. The decimal
expansion of π has been subjected to every statistical test for randomness
known to Statistics – and passed them all with flying colours (hence the
Chudnovsky quotation above). This strongly suggests that π is indeed nor-
mal – but does not prove it. Indeed, there is no reason to suppose that we
will ever be able to prove this. From the point of view of Number Theory,
there is only one natural way to expand π, and that is as a continued fraction:
1
π/4 = 12
1+ 32
2+ 2
2+ 5
..
.
(William, Lord BROUNCKER (1620-1684), in 1655 – related to Wallis’ prod-
uct for π, for which see e.g. my home-page, M2PM3, L32).
Of course we all know that π begins 3.14159..., so the string “14159” has
us all thinking “pi”. But if we start after, say, a million places, the expansion
would “look random”. One can even personalise this. My date of birth is
19.03.1945; the decimal expansion of π started after 19,031,945 places would
be perfectly deterministic and predictable to me, but would look perfectly
random to anyone who did not know this.
Such examples make one think about what randomness is. Consider, for
example, a thousand independent tosses of a fair coin (heads = 1, tails =
0). There are 21,000 possible outcomes, each with equal probability 2−1,000 by
symmetry. Imagine two such outcomes, (i) one obtained by you, laboriously
tossing a coin a thousand times – or, simulating as above, (ii) all 1s. There is
one sense in which these are on the same footing (symmetry, above). There
is another in which they are obviously not: the first takes a thousand bits
of information to describe, the second takes two. Lurking in the background
here is a whole subject – Algorithmic Information Theory. This was created
by A. N. Kolmogorov (1903-1987) – the same Kolmogorov who gave us the
measure-theoretic probability we use all the time, though these are different
theories!

8
II. ESTIMATION (of parameters); LIKELIHOOD

We start with some examples.

Normal, N (µ, σ) (µ real, σ > 0), density
1 1
f (x|µ, σ) := √ exp{− (x − µ)2 /σ 2 }.
σ 2π 2

Here µ is the mean, σ 2 the variance (σ > 0 the standard deviation, or SD);
µ, σ are the parameters (the term is due to R. A. (Sir Ronald) FISHER (1890-
1962) in 1922).
Exponential E(λ) (λ > 0): f (x|λ) := λe−λx , x > 0.
Uniform, U [a, b] (a < b): f (x|a, b) := 1/(b − a) on [a, b], 0 elewhere.
Poisson, P (λ) (λ > 0): f (k|λ) := e−λ λk /k! (k = 0, 1, . . .).
We write θ for a parameter (scalar or vector), and write such examples
as f (x|θ), which we will call the density (w.r.t. Lebesgue measure in the first
three examples, counting measure in the fourth – see SP Ch. I). Here x is
the argument of a function, the density function.
If we have n independent copies sampled from this density, the joint
density is the product of the marginal densities:

f (x1 , . . . , xn |θ) = f (x1 |θ). . . . .f (xn |θ), (∗)

which we may abbreviate to

f (., . . . , .|θ) = f (.|θ). . . . .f (.|θ),

DATA.
Now suppose that the numerical values of the random variables in our
data set are x1 , . . . , xn . Fisher’s great idea of 1912 was to put the data xi
where the arguments xi were in (∗). He called this (later, 1921 on) the
likelihood, L – a function of the parameter θ:

L(θ) := f (x1 , . . . , xn |θ) = f (x1 |θ). . . . .f (xn |θ). (L)

The data points will tend to be concentrated where the probability is con-
centrated. Fisher advocated choosing as our estimate of the (unknown, but
non-random) parameter θ, the value(s) θ̂ (or θ̂n ) for which the likelihood
L(λ) is maximised. This gives the maximum likelihood estimator (MLE); the

9
method is the Method of Maximum Likelihood. It is intuitive, simple to use
and very powerful – ‘everyone’s favourite method of estimating parameters’.
It is often more convenient to use the log-likelihood,
` := log L,
and maximise that instead (as log is increasing, maximising L and ` are the
same).
Examples.
1. Normal N (µ, σ).
n
1 1X
L= . exp{− (xi − µ)2 /σ 2 },
σ n (2π)n/2 2 1
n
1X
` = const − n log σ − (xi − µ)2 /σ 2 .
2 1
n n
X 1X
∂`/∂µ = 0 : (xi − µ) = 0, µ= xi .
1 n 1
So the MLE of the (population) mean µ is the sample mean (average of the
data points):
n
1X
µ̂ = x̄, x := xi .
n 1
This makes sense: one would hope and expect that the sample mean is
informative about the population mean. Indeed, by SLLN,
µ̂ = X → EX (n → ∞) a.s.
(we revert to capitals for random variables; we use lower case for data values,
= observed values of random variables).
n n
n 1 X 1X
∂`/∂σ = 0 : − + 3 (xi − µ)2 = 0, σ2 = (xi − µ)2 .
σ σ 1 n 1

At the maximum, µ = x (above), giving the MLE of σ 2 as

n
1X
σ̂ 2 = (xi − x)2 .
n 1

The RHS is called the sample variance, S 2 . So the MLEs of the population
mean and variance µ, σ are the sample mean and variance x, S 2 .

10
Note. 1. Many authors use 1/(n − 1) in place of 1/n in the definition of the
sample variance (this is needed to get the estimate unbiased). But for large
n, there is little difference.
2. We can extend the bar notation:
2 2 2 2
σ̂ 2 = (X − X)2 = X 2 − 2XX + X = X 2 − 2X + X = X 2 − X .

Then by SLLN,

σ̂ 2 = X 2 − X̄ 2 → E[X 2 ] − [EX]2 = E[(X − E[X])2 ] = var(X) = σ 2 .

Thus the bar notation is ideally suited to use of SLLN. We can show similarly
that the (suitably defined) sample covariance and correlation tend to the
corresponding population covariance and correlation, etc.
3. The above shows clearly that desirable properties of estimators (e.g. being
MLEs and being unbiased) may be incompatible.
2. Exponential. For E(λ), the mean EX = 1/λ.
n
L = λn exp{−λ xi } = λn exp{−nλx̄},
X
` = n log λ − nλx̄.
1

∂`/∂λ = 0 : n/λ = nx̄ : λ̂ = 1/x̄.

Again, this is natural and what we would expect.
3. Poisson, P (λ). Recall that this has mean λ (and also variance λ). Writing
the data again as xi (these are non-negative integers ki ),
n n n n
L = e−nλ λxi /
Y Y X X
xi !, ` = −nλ + xi log λ − log xi !
1 1 1 1

n n
X 1X
∂`/∂λ = 0 : −n + xi /λ = 0, λ̂ = xi = x̄.
1 n 1
Again, this is natural, and what we would expect.
4. Uniform U [a, b]. Here

L = 1/(b − a)n (a ≤ x1 , . . . , xn ≤ b), 0 otherwise,

or (with min := min(x1 , . . . , xn ), and similary for max)

L = (b − a)−n I(a ≤ min, max ≤ b).

11
We are to maximise this wrt a, b. Don’t use calculus as above (we can’t
– the RHS is discontinuous, so not differentiable). Instead, we can do the
maximising of L on sight: the MLEs are
â = min, b̂ = max .
Note. We shall see later that this example is less well behaved, and different
from, the ones above.

SUFFICIENCY (Data Reduction).

If in the expression for the normal likelihood we substitute xi − µ =

(xi − x̄) + (x̄ − µ), square and expand, we get (as (xi − x̄) = 0, so the
P

cross-terms vanish)
1 n 2
L= . exp{− [S + (x̄ − µ)2 ]}.
σ n (2π)n/2 2
This involves the data x1 , . . . , xn only through x̄ and S 2 (equivalently, only
through x̄ and x2 ). Suppose n = 1, 000, and we record these two statistics,
but lose the data. Does it matter? There are two plausible views:
(i) Yes. A thousand numbers are more informative than two.
(ii) No. As above, we expect x̄, S 2 to be informative about µ, σ 2 , and the
fact that the likelihood only involves these confirms this.
In fact the optimistic view (ii) is the correct one: we can reduce the data
set of 1,000 values down to just two, without loss of information. This is
the idea of sufficiency, due to Fisher in 1920. It can be formulated in var-
ious equivalent ways, one of which is that above: we say that a statistic T
is sufficient for a parameter θ if the likelihood factorises into factors, one
which involves the data only through T (and may involve the parameters),
the other free of the parameters.
Sufficiency (or data reduction) is such a good idea that it should always
be used, to reduce the data. We would naturally like to be able to reduce
the data as much as possible (without loss of information). This is the idea
of minimal sufficiency (Lehmann and Scheffé; see Statistical Methods for Fi-
nance – SMF – I.4).
Example: Uniform distribution.
The form of the likelihood above shows that (min, max) is sufficient for
(a, b). [It is also minimal sufficient, but this is clear: we couldn’t expect a fur-
ther reduction, to a one-dimensional statistic, to suffice for a two-dimensional

12
parameter.]

The Likelihood Principle (LP).

The likelihood is of central importance. The Likelihood Principle (LP)
says in effect that the likelihood is all that matters. The LP is due implicitly
to Fisher in the 1920s, and to G. A. BARNARD (1915-2002) in the 1940s,
explicitly to Alan BIRNBAUM in 1962. We shall use it informally (a full
discussion would involve foundational questions in Statistics).

Limit Theorems

Laws of Large Numbers.

The ‘Law of Averages’ is known to the man/woman in the street. It says
that in the long run things happen about as often as they should (e.g., when
tossing a fair coin one gets heads about half the time). The precise form of
this is the Law of Large Numbers (LLN) in Probability Theory. There are
actually many forms of LLN; see SP for two, the weak LLN [II.7, L12] and
strong LLN [II.10, L14].

Central Limit Theorem.

The ‘Law of Errors’ is known to the physicist in the street, as ‘errors are
normally distributed about the mean’. The precise form is the Central Limit
Theorem (CLT); see SP [II.7, L12].

Large-sample theory of MLEs.

The idea of information per reading about a parameter θ, I(θ), (which
goes back to Fisher in 1934) can be made precise. It is related to the Cramér-
Rao inequality, or information inequality, which gives a minimum-variance
bound for unbiased estimators (see SMF I.2, or the index of any good book
on Statistics).
Under suitable regularity conditions, one can show (SMF, I.3) that
(i) the MLE is consistent:
θ̂ → θ (n → ∞) a.s.;
√ √
(ii) the MLE is asymptotically normal, with convergence rate n (or 1/ n,
depending on usage):
√ q
(θ − θ) n I(θ) → Φ = N (0, 1) (n → ∞) in distribution.

13
This qualifying phrase ‘under suitable regularity conditions’ is ubiquitous
in large-sample theory in Statistics. What is needed is that the model be
smooth enough to justify differentiating under the integral sign wrt θ twice
(this is what is needed to get the Cramér-Rao lower bound). Some such
condition is needed, as the following example shows.

Example: Uniform distribution, U [a, b].

Here â = min, b̂ = max. So

P (n(â − a)/(b − a) ≥ x) = P (n(Xi − a)/(b − a) ≥ x, i = 1, . . . n)

= [P (n(Xi − a)/(b − a) ≥ x]n = (1 − x/n)n → e−x (n → ∞).

That is,

n(â − a)/(b − a) → E(1) (n → ∞) in distribution,

(see SP Ch. II), and similarly

n(b − b̂)/(b − a) → E(1) (n → ∞) in distribution.

Note the contrast√to the above! There, the limit is normal, and the rate
of convergence is n. Here, the limit is exponential E(1), and the rate of
convergence is n, much faster. It looks as if this faster rate of convergence
is an advantage. So it is – but we pay a heavy price for it. We have no
protection against contamination of the data. If one of the data points is
way too small (say), it will permanently affect the sample minimum, and so
the estimate of a. By contrast, in the examples above, the influence of one
bad data point will be damped out as the sample size increases.
The branch of Statistics concerned with such protection against data
contamination is called Robust Statistics. The phenomenon noted above
(‘too rapid convergence’) is called super-efficiency – and indicates extreme
non-robustness. When the ‘suitable regularity conditions’ hold, we have a
regular maximum-likelihood estimation problem. The example above of the
uniform distribution U [a, b] is non-regular. The non-regularity results from
the support of the distribution (the region where it is positive) depending on
the parameters (the support is [a, b]).
Uniform distribution (continued).
Suppose now that the length b−a of the interval in U [a, b] is known. Then
we can w.l.o.g. take it as 1. This is now a one-parameter problem rather

14
than a two-parameter one; the natural choice of parameter is the mid-point,
θ, giving U [θ − 21 , θ + 12 ]. The likelihood now is

1 1 1 1
L(θ) = 1 (θ− ≤ x1 , . . . , xn ≤ θ+ ) : L(θ) = 1 (θ− min, max ≤ θ+ ).
2 2 2 2
The maximum value is 1, but this is now attained on an entire interval,
[min, max]. So we have infinitely many MLEs! Of course, we prefer the
symmetrical choice, of the sample mid-range. This is the average of max and
min, while the sample range is their difference:
1
mid := (max + min), ran := max − min .
2
The argument above shows that the MLE mid is super-efficient for θ, with
convergence rate n as for U [a, b], but now the limit law is the symmetric
exponential, SE(1), with density
1
f (x) = exp{−|x|} :
2
n(mid − θ) → SE(1) (n → ∞) in distribution.
The argument above also shows that {min, max} is (minimal) sufficient for θ,
but mid is not! So now we have (i) a non-unique MLE; (ii) a two-dimensional
minimal sufficient statistic {min, max} – equivalently, {mid, ran} – for a one-
dimensional parameter θ.
Note. This shows that ran is relevant. But this is strange: the parameter θ
is a location parameter, while ran = max − min is translation-invariant, and
so tells us nothing about the location of the interval, i.e. about θ. True –
but what ran does tell us about is the accuracy of mid as an estimator for
θ. If we are lucky and get a ‘good’ sample, max and min will be nearly as
far apart as they can be (1), and their average mid will be close to θ. But if
we are unlucky and have a ‘bad’ sample, they may be close, and then their
average mid may be nearly as far away from θ as it can be (1/2). So ran is
uninformative about θ on its own, but {mid, ran} is more informative than
mid alone. We say that ran is ancillary for θ (Fisher, 1934).

Higher dimensions
The above goes through with minimal changes in higher dimensions.
Normal distribution, N (µ, Σ). In d dimensions, the mean µ is a d-vector and

15
the covariance matrix Σ = (σij ) is a positive definite symmetric d × d matrix.
The sample mean x̄ (a d-vector) and sample covariance matrix S (a d × d
matrix) can be defined. They are the MLEs for µ, Σ, and by the LLN
x̄ → µ, S → Σ.
Also (x̄, S) are (minimal) sufficient for (µ, Σ). See e.g.
[BF] N. H. BINGHAM and John M. FRY, Regression: Linear Models in
Statistics, SUMS (Springer Undergraduate Mathematics Series), 2010.
The information (per reading) I(θ) becomes the information matrix. With
these changes, the above theory goes through much as before.

Method of Moments
This is due to Karl PEARSON (1857-1936) in the 1880s. It is generally
inferior to the Method of Maximum Likelihood, but can be useful. It con-
sists of estimating parameters by matching sample moments to population
moments.
Example: The Binomial distribution, B(k, p). Here k = 1, 2, . . ., p ∈ (0, 1),
!
k i
P (X = k) = p (1 − p)k−i .
i
This counts the number of successes in k Bernoulli trials (or heads in k tosses
of a biased coin).
Now assume that both p and k are unknown. Sample x1 , . . . , xn from
B(k, p). The first two sample moments are x, x2 = S 2 + [x]2 . The mean and
variance of the Bernoulli are µ = kp, σ 2 = kp(1 − p) = kp − kp2 . Matching
moments gives two equations for the two unknown parameters:
x = kp, x2 = kp(1 − p) + k 2 p2 .
So
S 2 = kp − kp2 , x = kp,
or x − S 2 = kp2 , (x)2 = k 2 p2 , giving the estimates
k̃ = (x)2 /(x − S 2 ), p̃ = x/k̃.
Application: under-reported crime (example: rape); k is the number of
crimes, p is the probability that a crime is reported. So: Statistics can
help police with crime that hasn’t even been reported to them! NHB

Wilkins, A Zurn Company Case Study
80% (5)
Wilkins, A Zurn Company Case Study
12 pages
ECE2191 Lecture Notes
No ratings yet
ECE2191 Lecture Notes
106 pages
cs109 Final Cheat 3 PDF
No ratings yet
cs109 Final Cheat 3 PDF
13 pages
Probability 2 Lecture Notes
No ratings yet
Probability 2 Lecture Notes
96 pages
Course Notes STATS 325 Stochastic Processes: Department of Statistics University of Auckland
No ratings yet
Course Notes STATS 325 Stochastic Processes: Department of Statistics University of Auckland
195 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
Permutation and Combinations
From Everand
Permutation and Combinations
Ramesh Chandra
4/5 (36)
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
No ratings yet
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
159 pages
randomnumbers-5
No ratings yet
randomnumbers-5
42 pages
Simulation: An Introduction
No ratings yet
Simulation: An Introduction
51 pages
Modeling With Probability
No ratings yet
Modeling With Probability
91 pages
Stochalgslides
No ratings yet
Stochalgslides
406 pages
Probability
100% (2)
Probability
520 pages
Probablity Book
No ratings yet
Probablity Book
520 pages
UNR - IPS - AUS - Probabilidad y Estadística - Libro: Snell Probability
No ratings yet
UNR - IPS - AUS - Probabilidad y Estadística - Libro: Snell Probability
520 pages
Proba Num GP
No ratings yet
Proba Num GP
116 pages
Introduction To Probability - Charles Grin Stead, Laurie Snell
100% (1)
Introduction To Probability - Charles Grin Stead, Laurie Snell
520 pages
Introduction To Probability
100% (2)
Introduction To Probability
520 pages
Introduction To Probability
100% (28)
Introduction To Probability
520 pages
Introduction To Probability
No ratings yet
Introduction To Probability
518 pages
Dartmouth - Prob
No ratings yet
Dartmouth - Prob
518 pages
Prob Main
No ratings yet
Prob Main
124 pages
AM207 2 Transforms Sampling
No ratings yet
AM207 2 Transforms Sampling
50 pages
Stochastic Dynamics
No ratings yet
Stochastic Dynamics
78 pages
Stochastic Processes ActSci
100% (1)
Stochastic Processes ActSci
195 pages
Course Notes Stochastic Processes - Auckland
No ratings yet
Course Notes Stochastic Processes - Auckland
195 pages
IE 403 Ch03 RNG RVG With Comments
No ratings yet
IE 403 Ch03 RNG RVG With Comments
27 pages
MESIO-SIM - (US) Generation of Random Variables
No ratings yet
MESIO-SIM - (US) Generation of Random Variables
42 pages
Probability Theory - Formula Sheet
No ratings yet
Probability Theory - Formula Sheet
13 pages
Introduction To Probability and Random Signals
100% (9)
Introduction To Probability and Random Signals
139 pages
Exam P Formula Sheet
100% (4)
Exam P Formula Sheet
14 pages
Probability Theory and Random Processes (MA 225) : Class Notes
No ratings yet
Probability Theory and Random Processes (MA 225) : Class Notes
35 pages
Probability Theory - Kelly.78 PDF
No ratings yet
Probability Theory - Kelly.78 PDF
78 pages
MC Manual PDF
No ratings yet
MC Manual PDF
45 pages
Computational Methods in Astrophysics: Monte Carlo Simulations and Radiative Transfer
No ratings yet
Computational Methods in Astrophysics: Monte Carlo Simulations and Radiative Transfer
45 pages
Variables Aleatorias 2
No ratings yet
Variables Aleatorias 2
34 pages
Generating Simulated Data: Project PHYSNET Physics Bldg. Michigan State University East Lansing, MI
No ratings yet
Generating Simulated Data: Project PHYSNET Physics Bldg. Michigan State University East Lansing, MI
10 pages
Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF
No ratings yet
Probability Theory - Year 2 Applied Maths& Physics - 2019-2020 PDF
126 pages
Notes
No ratings yet
Notes
69 pages
Course Notes
No ratings yet
Course Notes
111 pages
Rota Baclawski Prob Theory 79
No ratings yet
Rota Baclawski Prob Theory 79
467 pages
STAT 230 Course Notes Fall 2019
No ratings yet
STAT 230 Course Notes Fall 2019
425 pages
Auckland - 325book
No ratings yet
Auckland - 325book
195 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Complex Variables I Essentials
From Everand
Complex Variables I Essentials
Alan D. Solomon
No ratings yet
Computer Solved: Nonlinear Differential Equations
From Everand
Computer Solved: Nonlinear Differential Equations
Joe J. Ettl
No ratings yet
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
Complex Numbers (Trigonometry) Mathematics Question Bank
From Everand
Complex Numbers (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Topics on Tournaments in Graph Theory
From Everand
Topics on Tournaments in Graph Theory
John W. Moon
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Differentiation (Calculus) Mathematics Question Bank
From Everand
Differentiation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
4/5 (1)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Differential Games
From Everand
Differential Games
Avner Friedman
No ratings yet
Elements of Partial Differential Equations
From Everand
Elements of Partial Differential Equations
Ian N. Sneddon
4.5/5 (14)
Parametric Test
0% (1)
Parametric Test
5 pages
Slides 1
No ratings yet
Slides 1
73 pages
Pastore and Scheirer, 1974 (TDS - General)
No ratings yet
Pastore and Scheirer, 1974 (TDS - General)
14 pages
Information Theory and Coding 10EC55
No ratings yet
Information Theory and Coding 10EC55
50 pages
Download Linear Models An Integrated Approach 1st Edition Sreenivasa Rao Jammalamadaka ebook All Chapters PDF
100% (2)
Download Linear Models An Integrated Approach 1st Edition Sreenivasa Rao Jammalamadaka ebook All Chapters PDF
77 pages
10 Z Test
No ratings yet
10 Z Test
20 pages
CTV Actuarial Sciences
No ratings yet
CTV Actuarial Sciences
8 pages
Business Statistics-1 08.03.2022 MBS
No ratings yet
Business Statistics-1 08.03.2022 MBS
16 pages
06 Gaussian Distributions
No ratings yet
06 Gaussian Distributions
33 pages
06 - Latin Square Design (LSD)
No ratings yet
06 - Latin Square Design (LSD)
21 pages
Time Series Forecasting Business Report-1
No ratings yet
Time Series Forecasting Business Report-1
65 pages
Notes Scatter Plots
No ratings yet
Notes Scatter Plots
39 pages
CNNs Pytorch
No ratings yet
CNNs Pytorch
19 pages
Examples For 2.6 Poisson Distribution
No ratings yet
Examples For 2.6 Poisson Distribution
4 pages
Statistical methods in experimental physics 2nd ed. Edition James - The complete ebook version is now available for download
No ratings yet
Statistical methods in experimental physics 2nd ed. Edition James - The complete ebook version is now available for download
58 pages
Homework 4
No ratings yet
Homework 4
3 pages
Image Compression Fundamentals PDF
No ratings yet
Image Compression Fundamentals PDF
84 pages
Francis Diebold - Econometrics Slides
No ratings yet
Francis Diebold - Econometrics Slides
281 pages
STAB22 Final Exam Review Seminar (WINTER 2021)
No ratings yet
STAB22 Final Exam Review Seminar (WINTER 2021)
65 pages
PCM (8) Test For Significance (Dr. Tante)
No ratings yet
PCM (8) Test For Significance (Dr. Tante)
151 pages
Logistic Regression A Primer
No ratings yet
Logistic Regression A Primer
94 pages
02-Random Variables
No ratings yet
02-Random Variables
38 pages
Introduction To Kalman Filters: Michael Williams 5 June 2003
No ratings yet
Introduction To Kalman Filters: Michael Williams 5 June 2003
23 pages
Statistics Class 9+10
No ratings yet
Statistics Class 9+10
2 pages
Time Series Analysis
No ratings yet
Time Series Analysis
23 pages
Assignment 1st - 523 - Business Mathematics and Statistics
No ratings yet
Assignment 1st - 523 - Business Mathematics and Statistics
16 pages
Stochastic Methods For Petroleum Reservoir Characterization
No ratings yet
Stochastic Methods For Petroleum Reservoir Characterization
12 pages
Chapter 8: Index Models: Problem Sets
No ratings yet
Chapter 8: Index Models: Problem Sets
14 pages
Le Hoang Phuc An - 2152367 - Report - CC01 - 2023
No ratings yet
Le Hoang Phuc An - 2152367 - Report - CC01 - 2023
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Introstat

Uploaded by

Introstat

Uploaded by

introstat.

INTRODUCTORY LECTURES ON STATISTICS. 6.10.2017

xn+1 := axn + c (mod m).

For X a random variable and F its distribution function

(see e.g. M5F3 Stochastic Processes – SP – Ch. II): F is non-decreasing and

F −1 (t) := inf{x : F (x) ≥ t}

is similarly non-decreasing, but is left-continuous. So the infimum is attained,

F −1 (t) := min{x : F (x) ≥ t}.

Proposition (Probability Integral Transformation, PIT). If U ∼ U [0, 1],

f (x) := λe−λx (x > 0), 0 (x ≤ 0).

F −1 (u) = −λ−1 log(1 − u).

F −1 (U ) = −λ−1 log(1 − U ) ∼ E(λ).

f (x) = e−λx λn xn−1 /(n − 1)! (x > 0).

for all x and some constant c. We can proceed as follows.

X ∼ U (G); P (X ∈ A) = |A|/|G|, A⊂G

P (X ∈ B|X ∈ F ) = P (X ∈ B & X ∈ F )/P (X ∈ F ) = P (X ∈ B∩F )/P (X ∈ F )

Monte Carlo Method R

Random numbers and π.

We start with some examples.

f (x1 , . . . , xn |θ) = f (x1 |θ). . . . .f (xn |θ), (∗)

which we may abbreviate to

f (., . . . , .|θ) = f (.|θ). . . . .f (.|θ),

L(θ) := f (x1 , . . . , xn |θ) = f (x1 |θ). . . . .f (xn |θ). (L)

At the maximum, µ = x (above), giving the MLE of σ 2 as

σ̂ 2 = X 2 − X̄ 2 → E[X 2 ] − [EX]2 = E[(X − E[X])2 ] = var(X) = σ 2 .

∂`/∂λ = 0 : n/λ = nx̄ : λ̂ = 1/x̄.

L = 1/(b − a)n (a ≤ x1 , . . . , xn ≤ b), 0 otherwise,

or (with min := min(x1 , . . . , xn ), and similary for max)

L = (b − a)−n I(a ≤ min, max ≤ b).

SUFFICIENCY (Data Reduction).

If in the expression for the normal likelihood we substitute xi − µ =

The Likelihood Principle (LP).

Laws of Large Numbers.

Central Limit Theorem.

Large-sample theory of MLEs.

Example: Uniform distribution, U [a, b].

P (n(â − a)/(b − a) ≥ x) = P (n(Xi − a)/(b − a) ≥ x, i = 1, . . . n)

= [P (n(Xi − a)/(b − a) ≥ x]n = (1 − x/n)n → e−x (n → ∞).

n(â − a)/(b − a) → E(1) (n → ∞) in distribution,

(see SP Ch. II), and similarly

n(b − b̂)/(b − a) → E(1) (n → ∞) in distribution.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.