0% found this document useful (0 votes)
14 views

Sta 2

This document discusses parametric estimation methods in statistics and machine learning. It introduces concepts like i.i.d. samples, point estimation, empirical distribution functions, and the plug-in principle for constructing estimators like sample average, sample variance, and other statistics. Examples of parameter estimation problems are also provided.

Uploaded by

wayacel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Sta 2

This document discusses parametric estimation methods in statistics and machine learning. It introduces concepts like i.i.d. samples, point estimation, empirical distribution functions, and the plug-in principle for constructing estimators like sample average, sample variance, and other statistics. Examples of parameter estimation problems are also provided.

Uploaded by

wayacel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Parametric estimation

Given a r.v. z suppose that


Statistical foundations of machine learning
1. we do not know Fz (z) but we can write
INFO-F-422
Fz (z) = Fz (z, θ)
Gianluca Bontempi where θ ∈ Θ is a constant (or time-invariant) parameter,
Machine Learning Group
2. sample DN of N i.i.d. measurements of z
Computer Science Department
mlg.ulb.ac.be

Estimation: find a value θ̂ of the parameter θ so that the


parametrized distribution Fz (z, θ̂) closely matches the distribution
Fz (z) .

1/50 2/50

I.I.D. samples Some estimation problems

I I.I.D.: Identically and Independently Distributed.


I Identically distributed: all observations sampled from the
1. Let DN = {20, 31, 14, 11, 19, . . . } be the times in minutes
same distribution spent the last 2 weeks to go home. How much does it take in
I N.B.: data is random but the underlying distribution is fixed: average to reach my house from ULB?
this is the regularity we want to learn! 2. Suppose that the inter-arrival times of cars in a street are
DN = {10, 11, 1, 21, 2, . . . } seconds. What does this imply
Prob {zi = z} = Prob {zj = z} ∀ i, j = 1, . . . , N and z ∈ Z
about the mean inter-arrival time?
I Independently distributed: the fact that we observed a certain 3. Consider the students of the last year of Computer Science.
zi does not influence the probability of observing the value zj What is the variance of their grades?

Prob {zj = z|zi = zi } = Prob {zj = z}

3/50 4/50
Parametric estimation (II) Point estimation

I z with a parametric distribution Fz (z, θ), θ ∈ Θ.


Parametric estimation is a mapping from the space of the sample I parameter θ as function(al) of F
data to the space of parameters Θ.
θ = t(F )

I N i.i.d. observations DN = {z1 , z2 , . . . , zN }.


Two outcomes:
I Point estimate: function
1. point estimation: specific value of Θ
2. interval of confidence: region of Θ. θ̂ = h(DN )

of dataset DN .

5/50 6/50

Methods of constructing estimators Empirical distribution function

Given a i.i.d. sample

How to define h? Fz → {z1 , z2 , . . . , zN }

where
We will focus on: Fz (z) = Prob {z ≤ z}
1. Plug-in principle the empirical distribution function is
2. Maximum likelihood N(z) #zi ≤ z
F̂z (z) = =
N N
where N(z) is the number of samples in DN that do not exceed z.

7/50 8/50
TP R: empirical distribution Plug-in principle to define an estimator
I Dataset of N = 14 observed ages

DN = {20, 21, 22, 20, 23, 25, 26, 25, 20, 23, 24, 25, 26, 29}

I Empirical distribution function F̂z (Estimation/cumdis.R) is I Sample DN from Fz (z, θ) where θ = t(F (z)). .
a staircase function with discontinuities at the points zi . I Plug-in estimate of θ:
Empirical Distribution function

1.0
0.8
θ̂ = t(F̂ (z))

where distribution function is replaced by empirical


distribution.
0.6
Fn(x)

0.4
0.2
0.0

20 22 24 26 28 30

9/50 10/50

Sample average Sample variance

I Consider the expectation of z ∼ Fz (·)


Z I z ∼ Fz (·) whose mean µ and variance σ 2 are unknown.
θ = E [z] = zdF (z) I Sample Fz → DN .
I Plug-in estimate of σ 2 is the sample variance
with θ unknown.
I Sample Fz → DN N
2 1 X
σ̂ = (zi − µ̂)2
I Plug-in point estimate of θ is the sample average N −1
i=1
N
where µ̂ is the plug-in estimate of µ.
Z
1 X
θ̂ = zd F̂ (z) = zi = µ̂
N I Why N − 1 instead of N at the denominator?
i=1

which is a function h of the dataset DN .

11/50 12/50
Other plug-in estimators Sampling distribution

I Skewness estimator:
I Point estimate is
1 PN
N i=1 (zi − µ̂)3 θ̂ = h(DN )
γ̂ =
σ̂ 3 where DN is the realisation of a random variable DN .
I Upper critical point estimator: I If DN is random, the point estimator

ẑα = sup{z : F̂ (z) ≤ 1 − α} θ̂ = h(DN )

I Sample correlation: is random as well and its probability distribution is the


PN sampling distribution.
− µ̂x )(yi − µ̂y )
i=1 (xi I Sampling distribution is a theoretical notion (cannot be
ρ̂(x, y) = qP qP
N 2 N 2 directly observed with a single dataset).
i=1 (xi − µ̂x ) i=1 (yi − µ̂y )

13/50 14/50

Sampling or finite-sample distribution Monte Carlo illustration of a sampling distribution


p( ) SAMPLING DISTRIBUTION

To illustrate the theoretical notion of sampling distribution, we


need to generate random samples from F (z, θ).
θ
N

θ
(1)
θ
(2)
θ
(3)
θ
(...) 1: S = {}
N N N N
2: for r = 1 to R do
3: Fz → DN = {z1 , z2 , . . . , zN } // sample dataset
D
(1)
D
(2)
D
(3)
D
(...) 4: θ̂ = h(DN ) // compute estimate
S = S ∪ {θ̂}
N N N N

5:
6: end for
7: Plot histogram of S
8: Compute statistics of S (mean, variance)
UNKNOWN R.V. DISTRIBUTION
9: Study distribution of S with respect to θ

See the R scripts Estimation/sam_dis.R and


Estimation/est_step.R.
15/50 16/50
R script Histogram

Mean estimator on samples of size N= 20 : var= 5.0779380000003

mu<-0 # parameter

0.15
R<-10000 # number trials
N<-20 # size dataset

0.10
mu.hat<-numeric(R)

Density
for (r in 1:R){

0.05
D<-rnorm(N,mean=mu,sd=10) # random generator
mu.hat[r]<-mean(D) # estimator
}

0.00
−40 −20 0 20 40
hist(mu.hat) # histogram mu.hat

Suppose θ = 0. What could you say about this estimator? And if


θ = 1?

17/50 18/50

Bias and variance Estimation and darts


How accurate is θ̂?
Definition
An estimator θ̂ of θ is said to be unbiased if and only if

EDN [θ̂] = θ

Otherwise, it is biased with bias

Bias[θ̂] = EDN [θ̂] − θ

Definition
The variance of θ̂ is the variance of the sampling distribution
h i
Var θ̂ = EDN [(θ̂ − E [θ̂])2 ]

19/50 20/50
Bias/variance issue Some considerations

I unbiased estimator takes on average the right value.


I many unbiased estimators may exist for θ.
I if θ̂ is an unbiased estimator of θ, it may happen that f (θ̂) be
a BIASED estimator of f (θ).
I a biased estimator with a known bias (not depending on θ)
can be easily made unbiased (e.g. compensating for the bias).
I sample average µ̂ and sample variance σ̂ 2 are unbiased
estimators of the mean E [z] and the variance Var [z],
respectively.
I in general σ̂ is not an unbiased estimator of σ even if σ̂ 2 is an
unbiased estimator of σ 2 .

21/50 22/50

Useful relationships Bias and variance of µ̂

I Let µ and σ 2 be mean and variance of Fz (·).


I I i.i.d. sample DN ← Fz .
E [ax + by] = aE [x] + bE [y]
I
N
" # P
I N
1 X E [zi ] Nµ
EDN [µ̂] = EDN zi = i=1 = =µ
N N N
i=1
Var [ax + by] = a2 Var [x]+b 2 Var [y]+2ab (E [xy] − E [x]E [y]) =
I sample average estimator is not biased whatever is the
= a2 Var [x] + b 2 Var [y] + 2abCov[x, y]
distribution Fz (·).
where I Since Cov[zi , zj ] = 0, for i 6= j, the variance of the sample
average estimator is
Cov[x, y] = E [(x − E [x])(y − E [y])] = E [xy] − E [x]E [y]
N
" # " N #
1 X 1 X 1 σ2
is the covariance. Var [µ̂] = Var zi = 2 Var zi = 2 Nσ 2 =
N N N N
i=1 i=1

23/50 24/50
Bias of σ̂ 2 Considerations

I Previous results are independent of the form F (·) of the


What is the bias of the estimator of the variance? distribution.
I Variance of µ̂ is 1/N times the variance of z. Rationale for
collecting several samples: the larger N, the smaller is Var [µ̂],
Given an i.i.d. DN ← Fz it can be shown that so bigger N means a better estimate of µ.
N I According to the central limit theorem, under quite general
2 1 X
EDN [σ̂ ] = EDN [ (zi − µ̂)2 ] = σ 2 conditions on the distribution Fz , the distribution of µ̂ will be
N −1
i=1 approximately normal as N gets large

I Sample variance (with N − 1 at denominator) is not biased ! µ̂ ∼ N (µ, σ 2 /N) for N → ∞


I Question? Is µ̂2 an unbiased estimator of µ2 ? Try to answer p
I Standard error Var [µ̂] indicates statistical accuracy.
first in analytical terms, then use a Monte Carlo simulation to Roughly speaking we expect µ̂ to be less than one standard
validate the answer. error away from µ about 68% of the time, and less than two
standard errors away from µ about 95% of the time .

25/50 26/50

Exercise Bias/variance decomposition of MSE


Let z such that E [z] = µ and Var [z] = σ 2 . Suppose we want to
estimate from i.i.d. dataset DN the parameter θ = µ2 . Mean-square error (MSE)
Let us consider three estimators:
1. !2 MSE = EDN [(θ − θ̂)2 ]
PN
z
i=1 i
θ̂1 =
N
I The MSE of an unbiased estimator is its variance.
2. PN 2
i=1 zi
I For a generic estimator it can be shown that
θ̂2 =
N h i h i2 h i
3. 2
MSE = (EDN [θ̂] − θ) + Var θ̂ = Bias[θ̂] + Var θ̂
( N 2
P
i=1 zi )
θ̂3 =
N i.e., the mean-square error is equal to the sum of the variance
Are they unbiased? Compute analytically the bias and verify the and the squared bias (bias-variance decomposition).
result by Monte Carlo simulation for different values of N. I See R script Estimation/mse_bv.R.
Hint: σ2
= E [z2 ]
− µ2
Solution in the file gbcode/exercises/Exercise1.pdf in the R
package. 27/50 28/50
Bias/variance decomposition of MSE (II) Question

MSE = EDN [(θ − θ̂)2 ] =


I Suppose z1 , . . . , zN is a i.i.d. sample of observations from a
2
= EDN [(θ − EDN [θ̂] + EDN [θ̂] − θ̂) ] = distribution with mean µ and variance σ 2 .
I Study the unbiasedness of the three estimators of the mean µ:
= EDN [(θ − EDN [θ̂])2 ] + EDN [(EDN [θ̂] − θ̂)2 ]+
PN
+ EDN [2(θ − EDN [θ̂])(EDN [θ̂] − θ̂)] = i=1 zi
θ̂1 = µ̂ =
N
= EDN [(θ − EDN [θ̂])2 ] + EDN [(EDN [θ̂] − θ̂)2 ]+ N θ̂1
θ̂2 =
+ 2(θ − EDN [θ̂])(EDN [θ̂] − EDN [θ̂]) = N +1
h i θ̂3 = z1
= (EDN [θ̂] − θ)2 + Var θ̂ =
h i2 h i
= Bias[θ̂] + Var θ̂

29/50 30/50

Efficiency Sampling distributions for Gaussian r.v

Suppose we have two unbiased estimators. How to choose Let z1 , . . . , zN be i.i.d. N (µ, σ 2 ) and let us consider the following
between them? sample statistics
Definition (Relative efficiency) N N
1 X X SS
c
Let us consider two unbiased estimators θ̂ 1 and θ̂ 2 . If µ̂ = zi , SS
c= (zi − µ̂)2 , σ̂ 2 =
N N −1
h i h i i=1 i=1
Var θ̂ 1 < Var θ̂ 2 It can be shown that the following relations hold
1. µ̂ ∼ N (µ, σ 2 /N) and µ̂−µ
√σ ∼ N (0, 1)
we say that θ̂ 1 is more efficient than θ̂ 2 . N

2. N(µ̂ − µ)/σ̂ ∼ TN−1 or µ̂−µ σ̂

∼ TN−1 where TN−1 denotes
N
the Student distribution with N − 1 degrees of freedom.
If the estimators are biased, typically the comparison is done on  
3. if E [|z − µ|4 ] = µ4 then Var σ̂ 2 = N1 µ4 − N−3 4 .
 
N−1 σ
the basis of the mean square error.

31/50 32/50
Likelihood Maximum likelihood

I Idea: given an unknown parameter θ and a sample data DN ,


Let us consider
the maximum likelihood estimate θ̂ is the value for which the
1. a density distribution pz (z, θ) which depends on a parameter θ likelihood LN (θ) has a maximum
2. a i.i.d. DN = {z1 , z2 , . . . , zN } from such distribution.
θ̂ml = arg max LN (θ)
The joint probability density of the sample data is θ∈Θ

N
Y I The r.v θ̂ is the maximum likelihood estimator (m.l.e.).
pDN (DN , θ) = pz (zi , θ) = LN (θ)
I It is usual to consider the log-likelihood lN (θ) since being
i=1
log(·) a monotone function, we have
where for a fixed DN , LN (·) is a function of θ and is called the
empirical likelihood of θ given DN . θ̂ml = arg max LN (θ) = arg max log(LN (θ)) = arg max lN (θ)
θ∈Θ θ∈Θ θ∈Θ

33/50 34/50

Example: maximum likelihood Example: maximum likelihood (II)


By plotting LN (µ), µ ∈ [−2, 2] we have
I Let us observe N = 10 realizations of a continuous variable z:

DN = {z1 , . . . , z10 } = {1.263, . . . , 2.405}

1.5e-07
I Suppose that the probabilistic model underlying the data is
Gaussian with an unknown mean µ and a known variance

1.0e-07
σ 2 = 1.

L
I The likelihood LN (µ) is a function of (only) the unknown

5.0e-08
parameter µ.
I By applying the maximum likelihood technique we have

0.0e+00
-2 -1 0 1 2
N  (zi −µ)2

1 mu

e − 2σ2
Y
µ̂ = arg max L(µ) = arg max √
µ µ 2πσ
i=1 P µ according to the data is µ̂ ≈ 0.358.
Then the most likely value of
Note that in this case µ̂ = Nzi .
R script Estimation/ml_norm.R
35/50 36/50
Some considerations Example: log likelihood
Consider the previous example.
The behaviour of the log-likelihood for this model is
I Likelihood measures the relative abilities of the various
parameter values to explain the observed data.

-15
I Rationale: the value of the parameter under which the
observed data have the highest probability of arising is the

-20
best estimator of θ.

-25
I Likelihood function is a function of the parameter θ.

log(L)

-30
I Likelihood function is NOT the probability function of θ (θ is
constant).

-35
I LN (θ) is rather the conditional probability of observing the

-40
dataset DN for a given θ.
I Likelihood is the probability of the data given the parameter -2 -1 0

mu
1 2

and not the probability of the parameter given the data.

37/50 38/50

M.l. estimation Gaussian case: ML estimators


I Let DN be a random sample from the r.v. z ∼ N (µ, σ 2 ).
I If the parametric distribution is given, the analytical form of I The likelihood of the N samples is given by
the log-likelihood lN (θ) is known.
N N 
−(zi − µ)2
  
I In many cases the function lN (θ) is well behaved in being 2
Y
2
Y 1
LN (µ, σ ) = pz (zi , µ, σ ) = √ exp
continuous with a single maximum away from the extremes of 2πσ 2σ 2
i=1 i=1
the range of variation of θ.
I The log-likelihood is
I Then θ̂ is obtained simply as the solution of
N
" #
Y
∂lN (θ) lN (µ, σ 2 ) = log LN (µ, σ 2 ) = log pz (zi , µ, σ 2 ) =
=0
∂θ i=1
N PN
− µ)2
 
subject to i=1 (zi 1
X
2
= log pz (zi , µ, σ ) = − +N log √
∂ 2 lN (θ) 2σ 2 2πσ
<0 i=1
∂θ2 θ̂ml
I Note that, for a given σ, maximizing the log-likelihood is
to ensure that the identified stationary point is a maximum.
equivalent to minimize the sum of squares of the difference
between zi and the mean.
39/50 40/50
Gaussian case: ML estimators (II) Questions

I Taking the derivatives with respect to µ and σ 2 and setting


them equal to zero, we obtain I Let z ∼ U (0, M) and Fz → DN = {z1 , . . . , zN }.
1.
PN I Find the maximum likelihood estimator of M.
i=1 zi I Let z have a Poisson distribution, i.e.
µ̂ml = = µ̂ 2.
N
PN
2 i=1 (zi− µ̂ml )2 e −λ λz
σ̂ml = 6= σ̂ 2 pz (z, λ) =
z!
N
I M.l. estimator of the mean coincides with the sample average I If Fz (z, λ) → DN = {z1 , . . . , zN }, find the m.l.e. of λ

I M.l. estimator of the variance differs from the sample variance


for the different denominator.

41/50 42/50

M.l. estimation with numerical methods R example: Numerical optimization

I Suppose we known the analytical form of a one dimensional


function f (x ) : I → R.
Computational difficulties may arise if
I We want to find the value of x ∈ I that minimizes the
1. No analytical solution in explicit form exists for
function.
∂lN (θ)/∂θ = 0. Iterative numerical methods must be used
(see a numerical analysis class). This is particularly serious for I If no analytical solution is available, numerical optimization
a vector of parameters θ or when there are several relative methods can be applied (see course “Calcul numérique”).
maxima of lN . I In the R language these methods are already implemented
2. lN (θ) may be discontinuous, or have a discontinuous first I Let f (x ) = (x − 1/3)2 and I = [0, 1]. The minimum is given
derivative, or a maximum at an extremal point. by
f <- function (x,a) (x-a)^2
xmin <- optimize(f, c(0, 1), tol = 0.0001, a = 1/3)
xmin

43/50 44/50
R example: Numerical max. likelihood Properties of m.l. estimators

I Let DN be a random sample from the r.v. z ∼ N (µ, σ 2 ).


I The minus log-likelihood function of the N samples can be
written in R by Under the (strong) assumption that the probabilistic model
eml <- function(m,D,var) { structure is known, the maximum likelihood technique features
N<- length(D) the following properties:
Lik<-1 I θ̂ ml is asymptotically unbiased but usually biased in small
for (i in 1:N) 2 ).
samples (e.g. σ̂ml
Lik<-Lik*dnorm(D[i],m,sqrt(var))
-log(Lik) I the Cramer-Rao theorem establishes a lower bound to the
} variance of an estimator
I The numerical minimization of −lN (µ, s 2 ) for a given σ = s in I θ̂ ml is asymptotically normally distributed around θ.
the interval I = [−10, 10] can be written in R in this form
xmin<-optimize( eml,c(-10,10),D=DN,var=s)
I Script Estimation/emp_ml.R.

45/50 46/50

Interval estimation Horseshoe game

I Unlike point estimation, interval estimation maps DN to an


interval of Θ.
I An interval estimator is a transformation which, given a
dataset DN , returns an interval estimate of θ.
I While an estimator is a random variable, an interval estimator
is a random interval.
I Let θ and θ̄ be the lower and the upper bound respectively.
I While an interval either contains or not a certain value, a
random interval has a certain probability of containing a value.

47/50 48/50
From sampling distribution to confidence interval Interval estimation (II)
I Suppose that our interval estimator satisfies
Let us suppose we have an estimator θ̂ of the parameter θ which is 
Prob θ ≤ θ ≤ θ̄ = 1 − α α ∈ [0, 1]
I unbiased
I with variance σ 2 then the random interval [θ, θ̄] is called a 100(1 − α)%
θ̂
confidence interval of θ.
I with a Normal sampling distribution θ̂ ∼ N (θ, σ 2 )
θ̂ I Notice that θ is a fixed unknown value and that at each
We can write realization DN the interval either does or does not contain the
n o true θ.
Prob θ − 1.96σθ̂2 ≤ θ̂ ≤ θ + 1.96σθ̂2 = 0.95
I If we repeat the procedure of sampling DN and constructing
which entails the confidence interval many times, then our confidence
n o interval will contain the true θ at least 100(1 − α)% of the
Prob θ̂ − 1.96σθ̂2 ≤ θ ≤ θ̂ + 1.96σθ̂2 = 0.95 time (i.e. 95% of the time if α = 0.05).
I While an estimator is characterized by bias and variance, an
interval estimator is characterized by its endpoints and
confidence.
49/50 50/50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy