18.6501x Fundamentals of Statistics
18.6501x Fundamentals of Statistics
Probability Previous studies showed that the drug was 80% effective. Then we can where T is a random variable and u is a given real number. Then,
anticipate that for a study on 100 patients, in average 80 will be cured and at least 65
(d)
will be cured with 99.99% chances. 1−α • Tn + Un −−−−→ T + u
n→∞
78
Statistics Observe that 100 patients were cured. We (will be able to) conclude that
(d)
we are 95% confident that for other studies, the drug will be effective on between • Tn Un −−−−→ T u
n→∞
69.88% and 86.11% of patients.
α Tn (d) T
• If, in addition, u 6= 0, then −−−−→ .
Probability Redux Un n→∞ u
qα
Let X1 , . . . , Xn be i.i.d. random variables with E [X] = µ and Var(X) = σ 2 . Continuous Mapping Theorem
Let F denote the CDF of X.
Law of Large Numbers If f is a continuous function, then
• F (qα ) = 1 − α
n
1 X P,a.s. • If F is invertible, then qα = F −1 (1 − α)
a.s./P/(d)
Tn −−−−−−−→ T ⇒
a.s./P/(d)
f (Tn ) −−−−−−−→ f (T ) .
Xn = Xi −−−−→ µ. n→∞ n→∞
n i=1
n→∞
• P (X > qα ) = α
• If X ∼ N (0, 1), P |X| > qα/2 = α
Central Limit Theorem Foundation of Inference
Three Types of Convergence
√ Xn − µ (d)
n
σ
−−−−→
n→∞
N (0, 1). Almost Surely (a.s.) Convergence Statistical Model
Let the observed outcome of a statistical experiment be a sample X1 , . . . , Xn of n
a.s.
Equivalently, Tn −−−−→ T ⇐⇒ P ω : Tn (ω) −−−−→ T (ω) =1
n→∞ n→∞ i.i.d. random variables in some measurable space E (usually E ⊆ R) and denote by P
√ (d)
2
n Xn − µ −−−−→ N 0, σ . their common distribution. A statistical model associated to that statistical experiment
n→∞ Convergence in Probability is a pair
Hoeffding’s Inequality Let n be a positive integer and X, X1 , . . . Xn be i.i.d. P E, (Pθ )θ∈Θ
Tn −−−−→ T ⇐⇒ P (|Tn − T | ≥ ) −−−−→ 0 ∀ > 0
random variables such that E [X] = µ and X ∈ [a, b] almost surely. Then, n→∞ n→∞
where
Convergence in Distribution
− 2n
2 • E is called sample space;
P X n − µ ≥ ≤ 2e (b−a)2 ∀ > 0 (d)
Tn −−−−→ T ⇐⇒ E [f (Tn )] −−−−→ E [f (T )] • (Pθ )θ∈Θ is a family of probability measures on E;
n→∞ n→∞
for all continuous and bounded function f . • Θ is any set, called parameter set.
Parametric, Nonparametric and Semiparametric Models Confidence Intervals • If ψ = 0, H0 is not rejected.
• If ψ = 1, H0 is rejected.
• Usually, we will assume that the statistical model is well-specified, i.e., Let E, (Pθ )θ∈Θ be a statistical model based on observations X1 , . . . , Xn , and
defined such that ∃θ such that P = Pθ . This particular θ is called the true assume Θ ⊆ R. Let α ∈ (0, 1). Errors
parameter and is unknown. • Confidence interval (C.I.) of level 1 − α for θ: Any random (depending on
• Rejection region of a test ψ:
• We often assume that Θ ⊆ Rd for some d ≥ 1. The model is called X1 , . . . , Xn ) interval I whose boundaries do not depend on θ and such that n
parametric. Rψ = x ∈ E : ψ(x) = 1 .
Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ.
• Sometimes we could have Θ be infinite dimensional, in which case the model • Type 1 error of a test ψ:
is called nonparametric. • C.I. of asymptotic level 1 − α for θ: Any random interval I whose αψ : Θ0 → R (or [0, 1])
boundaries do not depend on θ and such that
• If Θ = Θ1 × Θ2 , where Θ1 is finite dimensional and Θ2 is infinite θ 7→ Pθ [ψ = 1]
dimensional, then we have a semiparametric model. In these models, we lim Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ.
n→∞ • Type 2 error of a test ψ:
only care to estimate the finite dimensional parameter and the infinite
dimensional one is called nuisance parameter. βψ : Θ 1 → R
iid
Example We observe R1 , . . . , Rn ∼ Ber(p) for some unknown p ∈ (0, 1). θ 7→ Pθ [ψ = 0]
Identifiability
• Statistical model: {0, 1}, (Ber(p))p∈(0,1) • Power of a test ψ:
The parameter θ is called identifiable if and only if the map θ ∈ Θ 7→ Pθ is injective, πψ = inf (1 − βψ (θ))
i.e., θ∈Θ1
0 • From CLT:
θ 6= θ =⇒ Pθ 6= Pθ0 √ Rn − p (d)
np −−−−→ N (0, 1) Level, test statistic and rejection region
or equivalently, p(1 − p) n→∞
Pθ = Pθ0 =⇒ θ=θ .
0 • A test ψ has level α if
• It yields αψ (θ) ≤ α, ∀θ ∈ Θ0 .
Parameter Estimation p p • A test ψ has asymptotic level α if
qα p(1 − p) qα p(1 − p)
Statistic Any measurable function of the sample, e.g., X̄n , max Xi , etc. I = R n − 2
√ , Rn + 2 √ lim αψ (θ) ≤ α, ∀θ ∈ Θ0 .
n→∞
i n n
Estimator of θ Any statistic whose expression does not depend on θ • In general, a test has the form
• But this is not a confidence interval because it depends on p! ψ = 1{Tn > c}
Three solutions: for some statistic Tn and threshold c ∈ R. Tn is called the test statistic. The
• An estimator θ̂n of θ is weakly (resp.strongly) consistent if rejection region is Rψ = {Tn > c}.
1. Conservative bound
P (resp. a.s.) p-value The (asymptotic) p-value of a test ψα is the smallest (asymptotic) level α at
θ̂n −−−−−−−−→ θ (w.r.t. P).
n→∞ 2. Solving the (quadratic) equation for p which ψα rejects H0 .
3. Plug-in
• An estimator θ̂n of θ is asymptotically normal if Methods of Estimation
√ (d)
n θ̂n − θ −−−−→ N 0, σ
2
The Delta Method
n→∞
Let (Zn )n≥1 be a sequence of random variables that satisfies
Total Variation Distance
Let E, (Pθ )θ∈Θ be a statistical model associated with a sample of i.i.d. r.v.
Bias of an Estimator √ (d) X1 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such that X1 ∼ Pθ∗ .
2
n (Zn − θ) −−−−→ N 0, σ
• Bias of an estimator of θ̂n of θ: n→∞
Statistician’s goal: Given X1 , . . . , Xn , find an estimator θ̂ = θ̂(X1 , . . . , Xn ) such
h i 2
for some θ ∈ R and σ > 0 (the sequence (Zn )n≥1 is said to be asymptotically that Pθ̂ is close to Pθ∗ for the true parameter θ ∗ .
bias θ̂n = E θ̂n − θ
normal around θ). Let g : R → R be continuously differentiable at the point θ. Then, The total variation distance between two probability measures Pθ and Pθ0 is
defined by
• If bias θ̂n = 0, we say that θ̂n is unbiased. • (g (Zn ))n≥1 is also asymptotically normal around g (θ). TV (Pθ , Pθ0 ) = max |Pθ (A) − Pθ0 (A)|
A⊂E
• More precisely, Total Variation Distance between Discrete Measures Assume that E is discrete
Jensen’s Inequality
√ (d)
0 2 2 (i.e., finite or countable). The total variation distance between Pθ and Pθ0 is
• If the function f (x) is convex, n (g (Zn ) − g (θ)) −−−−→ N 0, g (θ) σ .
n→∞ 1 X
TV (Pθ , Pθ0 ) = |pθ (x) − pθ0 (x)|
E [f (X)] ≥ f (E [X]) . 2 x∈E
Introduction to Hypothesis Testing
• If the function g(x) is concave, Total Variation Distance between Continuous Measures Assume that E is
Statistical Formulation Consider a sample
X1 , . . . , Xn of i.i.d. random variables continuous. The total variation distance between Pθ and Pθ0 is
E [g (X)] ≤ g (E [X]) . and a statistical model E, (Pθ )θ∈Θ . Let Θ0 and Θ1 be disjoint subsets of Θ.
1
Z
Consider the two hypotheses: TV (Pθ , Pθ0 ) = |fθ (x) − fθ0 (x)| dx
Quadratic Risk 2
• H0 : θ ∈ Θ0 Properties of Total Variation
• We want estimators to have low bias and low variance at the same time.
• H1 : θ ∈ Θ1 • TV (Pθ , Pθ0 ) = TV (Pθ0 , Pθ ) symmetric
• The risk (or quadratic risk) of an estimator θ̂n ∈ R is
H0 is the null hypothesis and H1 is the alternative hypothesis. • TV (Pθ , Pθ0 ) ≥ 0, TV (Pθ , Pθ0 ) ≤ 1 positive
2
R θ̂n = E θ̂n − θ = variance + bias
2
Asymmetry in the hypotheses H0 and H1 do not play a symmetric role: the data is • If TV (Pθ , Pθ0 ) = 0, then Pθ = Pθ0 definite
only used to try to disprove H0 . Lack of evidence does not mean that H0 is true. • TV (Pθ , Pθ0 ) ≤ TV (Pθ , Pθ00 ) + TV (Pθ00 , Pθ0 ) triangle inequality
• Low quadratic risk means that both bias and variance are small. A test is a statistic ψ ∈ {0, 1} such that: These imply that the total variation is a distance between probability distributions.
Kullback-Leibler (KL) Divergence • gradient vector: Multivariate Delta Method Let (Tn )n≥1 sequence of random vectors in Rd such
∂h(θ)
The Kullback-Leibler (KL) divergence between two probability measures Pθ and that
∂θ1 √ (d)
Pθ0 is defined by n (Tn − θ) −−−−→ Nd (0, Σ) ,
. n→∞
d
∇h(θ) = . ∈ R
. for some θ ∈ Rd and some covariance Σ ∈ Rd×d . Let g : Rd → Rk (k ≥ 1) be
pθ (x)
P
pθ (x) log if E is discrete ∂h(θ)
continuously differentiable at θ. Then,
pθ0 (x)
x∈E
KL (Pθ , Pθ0 ) = Z ∂θd
√
fθ (x) (d) |
fθ (x) log
dx if E is continuous • Hessian matrix: n (g (Tn ) − g (θ)) −−−−→ N 0, ∇g(θ) Σ ∇g(θ) ,
E f θ 0 (x) n→∞
∂ 2 h(θ) ∂ 2 h(θ)
KL-divergence is also known as relative entropy. ... ∂g(θ) ∂gj
∂θ ∂θ
1 1 ∂θ1 ∂θd where ∇g(θ) = = ∈ Rd×k
Properties of KL-divergence . .
∂θ ∂θi 1≤i≤d
Hh(θ) = . .. . ∈ Rd×d
1≤j≤k
2 .
. .
• KL (Pθ , Pθ0 ) 6= KL (Pθ0 , Pθ ) in general 2
∂ h(θ)
...
∂ h(θ) Fisher Information
• KL (Pθ , Pθ0 ) ≥ 0 ∂θd ∂θ1 ∂θd ∂θd
Define the log-likelihood for one observation as
• If KL (Pθ , Pθ0 ) = 0, then Pθ = Pθ0 (definite) h is concave ⇐⇒ x| Hh(θ)x ≤ 0, ∀x ∈ Rd , θ ∈ Θ d
• KL (Pθ , Pθ0 ) KL (Pθ , Pθ00 ) + KL (Pθ00 , Pθ0 ) in general `(θ) = log L1 (X, θ) , θ∈Θ⊂R .
h is strictly concave ⇐⇒ x| Hh(θ)x < 0, ∀x ∈ Rd , θ ∈ Θ
Assume that ` is a.s. twice differentiable. Under some regularity conditions, the
Maximum Likelihood Estimation Consistency of Maximum Likelihood Estimator Under mild regularity conditions,
Fisher information of the statistical model is defined as
we have
∗
MLE P
Likelihood, Discrete Case Let E, (Pθ )θ∈Θ be a statistical model associated with θ̂n −−−−→ θ
| |
I(θ) = E ∇`(θ)∇`(θ) − E [∇`(θ)] E [∇`(θ)] = −E [H`(θ)] .
a sample of i.i.d. r.v. X1 , . . . , Xn . Assume that E is discrete (i.e., finite or countable). n→∞
Definition The likelihood of the model is the map Ln (or just L) defined as Covariance In general, when θ ⊂ Rd , d ≥ 2, its coordinates are not necessarily If Θ ⊂ R, we get
0 00
independent. The covariance between two random variables X and Y is I(θ) = Var ` (θ) = −E ` (θ) .
n
Ln : E ×Θ→R
Cov(X, Y ) := E [(X − E [X]) (Y − E [Y ])] Asymptotic Normality of the MLE
(x1 , . . . , xn ; θ) 7→ Pθ [X1 = x1 , . . . , Xn = xn ]
= E [XY ] − E [X] E [Y ]
n
Y Theorem Let θ ∗ ∈ Θ (the true parameter). Assume the following:
= Pθ [Xi = xi ]
Properties
i=1 1. The parameter is identifiable.
• Cov (X, X) = Var (X)
2. For all θ ∈ Θ, the support of Pθ does not depend on θ.
Likelihood, Continuous Case Let E, (Pθ )θ∈Θ be a statistical model associated
• Cov (X, Y ) = Cov (Y, X)
with a sample of i.i.d. r.v. X1 , . . . , Xn . Assume that all the Pθ have density fθ . 3. θ ∗ is not on the boundary of Θ.
• If X and Y are independent, then Cov (X, Y ) = 0
Definition The likelihood of the model is the map L defined as 4. I(θ) is invertible in a neighborhood of θ ∗ .
n Covariance Matrix The covariance matrix of a random vector
L:E ×Θ→R 5. A few more technical conditions.
(d) |
(1) d
n X = X ,...,X ∈R
Y MLE
(x1 , . . . , xn ; θ) 7→ fθ (xi ) Then, θ̂n satisfies
i=1 is given by
|
θ ∗ w.r.t. Pθ∗ ;
MLE P
Σ = Cov (X) = E (X − E [X]) (X − E [X]) . • θ̂n −−−−→
n→∞
Maximum Likelihood Estimator LetX1 , . . . , Xn be an i.i.d. sample associated This is a matrix of size d × d.
with a statistical model E, (Pθ )θ∈Θ and let L be the corresponding likelihood. √ MLE (d)
n θ̂n − θ ∗ Nd 0, I −1 (θ ∗ ) w.r.t. Pθ∗ .
If X ∈ Rd and A, B are matrices: • −−−−→
Definition The maximum likelihood estimator of θ is defined as n→∞
| |
MLE Cov (AX + B) = Cov (AX) = A Cov(X)A = AΣX A
θ̂n = argmax L (X1 , . . . , Xn , θ) , The Method of Moments
θ∈Θ
The Multivariate Gaussian Distribution If (X, T )| is a Gaussian vector then its
provided it exists. PDF depends on 5 parameters: Moments
Log-likelihood Estimator In practice, we use the fact that E [X] , Var(X), E[Y ], Var(Y ), and Cov(X, Y ). Let X1 , . . . , Xn be an i.i.d. sample associated with a statistical model
E, (Pθ )θ∈Θ . Assume that E ⊆ R and Θ ⊆ Rd , for some d ≥ 1.
MLE
θ̂n = argmax log L (X1 , . . . , Xn , θ) , A Gaussian vector X ∈ Rd is completely determined by its expected value and h i
θ∈Θ covariance matrix Σ: Population Moments Let mk (θ) = Eθ X1k , 1 ≤ k ≤ d.
X ∼ Nd (µ, Σ) .
n
Concave and Convex Functions It has PDF over Rd given by: k = 1
X k
Empirical Moments Let m̂k = Xn n Xi , 1 ≤ k ≤ d.
A twice-differentiable function h : Θ ⊂ R → R is said to be concave if its second
1 1 | −1
i=1
derivative satisfies f (x) = 1
exp − (x − µ) Σ (x − µ)
00 ((2π)d det (Σ)) 2 2 From LLN,
h (θ) ≤ 0, ∀θ ∈ Θ. P/a.s.
00 m̂k −−−−→ mk (θ)
It is said to be strictly concave if the inequality is strict: h (θ) < 0. Moreover, h is d n→∞
The Multivariate CLT Let X1 , . . . , Xn ∈ R be independent copies of a random
said to be (strictly) convex if −h is (strictly) concave, i.e. h00 (θ) ≥ 0 (h00 (θ) > 0). More compactly, we say that the whole vector converges:
vector X such that E [X] = µ, Cov (X) = Σ, then
Multivariate Concave Functions More generally, for a multivariate function:
√ (d) P/a.s.
h : Θ ⊂ Rd → R, d ≥ 2, define the n Xn − µ −−−−→ Nd (0, Σ) (m̂1 , . . . , m̂d ) −−−−→ (m1 (θ), . . . , md (θ))
n→∞ n→∞
Moments Estimator 1.5
Asymptotic test Assume that m = cn and n → ∞
Using Slutsky’s theorem, we also have
Let
1 X n − Y m − (∆d − ∆c ) (d)
M :Θ→R
d s −−−−→ N (0, 1)
n→∞
bd2
σ bc2
σ
θ 7→ M (θ) = (m1 (θ), . . . , md (θ)) +
0.5 n m
n m
Assume M is one-to-one: 1 X 2 1 X 2
bd2 =
where σ Xi − X n bc2 =
and σ Yi − Y m
−1 0 n − 1 i=1 m − 1 i=1
θ=M (m1 (θ), . . . , md (θ)) −2 −1 0 1 2
H0 : µ = 0 vs. H1 : µ 6= 0
achieves its minimum at µ = µ∗ . Hypothesis Testing Test statistic:
Examples (1) √ Xn − µ
Parametric Hypothesis Testing √ Xn n
Tn = nq = s σ
• If E = M = R and ρ(x, µ) = (x − µ)2 , for all x, µ ∈ R: µ∗ = E [X]. Hypotheses S
en S
en
H0 : ∆ c = ∆ d vs. H 1 : ∆ d > ∆c
• If E = M = Rd and ρ(x, µ) = kx − µk22 , for all x, µ ∈ Rd : σ2
µ∗ = E [X] ∈ Rd . Since the data is Gaussian by assumption, we don’t need the CLT.
√ Xn S
en χ2n−1
∗
! ! Since n ∼ N (0, 1) (under H0 ) and 2 ∼ are independent by
• If E = M = R and ρ(x, µ) = |x − µ|, for all x, µ ∈ R: µ is a median of P. σ2 σ2 σ σ n−1
X n ∼ N ∆d , d and Y m ∼ N ∆c , c Cochran’s theorem, we have
n m
Example (2) If E = M = R, α ∈ (0, 1) is fixed and ρ(x, µ) = Cα (x − µ), for all Tn ∼ tn−1 .
x, µ ∈ R: µ∗ is a α-quantile of P. Then, Student’s test with (non-asymptotic) level α ∈ (0, 1):
Check Function X n − Y m − (∆d − ∆c ) n o
( s ∼ N (0, 1) ψα = 1 |Tn | > q α ,
2
Cα =
−(1 − α)x if x < 0 σd2 σc2
+ where q α is the 1 − α
αx if x ≥ 0. n m 2 2 -quantile of tn−1 .
Student’s T test (one-sample, one-sided) Likelihood function Categorical Likelihood
n d
H0 : µ ≤ µ 0 vs. H1 : µ > µ0 Ln : R × R → R • Likelihood of the model:
n N N N
Y Ln (X1 , . . . , Xn ; p) = p1 1 p2 2 . . . pKK
Test statistic: (x1 , . . . , xn ; θ) 7→ pθ (xi )
√ X n − µ0 i=1 where Nj = # {i = 1, . . . , n : Xi = aj } .
Tn = n q ∼ tn−1 (under H0 )
S
en The likelihood ratio test in this set-up is of the form • Let p
b be the MLE:
Nj
Ln (x1 , . . . , xn ; θ1 )
p
bj =
, j = 1, . . . , K.
Student’s test with (non-asymptotic) level α ∈ (0, 1): ψC = 1 >C n
Ln (x1 , . . . , xn ; θ0 ) b maximizes log Ln (X1 , . . . , Xn , p) under the constraint.
p
ψα = 1 {Tn > qα } √
where C is a threshold to be specified. χ2 test If H0 is true, then n p
b − p0 is asymptotically normal, and the following
where qα is the (1 − α)-quantile of tn−1 . A test based on the log-likelihood Consider an i.i.d. sample X1 , . . . , Xn with holds:
statistical model E, (Pθ )θ∈Θ , where Θ ⊆ Rd (d ≥ 1). Suppose the null hypothesis
Two-sample T-test Theorem Under H0 :
X n − Y m − (∆d − ∆c ) has the form 2
s ∼ tN (0) (0)
H0 : (θr+1 , . . . , θd ) = θr+1 , . . . , θd , n b j − p0j
p (d)
bd2 bc2
X 2
σ σ Tn = n −−−−→ χK−1
+ (0) (0) p0j n→∞
n m for some fixed and given numbers θr+1 , . . . , θd . j=1
Welch-Satterthwaite formula Let CDF and empirical CDF Let X1 , . . . , Xn be i.i.d. real random variables. The CDF
!2 θbn = argmax `n (θ) (MLE) of X1 is defined as
θ∈Θ
bd2
σ bc2
σ F (t) = P [X1 ≤ 1] , ∀t ∈ R.
+ and
n m c It completely characterizes the distribution of X1 .
N = ≥ min(n, m) θbn = argmax `n (θ) (constrained MLE)
bd4
σ bc4
σ θ∈Θ0 The empirical CDF of the sample X1 , . . . , Xn is defined as
+ n o
n2 (n − 1) m2 (m − 1) where Θ0 = θ ∈ Θ : (θr+1 , . . . , θd ) =
(0)
θr+1 , . . . , θd
(0) n
1 X
Fn (t) = 1 {Xi ≤ 1}
Test statistic: n i=1
Wald’s Test
c
Tn = 2 `n θ̂n − `n θ̂n . #{i = 1, . . . , n : Xi ≤ t}
A test based on the MLE Consider an i.i.d. sample X1 , . . . , Xn with statistical = , ∀t ∈ R.
Wilk’s Theorem Assume H0 is true and the MLE technical conditions are satisfied. n
model E, (Pθ )θ∈Θ , where Θ ⊆ Rd (d ≥ 1) and let θ0 ∈ Θ be fixed and given. θ ∗
is the true parameter. Then, Consistency By the LLN, for all t ∈ R,
(d) 2
Consider the following hypotheses: Tn −−−−→ χd−r a.s.
n→∞ Fn (t) −−−−→ F (t).
n→∞
∗ ∗ Likelihood ratio test with asymptotic level α ∈ (0, 1):
H0 : θ = θ 0 vs. H1 : θ 6= θ0
Glivenko-Cantelli Theorem (Fundamental theorem of statistics)
ψ = 1 {Tn > qα } ,
Let θbnMLE be the MLE. Assume the MLE technical conditions are satisfied. sup |Fn (t) − F (t)|
a.s.
−−−−→ 0
where qα is the (1 − α)-quantile of χ2d−r . t∈R n→∞
If H0 is true, then
√ 1
MLE 2 MLE
(d)
Goodness of Fit Tests Asymptotic normality By the CLT, for all t ∈ R,
n I θb θbn − θ0 −−−−→ Nd (0, Id ) Let X be a r.v. We want to know if the hypothesized distribution is a good fit for the √
n→∞ (d)
data. n (Fn (t) − F (t)) −−−−→ N (0, F (t) (1 − F (t)))
n→∞
Wald’s test Key characteristic of Goodness of Fit tests: no parametric modeling.
| Donsker’s Theorem If F is continuous, then
(d)
MLE
Tn := n θbn − θ0
MLE
I θbn
MLE
θbn − θ0 −−−−→
2
χd Discrete distribution Let E = {a1 , . . . , aK } be a finite space and (Pp )p∈∆ be √ a.s.
K n sup |Fn (t) − F (t)| −−−−→ sup |B(t)| ,
n→∞
the family of all probability distributions on E. t∈R n→∞ 0≤t≤1
6
• Remark: Ln (X1 , . . . , Xn |θ) is the likelihood used in the frequentist
3 approach.
4
4
2
Sample Quantiles
• The conditional distribution of θ given X1 , . . . , Xn is called the posterior
distribution. Denote by π(·|X1 , . . . , Xn ) its PDF.
0
2
4
−2
2
4
− F (X(2) ) Bayes’ formula
−4
1 π (θ|X1 , . . . , Xn ) ∝ π(θ)Ln (X1 , . . . , Xn |θ), ∀θ ∈ Θ
−6
4
−2 −1 0 1 2
Theoretical Quantiles
5
iid
Pivotal Distribution Tn is called a pivotal statistic: If H0 is true, the distribution of • Given p, X1 , . . . , Xn ∼ Ber(p), so
Tn does not depend on the distribution of the Xi ’s.
4
Pn Pn
Sample Quantiles
Ln (X1 , . . . , Xn |p) = p i=1 Xi (1 − p)
n− i=1 Xi .
3
Other Goodness of Fit Tests
2
• Hence,
Kolmogorov-Smirnov
1
a−1+ n
P Pn
d (Fn , F ) = sup |Fn (t) − F (t)| π (p|X1 , . . . , Xn ) ∝ p i=1 Xi (1 − p)
a−1+n− i=1 Xi
0
t∈R
−2 −1 0 1 2
0
X∼F Non-informative Priors
−1
Anderson-Darling • We can still use a Bayesian approach if we have no prior information about the
Sample Quantiles
parameter.
−2
[Fn (t) − F (t)]2
Z
2
d (Fn , F ) dF (t) • Good candidate: π(θ) ∝ 1, i.e., constant PDF on Θ.
−3
R F (t) (1 − F (t))
• If Θ is bounded, this is the uniform prior on Θ.
−4
Kolmogorov-Lilliefors Test • If Θ is unbounded, this does not define a proper PDF on Θ.
−2 −1 0 1 2
We want to test if X has a Gaussian distribution with unknown parameters. In this Theoretical Quantiles
• An improper prior on Θ is a measurable, non-negative function π(·) defined
case, Donsker’s theorem is no longer valid. Instead, we compute the quantiles for the on Θ that is not integrable:
test statistic
4. light tails
Z
sup Fn (t) − Φµ̂,σ̂2 (t)
t∈R Normal Q−Q Plot π(θ)dθ = ∞.
where µ̂ = X n , σ̂ 2 = Sn
2
and Φµ̂,σ̂2 (t) is the CDF of N µ̂, σ̂ 2 .
1.5
• In general, one can still define a posterior distribution using an improper prior,
1.0
They do not depend on unknown parameters. using Bayes’ formula.
0.5
Sample Quantiles
• Not a formal test but quick and easy check to see if a distribution is plausible. Bayesian framework, as well as an example of a non-informative prior:
−1.5
• Main idea: We want to check visually if the plot of Fn is close to that of F or, −2 −1 0 1 2
q
equivalently, if the plot of Fn−1 is close to F −1 . Theoretical Quantiles
πJ (θ) ∝ det I(θ)
• Check if the points
where I(θ) is the Fisher information matrix of the statistical model associated with
Bayesian Statistics
−1 1 −1 1 −1 −1 n−1
F (n ), Fn ( n ) , . . . , F ( n−1
n ), Fn ( n )
X1 , . . . , Xn in the frequentist approach (provided it exists).
Examples
are near the line y = x.
• Fn is not technically invertible but we define
Introduction to Bayesian Statistics • Bernoulli experiment: πJ (θ) ∝ p
1
, p ∈ (0, 1): the prior is
Prior and Posterior p(1 − p)
−1
i
Fn ( n ) = Xi , Beta( 12 , 1
2)
• Consider a probability distribution on a parameter space Θ with some PDF
the ith largest observation. π(·): the prior distribution. • Gaussian experiment: πJ (θ) ∝ 1, θ ∈ R, is an improper prior
Jeffreys prior satisfies a reparametrization invariance principle: If η is a Probabilistic Analysis Let X and Y be two r.v. (not neccessarily independent) with Closed Form Solution Assume that rank(X) = p. Then,
reparametrization of θ (i.e., η = φ(θ) for some one-to-one map φ), then the PDF π̃(·) two moments and such that Var (X) > 0. The theoretical linear regression of Y on X
is the line x 7→ a∗ + b∗ x, where b = X| X −1 X| Y.
of η satisfies: β
q
˜
π̃(η) ∝ det I(η), ∗ ∗
h
2
i
a , b = argmin E (Y − a − bX) Geometric Interpretation of the LSE Xβb is the orthogonal projection of Y onto the
˜ (a,b)∈R2 subspace spanned by the columns of X:
where I(η) is the Fisher information of the statistical model parametrized by η
instead of θ.
which gives Xβ
b = P Y,
Bayesian confidence regions For α ∈ (0, 1), a Bayesian confidence region with | −1 |
level α is a random subset R of the parameter space Θ, which depends on the sample ∗ Cov (X, Y ) where P = X (X X) X .
b =
X1 , . . . , Xn , such that Var (X) Statistical Inference To make inference, we need more assumptions.
∗ ∗ Cov (X, Y )
P [θ ∈ R|X1 , . . . , Xn ] = 1 − α. a = E[Y ] − b E[X] = E[Y ] − E[X] • The design matrix X is deterministic and rank(X) = p.
Var (X)
• The model is homoscedastic: ε1 , . . . , εn are i.i.d.
Note that R depends on the prior π(·).
Noise The points are not exactly on the line x 7→ a∗ + b∗ x if Var(Y |X = x) > 0. • The noise vector ε is Gaussian:
Bayesian confidence region and confidence interval are two distinct notions. The random variable ε = Y − (a∗ + b∗ X) is called noise and satisfies
2
∗ ∗
ε ∼ Nn 0, σ In
Bayesian estimation Y = a + b X + ε,
– conditional quantiles α
• Let X be the n × p matrix whose rows are X| | where q α (tn−p ) is the (1 − 2 )-quantile of tn−p .
– conditional variance (not information about location) 1 , . . . , Xn . X is called the design 2
matrix.
Bonferroni’s test Test whether a group of explanatory variables is significant in the
Linear Regression We focus on modeling the regression function • Let ε = (ε1 , . . . , εn )| ∈ Rn , the unobserved noise. Then, linear regression.
∗ ∗
f (x) = E [Y |X = x] . Y = Xβ + ε, β unknown. • H0 : βj = 0 ∀j ∈ S v.s. H1 : ∃j ∈ S, βj 6= 0 where S ⊆ {1, . . . , p}.
• Bonferroni’s test:
Restrict to simple functions. The simplest is • The LSE β
b satisfies [
b = argminkY − Xβk2 .
β RS,α = Rj, α , where k = |S|
f (x) = a + bx linear (or affine) function β∈Rp
2
j∈S
k
Generalized Linear Model In GLM, we have Y |X = x ∼ distribution in exponential family. Then, Recommended Resources
|
E [Y |X = x] = f X β
Generalization A generalized linear model (GLM) generalizes normal linear
regression models in the following directions: Link function β is the parameter of interest. A link function g relates the linear • Probability and Statistics (DeGroot and Schervish)
predictor X | β to the mean parameter µ,
1. Random component: Y |X = x ∼ some distribution • Mathematical Statistics and Data Analysis (Rice)
|
X β = g(µ) = g (µ(X)) .
2. Regression function: • Fundamentals of Statistics [Lecture Slides] (http://www.edx.org)
|
g (µ(x)) = x β g is required to be monotone increasing and differentiable
where g is called link function and µ(x) = E [Y |X = x] is the regression −1 | Please share this cheatsheet with friends!
µ=g X β
function.
Canonical Link The function g that links the mean µ to the canonical parameter θ is
Exponential Family called canonical link:
g(µ) = θ.
A family of distribution {Pθ : θ ∈ Θ}, Θ ⊂ Rk is said to be a k-parameter
exponential family on Rq , if there exist real-valued functions Since µ = b0 (θ), the canonical link is given by
0 −1
• η1 , . . . , ηk and B(θ) g(µ) = (b ) (µ).
q
• T1 , . . . , Tk , and h(y) ∈ R If φ > 0, the canonical link function is strictly increasing.
such that the density function of Pθ can be written as Example Bernoulli distribution
" k # y 1−y p
X p (1 − p) = exp y log + log(1 − p)
fθ (y) = exp ηi (θ)Ti (y) − B(θ) h(y) 1−p
i=1 θ
= exp yθ − log(1 + e )
Examples of discrete distributions The following distributions form discrete
p
exponential families of distributions with PMF: Hence, θ = log and b(θ) = log 1 + eθ .
1−p
• Bernoulli (p): py (1 − p)1−y , y ∈ {0, 1}
eθ
0 µ
λy −λ b (θ) = =µ ⇐⇒ θ = log
• Poisson (λ): e , y = 0, 1, . . . 1 + eθ 1−µ
y!
The canonical link for the Bernoulli distribution is the logit link.
Examples of continuous distributions The following distributions form continuous
exponential families of distributions with PDF: Model and Notation
1 y Let (Xi , Yi ) ∈ Rp × R, i = 1, . . . , n be independent random pairs such that the
• Gamma (a, b): y a−1 e− b conditional distribution of Yi given Xi = xi has density in the canonical exponential
Γ(a)ba
family:
β α −α−1 − β
yi θi − b(θi )
• Inverse Gamma (α, β): y e y fθi (yi ) = exp + c(yi , φ)
Γ(α) φ
s !
σ2 σ 2 (y − µ)2 Back to β: Given a link function g, note the following relationship between β and θ:
• Inverse Gaussian (µ, σ 2 ): 3
exp − 2
2πy 2µ y
0 −1 0 −1 −1 | |
θi = (b ) (µi ) = (b ) g (Xi β) ≡ h Xi β
One-parameter Canonical Exponential Family
where h is defined as
0 −1 −1 0 −1
yθ − b(θ)
h = (b ) ◦g = (g ◦ b ) .
fθ (y) = exp + c(y, φ)
φ If g is the canonical link function, h is the identity g = (b0 )−1 .
for some known functions b(θ) and c(y, φ). Log-likelihood The log-likelihood is given by
X Yi θi − b(θi )
• If φ is known, this is a one-parameter exponential family with θ being the `n (Y, X, β) = + constant
canonical parameter. i
φ
• If φ is unknown, this may/may not be a two-parameter exponential family. X Yi h X | β − b h X | β
i i
= + constant
• φ is called dispersion parameter. i
φ
Expected value Note that When we use the canonical link function, we obtain the expression
Y θ − b(θ) X Yi X | β − b X | β
`(θ) = + c (Y ; φ) , i i
`n (Y, X, β) = + constant
φ φ
i
which leads to
0
E [Y ] = b (θ). Strict concavity The log-likelihood `(θ) is strictly concave (if rank(X) = p) using
the canonical function when φ > 0. As a consequence, the maximum likelihood
Variance estimator is unique.
00
Var(Y ) = b (θ) · φ
On the other hand, if another parametrization is used, the likelihood function may
not be strictly concaving leading to several local maxima.