100% found this document useful (1 vote)
173 views

18.6501x Fundamentals of Statistics

This document provides a summary of key concepts related to the Gaussian distribution and properties of the normal distribution, including: 1) The Gaussian distribution is ubiquitous in statistics due to the central limit theorem. It is characterized by two parameters: the mean μ and variance σ2. 2) It has a symmetric bell-shaped density function and useful properties like being invariant under affine transformations. 3) The standard normal distribution N(0,1) allows computing probabilities by looking up values in its cumulative distribution function. 4) Concepts like quantiles, symmetry, and adding/multiplying normal distributions are discussed.

Uploaded by

Alexander CTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
173 views

18.6501x Fundamentals of Statistics

This document provides a summary of key concepts related to the Gaussian distribution and properties of the normal distribution, including: 1) The Gaussian distribution is ubiquitous in statistics due to the central limit theorem. It is characterized by two parameters: the mean μ and variance σ2. 2) It has a symmetric bell-shaped density function and useful properties like being invariant under affine transformations. 3) The standard normal distribution N(0,1) allows computing probabilities by looking up values in its cumulative distribution function. 4) Concepts like quantiles, symmetry, and adding/multiplying normal distributions are discussed.

Uploaded by

Alexander CTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

18.

6501x Fundamentals of Statistics The Gaussian Distribution Properties


Because of the CLT, the Gaussian (a.k.a. normal) distribution is ubiquitous in • If (Tn )n≥1 converges a.s., then it also converges in probability, and the two
This is a cheat sheet for statistics based on the online course given by Prof. Philippe statistics.
Rigollet. Compiled by Janus B. Advincula. limits are equal.
• X ∼ N µ, σ 2

• If (Tn )n≥1 converges in probability, then it also converges in distribution.
• E [X] = µ
Last Updated December 18, 2019 • Convergence in distribution implies convergence in probability if the limit
• Var(X) = σ 2 > 0
has a density (e.g. Gaussian):
Gaussian density (PDF)
Introduction to Statistics ! (d)
Tn −−−−→ T ⇒ P (a ≤ Tn ≤ b) −−−−→ P (a ≤ T ≤ b)
1 (x − µ)2 n→∞ n→∞
fµ,σ2 (x) = √ exp −
σ 2π 2σ 2
What is Statistics? Addition, Multiplication, Division
Useful Properties of Gaussian
Statistical view Data comes from a random process. The goal is to learn how this It is invariant under affine transformation.
process works in order to make predictions or to understand what plays a role in it. Assume
a.s./P a.s./P
• If X ∼ N µ, σ 2 , then for any a, b ∈ R,

Tn −−−−→ T and Un −−−−→ U.
n→∞ n→∞
 
2 2
Probability aX + b ∼ N aµ + b, a σ . a.s./P
• Tn + Un −−−−→ T + U
n→∞
• Standardization: If X ∼ N µ, σ 2 , then

a.s./P
X−µ • Tn Un −−−−→ T U
n→∞
Z= ∼ N (0, 1)
σ • If, in addition, U 6= 0 a.s., then
Observations
Truth We can compute probabilities from the CDF of Z ∼ N (0, 1):
(Data) 
u−µ v−µ
 Tn a.s./P T
P (u ≤ X ≤ v) = P ≤Z≤ −−−−→
σ σ Un n→∞ U

• Symmetry: If X ∼ N 0, σ 2 , then −X ∼ N 0, σ 2 . If x > 0,


 
Slutsky’s Theorem
Statistics P (|X| > x) = P (X > x) + P (−X > x) = 2 P (X > x)
Let (Xn ), (Yn ) be two sequences of random variables such that
Quantiles Let α ∈ (0, 1). The quantile of order 1 − α of a random variable X is the
number qα such that (d) P
Statistics vs. Probability P (X ≤ qα ) = 1 − α. (i) Tn −−−−→ T and (ii) Un −−−−→ u
n→∞ n→∞

Probability Previous studies showed that the drug was 80% effective. Then we can where T is a random variable and u is a given real number. Then,
anticipate that for a study on 100 patients, in average 80 will be cured and at least 65
(d)
will be cured with 99.99% chances. 1−α • Tn + Un −−−−→ T + u
n→∞
78
Statistics Observe that 100 patients were cured. We (will be able to) conclude that
(d)
we are 95% confident that for other studies, the drug will be effective on between • Tn Un −−−−→ T u
n→∞
69.88% and 86.11% of patients.
α Tn (d) T
• If, in addition, u 6= 0, then −−−−→ .
Probability Redux Un n→∞ u

Let X1 , . . . , Xn be i.i.d. random variables with E [X] = µ and Var(X) = σ 2 . Continuous Mapping Theorem
Let F denote the CDF of X.
Law of Large Numbers If f is a continuous function, then
• F (qα ) = 1 − α
n
1 X P,a.s. • If F is invertible, then qα = F −1 (1 − α)
a.s./P/(d)
Tn −−−−−−−→ T ⇒
a.s./P/(d)
f (Tn ) −−−−−−−→ f (T ) .
Xn = Xi −−−−→ µ. n→∞ n→∞
n i=1
n→∞
• P (X > qα ) = α

• If X ∼ N (0, 1), P |X| > qα/2 = α
Central Limit Theorem Foundation of Inference
Three Types of Convergence
√ Xn − µ (d)
n
σ
−−−−→
n→∞
N (0, 1). Almost Surely (a.s.) Convergence Statistical Model
Let the observed outcome of a statistical experiment be a sample X1 , . . . , Xn of n
 
a.s.
Equivalently, Tn −−−−→ T ⇐⇒ P ω : Tn (ω) −−−−→ T (ω) =1
n→∞ n→∞ i.i.d. random variables in some measurable space E (usually E ⊆ R) and denote by P
√   (d)

2

n Xn − µ −−−−→ N 0, σ . their common distribution. A statistical model associated to that statistical experiment
n→∞ Convergence in Probability is a pair 
Hoeffding’s Inequality Let n be a positive integer and X, X1 , . . . Xn be i.i.d. P E, (Pθ )θ∈Θ
Tn −−−−→ T ⇐⇒ P (|Tn − T | ≥ ) −−−−→ 0 ∀ > 0
random variables such that E [X] = µ and X ∈ [a, b] almost surely. Then, n→∞ n→∞
where
Convergence in Distribution
  − 2n
2 • E is called sample space;
P X n − µ ≥  ≤ 2e (b−a)2 ∀ > 0 (d)

Tn −−−−→ T ⇐⇒ E [f (Tn )] −−−−→ E [f (T )] • (Pθ )θ∈Θ is a family of probability measures on E;
n→∞ n→∞
for all continuous and bounded function f . • Θ is any set, called parameter set.
Parametric, Nonparametric and Semiparametric Models Confidence Intervals • If ψ = 0, H0 is not rejected.
• If ψ = 1, H0 is rejected.

• Usually, we will assume that the statistical model is well-specified, i.e., Let E, (Pθ )θ∈Θ be a statistical model based on observations X1 , . . . , Xn , and
defined such that ∃θ such that P = Pθ . This particular θ is called the true assume Θ ⊆ R. Let α ∈ (0, 1). Errors
parameter and is unknown. • Confidence interval (C.I.) of level 1 − α for θ: Any random (depending on
• Rejection region of a test ψ:
• We often assume that Θ ⊆ Rd for some d ≥ 1. The model is called X1 , . . . , Xn ) interval I whose boundaries do not depend on θ and such that  n
parametric. Rψ = x ∈ E : ψ(x) = 1 .
Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ.
• Sometimes we could have Θ be infinite dimensional, in which case the model • Type 1 error of a test ψ:
is called nonparametric. • C.I. of asymptotic level 1 − α for θ: Any random interval I whose αψ : Θ0 → R (or [0, 1])
boundaries do not depend on θ and such that
• If Θ = Θ1 × Θ2 , where Θ1 is finite dimensional and Θ2 is infinite θ 7→ Pθ [ψ = 1]
dimensional, then we have a semiparametric model. In these models, we lim Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ.
n→∞ • Type 2 error of a test ψ:
only care to estimate the finite dimensional parameter and the infinite
dimensional one is called nuisance parameter. βψ : Θ 1 → R
iid
Example We observe R1 , . . . , Rn ∼ Ber(p) for some unknown p ∈ (0, 1). θ 7→ Pθ [ψ = 0]
Identifiability  
• Statistical model: {0, 1}, (Ber(p))p∈(0,1) • Power of a test ψ:
The parameter θ is called identifiable if and only if the map θ ∈ Θ 7→ Pθ is injective, πψ = inf (1 − βψ (θ))
i.e., θ∈Θ1
0 • From CLT:
θ 6= θ =⇒ Pθ 6= Pθ0 √ Rn − p (d)
np −−−−→ N (0, 1) Level, test statistic and rejection region
or equivalently, p(1 − p) n→∞
Pθ = Pθ0 =⇒ θ=θ .
0 • A test ψ has level α if
• It yields αψ (θ) ≤ α, ∀θ ∈ Θ0 .
Parameter Estimation  p p  • A test ψ has asymptotic level α if
qα p(1 − p) qα p(1 − p)
Statistic Any measurable function of the sample, e.g., X̄n , max Xi , etc. I = R n − 2
√ , Rn + 2 √  lim αψ (θ) ≤ α, ∀θ ∈ Θ0 .
n→∞
i n n
Estimator of θ Any statistic whose expression does not depend on θ • In general, a test has the form
• But this is not a confidence interval because it depends on p! ψ = 1{Tn > c}

Three solutions: for some statistic Tn and threshold c ∈ R. Tn is called the test statistic. The
• An estimator θ̂n of θ is weakly (resp.strongly) consistent if rejection region is Rψ = {Tn > c}.
1. Conservative bound
P (resp. a.s.) p-value The (asymptotic) p-value of a test ψα is the smallest (asymptotic) level α at
θ̂n −−−−−−−−→ θ (w.r.t. P).
n→∞ 2. Solving the (quadratic) equation for p which ψα rejects H0 .
3. Plug-in
• An estimator θ̂n of θ is asymptotically normal if Methods of Estimation
√   (d) 
n θ̂n − θ −−−−→ N 0, σ
2
 The Delta Method
n→∞
Let (Zn )n≥1 be a sequence of random variables that satisfies
Total Variation Distance

Let E, (Pθ )θ∈Θ be a statistical model associated with a sample of i.i.d. r.v.
Bias of an Estimator √ (d) X1 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such that X1 ∼ Pθ∗ .
 
2
n (Zn − θ) −−−−→ N 0, σ
• Bias of an estimator of θ̂n of θ: n→∞
Statistician’s goal: Given X1 , . . . , Xn , find an estimator θ̂ = θ̂(X1 , . . . , Xn ) such
  h i 2
for some θ ∈ R and σ > 0 (the sequence (Zn )n≥1 is said to be asymptotically that Pθ̂ is close to Pθ∗ for the true parameter θ ∗ .
bias θ̂n = E θ̂n − θ
normal around θ). Let g : R → R be continuously differentiable at the point θ. Then, The total variation distance between two probability measures Pθ and Pθ0 is
  defined by
• If bias θ̂n = 0, we say that θ̂n is unbiased. • (g (Zn ))n≥1 is also asymptotically normal around g (θ). TV (Pθ , Pθ0 ) = max |Pθ (A) − Pθ0 (A)|
A⊂E
• More precisely, Total Variation Distance between Discrete Measures Assume that E is discrete
Jensen’s Inequality
√ (d)

0 2 2  (i.e., finite or countable). The total variation distance between Pθ and Pθ0 is
• If the function f (x) is convex, n (g (Zn ) − g (θ)) −−−−→ N 0, g (θ) σ .
n→∞ 1 X
TV (Pθ , Pθ0 ) = |pθ (x) − pθ0 (x)|
E [f (X)] ≥ f (E [X]) . 2 x∈E
Introduction to Hypothesis Testing
• If the function g(x) is concave, Total Variation Distance between Continuous Measures Assume that E is
Statistical Formulation Consider a sample
 X1 , . . . , Xn of i.i.d. random variables continuous. The total variation distance between Pθ and Pθ0 is
E [g (X)] ≤ g (E [X]) . and a statistical model E, (Pθ )θ∈Θ . Let Θ0 and Θ1 be disjoint subsets of Θ.
1
Z
Consider the two hypotheses: TV (Pθ , Pθ0 ) = |fθ (x) − fθ0 (x)| dx
Quadratic Risk 2
• H0 : θ ∈ Θ0 Properties of Total Variation
• We want estimators to have low bias and low variance at the same time.
• H1 : θ ∈ Θ1 • TV (Pθ , Pθ0 ) = TV (Pθ0 , Pθ ) symmetric
• The risk (or quadratic risk) of an estimator θ̂n ∈ R is
 H0 is the null hypothesis and H1 is the alternative hypothesis. • TV (Pθ , Pθ0 ) ≥ 0, TV (Pθ , Pθ0 ) ≤ 1 positive
  2 
R θ̂n = E θ̂n − θ = variance + bias
2
Asymmetry in the hypotheses H0 and H1 do not play a symmetric role: the data is • If TV (Pθ , Pθ0 ) = 0, then Pθ = Pθ0 definite
only used to try to disprove H0 . Lack of evidence does not mean that H0 is true. • TV (Pθ , Pθ0 ) ≤ TV (Pθ , Pθ00 ) + TV (Pθ00 , Pθ0 ) triangle inequality
• Low quadratic risk means that both bias and variance are small. A test is a statistic ψ ∈ {0, 1} such that: These imply that the total variation is a distance between probability distributions.
Kullback-Leibler (KL) Divergence • gradient vector: Multivariate Delta Method Let (Tn )n≥1 sequence of random vectors in Rd such
 ∂h(θ) 
The Kullback-Leibler (KL) divergence between two probability measures Pθ and that
 ∂θ1  √ (d)
Pθ0 is defined by n (Tn − θ) −−−−→ Nd (0, Σ) ,
 .  n→∞
 
d
∇h(θ) =  .  ∈ R
 .  for some θ ∈ Rd and some covariance Σ ∈ Rd×d . Let g : Rd → Rk (k ≥ 1) be
 
pθ (x)

P
 pθ (x) log if E is discrete  ∂h(θ) 
continuously differentiable at θ. Then,

pθ0 (x)

x∈E

KL (Pθ , Pθ0 ) = Z ∂θd

 

 fθ (x) (d) | 
 fθ (x) log
 dx if E is continuous • Hessian matrix: n (g (Tn ) − g (θ)) −−−−→ N 0, ∇g(θ) Σ ∇g(θ) ,
E f θ 0 (x) n→∞
∂ 2 h(θ) ∂ 2 h(θ)
 
 
KL-divergence is also known as relative entropy. ... ∂g(θ) ∂gj
 ∂θ ∂θ
 1 1 ∂θ1 ∂θd  where ∇g(θ) = = ∈ Rd×k
Properties of KL-divergence . .
 ∂θ ∂θi 1≤i≤d

Hh(θ) =  . .. .  ∈ Rd×d

1≤j≤k
 2 .
 . . 
• KL (Pθ , Pθ0 ) 6= KL (Pθ0 , Pθ ) in general 2

 ∂ h(θ)
...
∂ h(θ)  Fisher Information
• KL (Pθ , Pθ0 ) ≥ 0 ∂θd ∂θ1 ∂θd ∂θd
Define the log-likelihood for one observation as
• If KL (Pθ , Pθ0 ) = 0, then Pθ = Pθ0 (definite) h is concave ⇐⇒ x| Hh(θ)x ≤ 0, ∀x ∈ Rd , θ ∈ Θ d
• KL (Pθ , Pθ0 )  KL (Pθ , Pθ00 ) + KL (Pθ00 , Pθ0 ) in general `(θ) = log L1 (X, θ) , θ∈Θ⊂R .
h is strictly concave ⇐⇒ x| Hh(θ)x < 0, ∀x ∈ Rd , θ ∈ Θ
Assume that ` is a.s. twice differentiable. Under some regularity conditions, the
Maximum Likelihood Estimation Consistency of Maximum Likelihood Estimator Under mild regularity conditions,
Fisher information of the statistical model is defined as
we have

 MLE P
Likelihood, Discrete Case Let E, (Pθ )θ∈Θ be a statistical model associated with θ̂n −−−−→ θ
 | |
I(θ) = E ∇`(θ)∇`(θ) − E [∇`(θ)] E [∇`(θ)] = −E [H`(θ)] .
a sample of i.i.d. r.v. X1 , . . . , Xn . Assume that E is discrete (i.e., finite or countable). n→∞

Definition The likelihood of the model is the map Ln (or just L) defined as Covariance In general, when θ ⊂ Rd , d ≥ 2, its coordinates are not necessarily If Θ ⊂ R, we get
 0   00 
independent. The covariance between two random variables X and Y is I(θ) = Var ` (θ) = −E ` (θ) .
n
Ln : E ×Θ→R
Cov(X, Y ) := E [(X − E [X]) (Y − E [Y ])] Asymptotic Normality of the MLE
(x1 , . . . , xn ; θ) 7→ Pθ [X1 = x1 , . . . , Xn = xn ]
= E [XY ] − E [X] E [Y ]
n
Y Theorem Let θ ∗ ∈ Θ (the true parameter). Assume the following:
= Pθ [Xi = xi ]
Properties
i=1 1. The parameter is identifiable.
• Cov (X, X) = Var (X)
2. For all θ ∈ Θ, the support of Pθ does not depend on θ.

Likelihood, Continuous Case Let E, (Pθ )θ∈Θ be a statistical model associated
• Cov (X, Y ) = Cov (Y, X)
with a sample of i.i.d. r.v. X1 , . . . , Xn . Assume that all the Pθ have density fθ . 3. θ ∗ is not on the boundary of Θ.
• If X and Y are independent, then Cov (X, Y ) = 0
Definition The likelihood of the model is the map L defined as 4. I(θ) is invertible in a neighborhood of θ ∗ .
n Covariance Matrix The covariance matrix of a random vector
L:E ×Θ→R 5. A few more technical conditions.
(d) |
 
(1) d
n X = X ,...,X ∈R
Y MLE
(x1 , . . . , xn ; θ) 7→ fθ (xi ) Then, θ̂n satisfies
i=1 is given by
|
θ ∗ w.r.t. Pθ∗ ;
 MLE P
Σ = Cov (X) = E (X − E [X]) (X − E [X]) . • θ̂n −−−−→
n→∞
Maximum Likelihood Estimator LetX1 , . . . , Xn be an i.i.d. sample associated This is a matrix of size d × d.
with a statistical model E, (Pθ )θ∈Θ and let L be the corresponding likelihood. √  MLE  (d)
n θ̂n − θ ∗ Nd 0, I −1 (θ ∗ ) w.r.t. Pθ∗ .

If X ∈ Rd and A, B are matrices: • −−−−→
Definition The maximum likelihood estimator of θ is defined as n→∞
| |
MLE Cov (AX + B) = Cov (AX) = A Cov(X)A = AΣX A
θ̂n = argmax L (X1 , . . . , Xn , θ) , The Method of Moments
θ∈Θ
The Multivariate Gaussian Distribution If (X, T )| is a Gaussian vector then its
provided it exists. PDF depends on 5 parameters: Moments
Log-likelihood Estimator In practice, we use the fact that E [X] , Var(X), E[Y ], Var(Y ), and Cov(X, Y ). Let X1 , . . . , Xn be an i.i.d. sample associated with a statistical model
E, (Pθ )θ∈Θ . Assume that E ⊆ R and Θ ⊆ Rd , for some d ≥ 1.

MLE
θ̂n = argmax log L (X1 , . . . , Xn , θ) , A Gaussian vector X ∈ Rd is completely determined by its expected value and h i
θ∈Θ covariance matrix Σ: Population Moments Let mk (θ) = Eθ X1k , 1 ≤ k ≤ d.
X ∼ Nd (µ, Σ) .
n
Concave and Convex Functions It has PDF over Rd given by: k = 1
X k
Empirical Moments Let m̂k = Xn n Xi , 1 ≤ k ≤ d.
A twice-differentiable function h : Θ ⊂ R → R is said to be concave if its second
 
1 1 | −1
i=1
derivative satisfies f (x) = 1
exp − (x − µ) Σ (x − µ)
00 ((2π)d det (Σ)) 2 2 From LLN,
h (θ) ≤ 0, ∀θ ∈ Θ. P/a.s.
00 m̂k −−−−→ mk (θ)
It is said to be strictly concave if the inequality is strict: h (θ) < 0. Moreover, h is d n→∞
The Multivariate CLT Let X1 , . . . , Xn ∈ R be independent copies of a random
said to be (strictly) convex if −h is (strictly) concave, i.e. h00 (θ) ≥ 0 (h00 (θ) > 0). More compactly, we say that the whole vector converges:
vector X such that E [X] = µ, Cov (X) = Σ, then
Multivariate Concave Functions More generally, for a multivariate function:
√   (d) P/a.s.
h : Θ ⊂ Rd → R, d ≥ 2, define the n Xn − µ −−−−→ Nd (0, Σ) (m̂1 , . . . , m̂d ) −−−−→ (m1 (θ), . . . , md (θ))
n→∞ n→∞
Moments Estimator 1.5
Asymptotic test Assume that m = cn and n → ∞
Using Slutsky’s theorem, we also have
Let
1 X n − Y m − (∆d − ∆c ) (d)
M :Θ→R
d s −−−−→ N (0, 1)
n→∞
bd2
σ bc2
σ
θ 7→ M (θ) = (m1 (θ), . . . , md (θ)) +
0.5 n m
n m
Assume M is one-to-one: 1 X  2 1 X 2
bd2 =
where σ Xi − X n bc2 =
and σ Yi − Y m
−1 0 n − 1 i=1 m − 1 i=1
θ=M (m1 (θ), . . . , md (θ)) −2 −1 0 1 2

 We get the following test at asymptotic level α:


Moments estimator of θ: MLE is an M-estimator Assume that E, (Pθ )θ∈Θ is a statistical model associated  
with the data.

 

 
MM −1 
 X −Y 
θbn = M (m
b 1, . . . , m
b d) n m

Theorem Let M = Θ and ρ(x, θ) = − log L1 (x, θ), provided the likelihood is Rψ = s > qα
positive everywhere. Then, 2 2
σ σ
 
provided it exists.  bd 
∗ ∗ + c

 b 

µ =θ ,  
n m
Generalized Method of Moments where P = Pθ∗ (i.e., θ ∗ is the true value of the parameter).
Statistical Analysis The χ2 Distribution
Applying the multivariate CLT and Delta method yields:
• Define µ̂n as a minimizer of Definition For a positive integer d, the χ2 distribution with d degrees of freedom is
Theorem iid
√  MM n the law of the random variable Z12 + · · · + Zd2 , where Z1 , . . . , Zd ∼ N (0, 1).
(d)

1 X
n θbn − θ −−−−→ N (0, Γ(θ)) , Qn (µ) := ρ (Xi , µ) . Properties If V ∼ χ2k , then
n→∞
n i=1
• E [V ] = E Z12 + · · · + E Zd2 = d
" #| " #    
∂M −1 ∂M −1
where Γ(θ) = M (θ) Σ(θ) M (θ) ∂ 2 Q(µ) • Var(V ) = Var(Z12 ) + · · · + Var(Zd2 ) = 2d
∂θ ∂θ • Let J(µ) = .
∂µ∂µ| 1X
n 2 n
1X 2  2
MLE vs. Moment Estimator " # Sample Variance Sn = Xi − X n = X − Xn
∂ 2 ρ(X1 , µ) n i=1 n i=1 i
• Under some regularity conditions, J(µ) = E
• Comparison of the quadratic risks: In general, the MLE is more accurate. ∂µ∂µ| iid
Cochran’s Theorem If X1 , . . . , Xn ∼ N µ, σ 2 , then

• MLE still gives good results if the model is misspecified. 
∂ρ(X1 , µ)

• Let K(µ) = Cov • Xn ⊥⊥ Sn , for all n.
• Computational issues: Sometimes, the MLE is intractable but MM is easier ∂µ nSn
(polynomial equations). • Remark: In the log-likelihood case, • ∼ χ2n−1
σ2
J(θ) = K(θ) = I(θ) (Fisher information) We often prefer the unbiased estimator of σ 2 :
M-Estimation n

Asymptotic Normality Let µ ∈ M (the true parameter). Assume the following: 1 X 2 n
• Let X1 , . . . , Xn be i.i.d. with some unknown distribution P in some sample S
en = Xi − X n = Sn
n − 1 i=1 n−1
space E (E ⊆ Rd for some d ≥ 1). 1. µ∗ is the only minimizer of the function Q,
• No statistical model needs to be assumed (similar to ML). 2. J(µ) is invertible for all µ ∈ M, Student’s T Distribution

• The goal is to estimate some parameter µ associated with P, e.g. its mean, 3. A few more technical conditions. Definition For a positive integer d, the Student’s T distribution with d degrees of
variance, median, other quantiles, the true parameter in some statistical Z
Then, µ̂n satisfies freedom (denoted by td ) is the law of the random variable p , where
model, etc. V /d
• µ̂n
P
−−−−→ µ∗ Z ∼ N (0, 1), V ∼ χ2d and Z ⊥ ⊥ V.
• We want to find a function ρ : E × M → R, where M is the set of all n→∞
possible values for the unknown µ∗ , such that √
Student’s T test (one-sample, two-sided)
(d)
n (µ̂n − µ∗ ) N 0, J(µ∗ )−1 K(µ∗ )J(µ∗ )−1

• −−−−→ iid
Let X1 , . . . , Xn ∼ N µ, σ 2 where both µ and σ 2 are unknown. We want to test:

Q(µ) := E [ρ (X1 , µ)] n→∞

H0 : µ = 0 vs. H1 : µ 6= 0
achieves its minimum at µ = µ∗ . Hypothesis Testing Test statistic:
Examples (1) √ Xn − µ
Parametric Hypothesis Testing √ Xn n
Tn = nq = s σ
• If E = M = R and ρ(x, µ) = (x − µ)2 , for all x, µ ∈ R: µ∗ = E [X]. Hypotheses S
en S
en
H0 : ∆ c = ∆ d vs. H 1 : ∆ d > ∆c
• If E = M = Rd and ρ(x, µ) = kx − µk22 , for all x, µ ∈ Rd : σ2
µ∗ = E [X] ∈ Rd . Since the data is Gaussian by assumption, we don’t need the CLT.
√ Xn S
en χ2n−1

! ! Since n ∼ N (0, 1) (under H0 ) and 2 ∼ are independent by
• If E = M = R and ρ(x, µ) = |x − µ|, for all x, µ ∈ R: µ is a median of P. σ2 σ2 σ σ n−1
X n ∼ N ∆d , d and Y m ∼ N ∆c , c Cochran’s theorem, we have
n m
Example (2) If E = M = R, α ∈ (0, 1) is fixed and ρ(x, µ) = Cα (x − µ), for all Tn ∼ tn−1 .
x, µ ∈ R: µ∗ is a α-quantile of P. Then, Student’s test with (non-asymptotic) level α ∈ (0, 1):
Check Function X n − Y m − (∆d − ∆c ) n o
( s ∼ N (0, 1) ψα = 1 |Tn | > q α ,
2
Cα =
−(1 − α)x if x < 0 σd2 σc2
+ where q α is the 1 − α

αx if x ≥ 0. n m 2 2 -quantile of tn−1 .
Student’s T test (one-sample, one-sided) Likelihood function Categorical Likelihood
n d
H0 : µ ≤ µ 0 vs. H1 : µ > µ0 Ln : R × R → R • Likelihood of the model:
n N N N
Y Ln (X1 , . . . , Xn ; p) = p1 1 p2 2 . . . pKK
Test statistic: (x1 , . . . , xn ; θ) 7→ pθ (xi )
√ X n − µ0 i=1 where Nj = # {i = 1, . . . , n : Xi = aj } .
Tn = n q ∼ tn−1 (under H0 )
S
en The likelihood ratio test in this set-up is of the form • Let p
b be the MLE:
Nj

Ln (x1 , . . . , xn ; θ1 )
 p
bj =
, j = 1, . . . , K.
Student’s test with (non-asymptotic) level α ∈ (0, 1): ψC = 1 >C n
Ln (x1 , . . . , xn ; θ0 ) b maximizes log Ln (X1 , . . . , Xn , p) under the constraint.
p
ψα = 1 {Tn > qα } √
where C is a threshold to be specified. χ2 test If H0 is true, then n p
b − p0 is asymptotically normal, and the following

where qα is the (1 − α)-quantile of tn−1 . A test based on the log-likelihood Consider an i.i.d. sample X1 , . . . , Xn with holds:
statistical model E, (Pθ )θ∈Θ , where Θ ⊆ Rd (d ≥ 1). Suppose the null hypothesis

Two-sample T-test Theorem Under H0 :
X n − Y m − (∆d − ∆c ) has the form    2
s ∼ tN (0) (0)
H0 : (θr+1 , . . . , θd ) = θr+1 , . . . , θd , n b j − p0j
p (d)
bd2 bc2
X 2
σ σ Tn = n −−−−→ χK−1
+ (0) (0) p0j n→∞
n m for some fixed and given numbers θr+1 , . . . , θd . j=1

Welch-Satterthwaite formula Let CDF and empirical CDF Let X1 , . . . , Xn be i.i.d. real random variables. The CDF
!2 θbn = argmax `n (θ) (MLE) of X1 is defined as
θ∈Θ
bd2
σ bc2
σ F (t) = P [X1 ≤ 1] , ∀t ∈ R.
+ and
n m c It completely characterizes the distribution of X1 .
N = ≥ min(n, m) θbn = argmax `n (θ) (constrained MLE)
bd4
σ bc4
σ θ∈Θ0 The empirical CDF of the sample X1 , . . . , Xn is defined as
+ n  o
n2 (n − 1) m2 (m − 1) where Θ0 = θ ∈ Θ : (θr+1 , . . . , θd ) =
(0)
θr+1 , . . . , θd
(0) n
1 X
Fn (t) = 1 {Xi ≤ 1}
Test statistic: n i=1
Wald’s Test     
c
Tn = 2 `n θ̂n − `n θ̂n . #{i = 1, . . . , n : Xi ≤ t}
A test based on the MLE Consider an i.i.d. sample X1 , . . . , Xn with statistical = , ∀t ∈ R.
Wilk’s Theorem Assume H0 is true and the MLE technical conditions are satisfied. n
model E, (Pθ )θ∈Θ , where Θ ⊆ Rd (d ≥ 1) and let θ0 ∈ Θ be fixed and given. θ ∗

is the true parameter. Then, Consistency By the LLN, for all t ∈ R,
(d) 2
Consider the following hypotheses: Tn −−−−→ χd−r a.s.
n→∞ Fn (t) −−−−→ F (t).
n→∞
∗ ∗ Likelihood ratio test with asymptotic level α ∈ (0, 1):
H0 : θ = θ 0 vs. H1 : θ 6= θ0
Glivenko-Cantelli Theorem (Fundamental theorem of statistics)
ψ = 1 {Tn > qα } ,
Let θbnMLE be the MLE. Assume the MLE technical conditions are satisfied. sup |Fn (t) − F (t)|
a.s.
−−−−→ 0
where qα is the (1 − α)-quantile of χ2d−r . t∈R n→∞
If H0 is true, then
√  1 
MLE 2 MLE
 (d)
Goodness of Fit Tests Asymptotic normality By the CLT, for all t ∈ R,
n I θb θbn − θ0 −−−−→ Nd (0, Id ) Let X be a r.v. We want to know if the hypothesized distribution is a good fit for the √
n→∞ (d)
data. n (Fn (t) − F (t)) −−−−→ N (0, F (t) (1 − F (t)))
n→∞
Wald’s test Key characteristic of Goodness of Fit tests: no parametric modeling.
|  Donsker’s Theorem If F is continuous, then
(d)
  
MLE
Tn := n θbn − θ0
MLE
I θbn
MLE
θbn − θ0 −−−−→
2
χd Discrete distribution Let E = {a1 , . . . , aK } be a finite space and (Pp )p∈∆ be √ a.s.
K n sup |Fn (t) − F (t)| −−−−→ sup |B(t)| ,
n→∞
the family of all probability distributions on E. t∈R n→∞ 0≤t≤1

Wald’s test with asymptotic level α ∈ (0, 1):


 
 XK  where B(t) is a Brownian bridge on [0, 1].
• ∆K = p = (p1 , . . . , pK ) ∈ (0, 1)K : pj = 1
ψ = 1 {Tn > qα } ,  
j=1 Kolmogorov-Smirnov Test
where qα is the (1 − α)-quantile of χ2d . • For p ∈ ∆K and X ∼ Pp , √
Let Tn = sup n |Fn (t) − F (t)|. By Donsker’s theorem, if H0 is true, then
Wald’s Test in 1 dimension In one dimension, Wald’s test coincides with the t∈R
Pp [X = aj ] = pj , j = 1, . . . , K. (d)
two-sided test based on the asymptotic normality of the MLE. Tn −−−−→ Z, where Z has a known distribution (supremum of the absolute value of
n→∞
iid
Let X1 , . . . , Xn ∼ Pp , for some unknown p ∈ ∆K , and let p0 ∈ ∆K be fixed. a Brownian bridge).
Likelihood Ratio Test
We want to test: KS test with asymptotic level α:
iid 0 0
H0 : p = p vs. H1 : p 6= p
Basic Form of the Likelihood  X1 , . . . , Xn ∼ Pθ∗ , and consider the
 Ratio Test Let KS
δα = 1 {Tn > qα }
associated statistical model E, (Pθ )θ∈Rd . Suppose that Pθ is a discrete probability with asymptotic level α ∈ (0, 1).
distribution with pmf given by pθ . The Probability Simplex in K Dimensions The probability simplex in RK , denoted where qα is the (1 − α)-quantile of Z.
by ∆K , is the set of all vectors p = [p1 , . . . , pK ]| such that Let X(1) ≤ X(2) ≤ · · · ≤ X(n) be the reordered sample. The expression for Tn
In its most basic form, the likelihood ratio test can be used to decide between two
| reduces to
hypotheses of the following form: p · 1 = p 1 = 1, pi ≥ 0 for all K


i − 1
 
0  i 0 
∗ ∗ Tn = n max max − F X(i) , − F X(i) .
H0 : θ = θ 0 vs. H1 : θ = θ1 where 1 denotes the vector 1 = (1, . . . , 1)| i=1,...,n n n
Four patterns • Let X1 , . . . , Xn be a sample of n random variables.
Fn (X(i) ) 1. heavy tails • Denote by Ln (·|θ) the joint PDF of X1 , . . . , Xn conditionally on θ, where
1 Normal Q−Q Plot θ ∼ π.
FX

6
• Remark: Ln (X1 , . . . , Xn |θ) is the likelihood used in the frequentist
3 approach.

4
4

2
Sample Quantiles
• The conditional distribution of θ given X1 , . . . , Xn is called the posterior
distribution. Denote by π(·|X1 , . . . , Xn ) its PDF.

0
2
4

−2
2
4
− F (X(2) ) Bayes’ formula

−4
1 π (θ|X1 , . . . , Xn ) ∝ π(θ)Ln (X1 , . . . , Xn |θ), ∀θ ∈ Θ

−6
4
−2 −1 0 1 2

Theoretical Quantiles

0 Bernoulli experiment with a Beta prior


X(1) X(2) X(3) X(4) • p ∼ Beta(a, a):
2. right skewed
a−1 a−1
Normal Q−Q Plot π(p) ∝ p (1 − p) , p ∈ (0, 1)

5
iid
Pivotal Distribution Tn is called a pivotal statistic: If H0 is true, the distribution of • Given p, X1 , . . . , Xn ∼ Ber(p), so
Tn does not depend on the distribution of the Xi ’s.

4
Pn Pn

Sample Quantiles
Ln (X1 , . . . , Xn |p) = p i=1 Xi (1 − p)
n− i=1 Xi .

3
Other Goodness of Fit Tests

2
• Hence,
Kolmogorov-Smirnov

1
a−1+ n
P Pn
d (Fn , F ) = sup |Fn (t) − F (t)| π (p|X1 , . . . , Xn ) ∝ p i=1 Xi (1 − p)
a−1+n− i=1 Xi

0
t∈R
−2 −1 0 1 2

Theoretical Quantiles • The posterior distribution is


Cramér-Von Mises
Z n n
!
2 2 X X
d (Fn , F ) = [Fn (t) − F (t)] dF (t) Beta a+ Xi , a + n − Xi conjugate prior
R
3. left skewed
Normal Q−Q Plot
h i i=1 i=1
2
= E |Fn (X) − F (X)|

0
X∼F Non-informative Priors

−1
Anderson-Darling • We can still use a Bayesian approach if we have no prior information about the

Sample Quantiles
parameter.

−2
[Fn (t) − F (t)]2
Z
2
d (Fn , F ) dF (t) • Good candidate: π(θ) ∝ 1, i.e., constant PDF on Θ.

−3
R F (t) (1 − F (t))
• If Θ is bounded, this is the uniform prior on Θ.

−4
Kolmogorov-Lilliefors Test • If Θ is unbounded, this does not define a proper PDF on Θ.
−2 −1 0 1 2

We want to test if X has a Gaussian distribution with unknown parameters. In this Theoretical Quantiles
• An improper prior on Θ is a measurable, non-negative function π(·) defined
case, Donsker’s theorem is no longer valid. Instead, we compute the quantiles for the on Θ that is not integrable:
test statistic
4. light tails
Z
sup Fn (t) − Φµ̂,σ̂2 (t)

t∈R Normal Q−Q Plot π(θ)dθ = ∞.
where µ̂ = X n , σ̂ 2 = Sn
2
and Φµ̂,σ̂2 (t) is the CDF of N µ̂, σ̂ 2 .

1.5

• In general, one can still define a posterior distribution using an improper prior,
1.0
They do not depend on unknown parameters. using Bayes’ formula.
0.5
Sample Quantiles

Quantile-Quantile (QQ) plots


0.0

Jeffreys Prior and Bayesian Confidence Interval


−0.5

• Provide a visual way to perform goodness of fit tests.


Jeffreys prior is an attempt to incorporate frequentist ideas of likelihood in the
−1.0

• Not a formal test but quick and easy check to see if a distribution is plausible. Bayesian framework, as well as an example of a non-informative prior:
−1.5

• Main idea: We want to check visually if the plot of Fn is close to that of F or, −2 −1 0 1 2
q
equivalently, if the plot of Fn−1 is close to F −1 . Theoretical Quantiles
πJ (θ) ∝ det I(θ)
• Check if the points
where I(θ) is the Fisher information matrix of the statistical model associated with
Bayesian Statistics
   
−1 1 −1 1 −1 −1 n−1
F (n ), Fn ( n ) , . . . , F ( n−1
n ), Fn ( n )
X1 , . . . , Xn in the frequentist approach (provided it exists).
Examples
are near the line y = x.
• Fn is not technically invertible but we define
Introduction to Bayesian Statistics • Bernoulli experiment: πJ (θ) ∝ p
1
, p ∈ (0, 1): the prior is
Prior and Posterior p(1 − p)
−1
i
Fn ( n ) = Xi , Beta( 12 , 1
2)
• Consider a probability distribution on a parameter space Θ with some PDF
the ith largest observation. π(·): the prior distribution. • Gaussian experiment: πJ (θ) ∝ 1, θ ∈ R, is an improper prior
Jeffreys prior satisfies a reparametrization invariance principle: If η is a Probabilistic Analysis Let X and Y be two r.v. (not neccessarily independent) with Closed Form Solution Assume that rank(X) = p. Then,
reparametrization of θ (i.e., η = φ(θ) for some one-to-one map φ), then the PDF π̃(·) two moments and such that Var (X) > 0. The theoretical linear regression of Y on X
is the line x 7→ a∗ + b∗ x, where b = X| X −1 X| Y.

of η satisfies: β
q
˜
π̃(η) ∝ det I(η), ∗ ∗
h
2
i
a , b = argmin E (Y − a − bX) Geometric Interpretation of the LSE Xβb is the orthogonal projection of Y onto the
˜ (a,b)∈R2 subspace spanned by the columns of X:
where I(η) is the Fisher information of the statistical model parametrized by η
instead of θ.
which gives Xβ
b = P Y,
Bayesian confidence regions For α ∈ (0, 1), a Bayesian confidence region with | −1 |
level α is a random subset R of the parameter space Θ, which depends on the sample ∗ Cov (X, Y ) where P = X (X X) X .
b =
X1 , . . . , Xn , such that Var (X) Statistical Inference To make inference, we need more assumptions.
∗ ∗ Cov (X, Y )
P [θ ∈ R|X1 , . . . , Xn ] = 1 − α. a = E[Y ] − b E[X] = E[Y ] − E[X] • The design matrix X is deterministic and rank(X) = p.
Var (X)
• The model is homoscedastic: ε1 , . . . , εn are i.i.d.
Note that R depends on the prior π(·).
Noise The points are not exactly on the line x 7→ a∗ + b∗ x if Var(Y |X = x) > 0. • The noise vector ε is Gaussian:
Bayesian confidence region and confidence interval are two distinct notions. The random variable ε = Y − (a∗ + b∗ X) is called noise and satisfies  
2
∗ ∗
ε ∼ Nn 0, σ In
Bayesian estimation Y = a + b X + ε,

with E[ε] = 0 and Cov(X, ε) = 0 for some known or unknown σ 2 > 0.


• Posterior mean: θb(π) =
R
Θ
θπ (θ|X1 , . . . , Xn ) dθ
Statistical Problem In practice, a∗ , b∗ need to be estimated from data. Properties of LSE
• MAP (maximum a posteriori): θb MAP = argmax π(θ|X1 , . . . , Xn )
θ∈Θ Least Squares The least squares estimator (LSE) of (a, b) is the minimizer of the • LSE = MSE
It is the point that maximizes the posterior distribution, provided it is unique. sum of squared errors: • Distribution of β:
b
n   
X
(Yi − a − bXi ) .
2 b ∼ Np β ∗ , σ 2 X| X −1
β
Linear Regression i=1

Then, • Quadratic Risk of β:


b
h i   
Modeling Assumptions (Xi , Yi ), i = 1, . . . , n, are i.i.d. from some unknown joint b − βk2 = σ 2 tr X| X −1
E kβ
distribution P. P can be described entirely by (assuming all exist): XY − X Y 2
b̂ = 2
X2 − X
• either a joint PDF h(x, y) • Prediction Error: h i
â = Y − b̂X b 2 = σ 2 (n − p)
E kY − Xβk
R 2
• the marginal density of X, h(x) = h(x, y)dy and the conditional density
Multivariate Regression • Unbiased estimator of σ 2 :
h(x, y)
h(y|x) = b 2
kY − Xβk 1 X 2
n
h(x) We have a vector of explanatory variables or covariates: 2 2
σ
b = = ε
n − p i=1 i
b
n−p
 (1) 
h(y|x) answers all our questions. It contains all the information about Y given X. Xi
 .  p Significance Tests
Partial Modeling We can also describe the distribution only partially, e.g. using Xi =  .  ∈R .
 . 
(p)
Xi • Test whether the j th explanatory variable is significant in the linear regression.
• the expectation of Y : E [Y ]
• H0 : βj = 0 v.s. H1 : β 6= 0
• the conditional expectation of Y given X = x: E [X = x]. The function The response or dependent variable is Yi with
| ∗ • If γj (γj > 0) is the j th diagonal coefficient of (X| X)−1 :
Z Yi = Xi β + εi , i = 1, . . . , n
x 7→ f (x) := E [Y |X = x] = yh(y|x)dy bj − βj
β
and β ∗
1 is called the intercept. p ∼ tn−p
b2 γj
σ
is called regression function. Least Squares Estimator The least squares estimator of β ∗ is the minimizer of the
sum of squared errors β
• other possibilities: n • Let Tn(j) = p
bj
.
X | 2
β
b = argmin Y i − Xi β b 2 γj
σ
– the conditional median: m(x) such that β∈Rp i=1
• Test with non-asymptotic level α ∈ (0, 1):
Z m(x) 1
h(y|x)dy = LSE in Matrix Form n o
(j)
−∞ 2 Rj,α = Tn > q α (tn−p )
• Let Y = (Y1 , . . . , Yn )| ∈ Rn . 2

– conditional quantiles α
• Let X be the n × p matrix whose rows are X| | where q α (tn−p ) is the (1 − 2 )-quantile of tn−p .
– conditional variance (not information about location) 1 , . . . , Xn . X is called the design 2
matrix.
Bonferroni’s test Test whether a group of explanatory variables is significant in the
Linear Regression We focus on modeling the regression function • Let ε = (ε1 , . . . , εn )| ∈ Rn , the unobserved noise. Then, linear regression.
∗ ∗
f (x) = E [Y |X = x] . Y = Xβ + ε, β unknown. • H0 : βj = 0 ∀j ∈ S v.s. H1 : ∃j ∈ S, βj 6= 0 where S ⊆ {1, . . . , p}.
• Bonferroni’s test:
Restrict to simple functions. The simplest is • The LSE β
b satisfies [
b = argminkY − Xβk2 .
β RS,α = Rj, α , where k = |S|
f (x) = a + bx linear (or affine) function β∈Rp
2
j∈S
k
Generalized Linear Model In GLM, we have Y |X = x ∼ distribution in exponential family. Then, Recommended Resources
| 
E [Y |X = x] = f X β
Generalization A generalized linear model (GLM) generalizes normal linear
regression models in the following directions: Link function β is the parameter of interest. A link function g relates the linear • Probability and Statistics (DeGroot and Schervish)
predictor X | β to the mean parameter µ,
1. Random component: Y |X = x ∼ some distribution • Mathematical Statistics and Data Analysis (Rice)
|
X β = g(µ) = g (µ(X)) .
2. Regression function: • Fundamentals of Statistics [Lecture Slides] (http://www.edx.org)
|
g (µ(x)) = x β g is required to be monotone increasing and differentiable
where g is called link function and µ(x) = E [Y |X = x] is the regression −1 |  Please share this cheatsheet with friends!
µ=g X β
function.
Canonical Link The function g that links the mean µ to the canonical parameter θ is
Exponential Family called canonical link:
g(µ) = θ.
A family of distribution {Pθ : θ ∈ Θ}, Θ ⊂ Rk is said to be a k-parameter
exponential family on Rq , if there exist real-valued functions Since µ = b0 (θ), the canonical link is given by
0 −1
• η1 , . . . , ηk and B(θ) g(µ) = (b ) (µ).
q
• T1 , . . . , Tk , and h(y) ∈ R If φ > 0, the canonical link function is strictly increasing.
such that the density function of Pθ can be written as Example Bernoulli distribution
   
" k # y 1−y p
X p (1 − p) = exp y log + log(1 − p)
fθ (y) = exp ηi (θ)Ti (y) − B(θ) h(y) 1−p
 
i=1 θ
= exp yθ − log(1 + e )
Examples of discrete distributions The following distributions form discrete  
p  
exponential families of distributions with PMF: Hence, θ = log and b(θ) = log 1 + eθ .
1−p
• Bernoulli (p): py (1 − p)1−y , y ∈ {0, 1}

 
0 µ
λy −λ b (θ) = =µ ⇐⇒ θ = log
• Poisson (λ): e , y = 0, 1, . . . 1 + eθ 1−µ
y!
The canonical link for the Bernoulli distribution is the logit link.
Examples of continuous distributions The following distributions form continuous
exponential families of distributions with PDF: Model and Notation
1 y Let (Xi , Yi ) ∈ Rp × R, i = 1, . . . , n be independent random pairs such that the
• Gamma (a, b): y a−1 e− b conditional distribution of Yi given Xi = xi has density in the canonical exponential
Γ(a)ba
family:
β α −α−1 − β 
yi θi − b(θi )

• Inverse Gamma (α, β): y e y fθi (yi ) = exp + c(yi , φ)
Γ(α) φ
s !
σ2 σ 2 (y − µ)2 Back to β: Given a link function g, note the following relationship between β and θ:
• Inverse Gaussian (µ, σ 2 ): 3
exp − 2
2πy 2µ y
 
0 −1 0 −1 −1 | | 
θi = (b ) (µi ) = (b ) g (Xi β) ≡ h Xi β
One-parameter Canonical Exponential Family
where h is defined as
0 −1 −1 0 −1

yθ − b(θ)
 h = (b ) ◦g = (g ◦ b ) .
fθ (y) = exp + c(y, φ)
φ If g is the canonical link function, h is the identity g = (b0 )−1 .
for some known functions b(θ) and c(y, φ). Log-likelihood The log-likelihood is given by
X Yi θi − b(θi )
• If φ is known, this is a one-parameter exponential family with θ being the `n (Y, X, β) = + constant
canonical parameter. i
φ
• If φ is unknown, this may/may not be a two-parameter exponential family. X Yi h X | β − b h X | β
 
i i
= + constant
• φ is called dispersion parameter. i
φ
Expected value Note that When we use the canonical link function, we obtain the expression
Y θ − b(θ) X Yi X | β − b X | β

`(θ) = + c (Y ; φ) , i i
`n (Y, X, β) = + constant
φ φ
i
which leads to
0
E [Y ] = b (θ). Strict concavity The log-likelihood `(θ) is strictly concave (if rank(X) = p) using
the canonical function when φ > 0. As a consequence, the maximum likelihood
Variance estimator is unique.
00
Var(Y ) = b (θ) · φ
On the other hand, if another parametrization is used, the likelihood function may
not be strictly concaving leading to several local maxima.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy