Chapter 9.4 Allele Frequency Estimation
Chapter 9.4 Allele Frequency Estimation
The EM Algorithm
9.1 Introduction
Maximum likelihood is the dominant form of estimation in applied
statistics. Because closed-form solutions to likelihood equations are the
exception rather than the rule, numerical methods for finding maximum
likelihood estimates are of paramount importance. In this chapter we study
maximum likelihood estimation by the EM algorithm [65, 179, 191], a spe-
cial case of the MM algorithm. At the heart of every EM algorithm is some
notion of missing data. Data can be missing in the ordinary sense of a
failure to record certain observations on certain cases. Data can also be
missing in a theoretical sense. We can think of the E (expectation) step of
the algorithm as filling in the missing data. This action replaces the log-
likelihood of the observed data by a minorizing function. This surrogate
function is then maximized in the M step. Because the surrogate function
is usually much simpler than the likelihood, we can often solve the M step
analytically. The price we pay for this simplification is that the EM algo-
rithm is iterative. Reconstructing the missing data is bound to be slightly
wrong if the parameters do not already equal their maximum likelihood
estimates.
One of the advantages of the EM algorithm is its numerical stability.
As an MM algorithm, any EM algorithm leads to a steady increase in
the likelihood of the observed data. Thus, the EM algorithm avoids wildly
overshooting or undershooting the maximum of the likelihood along its
current direction of search. Besides this desirable feature, the EM handles
Q(θ | θn ) = E[ln f (X | θ) | Y = y, θ n ].
f (X | θ)
Q(θ | θ n ) − ln g(y | θ) = E ln Y = y, θn
g(Y | θ)
f (X | θ n )
≤ E ln Y = y, θ n
g(Y | θ n )
= Q(θ n | θn ) − ln g(y | θ n ).
nmA/A = E(nA/A | Y, pm )
p2mA
= nA
p2mA + 2pmApmO
nmA/O = E(nA/O | Y, pm )
2pmA pmO
= nA .
p2mA + 2pmApmO
The conditional expectations nmB/B and nmB/O reduce to similar expres-
sions. Hence, the surrogate function Q(p | pm ) derived from the complete
data likelihood matches the surrogate function of the MM algorithm up to
a constant, and the maximization step proceeds as described earlier. One of
the advantages of the EM derivation is that it explicitly reveals the nature
of the conditional expectations nmA/A , nmA/O , nmB/B , and nmB/O .
9.5 Clustering by EM
The k-means clustering algorithm discussed in Example 7.2.3 makes hard
choices in cluster assignment. The alternative of soft choices is possible
with admixture models [192, 259]. An admixture probability density h(y)
can be written as a convex combination
k
h(y) = πj hj (y), (9.3)
j=1
9.5 Clustering by EM 227
with solution
m
1
µn+1,j = m wij y i .
i=1 wij i=1
E(Xj | Xm ) = E(Xj − Xm | Xm ) + Xm
= µj − µm + X m .
where Sij is the set of pixels between the source and pixel j along projec-
tion i. If j is the next pixel after pixel j along projection i, then
Mij = E(Xij | Yi = yi , θn )
Nij = E(Xij | Yi = yi , θn ).
up to an irrelevant constant.
If we try to maximize Q(θ | θ n ) by setting its partial derivatives equal
to 0, we get for pixel j the equation
(Mij − Nij )lij
− Nij lij + = 0. (9.8)
i i
elij θj − 1
E(X ∗ M X) = E Xi mij Xj
i j
= mij E(Xi Xj )
i j
given the observed data and the current values of the matrices F and D.
Combining these results with equation (9.9) yields
E[(Y k − F X k )∗ D−1 (Y k − F X k ) | Y k = y k ]
= tr(D −1 F Ak F ∗ ) + (y k − F v k )∗ D−1 (y k − F v k )
= tr{D−1 [F Ak F ∗ + (y k − F v k )(y k − F v k )∗ ]}.
If we define
l l l
Λ = [Ak + v k v ∗k ], Γ = v k y ∗k , Ω = y k y ∗k
k=1 k=1 k=1
and take conditional expectations in equation (9.10), then we can write the
surrogate function of the E step as
Q(F , D | F n , Dn )
p
l 1
= − ln di − tr[D −1 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)],
2 i=1
2
tr[D −1 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)]
= tr[D −1 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ ] + tr[D −1 (Ω − Γ∗ Λ−1 Γ)]
1 1
= tr[D − 2 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ D − 2 ] + tr[D −1 (Ω − Γ∗ Λ−1 Γ)].
This calculation depends on the existence of the inverse matrix Λ−1 . Now
Λ is certainly positive definite if Ak is positive definite, and Problem 22
asserts that Ak is positive definite. It follows that Λ−1 not only exists but
is positive definite as well. Furthermore, the matrix
1 1
D− 2 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ D− 2
P = Pr(Y1 = y1 , . . . , Yn = yn ) (9.11)
P = αn (j)φn (yn | j)
j
Q(π | π m ) = r
E(X1j | Y r = y r , πm ) ln πj
r j
r
r
E(X1j | Y r = yr , πm )
πm+1,j =
R
for R runs can be interpreted as multinomial proportions with fractional
category counts. Problem 24 asks the reader to derive the EM algorithm for
estimating time homogeneous transition probabilities. Problem 25 covers
estimation of the parameters of the conditional densities φi (yi | j) for some
common densities.
9.9 Problems
1. Code and test any of the algorithms discussed in the text or problems
of this chapter.
2. The entropy of a probability density p(x) on Rn is defined by
then p(x) = αeβu(x) does indeed maximize entropy subject to the av-
erage energy constraint. The density p(x) is the celebrated Maxwell-
Boltzmann density.
i ni zi (θ m )
αm+1 =
i ni
i ni izi (θ m )
µm+1,1 =
i ni zi (θ m )
i ni i[1 − zi (θ m )]
µm+1,2 = .
i ni [1 − zi (θ m )]
From the initial estimates α0 = 0.3, µ01 = 1. and µ02 = 2.5, compute
via the EM algorithm the maximum likelihood estimates α̂ = 0.3599,
µ̂1 = 1.2561, and µ̂2 = 2.6634. Note how slowly the EM algorithm
converges in this example.
9.9 Problems 239
with weights
f (yi | θ n )
wni1 =
f (yi | θ n ) + f (−yi | θn )
f (−yi | θn )
wni2 = ,
f (yi | θ n ) + f (−yi | θn )
σ12 σ12
Ω = .
σ12 σ22
ci − ai βn
E(Xi | Yi = ci , θ n ) = ai β n + σn H
σn
E(Xi2 | Yi = ci , θn ) = (ai βn )2 + σn2
ci − ai βn
+ σn (ci + ai β n )H .
σn
1 1 1 s
= − + + O(s2 ).
es − 1 s 2 12
i (Mij − Nij )
θn+1,j = 1
2 i (Mij + Nij )lij
15. Suppose that the complete data in the EM algorithm involve N bi-
nomial trials with success probability θ per trial. Here N can be
random or fixed. If M trials result in success, then the complete data
likelihood can be written as θM (1 − θ)N −M c, where c is an irrelevant
constant. The E step of the EM algorithm amounts to forming
19. Suppose light bulbs have an exponential lifetime with mean θ. Two
experiments are conducted. In the first, the lifetimes y1 , . . . , ym of m
independent bulbs are observed. In the second, p independent bulbs
are observed to burn out before time t, and q independent bulbs are
observed to burn out after time t. In other words, the lifetimes in the
second experiment are both left and right censored. Construct an EM
algorithm for finding the maximum likelihood estimate of θ [95].
20. In many discrete probability models, only data with positive counts
are observed. Counts that are 0 are missing. Show that the likelihoods
for the binomial, Poisson, and negative binomial models truncated at
0 amount to
mi
xi pxi (1 − p)mi −xi
L1 (p) =
i
1 − (1 − p)mi
λxi e−λ
L2 (λ) =
i
xi !(1 − e−λ )
mi +xi −1
xi
(1 − p)xi pmi
L3 (p) = .
i
1 − pmi
xi
i
pn+1 = mi
i 1−(1−pn )mi
i xi
λn+1 = 1
i 1−e−λn
mi
i 1−pm
n
i
pn+1 = mi
i (xi + 1−pm n
i )
for u and un in the interval (0, 1). Prove this minorization first. (Hint:
If you rearrange the minorization, then Proposition 9.2.1 applies.)
22. Suppose that Σ is a positive definite matrix. Prove that the matrix
I − F ∗ (F F ∗ + Σ)−1 F is also positive definite. This result is used
in the derivation of the EM algorithm in Sect. 9.7. (Hints: For read-
ers familiar with the sweep operator of computational statistics, the
simplest proof relies on applying Propositions 7.5.2 and 7.5.3 of the
reference [166].)
where the weight wmij = E(Xij | Y, µm ). Show that the same update
applies when Yi given Zi = i is exponentially distributed with mean
µj or normally distributed with mean µj and common variance σ 2 .
In the latter setting, demonstrate that the EM update of σ 2 is
n
2 i=1 j wmij (yi − µm+1,j )2
σm+1 = n .
i=1 j wmij