0% found this document useful (0 votes)
13 views24 pages

Chapter 9.4 Allele Frequency Estimation

The EM algorithm is an iterative method for finding maximum likelihood estimates in problems with missing data. It works by alternating between an expectation step (E-step) of estimating the missing data given the observed data and current estimates, and a maximization step (M-step) of computing maximum likelihood estimates based on the complete data including the estimated missing values. The EM algorithm guarantees an ascent in the likelihood at each iteration and handles parameter constraints gracefully. However, it can converge slowly, especially when there is a lot of missing data, and does not guarantee convergence to the global maximum.

Uploaded by

yehor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views24 pages

Chapter 9.4 Allele Frequency Estimation

The EM algorithm is an iterative method for finding maximum likelihood estimates in problems with missing data. It works by alternating between an expectation step (E-step) of estimating the missing data given the observed data and current estimates, and a maximization step (M-step) of computing maximum likelihood estimates based on the complete data including the estimated missing values. The EM algorithm guarantees an ascent in the likelihood at each iteration and handles parameter constraints gracefully. However, it can converge slowly, especially when there is a lot of missing data, and does not guarantee convergence to the global maximum.

Uploaded by

yehor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

9

The EM Algorithm

9.1 Introduction
Maximum likelihood is the dominant form of estimation in applied
statistics. Because closed-form solutions to likelihood equations are the
exception rather than the rule, numerical methods for finding maximum
likelihood estimates are of paramount importance. In this chapter we study
maximum likelihood estimation by the EM algorithm [65, 179, 191], a spe-
cial case of the MM algorithm. At the heart of every EM algorithm is some
notion of missing data. Data can be missing in the ordinary sense of a
failure to record certain observations on certain cases. Data can also be
missing in a theoretical sense. We can think of the E (expectation) step of
the algorithm as filling in the missing data. This action replaces the log-
likelihood of the observed data by a minorizing function. This surrogate
function is then maximized in the M step. Because the surrogate function
is usually much simpler than the likelihood, we can often solve the M step
analytically. The price we pay for this simplification is that the EM algo-
rithm is iterative. Reconstructing the missing data is bound to be slightly
wrong if the parameters do not already equal their maximum likelihood
estimates.
One of the advantages of the EM algorithm is its numerical stability.
As an MM algorithm, any EM algorithm leads to a steady increase in
the likelihood of the observed data. Thus, the EM algorithm avoids wildly
overshooting or undershooting the maximum of the likelihood along its
current direction of search. Besides this desirable feature, the EM handles

K. Lange, Optimization, Springer Texts in Statistics 95, 221


DOI 10.1007/978-1-4614-5838-8 9,
© Springer Science+Business Media New York 2013
222 9. The EM Algorithm

parameter constraints gracefully. Constraint satisfaction is by definition


built into the solution of the M step. In contrast, competing methods of
maximization must incorporate special techniques to cope with parame-
ter constraints. The EM shares some of the negative features of the more
general MM algorithm. For example, the EM algorithm often converges
at an excruciatingly slow rate in a neighborhood of the maximum point.
This rate directly reflects the amount of missing data in a problem. In the
absence of concavity, there is also no guarantee that the EM algorithm
will converge to the global maximum. The global maximum can usually be
reached by starting the parameters at good but suboptimal estimates such
as method-of-moments estimates or by choosing multiple random starting
points.

9.2 Definition of the EM Algorithm


A sharp distinction is drawn in the EM algorithm between the observed,
incomplete data y and the unobserved, complete data x of a statistical
experiment [65, 179, 251]. Some function t(x) = y collapses x onto y.
For instance, if we represent x as (y, z), with z as the missing data, then
t is simply projection onto the y-component of x. It should be stressed
that the missing data can consist of more than just observations missing
in the ordinary sense. In fact, the definition of x is left up to the intuition
and cleverness of the statistician. The general idea is to choose x so that
maximum likelihood estimation becomes trivial for the complete data.
The complete data are assumed to have a probability density f (x | θ)
that is a function of a parameter vector θ as well as of x. In the E step of
the EM algorithm, we calculate the conditional expectation

Q(θ | θn ) = E[ln f (X | θ) | Y = y, θ n ].

Here θ n is the current estimated value of θ, upper case letters indicate


random vectors, and lower case letters indicate corresponding realizations
of these random vectors. In the M step, we maximize Q(θ | θ n ) with
respect to θ. This yields the new parameter estimate θ n+1 , and we repeat
this two-step process until convergence occurs. Note that θ and θ n play
fundamentally different roles in Q(θ | θ n ).
If ln g(y | θ) denotes the loglikelihood of the observed data, then the EM
algorithm enjoys the ascent property

ln g(y | θ n+1 ) ≥ ln g(y | θn ).

Proof of this assertion unfortunately involves measure theory, so some read-


ers may want to take it on faith and skip the rest of this section. A necessary
preliminary is the following well-known inequality from statistics.
9.2 Definition of the EM Algorithm 223

Proposition 9.2.1 (Information Inequality) Let h and k be probabil-


ity densities with respect to a measure µ. Suppose h > 0 and k > 0 almost
everywhere relative to µ. If Eh denotes expectation with respect to the prob-
ability measure hdµ, then Eh (ln h) ≥ Eh (ln k), with equality if and only if
h = k almost everywhere relative to µ.
Proof: Because − ln(w) is a strictly convex function on (0, ∞), Proposi-
tion 6.6.1 applied to the random variable k/h yields
k
Eh (ln h) − Eh (ln k) = Eh − ln
h
k
≥ − ln Eh
h
k
= − ln h dµ
h
= − ln k dµ
= 0.

Equality holds if and only if k/h = Eh (k/h) almost everywhere relative


to µ. This necessary and sufficient condition is equivalent to h = k since
Eh (k/h) = 1.
To prove the ascent property of the EM algorithm, it suffices to demon-
strate the minorization inequality

ln g(y | θ) ≥ Q(θ | θ n ) + ln g(y | θ n ) − Q(θn | θ n ),

where Q(θ | θ n ) = E[ln f (X | θ) | Y = y, θ n ]. With this end in mind,


note that both f (x | θ)/g(y | θ) and f (x | θ n )/g(y | θ n ) are conditional
densities of X on {x : t(x) = y} with respect to some measure µy . The
information inequality now indicates that

f (X | θ)
Q(θ | θ n ) − ln g(y | θ) = E ln Y = y, θn
g(Y | θ)
f (X | θ n )
≤ E ln Y = y, θ n
g(Y | θ n )
= Q(θ n | θn ) − ln g(y | θ n ).

Maximizing Q(θ | θ n ) therefore drives ln g(y | θ) uphill. The ascent in-


equality is strict whenever the conditional density f (x | θ)/g(y | θ) differs
at the parameter points θn and θ n+1 or

Q(θ n+1 | θ n ) > Q(θn | θ n ).

The preceding proof is a little vague as to the meaning of the conditional


density f (x | θ)/g(y | θ) and its associated measure µy . Commonly the
224 9. The EM Algorithm

complete data decomposes as x = (y, z), where z is considered the missing


data and t(y, z) = y is projection onto the observed data. Suppose (y, z)
has joint density f (y, z | θ) relative to a product measure ω×µ (y, z); ω and
µ are typically Lebesgue measure or counting measure. In this framework,
we define g(y | θ) = f (y, z | θ)dµ(z) and set µy = µ. The function
g(y | θ) serves as a density relative to ω. To check that these definitions
make sense, it suffices to prove that h(y, z)f (y, z | θ)/g(y | θ)dµ(z)
is a version of the conditional expectation E[h(Y , Z) | Y = y] for every
well-behaved function h(y, z). This assertion can be verified by showing
E{1S (Y ) E[h(Y , Z) | Y ]} = E[1S (Y )h(Y , Z)]
for every measurable set S. With
f (y, z | θ)
E[h(Y , Z) | Y = y] = h(y, z) dµ(z),
g(y | θ)
we calculate
E{1S (Y ) E[h(Y , Z) | Y = y]}
f (y, z | θ)
= h(y, z) dµ(z)g(y | θ) dω(y)
S g(y | θ)

= h(y, z)f (y, z | θ) dµ(z) dω(y)


S
= E[1S (Y )h(Y , Z)].
Hence in this situation, f (x | θ)/g(y | θ) is indeed the conditional density
of X given Y = y.

9.3 Missing Data in the Ordinary Sense


The most common application of the EM algorithm is to data missing in
the ordinary sense. For example, Problem 40 of Chap. 8 considers a bal-
anced ANOVA model with two factors. Missing observations in this setting
break the symmetry that permits explicit solution of the likelihood equa-
tions. Thus, there is ample incentive for filling in the missing observations.
If the observations follow an exponential model, and missing data are miss-
ing completely at random, then the EM algorithm replaces the sufficient
statistic of each missing observation by its expected value.
The density of a random variable Y from an exponential family can be
written as

f (y | θ) = g(y)eβ(θ)+h(y) γ(θ)
(9.1)
relative to some measure ν [73, 218]. The normal, Poisson, binomial, nega-
tive binomial, gamma, beta, and multinomial families are prime examples
9.4 Allele Frequency Estimation 225

of exponential families. The function h(y) in equation (9.1) is the sufficient


statistic. The maximum likelihood estimate of the parameter vector θ de-
pends on an observation y only through h(y). Predictors of y are incorpo-
rated into the functions β(θ) and γ(θ).
To fill in a missing observation y, we take the ordinary expectation

E[ln f (Y | θ) | θ n ] = E[ln g(Y ) | θ n ] + β(θ) + E[h(Y ) | θn ]∗ γ(θ)

of the complete data loglikelihood. This function is added to the loglikeli-


hood of the regular observations y1 , . . . , ym to generate the surrogate func-
tion Q(θ | θ n ). For example, if a typical observation is normally distributed
with mean µ(α) and variance σ 2 , then θ is the vector (α∗ , σ 2 )∗ and
1 1
E[ln f (Y | θ) | θ n ] = ln √ − 2 E{[Y − µ(α)]2 | θ n }
2πσ2 2σ
1 1
= ln √ − 2 {σn2 + [µ(αn ) − µ(α)]2 }.
2πσ 2 2σ
Once we have filled in the missing data, we can estimate α without refer-
ence to σ 2 . This is accomplished by adding each square [µi (αn ) − µi (α)]2
corresponding to a missing observation yi to the sum of squares for the ac-
tual observations and then minimizing the entire sum over α. In classical
models such as balanced ANOVA, the M step is exact. Once the iterative
limit limn→∞ αn = α̂ is reached, we can estimate σ 2 in one step by the
formula
m
1
σ2 = [yi − µi (α̂)]2
m i=1

using only the observed yi . The reader is urged to work Problem 40 of


Chap. 8 to see the whole process in action.

9.4 Allele Frequency Estimation


It is instructive to compare the EM and MM algorithms on identical prob-
lems. Even when the two algorithms specify the same iteration scheme, the
differences in deriving the algorithms are illuminating. Consider the ABO
allele frequency estimation problem of Sect. 8.4. From the EM perspective,
the complete data x are genotype counts rather than phenotype counts y.
In passing from the complete data to the observed data, nature collapses
genotypes A/A and A/O into phenotype A and genotypes B/B and B/O
into phenotype B. In view of the Hardy-Weinberg equilibrium law, the
complete data multinomial loglikelihood becomes

ln f (X | p) = nA/A ln p2A + nA/O ln(2pA pO ) + nB/B ln p2B


226 9. The EM Algorithm

+ nB/O ln(2pB pO ) + nAB ln(2pA pB ) + nO ln p2O


n
+ ln . (9.2)
nA/A , nA/O , nB/B , nB/O , nAB , nO
In the E step of the EM algorithm we take the expectation of ln f (X | p)
conditional on the observed counts nA , nB , nAB , and nO and the current
parameter vector pm = (pmA , pmB , pmO )∗ . This action yields the surrogate
function
Q(p | pm ) = E(nA/A | Y, pm ) ln p2A + E(nA/O | Y, pm ) ln(2pA pO )
+ E(nB/B | Y, pm ) ln p2B + E(nB/O | Y, pm ) ln(2pB pO )
+ E(nAB | Y, pm ) ln(2pApB ) + E(nO | Y, pm ) ln p2O
n
+ E ln Y, pm .
nA/A , nA/O , nB/B , nB/O , nAB , nO
It is obvious that
E(nAB | Y, pm ) = nAB
E(nO | Y, pm ) = nO .
Application of Bayes’ rule gives

nmA/A = E(nA/A | Y, pm )
p2mA
= nA
p2mA + 2pmApmO
nmA/O = E(nA/O | Y, pm )
2pmA pmO
= nA .
p2mA + 2pmApmO
The conditional expectations nmB/B and nmB/O reduce to similar expres-
sions. Hence, the surrogate function Q(p | pm ) derived from the complete
data likelihood matches the surrogate function of the MM algorithm up to
a constant, and the maximization step proceeds as described earlier. One of
the advantages of the EM derivation is that it explicitly reveals the nature
of the conditional expectations nmA/A , nmA/O , nmB/B , and nmB/O .

9.5 Clustering by EM
The k-means clustering algorithm discussed in Example 7.2.3 makes hard
choices in cluster assignment. The alternative of soft choices is possible
with admixture models [192, 259]. An admixture probability density h(y)
can be written as a convex combination
k
h(y) = πj hj (y), (9.3)
j=1
9.5 Clustering by EM 227

where the πj are nonnegative probabilities that sum to 1 and hj (y) is


the probability density of group j. According to Bayes’ rule, the posterior
probability that an observation y belongs to group j equals the ratio
πj hj (y)
k
. (9.4)
i=1 πi h i (y)
If hard assignment is necessary, then the rational procedure is to assign y
to the group with highest posterior probability.
Suppose the observations y 1 , . . . , y m represent a random sample from
the admixture density (9.3). In practice we want to estimate the admixture
proportions and whatever further parameters θ characterize the densities
hj (y | θ). The EM algorithm is natural in this context with group mem-
bership as the missing data. If we let zij be an indicator specifying whether
observation y i comes from group j, then the complete data loglikelihood
amounts to
m k
zij [ln πj + ln hj (y i | θ)] .
i=1 j=1

To find the surrogate function, we must find the conditional expectation


wij of zij . But this reduces to the Bayes’ rule (9.4) with θ fixed at θ n and
π fixed at π n , where as usual n indicates iteration number. Note that the
property kj=1 zij = 1 entails the property kj=1 wij = 1.
Fortunately, the E step of the EM algorithm separates the π parameters
from the θ parameters. The problem of maximizing
k
cj ln πj
j=1

with cj = m i=1 wij should be familiar by now. Since


k
j=1 cj = m, Exam-
cj
ple (1.4.2) shows that πn+1,j = m .
We now undertake estimation of the remaining parameters assuming
the groups are normally distributed with a common variance matrix Ω
but different mean vectors µ1 , . . . , µk . The pertinent part of the surrogate
function is
m k
1 1
wij − ln det Ω − (y i − µj )∗ Ω−1 (y i − µj )
i=1 j=1
2 2
k m
m 1
= − ln det Ω − wij (y i − µj )∗ Ω−1 (y i − µj )
2 2 j=1 i=1
k m
m 1
= − ln det Ω − tr Ω−1 wij (y i − µj )(y i − µj )∗ . (9.5)
2 2 j=1 i=1
228 9. The EM Algorithm

Differentiating the surrogate (9.5) with respect to µj gives the equation


m
wij Ω−1 (y i − µj ) = 0
i=1

with solution
m
1
µn+1,j = m wij y i .
i=1 wij i=1

Maximization of the surrogate (9.5) with respect to Ω can be rephrased


as maximization of
m 1
− ln det Ω − tr(Ω−1 M )
2 2
for the choice
k m
M = wij (y i − µn+1,j )(y i − µn+1,j )∗ .
j=1 i=1

Abstractly this is just the problem we faced in Example 6.5.7. Inspection


of the arguments there shows that
1
Ωn+1 = M. (9.6)
m

There is no guarantee of a unique mode in this model. Fortunately,


k-means clustering generates good starting values for the parameters. The
cluster centers provide the group means. If we set wij equal to 1 or 0 de-
pending on whether observation i belongs to cluster j or not, then the
matrix (9.6) serves as an initial guess of the common variance matrix.
The initial admixture proportion πj can be taken to be the proportion of
the observations assigned to cluster j.

9.6 Transmission Tomography


The EM and MM algorithms for transmission tomography differ. The MM
algorithm is easier to derive and computationally more efficient. In other
examples, the opposite is true.
In the transmission tomography example of Sect. 8.10, it is natural to
view the missing data as the number of photons Xij entering each pixel j
along each projection line i. These random variables supplemented by the
observations Yi constitute the complete data. If projection line i does not
intersect pixel j, then Xij = 0. Although Xij and Xij are not independent,
9.6 Transmission Tomography 229

the collection {Xij }j indexed by projection i is independent of the collection


{Xi j }j indexed by another projection i . This allows us to work projection
by projection in writing the complete data likelihood. We will therefore
temporarily drop the projection subscript i and relabel pixels, starting
with pixel 1 adjacent to the source and ending with pixel m − 1 adjacent
to the detector. In this notation X1 is the number of photons leaving the
source, Xj is the number of photons entering pixel j, and Xm = Y is the
number of photons detected.
By assumption X1 follows a Poisson distribution with mean d. Condi-
tional on X1 , . . . , Xj , the random variable Xj+1 is binomially distributed
with Xj trials and success probability e−lj θj . In other words, each of the
Xj photons entering pixel j behaves independently and has a chance e−lj θj
of avoiding attenuation in pixel j. It follows that the complete data loglike-
lihood for the current projection is
−d + X1 ln d − ln X1 !
m−1
Xj
+ ln + Xj+1 ln e−lj θj + (Xj − Xj+1 ) ln(1 − e−lj θj ) .
j=1
Xj+1
(9.7)
To perform the E step of the EM algorithm, we merely need to compute
the conditional expectations E(Xj | Xm = y, θ) for 1 ≤ j ≤ m. The
conditional expectations of other terms such as ln XXj+1
j
appearing in (9.7)
are irrelevant in the subsequent M step.
Reasoning as above, we infer that the unconditional mean of Xj is
j−1
µj = E(Xj ) = de− k=1
lk θk

and that the distribution of Xm conditional on Xj is binomial with Xj


trials and success probability
µm −
m−1
lk θ k
= e k=j .
µj
In view of our remarks about random thinning in Chap. 8, the joint prob-
ability density of Xj and Xm therefore reduces to
x
−µj
µj j xj µm xm µm xj −xm
Pr(Xj = xj , Xm = xm ) = e 1− ,
x j ! xm µj µj
and the conditional probability density of Xj given Xm becomes
j x
−µj µj xj
e xj ! xm
( µµmj )xm (1 − µm xj −xm
µj
)
Pr(Xj = xj | Xm = xm ) = xm
e−µm µxmm !
−(µj −µm ) (µj − µm )xj −xm
= e .
(xj − xm )!
230 9. The EM Algorithm

In other words, conditional on Xm , the difference Xj −Xm follows a Poisson


distribution with mean µj − µm . This implies in particular that

E(Xj | Xm ) = E(Xj − Xm | Xm ) + Xm
= µj − µm + X m .

Reverting to our previous notation, it is now possible to assemble the


surrogate function Q(θ | θn ) of the E step. Define
− lik θnk
Mij = di (e k∈Sij
− e− k
lik θnk
) + yi
− lik θnk
Nij = di (e k∈Sij ∪{j}
− e− k
lik θnk
) + yi ,

where Sij is the set of pixels between the source and pixel j along projec-
tion i. If j is the next pixel after pixel j along projection i, then

Mij = E(Xij | Yi = yi , θn )
Nij = E(Xij | Yi = yi , θn ).

In view of expression (9.7), we find that

Q(θ | θ n ) = − Nij lij θj + (Mij − Nij ) ln(1 − e−lij θj )


i j

up to an irrelevant constant.
If we try to maximize Q(θ | θ n ) by setting its partial derivatives equal
to 0, we get for pixel j the equation
(Mij − Nij )lij
− Nij lij + = 0. (9.8)
i i
elij θj − 1

This is an intractable transcendental equation in the single variable θj ,


and the M step must be solved numerically, say by Newton’s method. It is
straightforward to check that the left-hand side of equation (9.8) is strictly
decreasing in θj and has exactly one positive solution. Thus, the EM al-
gorithm like the MM algorithm has the advantages of decoupling the pa-
rameters in the likelihood equations and of satisfying the natural boundary
constraints θj ≥ 0. The MM algorithm is preferable to the EM algorithm
because the MM algorithm involves far fewer exponentiations in defining
its surrogate function.

9.7 Factor Analysis


In some instances, the missing data framework of the EM algorithm offers
the easiest way to exploit convexity in deriving an MM algorithm. The com-
plete data for a given problem is often fairly natural, and the difficulty
9.7 Factor Analysis 231

in deriving an EM algorithm shifts toward specifying the E step. Statisti-


cians are particularly adept at calculating complicated conditional expecta-
tions connected with sampling distributions. We now illustrate these truths
for estimation in factor analysis. Factor analysis explains the covariation
among the components of a random vector by approximating the vector by
a linear transformation of a small number of uncorrelated factors. Because
factor analysis models usually involve normally distributed random vec-
tors, Appendix A.2 reviews some basic facts about the multivariate normal
distribution.
For the sake of notational convenience, we now extend the expectation
and variance operators to random vectors. The expectation of a random
vector X = (X1 , . . . , Xn )∗ is defined componentwise by
 
E[X1 ]
 
E(X) =  ...  .
E[Xn ]
Linearity carries over from the scalar case in the sense that
E(X + Y ) = E(X) + E(Y )
E(M X) = M E(X)
for a compatible random vector Y and a compatible matrix M . The same
componentwise conventions hold for the expectation of a random matrix
and the variances and covariances of a random vector. Thus, we can express
the variance matrix of a random vector X as
Var(X) = E{[X − E(X)][X − E(X)]∗ } = E(XX ∗ ) − E(X) E(X)∗ .
These notational choices produce many other compact formulas. For in-
stance, the random quadratic form X ∗ MX has expectation
E(X ∗ M X) = tr[M Var(X)] + E(X)∗ M E(X). (9.9)
To verify this assertion, observe that

E(X ∗ M X) = E Xi mij Xj
i j

= mij E(Xi Xj )
i j

= mij [Cov(Xi , Xj ) + E(Xi ) E(Xj )]


i j
= tr[M Var(X)] + E(X)∗ M E(X).
The classical factor analysis model deals with l independent multivariate
observations of the form
Yk = µ + F Xk + U k.
232 9. The EM Algorithm

Here the p × q factor loading matrix F transforms the unobserved factor


score X k into the observed Y k . The random vector U k represents random
measurement error. Typically, q is much smaller than p. The random vectors
X k and U k are independent and normally distributed with means and
variances
E(X k ) = 0, Var(X k ) = I
E(U k ) = 0, Var(U k ) = D,
where I is the q × q identity matrix and D is a p × p diagonal matrix with
ith diagonal entry di . The entries of the mean vector µ, the factor loading
matrix F , and the diagonal matrix D constitute the parameters of the
model. For a particular realization y 1 , . . . , y l of the model, the maximum
likelihood estimation of µ is simply the sample mean µ̂ = ȳ. This fact is a
consequence of the reasoning given in Example 6.5.7. Therefore, we replace
each y k by y k − ȳ, assume µ = 0, and focus on estimating F and D.
The random vector (X ∗k , Y ∗k )∗ is the obvious choice of the complete data
for case k. If f (xk ) is the density of X k and g(y k | xk ) is the conditional
density of Y k given X k = xk , then the complete data loglikelihood can be
expressed as
l l
ln f (xk ) + ln g(y k | xk )
k=1 k=1
l
l 1 l
= − ln det I − x∗k xk − ln det D
2 2 2
k=1
l
1
− (y k − F xk )∗ D−1 (y k − F xk ). (9.10)
2
k=1

We can simplify this by noting that ln det I = 0 and ln det D = pi=1 ln di .


The key to performing the E step is to note that (X ∗k , Y ∗k )∗ follows a
multivariate normal distribution with variance matrix
Xk I F∗
Var = .
Yk F FF∗ + D
Equation (A.1) of Appendix A.2 then permits us to calculate the condi-
tional expectation
vk = E(X k | Y k = y k , F n , D n )
= F ∗n (F n F ∗n + Dn )−1 y k
and the conditional variance
Ak = Var(X k | Y k = y k , F n , Dn )
= I − F ∗n (F n F ∗n + Dn )−1 F n ,
9.7 Factor Analysis 233

given the observed data and the current values of the matrices F and D.
Combining these results with equation (9.9) yields

E[(Y k − F X k )∗ D−1 (Y k − F X k ) | Y k = y k ]
= tr(D −1 F Ak F ∗ ) + (y k − F v k )∗ D−1 (y k − F v k )
= tr{D−1 [F Ak F ∗ + (y k − F v k )(y k − F v k )∗ ]}.

If we define
l l l
Λ = [Ak + v k v ∗k ], Γ = v k y ∗k , Ω = y k y ∗k
k=1 k=1 k=1

and take conditional expectations in equation (9.10), then we can write the
surrogate function of the E step as

Q(F , D | F n , Dn )
p
l 1
= − ln di − tr[D −1 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)],
2 i=1
2

omitting the additive constant


l
1
− E(X ∗k X k | Y k = y k , F n , D n ),
2
k=1

which depends on neither F nor D.


To perform the M step, we first maximize Q(F , D | F n , D n ) with respect
to F , holding D fixed. We can do so by permuting factors and completing
the square in the trace

tr[D −1 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)]
= tr[D −1 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ ] + tr[D −1 (Ω − Γ∗ Λ−1 Γ)]
1 1
= tr[D − 2 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ D − 2 ] + tr[D −1 (Ω − Γ∗ Λ−1 Γ)].

This calculation depends on the existence of the inverse matrix Λ−1 . Now
Λ is certainly positive definite if Ak is positive definite, and Problem 22
asserts that Ak is positive definite. It follows that Λ−1 not only exists but
is positive definite as well. Furthermore, the matrix
1 1
D− 2 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ D− 2

is positive semidefinite and has a nonnegative trace. Hence, the maximum


value of the surrogate function Q(F , D | F n , D n ) with respect to F is
attained at the point F = Γ∗ Λ−1 , regardless of the value of D. In other
words, the EM update of F is F n+1 = Γ∗ Λ−1 . It should be stressed that
234 9. The EM Algorithm

Γ and Λ implicitly depend on the previous values F n and Dn . Once F n+1


is determined, the equation

0 = Q(F , D | F n , D n )
∂di
l 1
= − + 2 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)ii
2di 2di

provides the update


1
dn+1,i = (F n+1 ΛF ∗n+1 − F n+1 Γ − Γ∗ F ∗n+1 + Ω)ii .
l
One of the frustrating features of factor analysis is that the factor loading
matrix F is not uniquely determined. To understand the source of the
ambiguity, consider replacing F by F O, where O is a q × q orthogonal
matrix. The distribution of each random vector Y k is normal with mean
µ and variance matrix F F ∗ + D. If we substitute F O for F , then the
variance F OO ∗ F ∗ + D = F F ∗ + D remains the same. Another problem
in factor analysis is the existence of more than one local maximum. Which
one of these the EM algorithm converges to depends on its starting value
[76]. For a suggestion of how to improve the chances of converging to the
dominant mode, see the article [281].

9.8 Hidden Markov Chains


A hidden Markov chain incorporates both observed data and missing data.
The missing data are the sequence of states visited by the chain; the ob-
served data provide partial information about this sequence of states. De-
note the sequence of visited states by Z1 , . . . , Zn and the observation taken
at epoch i when the chain is in state Zi by Yi = yi . Baum’s algorithms
[8, 71] recursively compute the likelihood of the observed data

P = Pr(Y1 = y1 , . . . , Yn = yn ) (9.11)

without actually enumerating all possible realizations Z1 , . . . , Zn . Baum’s


algorithms can be adapted to perform an EM search. The references [78,
165, 216] discuss several concrete examples of hidden Markov chains.
The likelihood (9.11) is constructed from three ingredients: (a) the ini-
tial distribution π at the first epoch of the chain, (b) the epoch-dependent
transition probabilities pijk = Pr(Zi+1 = k | Zi = j), and (c) the condi-
tional densities φi (yi | j) = Pr(Yi = yi | Zi = j). The dependence of the
transition probability pijk on i allows the chain to be inhomogeneous over
time and promotes greater flexibility in modeling. Implicit in the definition
of φi (yi | j) are the assumptions that Y1 , . . . , Yn are independent given
9.8 Hidden Markov Chains 235

Z1 , . . . , Zn and that Yi depends only on Zi . For simplicity, we will assume


that the Yi are discretely distributed.
Baum’s forward algorithm is based on recursively evaluating the joint
probabilities
αi (j) = Pr(Y1 = y1 , . . . , Yi−1 = yi−1 , Zi = j).
At the first epoch, α1 (j) = πj by definition. The obvious update to αi (j) is

αi+1 (k) = αi (j)φi (yi | j)pijk . (9.12)


j

The likelihood (9.11) can be recovered by computing the sum

P = αn (j)φn (yn | j)
j

at the final epoch n.


In Baum’s backward algorithm, we recursively evaluate the conditional
probabilities
βi (k) = Pr(Yi+1 = yi+1 , . . . , Yn = yn | Zi = k),
starting by convention at βn (k) = 1 for all k. The required update is clearly

βi (j) = pijk φi+1 (yi+1 | k)βi+1 (k). (9.13)


k

In this instance, the likelihood is recovered at the first epoch by forming


the sum P = j πj φ1 (y1 | j)β1 (j).
Baum’s algorithms also interdigitate beautifully with the E step of the
EM algorithm. It is natural to summarize the missing data by a collection
of indicator random variables Xij . If the chain occupies state j at epoch i,
then we take Xij = 1. Otherwise, we take Xij = 0. In this notation, the
complete data loglikelihood can be written as
n
Lcom(θ) = X1j ln πj + Xij ln φi (Yi | j)
j i=1 j
n−1
+ Xij Xi+1,k ln pijk .
i=1 j k

Execution of the E step amounts to calculation of the conditional expecta-


tions
αi (j)φi (yi | j)pijk φi+1 (yi+1 | k)βi+1 (k)
E(Xij Xi+1,k | Y , θ m ) =
P θ=θm
αi (j)φi (yi | j)βi (j)
E(Xij | Y , θ m ) = ,
P θ=θ m
236 9. The EM Algorithm

where Y = y is the observed data, P is the likelihood of the observed data,


and θ m is the current parameter vector.
The M step may or may not be exactly solvable. If it is not, then one can
always revert to the MM gradient algorithm discussed in Sect. 10.4. In the
case of hidden multinomial trials, it is possible to carry out the M step
analytically. Hidden multinomial trials may govern (a) the choice of the
initial state j, (b) the choice of an observed outcome Yi at the ith epoch
given the hidden state j of the chain at that epoch, or (c) the choice of
the next state k given the current state j in a time-homogeneous chain.
In the first case, the multinomial parameters are the πj ; in the last case,
they are the common transition probabilities pjk .
As a concrete example, consider estimation of the initial distribution π
at the first epoch of the chain. For estimation to be accurate, there must
be multiple independent runs of the chain. Let the superscript r index the
various runs. The surrogate function delivered by the E step equals

Q(π | π m ) = r
E(X1j | Y r = y r , πm ) ln πj
r j

up to an additive constant. Maximizing Q(π | π m ) subject to the con-


straints j πj = 1 and πj ≥ 0 for all j is done as in Example 1.4.2. The
resulting EM updates

r
r
E(X1j | Y r = yr , πm )
πm+1,j =
R
for R runs can be interpreted as multinomial proportions with fractional
category counts. Problem 24 asks the reader to derive the EM algorithm for
estimating time homogeneous transition probabilities. Problem 25 covers
estimation of the parameters of the conditional densities φi (yi | j) for some
common densities.

9.9 Problems
1. Code and test any of the algorithms discussed in the text or problems
of this chapter.
2. The entropy of a probability density p(x) on Rn is defined by

− p(x) ln p(x)dx. (9.14)

Among all densities with a fixed mean vector µ = xp(x)dx and


variance matrix Ω = (x − µ)(x − µ)∗ p(x)dx, prove that the multi-
variate normal has maximum entropy. (Hints: Apply Proposition 9.2.1
and formula (9.9).)
9.9 Problems 237

3. In statistical mechanics, entropy is employed to characterize the sta-


tionary distribution of many independently behaving particles. Let
p(x) be the probability density that a particle is found at position x
in phase space Rn , and suppose that each position x is assigned an
energy u(x). If the average energy U = u(x)p(x)dx per particle is
fixed, then Nature chooses p(x) to maximize entropy as defined in
equation (9.14). Show that if constants α and β exist satisfying

αeβu(x) dx = 1 and u(x)αeβu(x) dx = U,

then p(x) = αeβu(x) does indeed maximize entropy subject to the av-
erage energy constraint. The density p(x) is the celebrated Maxwell-
Boltzmann density.

4. Show that the normal, Poisson, binomial, negative binomial, gamma,


beta, and multinomial families are exponential by writing their den-
sities in the form (9.1). What are the corresponding measure and
sufficient statistic in each case?

5. In the EM algorithm [65], suppose that the complete data X possesses


a regular exponential density

f (x | θ) = g(x)eβ(θ)+h(x) θ

relative to some measure ν. Prove that the unconditional mean of the


sufficient statistic h(X) is given by the negative gradient −∇β(θ) and
that the EM update is characterized by the condition

E[h(X) | Y, θn ] = −∇β(θ n+1 ).

6. Suppose the phenotypic counts in the ABO allele frequency estima-


tion example satisfy nA + nAB > 0, nB + nAB > 0, and nO > 0. Show
that the loglikelihood is strictly concave and possesses a single global
maximum on the interior of the feasible region.

7. In a genetic linkage experiment, 197 animals are randomly assigned


to four categories according to the multinomial distribution with cell
probabilities π1 = 12 + θ4 , π2 = 1−θ 1−θ
4 , π3 = 4 and π4 = θ4 . If the
corresponding observations are

y = (y1 , y2 , y3 , y4 )∗ = (125, 18, 20, 34)∗,

then devise an EM algorithm and use it to estimate θ̂ = .6268 [218].


(Hint: Split the first category into two so that there are five categories
for the complete data.)
238 9. The EM Algorithm

8. Derive the EM algorithm solving Problem 7 as an MM algorithm. No


mention of missing data is necessary.
9. Consider the data from The London Times [259] during the years
1910–1912 given in Table 9.1. The two columns labeled “Deaths i”
refer to the number of deaths to women 80 years and older reported
by day. The columns labeled “Frequency ni ” refer to the number of
days with i deaths. A Poisson distribution gives a poor fit to these
data, possibly because of different patterns of deaths in winter and
summer. A mixture of two Poissons provides a much better fit. Under
the Poisson admixture model, the likelihood of the observed data is
9 ni
−µ1 µi1 µi
αe + (1 − α)e−µ2 2 ,
i=0
i! i!

where α is the admixture parameter and µ1 and µ2 are the means of


the two Poisson distributions.

TABLE 9.1. Death notices from The London Times


Deaths i Frequency ni Deaths i Frequency ni
0 162 5 61
1 267 6 27
2 271 7 8
3 185 8 3
4 111 9 1

Formulate an EM algorithm for this model. Let θ = (α, µ1 , µ2 )∗ and


αe−µ1 µi1
zi (θ) =
αe−µ1 µi1 + (1 − α)e−µ2 µi2
be the posterior probability that a day with i deaths belongs to Pois-
son population 1. Show that the EM algorithm is given by

i ni zi (θ m )
αm+1 =
i ni

i ni izi (θ m )
µm+1,1 =
i ni zi (θ m )

i ni i[1 − zi (θ m )]
µm+1,2 = .
i ni [1 − zi (θ m )]

From the initial estimates α0 = 0.3, µ01 = 1. and µ02 = 2.5, compute
via the EM algorithm the maximum likelihood estimates α̂ = 0.3599,
µ̂1 = 1.2561, and µ̂2 = 2.6634. Note how slowly the EM algorithm
converges in this example.
9.9 Problems 239

10. Derive the least squares algorithm (8.19) as an EM algorithm [112].


q
(Hint: Decompose yi as the sum j=1 yij of realizations from inde-
pendent normal deviates with means xij θj and variances 1/q.)

11. Let x1 , . . . , xm be an i.i.d. sample from a normal density with mean


µ and variance σ 2 . Suppose for each xi we observe yi = |xi | rather
than xi . Formulate an EM algorithm for estimating µ and σ 2 , and
show that its updates are
m
1
µn+1 = (wni1 yi − wni2 yi )
m i=1
m
2 1
σn+1 = [wni1 (yi − µn+1 )2 + wni2 (−yi − µn+1 )2 ]
m i=1

with weights

f (yi | θ n )
wni1 =
f (yi | θ n ) + f (−yi | θn )
f (−yi | θn )
wni2 = ,
f (yi | θ n ) + f (−yi | θn )

where f (x | θ) is the normal density with θ = (µ, σ 2 )∗ . Demonstrate


that the modes of the likelihood of the observed data come in sym-
metric pairs differing only in the sign of µ. This fact does not prevent
accurate estimation of |µ| and σ 2 .

12. Consider an i.i.d. sample drawn from a bivariate normal distribution


with mean vector µ = (µ1 , µ2 )∗ and variance matrix

σ12 σ12
Ω = .
σ12 σ22

Suppose through some random accident that the first p observations


are missing their first component, the next q observations are miss-
ing their second component, and the last r observations are com-
plete. Design an EM algorithm to estimate the five mean and vari-
ance parameters, taking as complete data the original data before the
accidental loss.

13. The standard linear regression model can be written in matrix


notation as X = Aβ + U . Here X is the r × 1 vector of responses,
A is the r × s design matrix, β is the s × 1 vector of regression co-
efficients, and U is the r × 1 normally distributed error vector with
mean 0 and variance σ 2 I. The responses are right censored if for each
i there is a constant ci such that only Yi = min{ci , Xi } is observed.
240 9. The EM Algorithm

The EM algorithm offers a vehicle for estimating the parameter vec-


tor θ = (β ∗ , σ 2 )∗ in the presence of right censoring [65, 251]. Show
that

β n+1 = (A∗ A)−1 A∗ E(X | Y , θ n )


1
2
σn+1 = E[(X − Aβ n+1 )∗ (X − Aβ n+1 ) | Y , θ n ].
r
To compute the conditional expectations appearing in these formulas,
let ai be the ith row of A and define
v 2
√1 e− 2

H(v) = ∞ − w2
.
1

2π v e 2 dw

For a censored observation yi = ci < ∞, prove that

ci − ai βn
E(Xi | Yi = ci , θ n ) = ai β n + σn H
σn
E(Xi2 | Yi = ci , θn ) = (ai βn )2 + σn2
ci − ai βn
+ σn (ci + ai β n )H .
σn

Use these formulas to complete the specification of the EM algorithm.

14. In the transmission tomography model it is possible to approximate


the solution of equation (9.8) to good accuracy in certain situations.
Verify the expansion

1 1 1 s
= − + + O(s2 ).
es − 1 s 2 12

Using the approximation 1/(es − 1) ≈ 1/s − 1/2 for s = lij θj , show


that

i (Mij − Nij )
θn+1,j = 1
2 i (Mij + Nij )lij

results. Can you motivate this result heuristically?

15. Suppose that the complete data in the EM algorithm involve N bi-
nomial trials with success probability θ per trial. Here N can be
random or fixed. If M trials result in success, then the complete data
likelihood can be written as θM (1 − θ)N −M c, where c is an irrelevant
constant. The E step of the EM algorithm amounts to forming

Q(θ | θn ) = E(M | Y , θn ) ln θ + E(N − M | Y , θn ) ln(1 − θ) + ln c.


9.9 Problems 241

The binomial trials are hidden because only a function Y of them


is directly observed. The brief derivation in Sect. 9.8 shows that the
EM update amounts to
E(M | Y , θn )
θn+1 = .
E(N | Y , θn )
Prove that this is equivalent to the update
θn (1 − θn ) d
θn+1 = θn + L(θn ),
E(N | Y , θn ) dθ
where L(θ) is the loglikelihood of the observed data Y [270]. (Hint:
Apply identity (8.4) of Chap. 8.)
16. As an example of hidden binomial trials, consider a random sample
of twin pairs. Let u of these pairs consist of male pairs, v consist
of female pairs, and w consist of opposite sex pairs. A simple model
to explain these data involves a random Bernoulli choice for each
pair dictating whether it consists of identical or nonidentical twins.
Suppose that identical twins occur with probability p and noniden-
tical twins with probability 1 − p. Once the decision is made as to
whether the twins are identical, then sexes are assigned to the twins.
If the twins are identical, one assignment of sex is made. If the twins
are nonidentical, then two independent assignments of sex are made.
Suppose boys are chosen with probability q and girls with probabil-
ity 1 − q. Model these data as hidden binomial trials. Derive the EM
algorithm for estimating p and q.
17. Chun Li has derived an EM update for hidden multinomial trials. Let
N denote the number of hidden trials, θi the probability of outcome
i of k possible outcomes, and L(θ) the loglikelihood of the observed
data Y . Derive the EM update
 
k
θni  ∂ L(θ n ) − ∂
θn+1,i = θni + θnj L(θ n )
E(N | Y , θ n ) ∂θi j=1
∂θ j

following the reasoning of Problem 15.


18. In this problem you are asked to formulate models for hidden Poisson
and exponential trials [270]. If the number of trials is N and the mean
per trial is θ, then show that the EM update in the Poisson case is
θn d
θn+1 = θn + L(θn )
E(N | Y , θn ) dθ
and in the exponential case is
θn2 d
θn+1 = θn + L(θn ),
E(N | Y , θn ) dθ
242 9. The EM Algorithm

where L(θ) is the loglikelihood of the observed data Y .

19. Suppose light bulbs have an exponential lifetime with mean θ. Two
experiments are conducted. In the first, the lifetimes y1 , . . . , ym of m
independent bulbs are observed. In the second, p independent bulbs
are observed to burn out before time t, and q independent bulbs are
observed to burn out after time t. In other words, the lifetimes in the
second experiment are both left and right censored. Construct an EM
algorithm for finding the maximum likelihood estimate of θ [95].

20. In many discrete probability models, only data with positive counts
are observed. Counts that are 0 are missing. Show that the likelihoods
for the binomial, Poisson, and negative binomial models truncated at
0 amount to
mi
xi pxi (1 − p)mi −xi
L1 (p) =
i
1 − (1 − p)mi
λxi e−λ
L2 (λ) =
i
xi !(1 − e−λ )
mi +xi −1
xi
(1 − p)xi pmi
L3 (p) = .
i
1 − pmi

For observation i of the binomial model, there are xi successes out


of mi trials with success probability p per trial. For observation i of
the negative binomial model, there are xi failures before mi required
successes. For each model, devise an EM algorithm that fills in the
missing observations by imputing a geometrically distributed number
of truncated observations for every real observation. Show that the
EM updates reduce to

xi
i
pn+1 = mi
i 1−(1−pn )mi

i xi
λn+1 = 1
i 1−e−λn
mi
i 1−pm
n
i
pn+1 = mi
i (xi + 1−pm n
i )

for the three models.

21. Demonstrate that the EM updates of the previous problem can be


derived as MM updates based on the minorization
un u
− ln(1 − u) ≥ − ln(1 − un ) + ln
1 − un u n
9.9 Problems 243

for u and un in the interval (0, 1). Prove this minorization first. (Hint:
If you rearrange the minorization, then Proposition 9.2.1 applies.)

22. Suppose that Σ is a positive definite matrix. Prove that the matrix
I − F ∗ (F F ∗ + Σ)−1 F is also positive definite. This result is used
in the derivation of the EM algorithm in Sect. 9.7. (Hints: For read-
ers familiar with the sweep operator of computational statistics, the
simplest proof relies on applying Propositions 7.5.2 and 7.5.3 of the
reference [166].)

23. A certain company asks consumers to rate movies on an integer scale


from 1 to 5. Let Mi be the set of movies rated by person i. Denote
the cardinality of Mi by |Mi |. Each rater does so in one of two modes
that we will call “quirky” and “consensus”. In quirky mode, i has
a private rating distribution (qi1 , qi2 , qi3 , qi4 , qi5 ) that applies to ev-
ery movie regardless of its intrinsic merit. In consensus mode, rater
i rates movie j according to the distribution (cj1 , cj2 , cj3 , cj4 , cj5 )
shared with all other raters in consensus mode. For every movie i
rates, he or she makes a quirky decision with probability πi and a
consensus decision with probability 1 − πi . These decisions are made
independently across raters and movies. If xij is the rating given to
movie j by rater i, then prove that the likelihood of the data is

L = [πi qixij + (1 − πi )cjxij ].


i j∈Mi

Once we estimate the parameters, we can rank the reliability of rater


i by the estimate π̂i and the popularity of movie j by its estimated
average rating k kĉjk .
If we choose the natural course of estimating the parameters by maxi-
mum likelihood, then it is possible to derive an EM or MM algorithm.
From the right perspectives, these two algorithms coincide. Let n de-
note iteration number and wnij the weight
πni qnixij
wnij = .
πni qnixij + (1 − πni )cnjxij

Derive either algorithm and show that it updates the parameters by


1
πn+1,i = wnij
|Mi |
j∈Mi

j∈Mi 1{xij =x} wnij


qn+1,ix =
j∈Mi wnij
i 1{xij =x} (1 − wnij )
cn+1,jx = .
i (1 − wnij )
244 9. The EM Algorithm

These updates are easy to implement. Can you motivate them as


ratios of expected counts?
24. In the hidden Markov chain model, suppose that the chain is time
homogeneous with transition probabilities pjk . Derive an EM algo-
rithm for estimating the pjk from one or more independent runs of
the chain.
25. In the hidden Markov chain model, consider estimation of the pa-
rameters of the conditional densities φi (yi | j) of the observed data
y1 , . . . , yn . When Yi given Zi = j is Poisson distributed with mean
µj , show that the EM algorithm updates µj by
n
i=1 wmij yi
µm+1,j = n ,
i=1 wmij

where the weight wmij = E(Xij | Y, µm ). Show that the same update
applies when Yi given Zi = i is exponentially distributed with mean
µj or normally distributed with mean µj and common variance σ 2 .
In the latter setting, demonstrate that the EM update of σ 2 is
n
2 i=1 j wmij (yi − µm+1,j )2
σm+1 = n .
i=1 j wmij

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy