0% found this document useful (0 votes)

13 views24 pages

Chapter 9.4 Allele Frequency Estimation

The EM algorithm is an iterative method for finding maximum likelihood estimates in problems with missing data. It works by alternating between an expectation step (E-step) of estimating the missing data given the observed data and current estimates, and a maximization step (M-step) of computing maximum likelihood estimates based on the complete data including the estimated missing values. The EM algorithm guarantees an ascent in the likelihood at each iteration and handles parameter constraints gracefully. However, it can converge slowly, especially when there is a lot of missing data, and does not guarantee convergence to the global maximum.

Uploaded by

yehor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views24 pages

Chapter 9.4 Allele Frequency Estimation

Uploaded by

yehor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

9

The EM Algorithm

9.1 Introduction
Maximum likelihood is the dominant form of estimation in applied
statistics. Because closed-form solutions to likelihood equations are the
exception rather than the rule, numerical methods for finding maximum
likelihood estimates are of paramount importance. In this chapter we study
maximum likelihood estimation by the EM algorithm [65, 179, 191], a spe-
cial case of the MM algorithm. At the heart of every EM algorithm is some
notion of missing data. Data can be missing in the ordinary sense of a
failure to record certain observations on certain cases. Data can also be
missing in a theoretical sense. We can think of the E (expectation) step of
the algorithm as filling in the missing data. This action replaces the log-
likelihood of the observed data by a minorizing function. This surrogate
function is then maximized in the M step. Because the surrogate function
is usually much simpler than the likelihood, we can often solve the M step
analytically. The price we pay for this simplification is that the EM algo-
rithm is iterative. Reconstructing the missing data is bound to be slightly
wrong if the parameters do not already equal their maximum likelihood
estimates.
One of the advantages of the EM algorithm is its numerical stability.
As an MM algorithm, any EM algorithm leads to a steady increase in
the likelihood of the observed data. Thus, the EM algorithm avoids wildly
overshooting or undershooting the maximum of the likelihood along its
current direction of search. Besides this desirable feature, the EM handles

K. Lange, Optimization, Springer Texts in Statistics 95, 221

DOI 10.1007/978-1-4614-5838-8 9,
© Springer Science+Business Media New York 2013
222 9. The EM Algorithm

parameter constraints gracefully. Constraint satisfaction is by deﬁnition

built into the solution of the M step. In contrast, competing methods of
maximization must incorporate special techniques to cope with parame-
ter constraints. The EM shares some of the negative features of the more
general MM algorithm. For example, the EM algorithm often converges
at an excruciatingly slow rate in a neighborhood of the maximum point.
This rate directly reﬂects the amount of missing data in a problem. In the
absence of concavity, there is also no guarantee that the EM algorithm
will converge to the global maximum. The global maximum can usually be
reached by starting the parameters at good but suboptimal estimates such
as method-of-moments estimates or by choosing multiple random starting
points.

9.2 Deﬁnition of the EM Algorithm

A sharp distinction is drawn in the EM algorithm between the observed,
incomplete data y and the unobserved, complete data x of a statistical
experiment [65, 179, 251]. Some function t(x) = y collapses x onto y.
For instance, if we represent x as (y, z), with z as the missing data, then
t is simply projection onto the y-component of x. It should be stressed
that the missing data can consist of more than just observations missing
in the ordinary sense. In fact, the deﬁnition of x is left up to the intuition
and cleverness of the statistician. The general idea is to choose x so that
maximum likelihood estimation becomes trivial for the complete data.
The complete data are assumed to have a probability density f (x | θ)
that is a function of a parameter vector θ as well as of x. In the E step of
the EM algorithm, we calculate the conditional expectation

Q(θ | θn ) = E[ln f (X | θ) | Y = y, θ n ].

Here θ n is the current estimated value of θ, upper case letters indicate

random vectors, and lower case letters indicate corresponding realizations
of these random vectors. In the M step, we maximize Q(θ | θ n ) with
respect to θ. This yields the new parameter estimate θ n+1 , and we repeat
this two-step process until convergence occurs. Note that θ and θ n play
fundamentally diﬀerent roles in Q(θ | θ n ).
If ln g(y | θ) denotes the loglikelihood of the observed data, then the EM
algorithm enjoys the ascent property

ln g(y | θ n+1 ) ≥ ln g(y | θn ).

Proof of this assertion unfortunately involves measure theory, so some read-

ers may want to take it on faith and skip the rest of this section. A necessary
preliminary is the following well-known inequality from statistics.
9.2 Deﬁnition of the EM Algorithm 223

Proposition 9.2.1 (Information Inequality) Let h and k be probabil-

ity densities with respect to a measure µ. Suppose h > 0 and k > 0 almost
everywhere relative to µ. If Eh denotes expectation with respect to the prob-
ability measure hdµ, then Eh (ln h) ≥ Eh (ln k), with equality if and only if
h = k almost everywhere relative to µ.
Proof: Because − ln(w) is a strictly convex function on (0, ∞), Proposi-
tion 6.6.1 applied to the random variable k/h yields
k
Eh (ln h) − Eh (ln k) = Eh − ln
h
k
≥ − ln Eh
h
k
= − ln h dµ
h
= − ln k dµ
= 0.

Equality holds if and only if k/h = Eh (k/h) almost everywhere relative

to µ. This necessary and suﬃcient condition is equivalent to h = k since
Eh (k/h) = 1.
To prove the ascent property of the EM algorithm, it suﬃces to demon-
strate the minorization inequality

ln g(y | θ) ≥ Q(θ | θ n ) + ln g(y | θ n ) − Q(θn | θ n ),

where Q(θ | θ n ) = E[ln f (X | θ) | Y = y, θ n ]. With this end in mind,

note that both f (x | θ)/g(y | θ) and f (x | θ n )/g(y | θ n ) are conditional
densities of X on {x : t(x) = y} with respect to some measure µy . The
information inequality now indicates that

f (X | θ)
Q(θ | θ n ) − ln g(y | θ) = E ln Y = y, θn
g(Y | θ)
f (X | θ n )
≤ E ln Y = y, θ n
g(Y | θ n )
= Q(θ n | θn ) − ln g(y | θ n ).

Maximizing Q(θ | θ n ) therefore drives ln g(y | θ) uphill. The ascent in-

equality is strict whenever the conditional density f (x | θ)/g(y | θ) diﬀers
at the parameter points θn and θ n+1 or

Q(θ n+1 | θ n ) > Q(θn | θ n ).

The preceding proof is a little vague as to the meaning of the conditional

density f (x | θ)/g(y | θ) and its associated measure µy . Commonly the
224 9. The EM Algorithm

complete data decomposes as x = (y, z), where z is considered the missing

data and t(y, z) = y is projection onto the observed data. Suppose (y, z)
has joint density f (y, z | θ) relative to a product measure ω×µ (y, z); ω and
µ are typically Lebesgue measure or counting measure. In this framework,
we define g(y | θ) = f (y, z | θ)dµ(z) and set µy = µ. The function
g(y | θ) serves as a density relative to ω. To check that these definitions
make sense, it suffices to prove that h(y, z)f (y, z | θ)/g(y | θ)dµ(z)
is a version of the conditional expectation E[h(Y , Z) | Y = y] for every
well-behaved function h(y, z). This assertion can be verified by showing
E{1S (Y ) E[h(Y , Z) | Y ]} = E[1S (Y )h(Y , Z)]
for every measurable set S. With
f (y, z | θ)
E[h(Y , Z) | Y = y] = h(y, z) dµ(z),
g(y | θ)
we calculate
E{1S (Y ) E[h(Y , Z) | Y = y]}
f (y, z | θ)
= h(y, z) dµ(z)g(y | θ) dω(y)
S g(y | θ)

= h(y, z)f (y, z | θ) dµ(z) dω(y)

S
= E[1S (Y )h(Y , Z)].
Hence in this situation, f (x | θ)/g(y | θ) is indeed the conditional density
of X given Y = y.

9.3 Missing Data in the Ordinary Sense

The most common application of the EM algorithm is to data missing in
the ordinary sense. For example, Problem 40 of Chap. 8 considers a bal-
anced ANOVA model with two factors. Missing observations in this setting
break the symmetry that permits explicit solution of the likelihood equa-
tions. Thus, there is ample incentive for ﬁlling in the missing observations.
If the observations follow an exponential model, and missing data are miss-
ing completely at random, then the EM algorithm replaces the suﬃcient
statistic of each missing observation by its expected value.
The density of a random variable Y from an exponential family can be
written as
∗
f (y | θ) = g(y)eβ(θ)+h(y) γ(θ)
(9.1)
relative to some measure ν [73, 218]. The normal, Poisson, binomial, nega-
tive binomial, gamma, beta, and multinomial families are prime examples
9.4 Allele Frequency Estimation 225

of exponential families. The function h(y) in equation (9.1) is the suﬃcient

statistic. The maximum likelihood estimate of the parameter vector θ de-
pends on an observation y only through h(y). Predictors of y are incorpo-
rated into the functions β(θ) and γ(θ).
To ﬁll in a missing observation y, we take the ordinary expectation

E[ln f (Y | θ) | θ n ] = E[ln g(Y ) | θ n ] + β(θ) + E[h(Y ) | θn ]∗ γ(θ)

of the complete data loglikelihood. This function is added to the loglikeli-

hood of the regular observations y1 , . . . , ym to generate the surrogate func-
tion Q(θ | θ n ). For example, if a typical observation is normally distributed
with mean µ(α) and variance σ 2 , then θ is the vector (α∗ , σ 2 )∗ and
1 1
E[ln f (Y | θ) | θ n ] = ln √ − 2 E{[Y − µ(α)]2 | θ n }
2πσ2 2σ
1 1
= ln √ − 2 {σn2 + [µ(αn ) − µ(α)]2 }.
2πσ 2 2σ
Once we have ﬁlled in the missing data, we can estimate α without refer-
ence to σ 2 . This is accomplished by adding each square [µi (αn ) − µi (α)]2
corresponding to a missing observation yi to the sum of squares for the ac-
tual observations and then minimizing the entire sum over α. In classical
models such as balanced ANOVA, the M step is exact. Once the iterative
limit limn→∞ αn = α̂ is reached, we can estimate σ 2 in one step by the
formula
m
1
σ2 = [yi − µi (α̂)]2
m i=1

using only the observed yi . The reader is urged to work Problem 40 of

Chap. 8 to see the whole process in action.

9.4 Allele Frequency Estimation

It is instructive to compare the EM and MM algorithms on identical prob-
lems. Even when the two algorithms specify the same iteration scheme, the
diﬀerences in deriving the algorithms are illuminating. Consider the ABO
allele frequency estimation problem of Sect. 8.4. From the EM perspective,
the complete data x are genotype counts rather than phenotype counts y.
In passing from the complete data to the observed data, nature collapses
genotypes A/A and A/O into phenotype A and genotypes B/B and B/O
into phenotype B. In view of the Hardy-Weinberg equilibrium law, the
complete data multinomial loglikelihood becomes

ln f (X | p) = nA/A ln p2A + nA/O ln(2pA pO ) + nB/B ln p2B

226 9. The EM Algorithm

+ nB/O ln(2pB pO ) + nAB ln(2pA pB ) + nO ln p2O

n
+ ln . (9.2)
nA/A , nA/O , nB/B , nB/O , nAB , nO
In the E step of the EM algorithm we take the expectation of ln f (X | p)
conditional on the observed counts nA , nB , nAB , and nO and the current
parameter vector pm = (pmA , pmB , pmO )∗ . This action yields the surrogate
function
Q(p | pm ) = E(nA/A | Y, pm ) ln p2A + E(nA/O | Y, pm ) ln(2pA pO )
+ E(nB/B | Y, pm ) ln p2B + E(nB/O | Y, pm ) ln(2pB pO )
+ E(nAB | Y, pm ) ln(2pApB ) + E(nO | Y, pm ) ln p2O
n
+ E ln Y, pm .
nA/A , nA/O , nB/B , nB/O , nAB , nO
It is obvious that
E(nAB | Y, pm ) = nAB
E(nO | Y, pm ) = nO .
Application of Bayes’ rule gives

nmA/A = E(nA/A | Y, pm )
p2mA
= nA
p2mA + 2pmApmO
nmA/O = E(nA/O | Y, pm )
2pmA pmO
= nA .
p2mA + 2pmApmO
The conditional expectations nmB/B and nmB/O reduce to similar expres-
sions. Hence, the surrogate function Q(p | pm ) derived from the complete
data likelihood matches the surrogate function of the MM algorithm up to
a constant, and the maximization step proceeds as described earlier. One of
the advantages of the EM derivation is that it explicitly reveals the nature
of the conditional expectations nmA/A , nmA/O , nmB/B , and nmB/O .

9.5 Clustering by EM
The k-means clustering algorithm discussed in Example 7.2.3 makes hard
choices in cluster assignment. The alternative of soft choices is possible
with admixture models [192, 259]. An admixture probability density h(y)
can be written as a convex combination
k
h(y) = πj hj (y), (9.3)
j=1
9.5 Clustering by EM 227

where the πj are nonnegative probabilities that sum to 1 and hj (y) is

the probability density of group j. According to Bayes’ rule, the posterior
probability that an observation y belongs to group j equals the ratio
πj hj (y)
k
. (9.4)
i=1 πi h i (y)
If hard assignment is necessary, then the rational procedure is to assign y
to the group with highest posterior probability.
Suppose the observations y 1 , . . . , y m represent a random sample from
the admixture density (9.3). In practice we want to estimate the admixture
proportions and whatever further parameters θ characterize the densities
hj (y | θ). The EM algorithm is natural in this context with group mem-
bership as the missing data. If we let zij be an indicator specifying whether
observation y i comes from group j, then the complete data loglikelihood
amounts to
m k
zij [ln πj + ln hj (y i | θ)] .
i=1 j=1

To ﬁnd the surrogate function, we must ﬁnd the conditional expectation

wij of zij . But this reduces to the Bayes’ rule (9.4) with θ ﬁxed at θ n and
π ﬁxed at π n , where as usual n indicates iteration number. Note that the
property kj=1 zij = 1 entails the property kj=1 wij = 1.
Fortunately, the E step of the EM algorithm separates the π parameters
from the θ parameters. The problem of maximizing
k
cj ln πj
j=1

with cj = m i=1 wij should be familiar by now. Since

k
j=1 cj = m, Exam-
cj
ple (1.4.2) shows that πn+1,j = m .
We now undertake estimation of the remaining parameters assuming
the groups are normally distributed with a common variance matrix Ω
but diﬀerent mean vectors µ1 , . . . , µk . The pertinent part of the surrogate
function is
m k
1 1
wij − ln det Ω − (y i − µj )∗ Ω−1 (y i − µj )
i=1 j=1
2 2
k m
m 1
= − ln det Ω − wij (y i − µj )∗ Ω−1 (y i − µj )
2 2 j=1 i=1
k m
m 1
= − ln det Ω − tr Ω−1 wij (y i − µj )(y i − µj )∗ . (9.5)
2 2 j=1 i=1
228 9. The EM Algorithm

Diﬀerentiating the surrogate (9.5) with respect to µj gives the equation

m
wij Ω−1 (y i − µj ) = 0
i=1

with solution
m
1
µn+1,j = m wij y i .
i=1 wij i=1

Maximization of the surrogate (9.5) with respect to Ω can be rephrased

as maximization of
m 1
− ln det Ω − tr(Ω−1 M )
2 2
for the choice
k m
M = wij (y i − µn+1,j )(y i − µn+1,j )∗ .
j=1 i=1

Abstractly this is just the problem we faced in Example 6.5.7. Inspection

of the arguments there shows that
1
Ωn+1 = M. (9.6)
m

There is no guarantee of a unique mode in this model. Fortunately,

k-means clustering generates good starting values for the parameters. The
cluster centers provide the group means. If we set wij equal to 1 or 0 de-
pending on whether observation i belongs to cluster j or not, then the
matrix (9.6) serves as an initial guess of the common variance matrix.
The initial admixture proportion πj can be taken to be the proportion of
the observations assigned to cluster j.

9.6 Transmission Tomography

The EM and MM algorithms for transmission tomography diﬀer. The MM
algorithm is easier to derive and computationally more eﬃcient. In other
examples, the opposite is true.
In the transmission tomography example of Sect. 8.10, it is natural to
view the missing data as the number of photons Xij entering each pixel j
along each projection line i. These random variables supplemented by the
observations Yi constitute the complete data. If projection line i does not
intersect pixel j, then Xij = 0. Although Xij and Xij are not independent,
9.6 Transmission Tomography 229

the collection {Xij }j indexed by projection i is independent of the collection

{Xi j }j indexed by another projection i . This allows us to work projection
by projection in writing the complete data likelihood. We will therefore
temporarily drop the projection subscript i and relabel pixels, starting
with pixel 1 adjacent to the source and ending with pixel m − 1 adjacent
to the detector. In this notation X1 is the number of photons leaving the
source, Xj is the number of photons entering pixel j, and Xm = Y is the
number of photons detected.
By assumption X1 follows a Poisson distribution with mean d. Condi-
tional on X1 , . . . , Xj , the random variable Xj+1 is binomially distributed
with Xj trials and success probability e−lj θj . In other words, each of the
Xj photons entering pixel j behaves independently and has a chance e−lj θj
of avoiding attenuation in pixel j. It follows that the complete data loglike-
lihood for the current projection is
−d + X1 ln d − ln X1 !
m−1
Xj
+ ln + Xj+1 ln e−lj θj + (Xj − Xj+1 ) ln(1 − e−lj θj ) .
j=1
Xj+1
(9.7)
To perform the E step of the EM algorithm, we merely need to compute
the conditional expectations E(Xj | Xm = y, θ) for 1 ≤ j ≤ m. The
conditional expectations of other terms such as ln XXj+1
j
appearing in (9.7)
are irrelevant in the subsequent M step.
Reasoning as above, we infer that the unconditional mean of Xj is
j−1
µj = E(Xj ) = de− k=1
lk θk

and that the distribution of Xm conditional on Xj is binomial with Xj

trials and success probability
µm −
m−1
lk θ k
= e k=j .
µj
In view of our remarks about random thinning in Chap. 8, the joint prob-
ability density of Xj and Xm therefore reduces to
x
−µj
µj j xj µm xm µm xj −xm
Pr(Xj = xj , Xm = xm ) = e 1− ,
x j ! xm µj µj
and the conditional probability density of Xj given Xm becomes
j x
−µj µj xj
e xj ! xm
( µµmj )xm (1 − µm xj −xm
µj
)
Pr(Xj = xj | Xm = xm ) = xm
e−µm µxmm !
−(µj −µm ) (µj − µm )xj −xm
= e .
(xj − xm )!
230 9. The EM Algorithm

In other words, conditional on Xm , the diﬀerence Xj −Xm follows a Poisson

distribution with mean µj − µm . This implies in particular that

E(Xj | Xm ) = E(Xj − Xm | Xm ) + Xm
= µj − µm + X m .

Reverting to our previous notation, it is now possible to assemble the

surrogate function Q(θ | θn ) of the E step. Deﬁne
− lik θnk
Mij = di (e k∈Sij
− e− k
lik θnk
) + yi
− lik θnk
Nij = di (e k∈Sij ∪{j}
− e− k
lik θnk
) + yi ,

where Sij is the set of pixels between the source and pixel j along projec-
tion i. If j is the next pixel after pixel j along projection i, then

Mij = E(Xij | Yi = yi , θn )
Nij = E(Xij | Yi = yi , θn ).

In view of expression (9.7), we ﬁnd that

Q(θ | θ n ) = − Nij lij θj + (Mij − Nij ) ln(1 − e−lij θj )

i j

up to an irrelevant constant.
If we try to maximize Q(θ | θ n ) by setting its partial derivatives equal
to 0, we get for pixel j the equation
(Mij − Nij )lij
− Nij lij + = 0. (9.8)
i i
elij θj − 1

This is an intractable transcendental equation in the single variable θj ,

and the M step must be solved numerically, say by Newton’s method. It is
straightforward to check that the left-hand side of equation (9.8) is strictly
decreasing in θj and has exactly one positive solution. Thus, the EM al-
gorithm like the MM algorithm has the advantages of decoupling the pa-
rameters in the likelihood equations and of satisfying the natural boundary
constraints θj ≥ 0. The MM algorithm is preferable to the EM algorithm
because the MM algorithm involves far fewer exponentiations in deﬁning
its surrogate function.

9.7 Factor Analysis

In some instances, the missing data framework of the EM algorithm oﬀers
the easiest way to exploit convexity in deriving an MM algorithm. The com-
plete data for a given problem is often fairly natural, and the diﬃculty
9.7 Factor Analysis 231

in deriving an EM algorithm shifts toward specifying the E step. Statisti-

cians are particularly adept at calculating complicated conditional expecta-
tions connected with sampling distributions. We now illustrate these truths
for estimation in factor analysis. Factor analysis explains the covariation
among the components of a random vector by approximating the vector by
a linear transformation of a small number of uncorrelated factors. Because
factor analysis models usually involve normally distributed random vec-
tors, Appendix A.2 reviews some basic facts about the multivariate normal
distribution.
For the sake of notational convenience, we now extend the expectation
and variance operators to random vectors. The expectation of a random
vector X = (X1 , . . . , Xn )∗ is deﬁned componentwise by
 
E[X1 ]
 
E(X) =  ...  .
E[Xn ]
Linearity carries over from the scalar case in the sense that
E(X + Y ) = E(X) + E(Y )
E(M X) = M E(X)
for a compatible random vector Y and a compatible matrix M . The same
componentwise conventions hold for the expectation of a random matrix
and the variances and covariances of a random vector. Thus, we can express
the variance matrix of a random vector X as
Var(X) = E{[X − E(X)][X − E(X)]∗ } = E(XX ∗ ) − E(X) E(X)∗ .
These notational choices produce many other compact formulas. For in-
stance, the random quadratic form X ∗ MX has expectation
E(X ∗ M X) = tr[M Var(X)] + E(X)∗ M E(X). (9.9)
To verify this assertion, observe that

E(X ∗ M X) = E Xi mij Xj
i j

= mij E(Xi Xj )
i j

= mij [Cov(Xi , Xj ) + E(Xi ) E(Xj )]

i j
= tr[M Var(X)] + E(X)∗ M E(X).
The classical factor analysis model deals with l independent multivariate
observations of the form
Yk = µ + F Xk + U k.
232 9. The EM Algorithm

Here the p × q factor loading matrix F transforms the unobserved factor

score X k into the observed Y k . The random vector U k represents random
measurement error. Typically, q is much smaller than p. The random vectors
X k and U k are independent and normally distributed with means and
variances
E(X k ) = 0, Var(X k ) = I
E(U k ) = 0, Var(U k ) = D,
where I is the q × q identity matrix and D is a p × p diagonal matrix with
ith diagonal entry di . The entries of the mean vector µ, the factor loading
matrix F , and the diagonal matrix D constitute the parameters of the
model. For a particular realization y 1 , . . . , y l of the model, the maximum
likelihood estimation of µ is simply the sample mean µ̂ = ȳ. This fact is a
consequence of the reasoning given in Example 6.5.7. Therefore, we replace
each y k by y k − ȳ, assume µ = 0, and focus on estimating F and D.
The random vector (X ∗k , Y ∗k )∗ is the obvious choice of the complete data
for case k. If f (xk ) is the density of X k and g(y k | xk ) is the conditional
density of Y k given X k = xk , then the complete data loglikelihood can be
expressed as
l l
ln f (xk ) + ln g(y k | xk )
k=1 k=1
l
l 1 l
= − ln det I − x∗k xk − ln det D
2 2 2
k=1
l
1
− (y k − F xk )∗ D−1 (y k − F xk ). (9.10)
2
k=1

We can simplify this by noting that ln det I = 0 and ln det D = pi=1 ln di .

The key to performing the E step is to note that (X ∗k , Y ∗k )∗ follows a
multivariate normal distribution with variance matrix
Xk I F∗
Var = .
Yk F FF∗ + D
Equation (A.1) of Appendix A.2 then permits us to calculate the condi-
tional expectation
vk = E(X k | Y k = y k , F n , D n )
= F ∗n (F n F ∗n + Dn )−1 y k
and the conditional variance
Ak = Var(X k | Y k = y k , F n , Dn )
= I − F ∗n (F n F ∗n + Dn )−1 F n ,
9.7 Factor Analysis 233

given the observed data and the current values of the matrices F and D.
Combining these results with equation (9.9) yields

E[(Y k − F X k )∗ D−1 (Y k − F X k ) | Y k = y k ]
= tr(D −1 F Ak F ∗ ) + (y k − F v k )∗ D−1 (y k − F v k )
= tr{D−1 [F Ak F ∗ + (y k − F v k )(y k − F v k )∗ ]}.

If we deﬁne
l l l
Λ = [Ak + v k v ∗k ], Γ = v k y ∗k , Ω = y k y ∗k
k=1 k=1 k=1

and take conditional expectations in equation (9.10), then we can write the
surrogate function of the E step as

Q(F , D | F n , Dn )
p
l 1
= − ln di − tr[D −1 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)],
2 i=1
2

omitting the additive constant

l
1
− E(X ∗k X k | Y k = y k , F n , D n ),
2
k=1

which depends on neither F nor D.

To perform the M step, we ﬁrst maximize Q(F , D | F n , D n ) with respect
to F , holding D ﬁxed. We can do so by permuting factors and completing
the square in the trace

tr[D −1 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)]
= tr[D −1 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ ] + tr[D −1 (Ω − Γ∗ Λ−1 Γ)]
1 1
= tr[D − 2 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ D − 2 ] + tr[D −1 (Ω − Γ∗ Λ−1 Γ)].

This calculation depends on the existence of the inverse matrix Λ−1 . Now
Λ is certainly positive definite if Ak is positive definite, and Problem 22
asserts that Ak is positive definite. It follows that Λ−1 not only exists but
is positive definite as well. Furthermore, the matrix
1 1
D− 2 (F − Γ∗ Λ−1 )Λ(F − Γ∗ Λ−1 )∗ D− 2

is positive semideﬁnite and has a nonnegative trace. Hence, the maximum

value of the surrogate function Q(F , D | F n , D n ) with respect to F is
attained at the point F = Γ∗ Λ−1 , regardless of the value of D. In other
words, the EM update of F is F n+1 = Γ∗ Λ−1 . It should be stressed that
234 9. The EM Algorithm

Γ and Λ implicitly depend on the previous values F n and Dn . Once F n+1

is determined, the equation
∂
0 = Q(F , D | F n , D n )
∂di
l 1
= − + 2 (F ΛF ∗ − F Γ − Γ∗ F ∗ + Ω)ii
2di 2di

provides the update

1
dn+1,i = (F n+1 ΛF ∗n+1 − F n+1 Γ − Γ∗ F ∗n+1 + Ω)ii .
l
One of the frustrating features of factor analysis is that the factor loading
matrix F is not uniquely determined. To understand the source of the
ambiguity, consider replacing F by F O, where O is a q × q orthogonal
matrix. The distribution of each random vector Y k is normal with mean
µ and variance matrix F F ∗ + D. If we substitute F O for F , then the
variance F OO ∗ F ∗ + D = F F ∗ + D remains the same. Another problem
in factor analysis is the existence of more than one local maximum. Which
one of these the EM algorithm converges to depends on its starting value
[76]. For a suggestion of how to improve the chances of converging to the
dominant mode, see the article [281].

9.8 Hidden Markov Chains

A hidden Markov chain incorporates both observed data and missing data.
The missing data are the sequence of states visited by the chain; the ob-
served data provide partial information about this sequence of states. De-
note the sequence of visited states by Z1 , . . . , Zn and the observation taken
at epoch i when the chain is in state Zi by Yi = yi . Baum’s algorithms
[8, 71] recursively compute the likelihood of the observed data

P = Pr(Y1 = y1 , . . . , Yn = yn ) (9.11)

without actually enumerating all possible realizations Z1 , . . . , Zn . Baum’s

algorithms can be adapted to perform an EM search. The references [78,
165, 216] discuss several concrete examples of hidden Markov chains.
The likelihood (9.11) is constructed from three ingredients: (a) the ini-
tial distribution π at the first epoch of the chain, (b) the epoch-dependent
transition probabilities pijk = Pr(Zi+1 = k | Zi = j), and (c) the condi-
tional densities φi (yi | j) = Pr(Yi = yi | Zi = j). The dependence of the
transition probability pijk on i allows the chain to be inhomogeneous over
time and promotes greater flexibility in modeling. Implicit in the definition
of φi (yi | j) are the assumptions that Y1 , . . . , Yn are independent given
9.8 Hidden Markov Chains 235

Z1 , . . . , Zn and that Yi depends only on Zi . For simplicity, we will assume

that the Yi are discretely distributed.
Baum’s forward algorithm is based on recursively evaluating the joint
probabilities
αi (j) = Pr(Y1 = y1 , . . . , Yi−1 = yi−1 , Zi = j).
At the ﬁrst epoch, α1 (j) = πj by deﬁnition. The obvious update to αi (j) is

αi+1 (k) = αi (j)φi (yi | j)pijk . (9.12)

The likelihood (9.11) can be recovered by computing the sum

P = αn (j)φn (yn | j)
j

at the ﬁnal epoch n.

In Baum’s backward algorithm, we recursively evaluate the conditional
probabilities
βi (k) = Pr(Yi+1 = yi+1 , . . . , Yn = yn | Zi = k),
starting by convention at βn (k) = 1 for all k. The required update is clearly

βi (j) = pijk φi+1 (yi+1 | k)βi+1 (k). (9.13)

In this instance, the likelihood is recovered at the ﬁrst epoch by forming

the sum P = j πj φ1 (y1 | j)β1 (j).
Baum’s algorithms also interdigitate beautifully with the E step of the
EM algorithm. It is natural to summarize the missing data by a collection
of indicator random variables Xij . If the chain occupies state j at epoch i,
then we take Xij = 1. Otherwise, we take Xij = 0. In this notation, the
complete data loglikelihood can be written as
n
Lcom(θ) = X1j ln πj + Xij ln φi (Yi | j)
j i=1 j
n−1
+ Xij Xi+1,k ln pijk .
i=1 j k

Execution of the E step amounts to calculation of the conditional expecta-

where Y = y is the observed data, P is the likelihood of the observed data,

and θ m is the current parameter vector.
The M step may or may not be exactly solvable. If it is not, then one can
always revert to the MM gradient algorithm discussed in Sect. 10.4. In the
case of hidden multinomial trials, it is possible to carry out the M step
analytically. Hidden multinomial trials may govern (a) the choice of the
initial state j, (b) the choice of an observed outcome Yi at the ith epoch
given the hidden state j of the chain at that epoch, or (c) the choice of
the next state k given the current state j in a time-homogeneous chain.
In the ﬁrst case, the multinomial parameters are the πj ; in the last case,
they are the common transition probabilities pjk .
As a concrete example, consider estimation of the initial distribution π
at the ﬁrst epoch of the chain. For estimation to be accurate, there must
be multiple independent runs of the chain. Let the superscript r index the
various runs. The surrogate function delivered by the E step equals

Q(π | π m ) = r
E(X1j | Y r = y r , πm ) ln πj
r j

up to an additive constant. Maximizing Q(π | π m ) subject to the con-

straints j πj = 1 and πj ≥ 0 for all j is done as in Example 1.4.2. The
resulting EM updates

r
r
E(X1j | Y r = yr , πm )
πm+1,j =
R
for R runs can be interpreted as multinomial proportions with fractional
category counts. Problem 24 asks the reader to derive the EM algorithm for
estimating time homogeneous transition probabilities. Problem 25 covers
estimation of the parameters of the conditional densities φi (yi | j) for some
common densities.

9.9 Problems
1. Code and test any of the algorithms discussed in the text or problems
of this chapter.
2. The entropy of a probability density p(x) on Rn is deﬁned by

− p(x) ln p(x)dx. (9.14)

Among all densities with a ﬁxed mean vector µ = xp(x)dx and

variance matrix Ω = (x − µ)(x − µ)∗ p(x)dx, prove that the multi-
variate normal has maximum entropy. (Hints: Apply Proposition 9.2.1
and formula (9.9).)
9.9 Problems 237

3. In statistical mechanics, entropy is employed to characterize the sta-

tionary distribution of many independently behaving particles. Let
p(x) be the probability density that a particle is found at position x
in phase space Rn , and suppose that each position x is assigned an
energy u(x). If the average energy U = u(x)p(x)dx per particle is
ﬁxed, then Nature chooses p(x) to maximize entropy as deﬁned in
equation (9.14). Show that if constants α and β exist satisfying

αeβu(x) dx = 1 and u(x)αeβu(x) dx = U,

then p(x) = αeβu(x) does indeed maximize entropy subject to the av-
erage energy constraint. The density p(x) is the celebrated Maxwell-
Boltzmann density.

4. Show that the normal, Poisson, binomial, negative binomial, gamma,

beta, and multinomial families are exponential by writing their den-
sities in the form (9.1). What are the corresponding measure and
suﬃcient statistic in each case?

5. In the EM algorithm [65], suppose that the complete data X possesses

a regular exponential density
∗
f (x | θ) = g(x)eβ(θ)+h(x) θ

relative to some measure ν. Prove that the unconditional mean of the

suﬃcient statistic h(X) is given by the negative gradient −∇β(θ) and
that the EM update is characterized by the condition

E[h(X) | Y, θn ] = −∇β(θ n+1 ).

6. Suppose the phenotypic counts in the ABO allele frequency estima-

tion example satisfy nA + nAB > 0, nB + nAB > 0, and nO > 0. Show
that the loglikelihood is strictly concave and possesses a single global
maximum on the interior of the feasible region.

7. In a genetic linkage experiment, 197 animals are randomly assigned

to four categories according to the multinomial distribution with cell
probabilities π1 = 12 + θ4 , π2 = 1−θ 1−θ
4 , π3 = 4 and π4 = θ4 . If the
corresponding observations are

y = (y1 , y2 , y3 , y4 )∗ = (125, 18, 20, 34)∗,

then devise an EM algorithm and use it to estimate θ̂ = .6268 [218].

(Hint: Split the ﬁrst category into two so that there are ﬁve categories
for the complete data.)
238 9. The EM Algorithm

8. Derive the EM algorithm solving Problem 7 as an MM algorithm. No

mention of missing data is necessary.
9. Consider the data from The London Times [259] during the years
1910–1912 given in Table 9.1. The two columns labeled “Deaths i”
refer to the number of deaths to women 80 years and older reported
by day. The columns labeled “Frequency ni ” refer to the number of
days with i deaths. A Poisson distribution gives a poor fit to these
data, possibly because of different patterns of deaths in winter and
summer. A mixture of two Poissons provides a much better fit. Under
the Poisson admixture model, the likelihood of the observed data is
9 ni
−µ1 µi1 µi
αe + (1 − α)e−µ2 2 ,
i=0
i! i!

where α is the admixture parameter and µ1 and µ2 are the means of

the two Poisson distributions.

TABLE 9.1. Death notices from The London Times

Deaths i Frequency ni Deaths i Frequency ni
0 162 5 61
1 267 6 27
2 271 7 8
3 185 8 3
4 111 9 1

Formulate an EM algorithm for this model. Let θ = (α, µ1 , µ2 )∗ and

αe−µ1 µi1
zi (θ) =
αe−µ1 µi1 + (1 − α)e−µ2 µi2
be the posterior probability that a day with i deaths belongs to Pois-
son population 1. Show that the EM algorithm is given by

i ni zi (θ m )
αm+1 =
i ni

i ni izi (θ m )
µm+1,1 =
i ni zi (θ m )

i ni i[1 − zi (θ m )]
µm+1,2 = .
i ni [1 − zi (θ m )]

From the initial estimates α0 = 0.3, µ01 = 1. and µ02 = 2.5, compute
via the EM algorithm the maximum likelihood estimates α̂ = 0.3599,
µ̂1 = 1.2561, and µ̂2 = 2.6634. Note how slowly the EM algorithm
converges in this example.
9.9 Problems 239

10. Derive the least squares algorithm (8.19) as an EM algorithm [112].

q
(Hint: Decompose yi as the sum j=1 yij of realizations from inde-
pendent normal deviates with means xij θj and variances 1/q.)

11. Let x1 , . . . , xm be an i.i.d. sample from a normal density with mean

µ and variance σ 2 . Suppose for each xi we observe yi = |xi | rather
than xi . Formulate an EM algorithm for estimating µ and σ 2 , and
show that its updates are
m
1
µn+1 = (wni1 yi − wni2 yi )
m i=1
m
2 1
σn+1 = [wni1 (yi − µn+1 )2 + wni2 (−yi − µn+1 )2 ]
m i=1

with weights

where f (x | θ) is the normal density with θ = (µ, σ 2 )∗ . Demonstrate

that the modes of the likelihood of the observed data come in sym-
metric pairs diﬀering only in the sign of µ. This fact does not prevent
accurate estimation of |µ| and σ 2 .

12. Consider an i.i.d. sample drawn from a bivariate normal distribution

with mean vector µ = (µ1 , µ2 )∗ and variance matrix

σ12 σ12
Ω = .
σ12 σ22

Suppose through some random accident that the ﬁrst p observations

are missing their ﬁrst component, the next q observations are miss-
ing their second component, and the last r observations are com-
plete. Design an EM algorithm to estimate the ﬁve mean and vari-
ance parameters, taking as complete data the original data before the
accidental loss.

13. The standard linear regression model can be written in matrix

notation as X = Aβ + U . Here X is the r × 1 vector of responses,
A is the r × s design matrix, β is the s × 1 vector of regression co-
eﬃcients, and U is the r × 1 normally distributed error vector with
mean 0 and variance σ 2 I. The responses are right censored if for each
i there is a constant ci such that only Yi = min{ci , Xi } is observed.
240 9. The EM Algorithm

The EM algorithm oﬀers a vehicle for estimating the parameter vec-

tor θ = (β ∗ , σ 2 )∗ in the presence of right censoring [65, 251]. Show
that

β n+1 = (A∗ A)−1 A∗ E(X | Y , θ n )

1
2
σn+1 = E[(X − Aβ n+1 )∗ (X − Aβ n+1 ) | Y , θ n ].
r
To compute the conditional expectations appearing in these formulas,
let ai be the ith row of A and deﬁne
v 2
√1 e− 2
2π
H(v) = ∞ − w2
.
1
√
2π v e 2 dw

For a censored observation yi = ci < ∞, prove that

ci − ai βn
E(Xi | Yi = ci , θ n ) = ai β n + σn H
σn
E(Xi2 | Yi = ci , θn ) = (ai βn )2 + σn2
ci − ai βn
+ σn (ci + ai β n )H .
σn

Use these formulas to complete the speciﬁcation of the EM algorithm.

14. In the transmission tomography model it is possible to approximate

the solution of equation (9.8) to good accuracy in certain situations.
Verify the expansion

1 1 1 s
= − + + O(s2 ).
es − 1 s 2 12

Using the approximation 1/(es − 1) ≈ 1/s − 1/2 for s = lij θj , show

that

i (Mij − Nij )
θn+1,j = 1
2 i (Mij + Nij )lij

results. Can you motivate this result heuristically?

15. Suppose that the complete data in the EM algorithm involve N bi-
nomial trials with success probability θ per trial. Here N can be
random or ﬁxed. If M trials result in success, then the complete data
likelihood can be written as θM (1 − θ)N −M c, where c is an irrelevant
constant. The E step of the EM algorithm amounts to forming

Q(θ | θn ) = E(M | Y , θn ) ln θ + E(N − M | Y , θn ) ln(1 − θ) + ln c.

9.9 Problems 241

The binomial trials are hidden because only a function Y of them

is directly observed. The brief derivation in Sect. 9.8 shows that the
EM update amounts to
E(M | Y , θn )
θn+1 = .
E(N | Y , θn )
Prove that this is equivalent to the update
θn (1 − θn ) d
θn+1 = θn + L(θn ),
E(N | Y , θn ) dθ
where L(θ) is the loglikelihood of the observed data Y [270]. (Hint:
Apply identity (8.4) of Chap. 8.)
16. As an example of hidden binomial trials, consider a random sample
of twin pairs. Let u of these pairs consist of male pairs, v consist
of female pairs, and w consist of opposite sex pairs. A simple model
to explain these data involves a random Bernoulli choice for each
pair dictating whether it consists of identical or nonidentical twins.
Suppose that identical twins occur with probability p and noniden-
tical twins with probability 1 − p. Once the decision is made as to
whether the twins are identical, then sexes are assigned to the twins.
If the twins are identical, one assignment of sex is made. If the twins
are nonidentical, then two independent assignments of sex are made.
Suppose boys are chosen with probability q and girls with probabil-
ity 1 − q. Model these data as hidden binomial trials. Derive the EM
algorithm for estimating p and q.
17. Chun Li has derived an EM update for hidden multinomial trials. Let
N denote the number of hidden trials, θi the probability of outcome
i of k possible outcomes, and L(θ) the loglikelihood of the observed
data Y . Derive the EM update
 
k
θni  ∂ L(θ n ) − ∂
θn+1,i = θni + θnj L(θ n )
E(N | Y , θ n ) ∂θi j=1
∂θ j

following the reasoning of Problem 15.

18. In this problem you are asked to formulate models for hidden Poisson
and exponential trials [270]. If the number of trials is N and the mean
per trial is θ, then show that the EM update in the Poisson case is
θn d
θn+1 = θn + L(θn )
E(N | Y , θn ) dθ
and in the exponential case is
θn2 d
θn+1 = θn + L(θn ),
E(N | Y , θn ) dθ
242 9. The EM Algorithm

where L(θ) is the loglikelihood of the observed data Y .

19. Suppose light bulbs have an exponential lifetime with mean θ. Two
experiments are conducted. In the ﬁrst, the lifetimes y1 , . . . , ym of m
independent bulbs are observed. In the second, p independent bulbs
are observed to burn out before time t, and q independent bulbs are
observed to burn out after time t. In other words, the lifetimes in the
second experiment are both left and right censored. Construct an EM
algorithm for ﬁnding the maximum likelihood estimate of θ [95].

20. In many discrete probability models, only data with positive counts
are observed. Counts that are 0 are missing. Show that the likelihoods
for the binomial, Poisson, and negative binomial models truncated at
0 amount to
mi
xi pxi (1 − p)mi −xi
L1 (p) =
i
1 − (1 − p)mi
λxi e−λ
L2 (λ) =
i
xi !(1 − e−λ )
mi +xi −1
xi
(1 − p)xi pmi
L3 (p) = .
i
1 − pmi

For observation i of the binomial model, there are xi successes out

of mi trials with success probability p per trial. For observation i of
the negative binomial model, there are xi failures before mi required
successes. For each model, devise an EM algorithm that ﬁlls in the
missing observations by imputing a geometrically distributed number
of truncated observations for every real observation. Show that the
EM updates reduce to

xi
i
pn+1 = mi
i 1−(1−pn )mi

i xi
λn+1 = 1
i 1−e−λn
mi
i 1−pm
n
i
pn+1 = mi
i (xi + 1−pm n
i )

for the three models.

21. Demonstrate that the EM updates of the previous problem can be

derived as MM updates based on the minorization
un u
− ln(1 − u) ≥ − ln(1 − un ) + ln
1 − un u n
9.9 Problems 243

for u and un in the interval (0, 1). Prove this minorization ﬁrst. (Hint:
If you rearrange the minorization, then Proposition 9.2.1 applies.)

22. Suppose that Σ is a positive deﬁnite matrix. Prove that the matrix
I − F ∗ (F F ∗ + Σ)−1 F is also positive deﬁnite. This result is used
in the derivation of the EM algorithm in Sect. 9.7. (Hints: For read-
ers familiar with the sweep operator of computational statistics, the
simplest proof relies on applying Propositions 7.5.2 and 7.5.3 of the
reference [166].)

23. A certain company asks consumers to rate movies on an integer scale

from 1 to 5. Let Mi be the set of movies rated by person i. Denote
the cardinality of Mi by |Mi |. Each rater does so in one of two modes
that we will call “quirky” and “consensus”. In quirky mode, i has
a private rating distribution (qi1 , qi2 , qi3 , qi4 , qi5 ) that applies to ev-
ery movie regardless of its intrinsic merit. In consensus mode, rater
i rates movie j according to the distribution (cj1 , cj2 , cj3 , cj4 , cj5 )
shared with all other raters in consensus mode. For every movie i
rates, he or she makes a quirky decision with probability πi and a
consensus decision with probability 1 − πi . These decisions are made
independently across raters and movies. If xij is the rating given to
movie j by rater i, then prove that the likelihood of the data is

L = [πi qixij + (1 − πi )cjxij ].

i j∈Mi

Once we estimate the parameters, we can rank the reliability of rater

i by the estimate π̂i and the popularity of movie j by its estimated
average rating k kĉjk .
If we choose the natural course of estimating the parameters by maxi-
mum likelihood, then it is possible to derive an EM or MM algorithm.
From the right perspectives, these two algorithms coincide. Let n de-
note iteration number and wnij the weight
πni qnixij
wnij = .
πni qnixij + (1 − πni )cnjxij

Derive either algorithm and show that it updates the parameters by

1
πn+1,i = wnij
|Mi |
j∈Mi

j∈Mi 1{xij =x} wnij

qn+1,ix =
j∈Mi wnij
i 1{xij =x} (1 − wnij )
cn+1,jx = .
i (1 − wnij )
244 9. The EM Algorithm

These updates are easy to implement. Can you motivate them as

ratios of expected counts?
24. In the hidden Markov chain model, suppose that the chain is time
homogeneous with transition probabilities pjk . Derive an EM algo-
rithm for estimating the pjk from one or more independent runs of
the chain.
25. In the hidden Markov chain model, consider estimation of the pa-
rameters of the conditional densities φi (yi | j) of the observed data
y1 , . . . , yn . When Yi given Zi = j is Poisson distributed with mean
µj , show that the EM algorithm updates µj by
n
i=1 wmij yi
µm+1,j = n ,
i=1 wmij

where the weight wmij = E(Xij | Y, µm ). Show that the same update
applies when Yi given Zi = i is exponentially distributed with mean
µj or normally distributed with mean µj and common variance σ 2 .
In the latter setting, demonstrate that the EM update of σ 2 is
n
2 i=1 j wmij (yi − µm+1,j )2
σm+1 = n .
i=1 j wmij

Honda Monkey z50j Workshop Manual
100% (5)
Honda Monkey z50j Workshop Manual
276 pages
John Boardman - Athenian Red Figure Vases. The Classical Period
100% (3)
John Boardman - Athenian Red Figure Vases. The Classical Period
128 pages
CARE PLAN DM Urban
No ratings yet
CARE PLAN DM Urban
18 pages
1.2 Raw Materials and Its Preparation For Iron Making
100% (2)
1.2 Raw Materials and Its Preparation For Iron Making
107 pages
US Military - Emergency War Surgery (2004)
100% (4)
US Military - Emergency War Surgery (2004)
480 pages
KTU-S5 To S8-draftsyllabus-CIVILENGG-Final PDF
No ratings yet
KTU-S5 To S8-draftsyllabus-CIVILENGG-Final PDF
130 pages
DS-Important Questions
No ratings yet
DS-Important Questions
93 pages
DTC B 836 Open in Curtain Shield Airbag (P Seat Side) Squib Circuit
No ratings yet
DTC B 836 Open in Curtain Shield Airbag (P Seat Side) Squib Circuit
4 pages
EM Algorithm
No ratings yet
EM Algorithm
30 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
Lecture Expectation Maximization
No ratings yet
Lecture Expectation Maximization
58 pages
CpE646 6v3 PDF
No ratings yet
CpE646 6v3 PDF
44 pages
Expectation Maximization (EM) Algorithm
No ratings yet
Expectation Maximization (EM) Algorithm
47 pages
Week 4 - General Physics Damped Oscillations PDF
100% (1)
Week 4 - General Physics Damped Oscillations PDF
77 pages
EM Algorithm
No ratings yet
EM Algorithm
5 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
5
No ratings yet
5
29 pages
16 Aos1435
No ratings yet
16 Aos1435
44 pages
EM Algorithm and Variants: An Informal Tutorial
No ratings yet
EM Algorithm and Variants: An Informal Tutorial
17 pages
Statistical Inference III: Mohammad Samsul Alam
No ratings yet
Statistical Inference III: Mohammad Samsul Alam
32 pages
Beamer
No ratings yet
Beamer
34 pages
M1 & M2 Supplementaries
No ratings yet
M1 & M2 Supplementaries
52 pages
How To Write Compelling Copy: Copywriting For Beginners
No ratings yet
How To Write Compelling Copy: Copywriting For Beginners
38 pages
Figueiredo EM Algorithm
No ratings yet
Figueiredo EM Algorithm
35 pages
Lec 13
No ratings yet
Lec 13
27 pages
Tank Hydraulics - Calculation of CIP System
No ratings yet
Tank Hydraulics - Calculation of CIP System
5 pages
Machine Learning: CSCE883
No ratings yet
Machine Learning: CSCE883
22 pages
Lec16 PDF
No ratings yet
Lec16 PDF
10 pages
21 Century Literature From The World: Weltliteratur
No ratings yet
21 Century Literature From The World: Weltliteratur
35 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
ds11 2
No ratings yet
ds11 2
19 pages
Lecture 5
No ratings yet
Lecture 5
16 pages
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
No ratings yet
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
8 pages
Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
No ratings yet
Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
22 pages
Unit 3 ML
No ratings yet
Unit 3 ML
45 pages
Likelihood EM HMM Kalman
No ratings yet
Likelihood EM HMM Kalman
46 pages
Lecture3 EM
No ratings yet
Lecture3 EM
36 pages
PROBABILISTIC Learning Jb-New
No ratings yet
PROBABILISTIC Learning Jb-New
13 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Unit2 6
No ratings yet
Unit2 6
12 pages
S6, S7, S8 CS - U4 Getter Setter EM Algorithm
No ratings yet
S6, S7, S8 CS - U4 Getter Setter EM Algorithm
32 pages
Oral Texte
No ratings yet
Oral Texte
12 pages
ML-2-Expectation Maximization
No ratings yet
ML-2-Expectation Maximization
11 pages
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 12-May-2021 5.5 Expectation Maximization
No ratings yet
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 12-May-2021 5.5 Expectation Maximization
28 pages
Effect Sizes Means
No ratings yet
Effect Sizes Means
10 pages
(Slides) The em Algorithm
No ratings yet
(Slides) The em Algorithm
14 pages
Expectation Maximization
No ratings yet
Expectation Maximization
19 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
Learning With Hidden Variables - EM Algorithm
No ratings yet
Learning With Hidden Variables - EM Algorithm
31 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
ExpectationMaximization Algorithm
No ratings yet
ExpectationMaximization Algorithm
7 pages
Objectives:: Expectation Maximization (Em)
No ratings yet
Objectives:: Expectation Maximization (Em)
17 pages
9dl Merged
No ratings yet
9dl Merged
7 pages
TR 97 021
No ratings yet
TR 97 021
15 pages
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
No ratings yet
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
10 pages
Ad/As: AP Macroeconomics Test: The Model
No ratings yet
Ad/As: AP Macroeconomics Test: The Model
6 pages
Senior Design Engineer For Static Equipment With 13 Years of Experience
100% (1)
Senior Design Engineer For Static Equipment With 13 Years of Experience
7 pages
Medium To Advanced Vocabulary 50 Words
No ratings yet
Medium To Advanced Vocabulary 50 Words
5 pages
Api 5l Grade B Pipe
No ratings yet
Api 5l Grade B Pipe
3 pages
Grade Level: Grade 1 Subject: Mother Tongue Grading Period Most Essential Learning Competencies No
100% (1)
Grade Level: Grade 1 Subject: Mother Tongue Grading Period Most Essential Learning Competencies No
11 pages
Gaussian Mixtures
No ratings yet
Gaussian Mixtures
5 pages
Bishop-Pattern-Recognition-and-Machine-Learning-2006 第455 - 459页
No ratings yet
Bishop-Pattern-Recognition-and-Machine-Learning-2006 第455 - 459页
5 pages
Gaussian Distribution
No ratings yet
Gaussian Distribution
5 pages
Unit 2
No ratings yet
Unit 2
7 pages
The Expectation Maximization Algorithm
No ratings yet
The Expectation Maximization Algorithm
7 pages
Expectation Maximization Primer (Do and Batzoglou) Supplementary
No ratings yet
Expectation Maximization Primer (Do and Batzoglou) Supplementary
3 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
CSE1115 Final 221
No ratings yet
CSE1115 Final 221
4 pages
An Alternative View of EM - Poornima
No ratings yet
An Alternative View of EM - Poornima
4 pages
EM Converge Property
No ratings yet
EM Converge Property
8 pages
COVID-19, Online Teaching, and Deepening Digital Divide in India
No ratings yet
COVID-19, Online Teaching, and Deepening Digital Divide in India
4 pages
AI29
No ratings yet
AI29
3 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Geog GR 10 Map Work Scope 2024 Term 1
No ratings yet
Geog GR 10 Map Work Scope 2024 Term 1
3 pages
EM Algo
No ratings yet
EM Algo
8 pages
Nusantara Innovation Journal Template
No ratings yet
Nusantara Innovation Journal Template
3 pages
UNIT 4 - EM Alg
No ratings yet
UNIT 4 - EM Alg
3 pages
Physics: Kinematics: Motion in Two Direction
No ratings yet
Physics: Kinematics: Motion in Two Direction
16 pages
The EM Algorithm: Ajit Singh November 20, 2005
No ratings yet
The EM Algorithm: Ajit Singh November 20, 2005
4 pages
ML - Expectation-Maximization Algorithm
No ratings yet
ML - Expectation-Maximization Algorithm
3 pages
Compensation Management & Reward System
No ratings yet
Compensation Management & Reward System
4 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
Group 15 To 18
No ratings yet
Group 15 To 18
6 pages
Shreesha Tantry - CV
No ratings yet
Shreesha Tantry - CV
3 pages
Syllabus For Introduction To Robotics
No ratings yet
Syllabus For Introduction To Robotics
2 pages
Bellabee Explained in Easy Terms and Business Model To Practitioners 2
No ratings yet
Bellabee Explained in Easy Terms and Business Model To Practitioners 2
8 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.