0% found this document useful (0 votes)
1 views10 pages

EM Algorithm

The Expectation-Maximization (EM) algorithm is a method for maximum-likelihood estimation in cases of incomplete data, involving iterative E-step and M-step processes to estimate parameters. It is a hill-climbing approach that can converge to local maxima, with strategies to improve initial estimates for better results. The document provides examples of applying the EM algorithm in various contexts, including normal distributions and genetic linkage data.

Uploaded by

Poorva Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views10 pages

EM Algorithm

The Expectation-Maximization (EM) algorithm is a method for maximum-likelihood estimation in cases of incomplete data, involving iterative E-step and M-step processes to estimate parameters. It is a hill-climbing approach that can converge to local maxima, with strategies to improve initial estimates for better results. The document provides examples of applying the EM algorithm in various contexts, including normal distributions and genetic linkage data.

Uploaded by

Poorva Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is a general algorithm for maximum-


likelihood estimation where the data are “incomplete” (missing data).
Informally, the EM algorithm starts with randomly assigning values to all the parameters to
be estimated. It then iteratively alternates between two steps, called the expectation step (i.e.,
the “E-step”) and the maximization step (i.e., the “M-step”), respectively. In the E-step, it
computes the expected likelihood for the complete data (Q-function) where the expectation is
taken w.r.t. missing data. In the M-step, it re-estimates all the parameters by maximizing the
Q-function. Once we have a new generation of parameter values, we can repeat the E-step
and another M-step. This process continues until the likelihood converges, i.e., reaching a
local maximum.
Thus, the general procedure of the EM algorithm is the following
1. Initialize θ0 randomly or heuristically according to any prior knowledge about where the
optimal parameter value might be.
2. Iteratively improve the estimate of θ by alternating between the following two-steps:
(a) The E-step (expectation): Compute Q(θ; θn)= E Z/X, θn [log P(z/x, θ)]
(b) The M-step (maximization): Re-estimate θ by maximizing the Q-function.
3. Stop when the likelihood L(θ) converges.
The EM algorithm is a hill-climbing approach; thus, it can only be guaranteed to reach a local
maximum. When there are multiple maxima, whether we will actually reach the global
maxima clearly depends on where we start; if we start at the “right hill”, we will be able to
find global maxima. When there are multiple maxima, it is often hard to identify the “right
hill”. There are two commonly used strategies to solve this problem. The first is that we try
many different initial values and choose the solution that has the highest converged
likelihood value. The second uses a much simpler model (ideally one with a unique global
maxima) to determine an initial value for more complex models. The idea is that a simpler
model can hopefully help to locate a rough region where the global optimum exists, and we
start from a value in that region to search for a more accurate optimum using a more complex
model.
Examples
1. Consider a set of n observations drawn from Normal distribution with mean µ and
variance 1 where one observation is missing.
The likelihood function is
L= L(µ)= f( x 1 , x 2 ,… , x n, µ)=f( x 1, µ)f( x 2, µ)… f( x n, µ)

( ) 1 −12 ( x σ−μ ) … 1 −12 ( x σ−μ )


2
−1 x −μ
1 1
2
2
n
2

= e2 σ
σ √2π σ √2 π e σ √2π e

n
2
1 −∑ (x i−μ)
= n /2 e i=1
(2 π)
2

n
2
n ∑ (x−μ)
logL(µ) = - log2π - i=1
2
2

If the missing observation is z, the log likelihood function is


n−1
2
n ∑ (x−μ) - ( z−μ)2
logL(µ) = - log2π - i=1
2 2
2
n −1 2 2
(n−1) μ μ
logL(µ) = C +μ ∑ x i - +μz-
i=1 2 2
2

logL(µ)α μ(x1+x2+…+xn-1+z)-
2
2

Q(μ,μ0)= E(logL/x)= μ(x1+x2+…+xn-1+ μ0)-
2

= 0 ⇒ (x1+x2+…+xn-1+ μ0)-nμ =0
d Q (μ , μ 0)

( x 1 + x 2+ …+ x n−1+ μ 0)
μ1=
n

( x 1 + x 2+ …+ x n−1+ μ 0)
Using the initial guess μ0 the estimate of μ is obtained as μ1= and
n
( x + x + …+ x n−1+ μ k )
the process is repeated (at the k th stage, μk+1= 1 2 ) till it converges to the
n
estimated value of μ.

Assume that we observe a sample x = (9; 11; ?) from the univariate normal distribution
N(µ,1), estimate the missing observation.
( x + x + …+ x n−1+ μ k ) ( x 1 + x 2+ μ k )
μk+1= 1 2 =
n 3

Let the initial guess of it as µ0 = 0. Then the maximum likelihood estimate of µ will be
9+11+ μ 0 9+11+ 0
µ1 = = = 6.67
3 3
9+11+ μ1 9+11+6.67
Using this formula again gives µ2 = = =8.89 and so on.
3 3

µn µn+1
0 6.67
6.67 8.89
8.89 9.63
9.63 9.88
9.88 9.96
9.96 9.99
9.99 10.00
10.00 10.00
It can be seen that µn → 10 as n →∞ .
For this simple model (a univariate normal distribution with unit variance), it can be seen that
9+11
substituting the average of the known values =10 is the best answer to replace missing
2
value.

2. Let X1,X2,…,Xn be random sample from N(µ,σ) and m out of n observations are
missing. Estimate missing observations using EM algorithm.

Χ ~ N (μ, σ2)

L =L(µ, σ2)= f( x 1 , x 2 ,… , x n, µ , σ 2)=f( x 1, µ , σ 2)f( x 2, µ , σ 2 )… f( x n, µ, σ2)

( ) 1 −12 ( x σ−μ ) … 1 −12 ( x σ−μ )


2
−1 x −μ
1 1
2
2
n
2

= e2 σ
σ √2π σ √2 π e σ √2π e

1 −1
¿¿
= n n e 2σ
2

(√ 2π) σ
1 −1 ¿¿
Lα 2σ v(z)+(E(z))2=E(z2)= σ2+ µ2
2

σ e
n

1 −1
L α n 2 σ ¿¿ 2

σ e
2 2
Σ xi Σ zi µ n−m m
n 2
log L = - n log σ - 2
- 2
+ 2 [ ∑ x i +¿ ∑ z i ¿ - 2µ
2σ 2σ σ i=1 i=1 2σ
Σ xi
2
m ( σ 20 + µ20 ) µ n−m n 2
2 [ ∑ x i +¿ ¿ mµ0] -
2 2
Q( µ , σ , µ 0 ,σ )=E(log L,z/x)= -n log σ –
0 2
– + 2µ
2σ 2 σ2 σ i=1 2σ
n−m
2
dQ(µ , σ , µ0 , σ )
=
2
0 ∑ x i+ m µ0 - n µ = 0
i=1 2
dµ σ
σ2
n−m

∴ μ1 = ∑ x i+ ¿ mµ0
i=1
¿
n
m ( σ 0 + µ0 ) 2 µ[ Σ x i +m µ0 ]
2 2 2 2 2 2
dQ(µ , σ , µ0 , σ 0 ) n Σx nµ
= - + 3i + - - + 3 =0
dσ σ σ σ
3
σ
3
σ
n−m

∴ σ 12= ∑ x 2i + m ( σ 20 + µ20 )−2 µ1 [ Σ x i+ m µ0 ] +n µ 12


i=1
n

n−m

=
∑ x 2i + m ( σ 20 + µ20 )
- μ12
i=1
n
n−m n−m

μk +1= ∑ x 2i + m μk
, σ
2
k+1 =
∑ x 2i + m( σ 2k + μ2k )
- μ2k+1
i=1 i=1
n n

3. A Classic Genetic Example


This is a famous example from Rao (1973). Consider the genetic linkage of 197 animals, in
which the phenotypes are distributed into 4 categories 1to 4. The no. of animals in each of 4
1 θ 1−θ 1−θ θ
categories are (y1; y2; y3; y4)=(125; 18; 20; 34) with cell probabilities ( + , , ,
2 4 4 4 4
).
It is possible to maximize this multinomial likelihood directly to obtain estimate of θ.
However, EM algorithm brings a substantial simplification by using the augmentation
method. The observed data y’s is augmented by dividing the first cell into two, with
respective cell probabilities 1/2 and θ/4. This gives an augmented data set (x1; x2; x3; x4; x5),
where x1 + x2 = y1, and x3 = y2, x4 = y3, x5 = y4.
The likelihood function based on x’s is
197 ! 1 θ 1−θ x3 1−θ x4 θ x5
L= ( )x1 ( )x2 ( ) ( ) ( )
x1! x2! x3! x 4! x5! 2 4 4 4 4
L(θ/X) α θx2+x5 (1- θ)x3+x4
log L(θ /x) = (x2 + x5) logθ + (x3 +x4) log(1- θ)
Q(θ,θn) = EX/Y, θn [log L(θ /X)]
= EX/Y, θn [(x2 + x5) logθ + (x3 +x4) log(1- θ)]
= EX/Y, θn [x2] logθ + x5 logθ + (x3 + x4) log(1- θ)
θ0
P ( X 2 ∩Y 1) 4 θ0
X2 is not observed, P(X2/ Y1)= = =
P (Y 1) 1 θ 0 θ0 +2
+
2 4
θ0
X2/ Y1~ Binomial (125, )
θ0 +2
θ0
Thus Q(θ, θn) = 125 logθ + x5 logθ + (x3 + x4) log(1- θ) to be maximized.
(θ 0+ 2)
θ0
d Q (θ ,θ 0) (125 + x 5) ( x 3+ x 4)
= 0 gives (θ 0+2) - =0
dθ (1−θ)
θ

θ0
(125 + x 5)(1- θ)-(x 3+ x 4)θ = 0
(θ0 +2)
(125 θ0 + x5 (θ0 +2))(1- θ) -(x 3+ x 4)θ(θ0 +2)=0
θ(125 θ0 + x5 (θ0 +2)+(x 3+ x 4)(θ0 +2))= (125 θ0 + x5 (θ0 +2))
(125+ x 5) θ0 +2 x5
θ1 =
(125+ x 5 + x 3 + x 4 )θ0 +2( x5 + x 3 + x 4 )

159 θ0 +68
θ1 =
197 θ0 +144

Starting with θ0 = 0.5, using the above formula θ1 = 0.6082. Repeatedly using the above
formula, the following values of θ can be obtained as 0.6268
n θn
0 0.5
1 0.6082
2 0.6243
3 0.6265
4 0.6268
5 0.6268

MLE method
197 ! 2+ θ y1 1−θ y2 1−θ y3 θ y4
L= ( ) ( ) ( ) ( )
y 1 ! y 2! y 3 ! y 4 ! 4 4 4 4
L(θ/X) = c(2+θ)y1 (1- θ)y2+y3 θy4
log L(θ /x) = y1log(2+θ) + (y2 +y3) log(1- θ)+ y4 logθ
d L(θ / x) y 1 y 2+ y 3 y 4
= 0 gives - + =0
dθ (2+θ) (1−θ) θ
(1- θ)θ y1-(2+θ)θ(y2 +y3)+ (2+θ) (1-θ) y4=0
2
θ (y1+ y2 +y3+ y4)+ θ(-y1+2 y2+2 y3+ y4)-2 y4 =0
197θ2 - 15θ-68=0
θ =0.6268

4. A coin tossing experiment: Consider an experiment of tossing two coins A and B


having probability of success (head) θ1 and θ2 respectively. The coins are tossed 10
times and the results are given below.

Coi No. of heads coin No. of heads


Flips
n A coin B
B HTTTHHTHTH 0 5
A HHHHTHHHHH 9 0
A HTHHHHHTHH 8 0
B HTHTTTHHTT 0 4
A THHHTHHHTH 7 0
Total 24 9

Estimate of θ1 is 24/30 = 0.8 and estimate of θ2 is 9/20 = 0.45

Assume that the identities of the coins used for each toss are not known. It is hidden variable.
EM algorithm is used to obtain estimates of θ1 and θ2 in this situation.

It turns out that we can make progress by starting with a guess for the coin biases, which will
allow us to estimate which coin was chosen in each trial and come up with an estimate for the
expected number of heads and tails for each coin across the trials (E-step). We then use these
counts to recompute a better guess for each coin bias (M-step). By repeating these two steps,
we continue to get a better estimate of the two-coin biases and converge at a solution that
turns out to be a local maximum to the problem.

The E-Step: Estimating likelihood each coin was chosen. Let the series of flips be event E.
The no. of heads and tails in event E be h and t. Let ZA and ZB be the events that coin A and B
are chosen.
First, let's assume that it is coin A, then the probability of seeing these flips would be
P(E|ZA) = θ1h(1-θ1)t

Similarly, if we assume it was coin B, P(E|ZB) = θ2h(1-θ2)t

P(E∨Z A)P(Z A ) P ( E ∩Z A)
Using Bayes’ theorem, P(ZA|E)= (= =
P ( E∨Z A )P( Z A)+ P( E∨Z B)P(Z B) P(E)
P(E ∩ Z A)
)
P ( E ∩ Z A )+ P(E ∩ Z B)

We know that P(ZA)=P(ZB)=0.5

P(E∨Z A )
P(ZA|E)=
P ( E∨Z A )+ P (E∨Z B)
h t
θ 1 (1−θ 1)
=
θ1 (1−θ 1)t +θ 2h (1−θ 2)t
h

h t
θ 2 (1−θ 2)
and for coin B, P(ZB|E)= h
θ1 (1−θ 1)t +θ 2h (1−θ 2)t

The "E-step" assuming θ1= 0.6 and θ2 = 0.5, probabilities for the first flip are

5 5 5 5
.6 ∗.4 .5 ∗.5
P(ZA|E) = 5 5 5 5 =0.45, P(ZB|E)= 5 5 5 5 =0.55
.6 ∗.4 + .5 ∗.5 .6 ∗.4 + .5 ∗.5
Probability Probability it No. heads No. heads
Flips it was coin was coin B attributed attributed
A P(ZA|E) P(ZB|E) to A to B
HTTTHHTHTH 0.45 0.55 2.2 2.8
HHHHTHHHHH 0.8 0.2 7.2 1.8
HTHHHHHTHH 0.73 0.27 5.9 2.1
HTHTTTHHTT 0.35 0.65 1.4 2.6
THHHTHHHTH 0.65 0.35 4.5 2.5
Total 2.98 2.02 21.2 11.8

The M-Step: Revised estimates of θ1 and θ2 are obtained by dividing the expected number of
heads by the expected number of total flips
21.2
θ1= =0.71
10∗2.98

11.8
θ2= =0.58
10∗2.02

With updated estimates for θ1 and θ2, we can repeat the E-step again and then M-step.

5. Gaussian Mixture Model


If the observations X1,X2,…,Xn be a random sample from a mixture model with k mixture
components N(μj,σj2) j=1,2,..k. The joint probability of observations X1, X2,…,Xn is
L= Π ni=1 Π kj=1 ¿j f(xi, µj, σj2)] zij

n=3, k=2
L= ¿1 f(x1, µ1, σ12)] z11¿2 f(x1, µ2, σ22)] z12¿1 f(x2, µ1, σ12)] z21¿2 f(x2, µ2, σ22)] z22¿1 f(x3, µ1, σ12)] z31¿2
f(x3, µ2, σ22)] z32

π j : probability that the observation is from jth Normal distribution


zij = 1 if ith observation is from jth Normal distribution
= 0 otherwise.
zij‘s are hidden variables and
E(zij/ f(xi, µj, σj2))= P ¿ = γij (1)
n k
logL = ∑ ∑ zij [log π j +log f (xi, µj, σj2)]
i=1 j=1
n k
E(logL/f(x, µj, σj )) = ∑ ∑ γ ij [log π j + log f (x ¿¿ i, µ j , σ 2j )¿]
2

i=1 j=1
k
Maximise Q=E(logL/f(x, µj, σj2)) subject to the condition ∑ π j ¿ 1
j=1
n k k
Φ= ∑ ∑ γ ij [log π j + log f (x ¿¿ i, µ j , σ 2j )¿]+λ(1−∑ π j)
i=1 j=1 j=1
n

=0∴∑
n
γ ij
– λ = 0 ∴ πj =
∑ γ ij
i=1
dπ j i=1 π j
λ
n k
k
∴∑π j = ∑ ∑ γ ij =1
i=1 j=1
j=1
λ
n k

∑ ∑ γ ij n k
=∑ ∑ γ ij =n
i=1 j=1
∴ λ= k

∑π j i=1 j=1

j=1
n

∑ γ ij n

∴ πj =
i=1
=
∑ γ ij (2)
n k i=1

∑ ∑ γ ij n
i=1 j=1

2 1 −(x −µ )2 i j

f (x ¿¿ i, µ j , σ j )¿ = ❑ 2σ 2

√2 π σj e j

❑ −(x i−µ j)2


log f (x ¿¿ i, µ j , σ 2j )¿ = -log√ 2 π -logσ j 2
2σj
n k
E(logL/f(x, µj, σj2))= ∑ ∑ γ ij [log π j + log f (x ¿¿ i, µ j , σ 2j )¿]
i=1 j=1
n k
−(x i−µ j)2
Q= ∑ ∑ γ ij [log π j -log√ 2 π -logσ j

]
i=1 j=1 2 σ 2j
dQ n
(x i−µ j )
dµj
=0∴ ∑ γ ij σ 2j
=0
i=1

n n
∴ ∑ γ ijxi = µ j ∑ γ ij
i=1 i=1
n

∑ γ ij x i
i=1
µj = n (3)
∑ γ ij
i=1

dQ n
−1 (x i−µ j )2
❑ = 0 ∴
d σj ∑ γ ij ( σ ❑j + σ 3j
) =0
i=1

n n
σ 2j ∑ γ ij = ∑ γ ij(x i−µ j)2
i=1 i=1
n

∑ γ ij (x i−µ j )2
2 i=1
σ =
j n (4)
∑ γ ij
i=1

The EM algorithm is as follows:

1. Initialize the μj’s, σj’s and πj’s.


2. E-step: Evaluate the posterior probabilities γij using the current values of the μj’s
and σj’s with equation (1) and evaluate expected log-likelihood with these parameters.
3. M-step: Estimate new parameters ^μj, σ^ 2j and ^π j with the current values of γij using
equations (2), (3) and (4).
4. Evaluate the expected log-likelihood with the new parameter estimates. If the log-
likelihood has changed by less than some small ϵ, stop. Otherwise, go back to step 2.

The EM algorithm is sensitive to the initial values of the parameters, so care must be taken in
the first step. However, assuming the initial values are “valid,” one property of the EM
algorithm is that the log-likelihood increases at every step.

This invariant proves to be useful when debugging the algorithm in practice.

Expectation Maximization is an iterative method for finding maximum likelihood or


maximum a posteriori (MAP) estimates of parameters in statistical models, where the model
depends on unobserved latent variables.
Missing observation y from Bivariate Normal distribution (X,Y)~BVN(µ1,µ2,σ1,σ2,ρ)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy