Chap2 Part2 GMM
Chap2 Part2 GMM
• Able to use the Expectation-Maximisation algorithm in order to calculate the GMM optimal
parameters
2
Motivation
• Hard clustering vs Soft Clustering
• Hard Clustering: Each point is associated to one and only one Cluster.
• Soft Clustering: Every point belongs to several clusters with certain degrees.
3
Gaussian Distribution
• A Gaussian distribution, or normal distribution, is a type of continuous probability
distribution that is symmetrical about its mean. It has a bell-shaped curve that is
symmetrical from the mean point to both halves of the curve
of occurring.
4
Gaussian distribution
• Mathematical definition:
The probability density function related to a continuous random variable x which follows
a normal/Gaussian distribution, is given by :
With : µ= Mean
σ= Standard Variation
5
Gaussian Distribution
6
Gaussian Distribution
7
Gaussian distribution
• For d-dimensions, the Gaussian distribution of a vector x
is defined by:
8
Gaussian Distribution
9
Gaussian Mixture Models (GMM)
• Gaussian mixture models (GMMs) are a type of machine learning algorithm.
• They are used to classify data into different categories based on the probability
distribution.
10
Gaussian Mixture Model (GMM)
• Definition:
• A Gaussian Mixture is a function that is comprised of several Gaussians, each
identified by k {1,…, K}, where K is the number of clusters of our dataset.
Each Gaussian k in the mixture is comprised of the following parameters:
• A mean μ that defines its centre.
• A covariance Σ .
• A mixing probability that defines how big or small the Gaussian function
will be.
11
Gaussian Mixture Model (GMM)
• The probability given in a mixture models of K Gaussians is:
12
Gaussian Mixture Model (GMM)
• Example of Gaussian Mixture model of two components (2 Gaussian
Distributions)
13
Gaussian Mixture Model (GMM)
• Example: (a) & (b) presents each one a 2-dimensions Gaussian Mixture model
14
Parameters Estimation
• Problem:
Given a set of data X={x1, x2, …, xN} following probably a GMM distribution,
estimate the parameters θ of the GMM that fits the data.
with θ={ µ, Σ, ѡ } for each component of the mixture.
15
Maximum Likelihood Estimation (MLE)
• The Maximum Likelihood Estimation is a frequentist approach used to estimate
the optimal parameters related to a mixture model.
• Given a sample of dataset, the Maximum likelihood estimator calculates the best
values of the mixture model.
16
Expectation Maximisation (EM) Algorithm
• MLE is a frequentist principle that suggests that given a dataset, the “best” parameters
to use are the ones that maximise the probability of the data
• It is widely used for optimisation problems especially when the objective function has
complexities such as maximising the log likelihood.
17
Expectation Maximisation (EM) Algorithm
18
EM more in depth
• Expectation (E) step: Using the current
estimate for the parameters, create
function for the expectation of the log-
likelihood.
19
EM more in depth
• Problem:
zik hidden (latent) variable, related to xi, zik=1 if xi belongs to the k component of the mixture
• EM Steps:
1. Initialize the parameters θ of the model
2. E Step: Find the posterior probabilities of the latent variable Z given current parameter θ
3. M Step: Re-estimate the parameter values given the current posterior probabilities. Use the
computed values of Z to re-estimate θ
4. Iterate steps 2 and 3 until convergence
20
EM more in depth
• Example:
Hidden variable : for each
point which gaussian
(component) generate it ?
21
EM more in depth
• E Step
For each point estimate the
probability that each Gaussian
generated it
22
EM more in depth
• M Step
For each point estimate the
probability that each Gaussian
generated it
23
K-means vs GMM
K-means GMM
• Objective function • Objective function
– Performs hard assignment during E- step – Perform soft assignment during E-step
• Assumes spherical clusters with equal probability of • Can be used for non-spherical clusters
a cluster
• Can generate clusters with different probabilities
24
Gaussian Mixture Models (GMMs)
• Strengths
– Give probabilistic cluster assignments
– Have probabilistic interpretation
– Can handle clusters with varying sizes, variance etc.
• Weakness
– Initialization matters
– Choose appropriate distributions
– Overfitting issues
25
Appendix
Expectation-Maximisation
Expectation-Maximisation
• Let’s suppose we want to know what is the probability that a data point xn comes from
Gaussian k. We can express this as:
• “given a data point x, what is the probability it came from Gaussian k?”
• In this case, z is a latent (hidden, unknown variable that takes only two possible values. It is
one when x came from Gaussian k, and zero otherwise.
• Knowing the probability of occurrence of z will be useful in helping us determine the
Gaussian mixture parameters.
• Likewise, we can state the following:
• Which means that the overall probability of observing a point that comes from Gaussian k is
actually equivalent to the mixing coefficient for that Gaussian.
• Now let z be the set of all possible latent variables z, hence:
27
Expectation-Maximisation
• Each z occurs independently of others and that they can only take the value of one when
k is equal to the cluster the point comes from. Therefore:
• Now, what about finding the probability of observing our data given that it came from
Gaussian k? Turns out to be that it is actually the Gaussian function itself! Following the
same logic we used to define p(z), we can state:
• The aim is to determine what the probability of z given our observation x? Well, it turns
out to be that the equations we have just derived, along with the Bayes rule, will help us
determine this probability. From the product rule of probabilities, we know that
28
Expectation-Maximisation
• We just need to sum up the terms on z to get p(xn), not p(xn, z)
• This is the equation that defines a Gaussian Mixture. To determine the optimal values
for the parameters we need to determine the maximum likelihood of the model. We
can find the likelihood as the joint probability of all observations xn, defined by:
29
Expectation-Maximisation
• Now, remember that our aim is to find the probability of z given x.
• From Bayes rule, we know that
• Also, we have:
Eq1
30
Expectation-Maximisation
• Let us now define the steps that the general EM algorithm will follow.
• Step 1: Initialise θ accordingly. For instance, we can use the results obtained by a
previous K-Means run as a good starting point for our algorithm.
• Step 2 (Expectation step): Evaluate
• Eq2
• The expectation step consists on calculating the value of ϒ, so if we replace Eq1 in Eq2,
we get:
Eq3
31
Expectation-Maximisation
• The log of this expression is given by
Eq4
• Now, we replace Eq4 in Eq3:
Eq5
32
Expectation-Maximisation
• And now we can easily determine the parameters by using maximum likelihood. Let’s now
take the derivative of Q with respect to π and set it equal to zero:
• By rearranging the terms and applying a summation over k to both sides of the equation,
we obtain:
• we know that the summation of all mixing coefficients π equals one. In addition, we know
that summing up the probabilities γ over k will also give us 1. Thus we get λ = N. Using this
result, we can solve for π:
33
Expectation-Maximisation
• Similarly, if we differentiate Q with respect to μ and Σ, equate the derivative to zero and
then solve for the parameters by using the log-likelihood equation, we obtain:
• Then we will use these revised values to determine ϒ in the next EM iteration and so on
and so forth until we see some convergence in the likelihood value
34