2 Clustering
2 Clustering
Frauke Liers
Friedrich-Alexander-Universität Erlangen-Nürnberg
Rough Differentiation in Learning Methods
supervised learning:
• predict values of an outcome measure based on a number of input measures
(e.g., given some patient data together with label ’has illness’ / ’does not have
illness’. New patient data comes in, predict whether the patient is ill or not.)
supervised learning:
• predict values of an outcome measure based on a number of input measures
(e.g., given some patient data together with label ’has illness’ / ’does not have
illness’. New patient data comes in, predict whether the patient is ill or not.)
unsupervised learning:
• no outcome measure given. goal: find structures among data.
...also something in between: semi-supervised learning.
Given
• N number of data points
• M number of variables (i.e “mass”, “price”, “color”, ...)
• Data X = {x1, . . . , xN }, where xn ∈ RM for all n = 1, . . . , N
• K assumed number of clusters
Want
Given
• N number of data points
• M number of variables (i.e “mass”, “price”, “color”, ...)
• Data X = {x1, . . . , xN }, where xn ∈ RM for all n = 1, . . . , N
• K assumed number of clusters
Want
• Assignment: xn 7→ kn ∈ {1, . . . , K } for all n = 1, . . . , N
Given
• N number of data points
• M number of variables (i.e “mass”, “price”, “color”, ...)
• Data X = {x1, . . . , xN }, where xn ∈ RM for all n = 1, . . . , N
• K assumed number of clusters
Want
• Assignment: xn 7→ kn ∈ {1, . . . , K } for all n = 1, . . . , N
• Assignment rule: x 7→ k (x) ∈ {1, . . . , K } for all x ∈ RM
Given
• N number of data points
• M number of variables (i.e “mass”, “price”, “color”, ...)
• Data X = {x1, . . . , xN }, where xn ∈ RM for all n = 1, . . . , N
• K assumed number of clusters
Want
• Assignment: xn 7→ kn ∈ {1, . . . , K } for all n = 1, . . . , N
• Assignment rule: x 7→ k (x) ∈ {1, . . . , K } for all x ∈ RM
• Reconstruction rule (’representative’): k 7→ mk ∈ RM
On an abstract level:
• Determination of best possible clustering (w.r.t. some objective) is a classical
combinatorial optimization problem
• K-means clustering: Determine K points, i.e., centers, that minimize the sum
of the squared Euclidean distance to its closest center.
Observations
• The clustering energy has local
minima
(https://upload.wikimedia.org/wikipedia/commons/7/7c/K-means_convergence_to_a_
local_minimum.png, modified)
iterate between: determine clustering for fixed means, determine means for
fixed clustering.
Let us fix the clustering C in
K
1XX
E (C , m) := kx − mk k2 .
2
k =1 x ∈Ck
iterate between: determine clustering for fixed means, determine means for
fixed clustering.
Let us fix the clustering C in
K
1XX
E (C , m) := kx − mk k2 .
2
k =1 x ∈Ck
iterate between: determine clustering for fixed means, determine means for
fixed clustering.
Let us fix the clustering C in
K
1XX
E (C , m) := kx − mk k2 .
2
k =1 x ∈Ck
and hence
1 X
mk = x=
b mean of the cluster
|Ck |
x ∈Ck
iterate between: determine clustering for fixed means, determine means for
fixed clustering.
Conversely, let us fix the means m in
K
1XX
E (C , m) := kx − mk k2 .
2
k =1 x ∈Ck
iterate between: determine clustering for fixed means, determine means for
fixed clustering.
Conversely, let us fix the means m in
K
1XX
E (C , m) := kx − mk k2 .
2
k =1 x ∈Ck
iterate between: determine clustering for fixed means, determine means for
fixed clustering.
Conversely, let us fix the means m in
K
1XX
E (C , m) := kx − mk k2 .
2
k =1 x ∈Ck
mk Ck
end
end
until assignment step does not do anything;
end
end
until assignment step does not do anything;
• Assignment rule: x 7→ argmink kx − mk k.
• Reconstruction rule: k 7→ mk
Disadvantages:
• K-means sometimes does not work well, in particular for non-spherical /
nonconvex data or for unevenly sized clusters, i.e. it has some implicit
assumptions:
from varianceexplained.org
next: some improvements.
Frauke Liers · Friedrich-Alexander-Universität Erlangen-Nürnberg · Unsupervised Learning: Clustering 14
Expectation-Maximization Clustering Algorithm (EM)
Let x = (x1, . . . , xM )> be random vector, finite variance and mean. Let
covariance matrix Σ = (Σxi ,xj ) ∈ RM ×M be defined as
Σxi ,xj = E ((xi − E (xi ))(xj − E (xj ))) where E denotes expected values µX = E (X ).
• represents important statistical information, in particular correlation between
data
• is a real, quadratic, symmetric matrix
• is positive-semidefinite matrix
Update step takes data points into account via relative frequencies.
Compare this to K-means... Indeed, K-means can be seen as a special case of
EM.
drawback: K-means and EM only find local optima. (recall already K-means
problem is NP-hard...)
play around with skikit-learn (machine learning library in python)
Then
mk = xik? , k = 1, 2, . . . K
are the current estimates of the cluster centers.
2. Given a current set of cluster centers {m1, . . . , mK }, minimize the total
dissimilarity by assigning each observation to the closest (current) cluster
center:
C (i ) = argmin D (xi , mk ).
1≤k ≤K
3. Iterate steps 1 and 2 until the assignments do not change.
For a clustering problem, best number of clusters depends on the goal and on
the knowledge on the application.
• Sometimes, the best value of K is given as input (e.g., K salespeople are
employed, the task is to cluster a database in K many segments)
• However, if ’natural’ clusters need to be determined, the best value of K is
unknown and needs to be estimated from data as well.