kdd10 Thclust
kdd10 Thclust
Outline
I Euclidean Clustering and k-means algorithm
III Stability
Outline
I Euclidean Clustering and k-means algorithm
What to do to select initial centers (and what not to do)
How long does k-means take to run in theory, practice and
theoretical practice
How to run k-means on large datasets
II Bregman Clustering and k-means
III Stability
Outline
I Euclidean Clustering and k-means algorithm
What to do to select initial centers (and what not to do)
How long does k-means take to run in theory, practice and
theoretical practice
How to run k-means on large datasets
II Bregman Clustering and k-means
Bregman Clustering as generalization of k-means
Performance Results
III Stability
Outline
I Euclidean Clustering and k-means algorithm
What to do to select initial centers (and what not to do)
How long does k-means take to run in theory, practice and
theoretical practice
How to run k-means on large datasets
II Bregman Clustering and k-means
Bregman Clustering as generalization of k-means
Performance Results
III Stability
How to relate closeness in cost function to closeness in
clusters.
3
Sergei V. and Suresh V. Theory of Clustering
points in Rd split them into k similar groups.
n nIntroduction
3
Sergei V. and Suresh V. Theory of Clustering
points in Rd split them into k similar groups.
n nIntroduction
ctive: minimize maximum radius
How do we define “best" ?
Minimize the maximum radius of a cluster
4
Sergei V. and Suresh V. Theory of Clustering
points in Rd split them into k similar groups.
n nIntroduction
ctive: minimize maximum radius
How do we define “best" ?
maximize inter-cluster distance
Maximize the average inter-cluster distance
5
Sergei V. and Suresh V. Theory of Clustering
points in Rd split them into k similar groups.
n nIntroduction
ctive: minimize maximum radius
How do we define
maximize “best" ? distance
inter-cluster
Minimize the variance within each cluster.
6
Sergei V. and Suresh V. Theory of Clustering
Introduction
Minimizing Variance
P P C = {C12, C2 , . . . , Ck } that
Given X and k, find a clustering
minimizes: φ(X, C ) = ci x∈Ci kx − ci k
Minimizing Variance
P P C = {C12, C2 , . . . , Ck } that
Given X and k, find a clustering
minimizes: φ(X, C ) = ci x∈Ci kx − ci k
Definition
Let φ ∗ denote the value of the optimum solution above. We say that a
clustering C 0 is α-approximate if:
φ ∗ ≤ φ(X, C 0 ) ≤ α · φ ∗
Minimizing Variance
P P C = {C1 , C2 , . . . , Ck } that
Given X and k, find a clustering
minimizes: φ(X, C ) = ci x∈Ci kx − ci k2
Minimizing Variance
P P C = {C1 , C2 , . . . , Ck } that
Given X and k, find a clustering
minimizes: φ(X, C ) = ci x∈Ci kx − ci k2
Repeat...
32
Sergei V. and Suresh V. Theory of Clustering
Performance
k-means Accuracy
. . . How
that’sgood is this algorithm?
potentially arbitrarily worse than optimum solution
33
Sergei V. and Suresh V. Theory of Clustering
Performance
k-means Accuracy
But does this really happen?
But the Gaussian case has an easy fix: use a furthest point heuristic
47
Sergei V. and Suresh V. Theory of Clustering
Seeding on Gaussians Simple Fix
Select
But thecenters
Gaussianusing a furthest
case has an easy point algorithm
fix: use a furthest(2-approximation
point heuristic
to k-Center clustering).
48
Sergei V. and Suresh V. Theory of Clustering
Seeding on Gaussians Simple Fix
Select
But thecenters
Gaussianusing a furthest
case has an easy point algorithm
fix: use a furthest(2-approximation
point heuristic
to k-Center clustering).
49
Sergei V. and Suresh V. Theory of Clustering
Seeding on Gaussians Simple Fix
Select
But thecenters
Gaussianusing a furthest
case has an easy point algorithm
fix: use a furthest(2-approximation
point heuristic
to k-Center clustering).
50
Sergei V. and Suresh V. Theory of Clustering
Seeding on Gaussians Simple Fix
Select
But thecenters
Gaussianusing a furthest
case has an easy point algorithm
fix: use a furthest(2-approximation
point heuristic
to k-Center clustering).
51
Sergei V. and Suresh V. Theory of Clustering
Sensitive to Outliers
Seeding on Gaussians
Let D(x) be the distance between a point x and its nearest cluster
center. Chose the next point proportionally to Dα (x).
Let D(x) be the distance between a point x and its nearest cluster
center. Chose the next point proportionally to Dα (x).
α = 0 −→ Random initialization
Let D(x) be the distance between a point x and its nearest cluster
center. Chose the next point proportionally to Dα (x).
α = 0 −→ Random initialization
α = ∞ −→ Furthest point heuristic
Let D(x) be the distance between a point x and its nearest cluster
center. Chose the next point proportionally to Dα (x).
α = 0 −→ Random initialization
α = ∞ −→ Furthest point heuristic
α = 2 −→ k-means++
Let D(x) be the distance between a point x and its nearest cluster
center. Chose the next point proportionally to Dα (x).
α = 0 −→ Random initialization
α = ∞ −→ Furthest point heuristic
α = 2 −→ k-means++
More generally
Set the probability of selecting a point proportional to its
contribution to the overall error.
P P
If minimizing ci x∈Ci kx − ci k, sample according to D.
If minimizing ci c∈Ci kx − ci k∞ , sample according to D∞
P P
k-Means++
If the data set looks Gaussian. . .
k-Means++
If the data set looks Gaussian. . .
k-Means++
If the data set looks Gaussian. . .
k-Means++
If the data set looks Gaussian. . .
k-Means++
If the data set looks Gaussian. . .
k-Means++
If the outlier should be its own cluster . . .
k-Means++
If the outlier should be its own cluster . . .
k-Means++
If the outlier should be its own cluster . . .
k-Means++
If the outlier should be its own cluster . . .
k-Means++
If the outlier should be its own cluster . . .
Theorem (AV07)
This algorithm always attains an O(log k) approximation in
expectation
Theorem (AV07)
This algorithm always attains an O(log k) approximation in
expectation
Theorem (ORSS06)
A slightly modified version of this algorithm attains an O(1)
approximation if the data is ‘nicely clusterable’ with k clusters.
Definition
A pointset X is (k, ε)-separated if φk∗ (X) ≤ ε2 φk−1
∗
(X).
Intuition
Look at the optimum clustering. In expectation:
1 If the algorithm selects a point from a new OPT cluster, that
cluster is covered pretty well
2 If the algorithm picks two points from the same OPT cluster,
then other clusters must contribute little to the overall error
Intuition
Look at the optimum clustering. In expectation:
1 If the algorithm selects a point from a new OPT cluster, that
cluster is covered pretty well
2 If the algorithm picks two points from the same OPT cluster,
then other clusters must contribute little to the overall error
As long as the points are reasonably well separated, the first
condition holds.
Intuition
Look at the optimum clustering. In expectation:
1 If the algorithm selects a point from a new OPT cluster, that
cluster is covered pretty well
2 If the algorithm picks two points from the same OPT cluster,
then other clusters must contribute little to the overall error
As long as the points are reasonably well separated, the first
condition holds.
Two theorems
Assume the points are (k, ε)-separated and get an O(1)
approximation.
Make no assumptions about separability and get an O(log k)
approximation.
k-means++ Summary:
To select the next cluster, sample a point in proportion to its
current contribution to the error
Works for k-means, k-median, other objective functions
Universal O(log k) approximation, O(1) approximation under
some assumptions
Can be implemented to run in O(nkd) time (same as a single
k-means step)
k-means++ Summary:
To select the next cluster, sample a point in proportion to its
current contribution to the error
Works for k-means, k-median, other objective functions
Universal O(log k) approximation, O(1) approximation under
some assumptions
Can be implemented to run in O(nkd) time (same as a single
k-means step)
KM++ v. KM v. KM-Hybrid
1300
1200
1100
1000
LLOYD
Error
HYBRID
KM++
900
800
700
600
0 50 100 150 200 250 300 350 400 450 500
Stage
KM++ v. KM v. KM-Hybrid
250000
200000
150000
LLOYD
Error
HYBRID
KM++
100000
50000
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
20
40
60
80
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
Stage
Theorem (V09)
There exists a pointset X in R2 and a set of initial centers C so that
k-means takes 2Ω(k) iterations to converge when initialized with C .
28
28
Perturbation
To each point x ∈ X add independent noise drawn from N(0, σ2 ).
Definition
The smoothed complexity of an algorithm is the maximum expected
running time after adding the noise:
Theorem (AMR09)
The smoothed complexity of k-means is bounded by
n k d D log4 n
34 34 8 6
O
σ6
Notes
While the bound is large, it is not exponential (2k k34 for
large enough k)
The (D/σ)6 factor shows the bound is scale invariant
Comparing bounds
The smoothed complexity of k-means is polynomial in n, k and D/σ
where D is the diameter of X, whereas the worst case complexity of
k-means is exponential in k
Implications
The pathological examples:
Are very brittle
Can be avoided with a little bit of random noise
Running Time
Exponential worst case running time
Polynomial typical case running time
Running Time
Exponential worst case running time
Polynomial typical case running time
Solution Quality
Arbitrary local optimum, even with many random restarts
Simple initialization leads to a good solution
Implementing k-means++
Initialization:
Takes O(nd) time and one pass over the data to select the next
center
Takes O(nkd) time total
Overall running time:
Each round of k-means takes O(nkd) running time
Typically finish after a constant number of rounds
Implementing k-means++
Initialization:
Takes O(nd) time and one pass over the data to select the next
center
Takes O(nkd) time total
Overall running time:
Each round of k-means takes O(nkd) running time
Typically finish after a constant number of rounds
Large Data
What if O(nkd) is too much, can we parallelize this algorithm?
Approach
Partition the data:
Split X into X1 , X2 , . . . , Xm of roughly equal size.
Approach
Partition the data:
Split X into X1 , X2 , . . . , Xm of roughly equal size.
In parallel compute a clustering on each partition:
j j
Find C j = {C1 , . . . , Ck }: a good clustering on each partition,
j j
and denote by wi the number of points in cluster Ci .
Approach
Partition the data:
Split X into X1 , X2 , . . . , Xm of roughly equal size.
In parallel compute a clustering on each partition:
j j
Find C j = {C1 , . . . , Ck }: a good clustering on each partition,
j j
and denote by wi the number of points in cluster Ci .
Cluster the clusters:
Let Y = ∪1≤j≤m C j . Find a clustering of Y, weighted by the
j
weights W = {wi }.
Running time
Suppose we partition the input across m different machines.
First phase running time: O( nkd
m
).
Second phase running time O(mk2 d).
Approximation Guarantees
Using k-means++ sets β = γ = O(log k) and leads to a O(log2 k)
approximation.
Approximation Guarantees
Using k-means++ sets β = γ = O(log k) and leads to a O(log2 k)
approximation.
Approximation Guarantees
Using k-means++ sets β = γ = O(log k) and leads to a O(log2 k)
approximation.
Theorem (ADK09)
Running k-means++ initialization for O(k) rounds leads to a O(1)
approximation to the optimal solution (but uses more centers than
OPT).
Final Algorithm
Partition the data:
Split X into X1 , X2 , . . . , Xm of roughly equal size.
Final Algorithm
Partition the data:
Split X into X1 , X2 , . . . , Xm of roughly equal size.
Compute a clustering using ` = O(k) centers each partition:
Find C j = {C1 , . . . , C` } using k-means++ on each partition,
j j
j j
and denote by wi the number of points in cluster Ci .
Final Algorithm
Partition the data:
Split X into X1 , X2 , . . . , Xm of roughly equal size.
Compute a clustering using ` = O(k) centers each partition:
Find C j = {C1 , . . . , C` } using k-means++ on each partition,
j j
j j
and denote by wi the number of points in cluster Ci .
Cluster the clusters.
Let Y = ∪1≤j≤m C j be a set of O(`m) points. Use k-means++
j
to cluster Y, weighted by the weights W = {wi }.
Theorem
The algorithm achieves an O(1) approximation in time
O( nkd
m
+ mk2 d)
Before...
k-means used to be a prime example of the disconnect between
theory and practice – it works well, but has horrible worst case
analysis
...and after
Smoothed analysis explains the running time and rigorously
analyzed initializations routines help improve clustering quality.
Outline
I Euclidean Clustering and k-means algorithm
What to do to select initial centers (and what not to do)
How long does k-means take to run in theory, practice and
theoretical practice
How to run k-means on large datasets
Outline
I Euclidean Clustering and k-means algorithm
What to do to select initial centers (and what not to do)
How long does k-means take to run in theory, practice and
theoretical practice
How to run k-means on large datasets
II Bregman Clustering and k-means
Bregman Clustering as generalization of k-means
Performance Results
Outline
I Euclidean Clustering and k-means algorithm
What to do to select initial centers (and what not to do)
How long does k-means take to run in theory, practice and
theoretical practice
How to run k-means on large datasets
II Bregman Clustering and k-means
Bregman Clustering as generalization of k-means
Performance Results
III Stability
How to relate closeness in cost function to closeness in
clusters.
Kullback-Leibler distance:
X pi
D(p, q) = pi log
i
qi
Kullback-Leibler distance:
X pi
D(p, q) = pi log
i
qi
Itakuro-Saito distance:
Xp pi
i
D(p, q) = − log −1
i
qi qi
Definition
Let φ : Rd → R be a strictly convex function. The Bregman
divergence dφ is defined as
Examples:
x
Kullback-Leibler: φ(x) = xi ln xi − xi , Dφ (x k y) = xi ln yi
P P
i
P xi xi
Itakura-Saito: φ(x) = − ln xi , Dφ (x k y) = i y − log y − 1
P
i i
1
`22 : φ(x) = 2
kxk2 , Dφ (x k y) = kx − yk 2
q
D(qkp)
D(pkq)
Asymmetry: In general, Dφ (p k q) 6= Dφ (q k p)
No triangle inequality: Dφ (p k q) + Dφ (q k r) can be less than
Dφ (p k r) !
How can we now do clustering ?
Sergei V. and Suresh V. Theory of Clustering
Breaking down k-means
Key Point
Setting cluster center as centroid minimizes the average squared
distance to center
Key Point
Setting cluster center as centroid minimizes the average squared
distance to center
Problem
Given points x1 , . . . xn ∈ Rd , find c such that
X
Dφ (xi k c)
i
is minimized.
Answer
1X
c= xi
n
Independent of φ[BMDG05] !
Key Point
Setting cluster center as centroid minimizes average Bregman
divergence to center
Lemma ([BMDG05])
The (Bregman) k-means algorithm converges in cost.
Expectation maximization:
Initialize density parameters and means for k distributions
while not converged do
For distribution i and point x, compute conditional probability
p(i|x) that x was drawn from i (by Bayes rule)
For each distribution i, recompute new density parameters
and means (via maximum likelihood)
end while
with Ψ convex.
Theorem ([BMDG05])
pΨ,θ = exp(−Dφ (x k µ))bφ (x)
where µ is the expectation parameter ∇Ψ(θ )
Sergei V. and Suresh V. Theory of Clustering
EM: Euclidean and Bregman
Expectation maximization:
Initialize density parameters and means for k distributions
while not converged do
For distribution i and point x, compute conditional probability
p(i|x) that x was drawn from i (by Bayes rule)
For each distribution i, recompute new density parameters
and means (via maximum likelihood)
end while
Two questions:
Two questions:
Two questions:
Parameters: n, k, d.
, Good news
k-means always converges in O(nkd ) time.
/ Bad news
k-means can take time 2Ω(k) to converge:
Parameters: n, k, d.
, Good news
k-means always converges in O(nkd ) time.
/ Bad news
k-means can take time 2Ω(k) to converge:
Even if d = 2, i.e in the plane
Parameters: n, k, d.
, Good news
k-means always converges in O(nkd ) time.
/ Bad news
k-means can take time 2Ω(k) to converge:
Even if d = 2, i.e in the plane
Even if centers are chosen from the initial data
c c
Theorem
Smoothed complexity of k-means using Gaussian noise with variance
σ is polynomial in n and 1/σ.
Theorem ([MR09])
For “well-behaved”pBregman divergences, smoothed complexity is
bounded by poly(n k , 1/σ) and kkd poly(n, 1/σ).
Two questions:
Two questions:
Problem
Given x1 , . . . , xn , and parameter k, find k centers c1 , . . . , ck such that
n
X k
min d(xi , cj )
j=1
x=1
is minimized.
Problem
Given x1 , . . . , xn , and parameter k, find k centers c1 , . . . , ck such that
n
X k
min d(xi , cj )
j=1
x=1
is minimized.
Problem (c-approximation)
Let OPT be the optimal solution Pn above. Fix c > 0. Find centers
c01 , . . . c0k such that if A = x=1 minkj=1 d(xi , c0j ), then
OPT ≤ A ≤ c · OPT
Initialization
Let distance from x to nearest cluster center be D(x)
Pick x as new center with probability
p(x) ∝ D2 (x)
Properties of solution:
For arbitrary data, this gives O(log n)-approximation
For “well-separated data”, this gives constant
(O(1))-approximation.
Initialization
Let Bregman divergence from x to nearest cluster center be
D(x)
Pick x as new center with probability
p(x) ∝ D(x)
OP T C∗
d(OP T, C ∗ )
C dq (OP T, C)
cost(C )
dq (C , OPT) =
cost(OPT)
OP T C∗
Measuring dq
OP T C∗
Measuring d
Definition (α-perturbations[BL09])
A clustering instance (P, d) is α-perturbation-resilient if the optimal
clustering is identical to the optimal clustering for any (P, d0 ), where
The smaller the α, the more resilient the instance (and the
more “stable”)
Center-based clustering problems (k-median, k-means,
k-center)
p can be solved optimally for
3-perturbation-resilient inputs[ABS10]
Surprising facts:
Finding a c-approximation in general might be NP-hard.
Finding a c-approximation here is easy !
Theorem
In polynomial time, we can find a clustering that is O(ε)-close to the
target clustering, even if finding a c-approximation is NP-hard.
http://www.cs.utah.edu/~suresh/web/2010/05/08/
new-developments-in-the-theory-of-clustering-tutorial/
Andrea Vattani.
k-means requires exponentially many iterations even in the plane.
In SCG ’09: Proceedings of the 25th annual symposium on Computational geometry, pages 324–332, New York, NY,
USA, 2009. ACM.