Alternatives To The K-Means Algorithm That Find Better Clusterings
Alternatives To The K-Means Algorithm That Find Better Clusterings
Alternatives To The K-Means Algorithm That Find Better Clusterings
ABSTRACT better minimizes the mathematical criterion (for the same number
We investigate here the behavior of the standard k-means cluster- of centers) to be a better-quality clustering.
ing algorithm and several alternatives to it: the k-harmonic means We use the term “center-based clustering” to refer to the family
algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian of algorithms such as k-means and Gaussian expectation-maximization,
expectation-maximization, and two new variants of k-harmonic means. since they use a number of “centers” to represent and/or partition
Our aim is to find which aspects of these algorithms contribute to the input data. Each center defines a cluster with a central point and
finding good clusterings, as opposed to converging to a low-quality perhaps a covariance matrix. Center-based clustering algorithms
local optimum. We describe each algorithm in a unified framework begin with a guess about the solution, and then refine the positions
that introduces separate cluster membership and data weight func- of centers until reaching a local optimum. These methods can work
tions. We then show that the algorithms do behave very differently well, but they can also converge to a local minimum that is far from
from each other on simple low-dimensional synthetic datasets and the global minimum, i.e. the clustering that has the highest quality
image segmentation tasks, and that the k-harmonic means method according to the criterion in use. Converging to bad local optima is
is superior. Having a soft membership function is essential for find- related to sensitivity to initialization, and is a primary problem of
ing high-quality clusterings, but having a non-constant data weight data clustering.
function is useful also. The goal of this work is to understand and extend center-based
clustering algorithms to find good-quality clusterings in spatial data.
Recently, many wrapper methods have been proposed to improve
Categories and Subject Descriptors clustering solutions. A wrapper method is one that transforms the
H.3.3 [Information Systems]: Information Search and Retrieval; input or output of the clustering algorithm, and/or uses the algo-
I.5.3 [Computing Methodologies]: Pattern Recognition rithm multiple times. One commonly used wrapper method is sim-
ply running the clustering algorithm several times from different
starting points (often called random restart), and taking the best so-
General Terms lution. Algorithms such as used in [10] push this technique to its
Clustering quality k-means k-harmonic means unsupervised classi- extreme, at the cost of computation. Another wrapper method is
fication searching for the best initializations possible; this has been looked
at in [16, 12, 3]. This is fruitful research, as many clustering al-
gorithms are sensitive to their initializations. Other research [15,
1. INTRODUCTION 17] has been looking at finding the appropriate number of clusters,
Data clustering, which is the task of finding natural groupings in and analyzing the difference between the cluster solution and the
data, is an important task in machine learning and pattern recogni- dataset. This is useful when the appropriate number of centers is
tion. Typically in clustering there is no one perfect solution to the unknown, or the algorithm is stuck at a sub-optimal solution.
problem, but algorithms seek to minimize a certain mathematical These approaches are beneficial, but they are attempting to fix
criterion (which varies between algorithms). Minimizing such cri- the problems of clustering algorithms externally, rather than to im-
teria is known to be NP-hard for the general problem of partitioning prove the clustering algorithms themselves. We are interested in
d-dimensional data into k sets [6]. Algorithms like k-means seek improving the clustering algorithms directly to make them less sen-
local rather than the global minimum solutions, but can get stuck sitive to initializations and give better solutions. Of course, any
at poor solutions. In these cases we consider that a solution which clustering algorithm developed could benefit from wrapper meth-
ods.
Recently, Zhang et al. introduced a new clustering algorithm
called k-harmonic means (KHM) that arises from an optimization
criterion based on the harmonic mean [22, 21]. This algorithm
Permission to make digital or hard copies of all or part of this work for shows promise in finding good clustering solutions quickly, and
personal or classroom use is granted without fee provided that copies are outperforms k-means (KM) and Gaussian expectation-maximization
not made or distributed for profit or commercial advantage and that copies (GEM) in many tests. The KHM algorithm also has a novel feature
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
that gives more influence to data points that are not well-modeled
permission and/or a fee. by the clustering solution, but is unknown how important this fea-
CIKM '02, November 4-9, 2002, McLean, Virginia, USA.
Copyright 2002 ACM 1-58113-492-4/02/0011...$5.00.
600
Forgy Random partition
601
The algorithms k-means, Gaussian expectation-maximization, fuzzy KM algorithm optimizes is
k-means, and k-harmonic means are in the family of center-based
n
clustering algorithms. They each have their own objective function,
which defines how good a clustering solution is. The goal of each
KM(X,C) = ∑ j∈{1...k}
min ||xi − c j ||2 (2)
i=1
algorithm is to minimize its objective function. Since these objec-
tive functions cannot be minimized directly, we use iterative update This objective function gives an algorithm which minimizes the
algorithms which converge on local minima. within-cluster variance (the squared distance between each center
and its assigned data points).
3.1 General iterative clustering The membership and weight functions for KM are:
We can formulate a general model for the family of clustering
1 ; if l = arg min j ||xi − c j ||2
algorithms that use iterative optimization, following [8], and use mKM (cl |xi ) = (3)
this framework to make comparisons between algorithms. Define a 0 ; otherwise
d-dimensional set of n data points X = {x1 , . . . , xn } as the data to be wKM (xi ) = 1 (4)
clustered. Define a d-dimensional set of k centers C = {c1 , . . . , ck }
as the clustering solution that an iterative algorithm refines. KM has a hard membership function, and a constant weight func-
A membership function m(c j |xi ) defines the proportion of data tion that gives all data points equal importance. KM is easy to
point xi that belongs to center c j with constraints m(c j |xi ) ≥ 0 and understand and implement, making it a popular algorithm for clus-
∑kj=1 m(c j |xi ) = 1. Some algorithms use a hard membership func- tering.
tion, meaning m(c j |xi ) ∈ {0, 1}, while others use a soft membership
function, meaning 0 ≤ m(c j |xi ) ≤ 1. Kearns and colleagues have 3.3 Gaussian expectation-maximization
analyzed the differences between hard and soft membership from The Gaussian expectation-maximization (GEM) algorithm for
an information-theoretic standpoint [9]. One of the reasons that k- clustering uses a linear combination of d-dimensional Gaussian
means can converge to poor solutions is due to its hard membership distributions as the centers. It minimizes the objective function
function. However, the hard membership function makes possible
n k
many computational optimizations that do not affect accuracy of
GEM(X,C) = − ∑ log ∑ p(xi |c j )p(c j ) (5)
the algorithm, such as using kd-trees [14]. i=1 j=1
A weight function w(xi ) defines how much influence data point
xi has in recomputing the center parameters in the next iteration, where p(xi |c j ) is the probability of xi given that it is generated by
with constraint w(xi ) > 0. A dynamic, or changing, weight function the Gaussian distribution with center c j , and p(c j ) is the prior prob-
was introduced in [21]. Giving variable influence to data in cluster- ability of center c j . We use a logarithm to make the math easier
ing has analogies to boosting in supervised learning [7]. Each ap- (while not changing the solution), and we negate the value so that
proach gives more weight to data points that are not “well-covered” we can minimize the quantity (as we do with the other algorithms
by the current solution. Unlike boosting, this approach does not we investigate). See [2, pages 59–73] for more about this algo-
create an ensemble of solutions. rithm. The membership and weight functions of GEM are
Now we can define a general model of iterative, center-based
clustering. The steps are: p(xi |c j )p(c j )
mGEM (c j |xi ) = (6)
p(xi )
1. Initialize the algorithm with guessed centers C.
wGEM (xi ) = 1 (7)
2. For each data point xi , compute its membership m(c j |xi ) in Bayes’ rule is used to compute the soft membership, and mGEM is a
each center c j and its weight w(xi ). probability since the factors in Equation 6 are probabilities. GEM
has a constant weight function that gives all data points equally
3. For each center c j , recompute its location from all data points importance, like KM. Note that wGEM (xi ) is not the same as p(xi ).
xi according to their memberships and weights:
602
The membership and weight functions for FKM are: problems related to the hard membership function. As far as we
know, adding weights in this manner to KM is a new idea.
||xi − c j ||−2/(r−1) The definitions of the membership and weight functions for H1
mFKM (c j |xi ) = (9)
∑ j=1 ||xi − c j ||−2/(r−1)
k are:
wFKM (xi ) = 1 (10) 1 ; if l = arg min j ||xi − c j ||2
mH1 (cl |xi ) = (14)
0 ; otherwise
FKM has a soft membership function, and a constant weight func-
tion. As r tends toward 1 from above, the algorithm behaves more ∑kj=1 ||xi − c j ||−p−2
wH1 (xi ) = 2 (15)
like standard k-means, and the centers share the data points less.
∑kj=1 ||xi − c j ||−p
3.5 K-harmonic means
The k-harmonic means algorithm (KHM) is a method similar to
4.2 Hybrid 2: soft membership,
KM that arises from a different objective function [21]. The KHM constant weights
objective function uses the harmonic mean of the distance from Hybrid 2 (H2) uses the soft membership function of KHM, and
each data point to all centers. the constant weight function of KM. The definitions of the mem-
bership and weight functions for H2 are:
n k
KHM(X,C) = ∑ ∑k 1
(11)
mH2 (c j |xi ) =
||xi − c j ||−p−2
(16)
i=1 j=1 ||xi −c j || p
∑ j=1 ||xi − c j ||−p−2
k
Here p is an input parameter, and typically p ≥ 2. The harmonic wH2 (xi ) = 1 (17)
mean gives a good (low) score for each data point when that data
point is close to any one center. This is a property of the harmonic Note that H2 resembles FKM. In fact, for certain values of r and p
mean; it is similar to the minimum function used by KM, but it is a they are mathematically equivalent. It is interesting to note, then,
smooth differentiable function. that the membership function of KHM (from which we get H2)
The membership and weight functions for KHM are: and FKM are also very similar. We investigate H2 and FKM as
separate entities to keep clear the fact that we are investigating the
||xi − c j ||−p−2 membership and weight functions of KHM separately.
mKHM (c j |xi ) = (12)
∑kj=1 ||xi − c j ||−p−2
5. EXPERIMENTAL SETUP
∑kj=1 ||xi − c j ||−p−2
wKHM (xi ) = 2 (13) We perform two sets of experiments to demonstrate the proper-
∑kj=1 ||xi − c j ||−p ties of the algorithms described in Sections 3 and 4. We want to an-
swer several questions: how do different initializations affect each
Note that KHM has a soft membership function, and also a vary- algorithm, what is the influence of soft versus hard membership,
ing weight function. This weight function gives higher weight to and what is the benefit of using varying versus constant weights.
points that are far away from every center, which aids the centers Though each algorithm minimizes a different objective function,
in spreading to cover the data. we measure the quality of each clustering solution by the square-
The implementation of KHM needs to deal with the case where root of the k-means objective function in Equation 2. It is a reason-
xi = c j . In this case we follow Zhang using max(||xi − c j ||, ε) and able metric by which to judge cluster quality, and by using a single
use a small positive value of ε. We also apply this technique for metric we can compare different algorithms. We use the square
FKM and the algorithms discussed in Section 3. We have not en- root because the squared distance term can exaggerate the sever-
countered any numerical problems in any of our tests. ity of poor solutions. We considered running KM on the output of
each algorithm, so that the KM objective function could be better
minimized. We found that this did not help significantly, so we do
4. NEW CLUSTERING ALGORITHMS not do this here.
We are interested in the properties of the new algorithm KHM. Our experiments use two datasets already used in recent empir-
It has a soft membership function and a varying weight function, ical work on clustering algorithms [23, 14], and a photograph of a
which makes it unique among the algorithms we have encountered. hand from [4]. The algorithms we test are KM, KHM, FKM, H1,
KHM has been shown to be less sensitive to initialization on syn- H2, and GEM. The code for each of these algorithms is our own
thetic data [21]. (written in Matlab), except for GEM (FastMix code provided by
Here we analyze two aspects of KHM (the membership and the [18]). We need to supply the the KHM, H1, and H2 with the pa-
weight functions) and define two new algorithms we call Hybrid rameter p, and FKM with r. We set p = 3.5 for all tests, as that was
1 and Hybrid 2. They are named for the fact that they are hybrid the best value found by Zhang. We set r = 1.3, as that is the best
algorithms that combine features of KM and KHM. The purpose for value we found based on our preliminary tests.
creating these algorithms is to find out what effects the membership The two initializations we use are the Forgy and Random Parti-
and weight functions of KHM have by themselves. tion methods [16]. The Forgy method chooses k data points from
the dataset at random and uses them as the initial centers. The Ran-
4.1 Hybrid 1: hard membership, dom Partition method assigns each data point to a random center,
varying weights then computes the initial location of each center as the centroid of
Hybrid 1 (H1) uses the hard membership function of KM. Ev- its assigned points. The Forgy method tends to spread centers out
ery point belongs only to its closest center. However, H1 uses the in the data, while the Random Partition method tends to place the
KHM weight function, which gives more weight to points that are centers in a small area near the middle of the dataset. Random
far from every center. We expect that this algorithm should con- Partition was found to be a preferable initialization method for its
verge more quickly than KM due to the weights, but will still have simplicity and quality in [16, 12]. For GEM, we also initialize
603
K−means:1 iterations. Quality:412.3576 K−means:1 iterations. Quality:13784.0551
2 2
1.5 1.5 Table 2: Experiment 1: Quality of solutions for one run on the
1 1 BIRCH dataset, using Forgy and Random Partition initializa-
0.5 0.5 tions. Lower quality scores are better. “Clusters found” is the
0 0 number of true clusters (maximum 100) in which the algorithm
−0.5 −0.5 placed at least one center.
−1 −1
−1.5 −1.5
√
KM quality Clusters found
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Forgy RP Forgy RP
GEM 15.530 24.399 77 49
KM 12.771 18.396 83 60
Figure 3: Experiment 1: Forgy (left) and Random Parti-
H1 12.159 15.242 86 72
tion (right) initializations for the BIRCH dataset. Centers are
shown in the dark color, data points in the light color. This FKM 11.612 10.441 89 93
dataset has a grid of 10x10 natural clusters. H2 10.670 9.908 92 95
KHM 10.255 9.999 94 95
604
average performance times optimal
1.2 1.2
1 1
0 10 20 30 40 0 10 20 30 40
iteration # iteration #
Figure 5: Experiment 2: Convergence curves starting from Forgy (top) and Random Partition (bottom) initializations on 2-d syn-
thetic data. The x-axis shows the number of iterations, and the y-axis (log scale) shows average clustering quality score, where lower
values are better. Only the final results for GEM are shown. Note that KM and GEM perform worse than every other algorithm.
shows the point-wise comparison of each algorithm for the same 7. CONCLUSIONS
experiment. It is clear from this as well that soft membership al- Our experiments clearly show the superiority of the k-harmonic
gorithms (KHM, FKM, H2) perform better than hard membership means algorithm (KHM) for finding clusterings of high quality in
algorithms (KM, H1) in both average performance and variance. low dimensions. Our algorithms H1 and H2 let us study the effects
The results for 4 and 6 dimensional datasets are very similar, so we of the KHM weight and membership separately. They show that
do not report them here. soft membership is essential for finding good clusterings, as H2
Figure 5 shows the speed of convergence of each algorithm for performs nearly as well as KHM, but that varying weights are ben-
d = 2 dimensions. The x-axis shows the iteration number, and eficial with a hard membership function, since H1 performs better
the y-axis shows the average k-means quality ratio at that itera- than KM. Varying weights are intuitively similar to the weights ap-
tion, computed using the 100 datasets. We can see that GEM and plied to training examples by boosting [7]. It remains to be seen
KM are uniformly inferior to every other algorithm, and that the whether this analogy can be made precise.
soft membership algorithms KHM, H2, and FKM move quickly to Previous work in initialization methods has concluded that the
find good solutions. Only the final result for GEM is plotted as Random Partition method is good for GEM and for KM, but our
we cannot capture clustering progress before the FastMix software experiments do not confirm this conclusion. The Forgy method of
terminates. initialization (choosing random points as initial centers) works best
FastMix has the ability to add and remove centers to better fit for GEM, KM, and H1. Overall, our results suggest that the best
its data. FastMix adds a center if the model underpredicts the data algorithms available today are FKM, H2, and KHM, initialized by
and removes a center if its prior probability is too low. We expect the Random Partition method.
that FastMix’s ability to add centers would be helpful in a dataset in Clustering in high dimensions has been an open problem for
which clusters are well-separated. For experiment 2, FastMix be- many years. However, recent research has shown that it may be
gan with 50 centers and only removed centers. FastMix converged preferable to use dimensionality reduction techniques before clus-
with an average of 48.39 centers (Forgy) and 40.13 centers (Ran- tering, and then use a low-dimensional clustering algorithm such
dom Partition) in the 2-dimension test. This shows that GEM is as k-harmonic means, rather than clustering in the high dimension
also sensitive to poor initializations. directly. In [5] the author shows that using a simple, inexpensive
linear projection preserves much of the properties of data (such as
cluster distances), while making it easier to find the clusters. Thus
Table 4: Experiment 2: The mean and standard deviation of there is a need for good-quality, fast clustering algorithms for low-
R2,i , the ratio between the k-means quality and the optimum, dimensional data, such as k-harmonic means.
over 100 datasets, in 2 dimensions. Lower values are better.
Results for 4 and 6 dimensions are similar, and have the same
ranking. 8. REFERENCES
[1] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function
Algorithms. Plenum Press, New York, 1981.
Forgy Random Partition [2] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon
GEM 1.3262 +/- 0.1342 2.3653 +/- 0.4497 Press, Oxford, 1995.
KM 1.1909 +/- 0.0953 2.0905 +/- 0.2616 [3] P. S. Bradley and U. M. Fayyad. Refining initial points for K-Means
H1 1.1473 +/- 0.0650 1.7644 +/- 0.2403 clustering. In Proc. 15th International Conf. on Machine Learning,
FKM 1.1281 +/- 0.0637 1.0989 +/- 0.0499 pages 91–99. Morgan Kaufmann, San Francisco, CA, 1998.
H2 1.1077 +/- 0.0536 1.0788 +/- 0.0416 [4] D. Comaniciu and P. Meer. Mean shift: a robust approach toward
KHM 1.0705 +/- 0.0310 1.0605 +/- 0.0294 feature space analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2001.
605
Table 3: Experiment 2: Competition matrix for 2-d data starting from Forgy (left) and Random Partition (right) initializations. Each
entry shows the number of times (maximum 100) that the algorithm in the column had a better-quality clustering than the algorithm
in the row. The results for 4 and 6 dimensions are similar, so we do not report them here. In particular, KHM is better than KM for
99 to 100 of 100 times in each dimension tested.
606
Forgy Random Partition
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
K−harmonic means (p=3.5):100 iterations. Quality:105.1674 K−harmonic means (p=3.5):100 iterations. Quality:99.98
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
607