Alternatives To The K-Means Algorithm That Find Better Clusterings

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Alternatives to the k-means algorithm

that find better clusterings

Greg Hamerly Charles Elkan


Department of Computer Science and Department of Computer Science and
Engineering Engineering
University of California, San Diego University of California, San Diego
La Jolla, CA 92093 La Jolla, CA 92093
ghamerly@cs.ucsd.edu elkan@cs.ucsd.edu

ABSTRACT better minimizes the mathematical criterion (for the same number
We investigate here the behavior of the standard k-means cluster- of centers) to be a better-quality clustering.
ing algorithm and several alternatives to it: the k-harmonic means We use the term “center-based clustering” to refer to the family
algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian of algorithms such as k-means and Gaussian expectation-maximization,
expectation-maximization, and two new variants of k-harmonic means. since they use a number of “centers” to represent and/or partition
Our aim is to find which aspects of these algorithms contribute to the input data. Each center defines a cluster with a central point and
finding good clusterings, as opposed to converging to a low-quality perhaps a covariance matrix. Center-based clustering algorithms
local optimum. We describe each algorithm in a unified framework begin with a guess about the solution, and then refine the positions
that introduces separate cluster membership and data weight func- of centers until reaching a local optimum. These methods can work
tions. We then show that the algorithms do behave very differently well, but they can also converge to a local minimum that is far from
from each other on simple low-dimensional synthetic datasets and the global minimum, i.e. the clustering that has the highest quality
image segmentation tasks, and that the k-harmonic means method according to the criterion in use. Converging to bad local optima is
is superior. Having a soft membership function is essential for find- related to sensitivity to initialization, and is a primary problem of
ing high-quality clusterings, but having a non-constant data weight data clustering.
function is useful also. The goal of this work is to understand and extend center-based
clustering algorithms to find good-quality clusterings in spatial data.
Recently, many wrapper methods have been proposed to improve
Categories and Subject Descriptors clustering solutions. A wrapper method is one that transforms the
H.3.3 [Information Systems]: Information Search and Retrieval; input or output of the clustering algorithm, and/or uses the algo-
I.5.3 [Computing Methodologies]: Pattern Recognition rithm multiple times. One commonly used wrapper method is sim-
ply running the clustering algorithm several times from different
starting points (often called random restart), and taking the best so-
General Terms lution. Algorithms such as used in [10] push this technique to its
Clustering quality k-means k-harmonic means unsupervised classi- extreme, at the cost of computation. Another wrapper method is
fication searching for the best initializations possible; this has been looked
at in [16, 12, 3]. This is fruitful research, as many clustering al-
gorithms are sensitive to their initializations. Other research [15,
1. INTRODUCTION 17] has been looking at finding the appropriate number of clusters,
Data clustering, which is the task of finding natural groupings in and analyzing the difference between the cluster solution and the
data, is an important task in machine learning and pattern recogni- dataset. This is useful when the appropriate number of centers is
tion. Typically in clustering there is no one perfect solution to the unknown, or the algorithm is stuck at a sub-optimal solution.
problem, but algorithms seek to minimize a certain mathematical These approaches are beneficial, but they are attempting to fix
criterion (which varies between algorithms). Minimizing such cri- the problems of clustering algorithms externally, rather than to im-
teria is known to be NP-hard for the general problem of partitioning prove the clustering algorithms themselves. We are interested in
d-dimensional data into k sets [6]. Algorithms like k-means seek improving the clustering algorithms directly to make them less sen-
local rather than the global minimum solutions, but can get stuck sitive to initializations and give better solutions. Of course, any
at poor solutions. In these cases we consider that a solution which clustering algorithm developed could benefit from wrapper meth-
ods.
Recently, Zhang et al. introduced a new clustering algorithm
called k-harmonic means (KHM) that arises from an optimization
criterion based on the harmonic mean [22, 21]. This algorithm
Permission to make digital or hard copies of all or part of this work for shows promise in finding good clustering solutions quickly, and
personal or classroom use is granted without fee provided that copies are outperforms k-means (KM) and Gaussian expectation-maximization
not made or distributed for profit or commercial advantage and that copies (GEM) in many tests. The KHM algorithm also has a novel feature
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
that gives more influence to data points that are not well-modeled
permission and/or a fee. by the clustering solution, but is unknown how important this fea-
CIKM '02, November 4-9, 2002, McLean, Virginia, USA.
Copyright 2002 ACM 1-58113-492-4/02/0011...$5.00.

600
Forgy Random partition

Figure 1: The original “hand” image used for segmentation.


The image size is 80x64 pixels and is full color.

ture is. Our work is a first answer to this question.


In this paper, we present a unified framework for looking at
center-based clustering algorithms, and then derive two new algo-
rithms that are based on properties of KM and KHM. The algo-
rithms are compared analytically and empirically.

2. IMAGE SEGMENTATION Figure 2: Image segmentation results. From top to bottom:


KM, KHM. We chose k = 5 clusters on the data (x, y, l, u, v) of
We motivate the need for good-quality clustering algorithms with the original input image “hand”. Each segment is colored by
an image segmentation example. Image segmentation is the task of the average color of the segment. Please see a full-color ver-
grouping the pixels of an image according to color, texture, and sion of this paper for the correct view. Note that KHM has
location. Clustering is an important part of image segmentation. more clearly segmented the hand from the background, and
There are two parts to image segmentation. First, we define a set of its segmentation is the same for both initializations. The KM
useful features on image pixels (such as position, color, and texture) segmentation varies depending on its initialization, and its seg-
that are used as an input to a clustering algorithm. Second, we feed mentations are poorer. Notice that the hand is split into two
the data into a clustering algorithm, and cluster it to produce the segments by KM with Forgy initialization, and the differences
segmentation. around the tip of the index finger.
We use the image “hand” (see Figure 1) from computer vision lit-
erature [4]. Rather than using complicated features such as texture
or color histograms, we simply convert the image to a dataset with inspection).
5 dimensions. The first two dimensions are the pixel coordinates Several recent algorithms in image segmentation [19, 13] are
(x, y), and the last three dimensions are the color values (l, u, v) based on eigenvector computations on distance matrices. These
in LUV color space. LUV is a standard in computer graphics for “spectral” algorithms still use k-means as a post-processing step to
representing a perceptually-uniform colors. Other image attributes find the actual segmentation, usually in a lower-dimensional space
may be more appropriate (e.g. texture similarity or color histogram than the original input. Thus there is a great need to have good-
windows around each pixel), but we are interested in the clustering quality clustering algorithms like k-means. Additionally, these spec-
algorithm, and not the inputs to the algorithm. Other image seg- tral algorithms are expensive to use, costing O(n3 ) time to use,
mentation methods [4, 19] also use similarity metrics for spatial (x where n is the number of input data points. Another class of clus-
and y), color, and texture information. tering algorihtms, agglomerative clustering algorithms, cost O(n2 )
We normalize each dimension to zero-mean and unit variance, time to compare the distances between all pairs of points. For large
and then cluster the data using two algorithms (KM and KHM), two images, or real-time video segmentation, speed is essential. We are
different initializations (Forgy and Random Partition, described later interested in linear-time O(n) clustering algorithms, which is what
in the paper), and k = 5 clusters. We found that KHM was able to we consider in this paper.
find best-quality clusterings compared with KM, according to the
KM quality metric. Table 1 shows the KM quality metric for both
KM and KHM (lower is better). Also, KHM found the same seg- Table 1: Quality of solutions for image segmentation. Numbers
mentation for both types of initializations, while KM behaved very are for the k-means quality metric; lower values are better. The
differently depending on the initialization. KHM algorithm found better-quality clusterings than KM, and
Figure 2 shows the outputs for the image segmentation task, with found the same clustering regardless of initialization.
k = 5 clusters. Each segment in the output image is colored by the
average color within the segment. Please see a full-color version
of this paper for the correct view. The segmentation by KM does Forgy Random Partition
poorly because it splits the hand into two segments with Forgy ini- KM 10398 10244
tialization, and it has a poor segmentation of the index finger with KHM 10137 10137
Random Partition initialization. However, KHM does a better job
of clearly segmenting the hand away from the background and has
a visually consistent segmentation across initializations. Thus we
see that the better-quality clustering (according to the KM quality
metric) has given a better-quality image segmentation (upon visual 3. CENTER-BASED CLUSTERING

601
The algorithms k-means, Gaussian expectation-maximization, fuzzy KM algorithm optimizes is
k-means, and k-harmonic means are in the family of center-based
n
clustering algorithms. They each have their own objective function,
which defines how good a clustering solution is. The goal of each
KM(X,C) = ∑ j∈{1...k}
min ||xi − c j ||2 (2)
i=1
algorithm is to minimize its objective function. Since these objec-
tive functions cannot be minimized directly, we use iterative update This objective function gives an algorithm which minimizes the
algorithms which converge on local minima. within-cluster variance (the squared distance between each center
and its assigned data points).
3.1 General iterative clustering The membership and weight functions for KM are:
We can formulate a general model for the family of clustering 
1 ; if l = arg min j ||xi − c j ||2
algorithms that use iterative optimization, following [8], and use mKM (cl |xi ) = (3)
this framework to make comparisons between algorithms. Define a 0 ; otherwise
d-dimensional set of n data points X = {x1 , . . . , xn } as the data to be wKM (xi ) = 1 (4)
clustered. Define a d-dimensional set of k centers C = {c1 , . . . , ck }
as the clustering solution that an iterative algorithm refines. KM has a hard membership function, and a constant weight func-
A membership function m(c j |xi ) defines the proportion of data tion that gives all data points equal importance. KM is easy to
point xi that belongs to center c j with constraints m(c j |xi ) ≥ 0 and understand and implement, making it a popular algorithm for clus-
∑kj=1 m(c j |xi ) = 1. Some algorithms use a hard membership func- tering.
tion, meaning m(c j |xi ) ∈ {0, 1}, while others use a soft membership
function, meaning 0 ≤ m(c j |xi ) ≤ 1. Kearns and colleagues have 3.3 Gaussian expectation-maximization
analyzed the differences between hard and soft membership from The Gaussian expectation-maximization (GEM) algorithm for
an information-theoretic standpoint [9]. One of the reasons that k- clustering uses a linear combination of d-dimensional Gaussian
means can converge to poor solutions is due to its hard membership distributions as the centers. It minimizes the objective function
function. However, the hard membership function makes possible  
n k
many computational optimizations that do not affect accuracy of
GEM(X,C) = − ∑ log ∑ p(xi |c j )p(c j ) (5)
the algorithm, such as using kd-trees [14]. i=1 j=1
A weight function w(xi ) defines how much influence data point
xi has in recomputing the center parameters in the next iteration, where p(xi |c j ) is the probability of xi given that it is generated by
with constraint w(xi ) > 0. A dynamic, or changing, weight function the Gaussian distribution with center c j , and p(c j ) is the prior prob-
was introduced in [21]. Giving variable influence to data in cluster- ability of center c j . We use a logarithm to make the math easier
ing has analogies to boosting in supervised learning [7]. Each ap- (while not changing the solution), and we negate the value so that
proach gives more weight to data points that are not “well-covered” we can minimize the quantity (as we do with the other algorithms
by the current solution. Unlike boosting, this approach does not we investigate). See [2, pages 59–73] for more about this algo-
create an ensemble of solutions. rithm. The membership and weight functions of GEM are
Now we can define a general model of iterative, center-based
clustering. The steps are: p(xi |c j )p(c j )
mGEM (c j |xi ) = (6)
p(xi )
1. Initialize the algorithm with guessed centers C.
wGEM (xi ) = 1 (7)
2. For each data point xi , compute its membership m(c j |xi ) in Bayes’ rule is used to compute the soft membership, and mGEM is a
each center c j and its weight w(xi ). probability since the factors in Equation 6 are probabilities. GEM
has a constant weight function that gives all data points equally
3. For each center c j , recompute its location from all data points importance, like KM. Note that wGEM (xi ) is not the same as p(xi ).
xi according to their memberships and weights:

∑ni=1 m(c j |xi )w(xi )xi 3.4 Fuzzy k-means


cj = (1)
∑ni=1 m(c j |xi )w(xi ) The fuzzy k-means algorithm (FKM; also called fuzzy c-means)
[1] is an adaptation of the KM algorithm that uses a soft mem-
4. Repeat steps 2 and 3 until convergence. bership function. Unlike KM which assigns each data point to its
closest center, the FKM algorithm allows a data point to belong
Now we can compare algorithms based on their membership and partly to all centers, like GEM.
weight functions. An alternative initialization procedure is to guess
n k
an initial partition, and then start the algorithm from step 3. The
computational complexity of each algorithm in this paper is O(nkd)
FKM(X,C) = ∑ ∑ urij ||xi − c j ||2 (8)
i=1 j=1
for each update iteration (Equation 1). The algorithms vary by con-
stant factors but have the same order complexity. The parameter ui j denotes the proportion of data point xi that is
assigned to center c j , and is under the constraints ∑kj=1 ui j = 1 for
3.2 K-means all i and ui j ≥ 0. The parameter r has the constraint r ≥ 1. A larger
The k-means algorithm (KM) [11] partitions data into k sets. The value for r makes the method “more fuzzy.”
solution is then a set of k centers, each of which is located at the Bezdek and others give separate update functions for ui j and c j .
centroid of the data for which it is the closest center. For the mem- The ui j update equation depends only on C and X, so we incorpo-
bership function, each data point belongs to its nearest center, form- rate its update function into the update for c j . Then we can repre-
ing a Voronoi partition of the data. The objective function that the sent FKM in the form of the general iterative update of Equation 1.

602
The membership and weight functions for FKM are: problems related to the hard membership function. As far as we
know, adding weights in this manner to KM is a new idea.
||xi − c j ||−2/(r−1) The definitions of the membership and weight functions for H1
mFKM (c j |xi ) = (9)
∑ j=1 ||xi − c j ||−2/(r−1)
k are:

wFKM (xi ) = 1 (10) 1 ; if l = arg min j ||xi − c j ||2
mH1 (cl |xi ) = (14)
0 ; otherwise
FKM has a soft membership function, and a constant weight func-
tion. As r tends toward 1 from above, the algorithm behaves more ∑kj=1 ||xi − c j ||−p−2
wH1 (xi ) =  2 (15)
like standard k-means, and the centers share the data points less.
∑kj=1 ||xi − c j ||−p
3.5 K-harmonic means
The k-harmonic means algorithm (KHM) is a method similar to
4.2 Hybrid 2: soft membership,
KM that arises from a different objective function [21]. The KHM constant weights
objective function uses the harmonic mean of the distance from Hybrid 2 (H2) uses the soft membership function of KHM, and
each data point to all centers. the constant weight function of KM. The definitions of the mem-
bership and weight functions for H2 are:
n k
KHM(X,C) = ∑ ∑k 1
(11)
mH2 (c j |xi ) =
||xi − c j ||−p−2
(16)
i=1 j=1 ||xi −c j || p
∑ j=1 ||xi − c j ||−p−2
k

Here p is an input parameter, and typically p ≥ 2. The harmonic wH2 (xi ) = 1 (17)
mean gives a good (low) score for each data point when that data
point is close to any one center. This is a property of the harmonic Note that H2 resembles FKM. In fact, for certain values of r and p
mean; it is similar to the minimum function used by KM, but it is a they are mathematically equivalent. It is interesting to note, then,
smooth differentiable function. that the membership function of KHM (from which we get H2)
The membership and weight functions for KHM are: and FKM are also very similar. We investigate H2 and FKM as
separate entities to keep clear the fact that we are investigating the
||xi − c j ||−p−2 membership and weight functions of KHM separately.
mKHM (c j |xi ) = (12)
∑kj=1 ||xi − c j ||−p−2
5. EXPERIMENTAL SETUP
∑kj=1 ||xi − c j ||−p−2
wKHM (xi ) =  2 (13) We perform two sets of experiments to demonstrate the proper-
∑kj=1 ||xi − c j ||−p ties of the algorithms described in Sections 3 and 4. We want to an-
swer several questions: how do different initializations affect each
Note that KHM has a soft membership function, and also a vary- algorithm, what is the influence of soft versus hard membership,
ing weight function. This weight function gives higher weight to and what is the benefit of using varying versus constant weights.
points that are far away from every center, which aids the centers Though each algorithm minimizes a different objective function,
in spreading to cover the data. we measure the quality of each clustering solution by the square-
The implementation of KHM needs to deal with the case where root of the k-means objective function in Equation 2. It is a reason-
xi = c j . In this case we follow Zhang using max(||xi − c j ||, ε) and able metric by which to judge cluster quality, and by using a single
use a small positive value of ε. We also apply this technique for metric we can compare different algorithms. We use the square
FKM and the algorithms discussed in Section 3. We have not en- root because the squared distance term can exaggerate the sever-
countered any numerical problems in any of our tests. ity of poor solutions. We considered running KM on the output of
each algorithm, so that the KM objective function could be better
minimized. We found that this did not help significantly, so we do
4. NEW CLUSTERING ALGORITHMS not do this here.
We are interested in the properties of the new algorithm KHM. Our experiments use two datasets already used in recent empir-
It has a soft membership function and a varying weight function, ical work on clustering algorithms [23, 14], and a photograph of a
which makes it unique among the algorithms we have encountered. hand from [4]. The algorithms we test are KM, KHM, FKM, H1,
KHM has been shown to be less sensitive to initialization on syn- H2, and GEM. The code for each of these algorithms is our own
thetic data [21]. (written in Matlab), except for GEM (FastMix code provided by
Here we analyze two aspects of KHM (the membership and the [18]). We need to supply the the KHM, H1, and H2 with the pa-
weight functions) and define two new algorithms we call Hybrid rameter p, and FKM with r. We set p = 3.5 for all tests, as that was
1 and Hybrid 2. They are named for the fact that they are hybrid the best value found by Zhang. We set r = 1.3, as that is the best
algorithms that combine features of KM and KHM. The purpose for value we found based on our preliminary tests.
creating these algorithms is to find out what effects the membership The two initializations we use are the Forgy and Random Parti-
and weight functions of KHM have by themselves. tion methods [16]. The Forgy method chooses k data points from
the dataset at random and uses them as the initial centers. The Ran-
4.1 Hybrid 1: hard membership, dom Partition method assigns each data point to a random center,
varying weights then computes the initial location of each center as the centroid of
Hybrid 1 (H1) uses the hard membership function of KM. Ev- its assigned points. The Forgy method tends to spread centers out
ery point belongs only to its closest center. However, H1 uses the in the data, while the Random Partition method tends to place the
KHM weight function, which gives more weight to points that are centers in a small area near the middle of the dataset. Random
far from every center. We expect that this algorithm should con- Partition was found to be a preferable initialization method for its
verge more quickly than KM due to the weights, but will still have simplicity and quality in [16, 12]. For GEM, we also initialize

603
K−means:1 iterations. Quality:412.3576 K−means:1 iterations. Quality:13784.0551
2 2

1.5 1.5 Table 2: Experiment 1: Quality of solutions for one run on the
1 1 BIRCH dataset, using Forgy and Random Partition initializa-
0.5 0.5 tions. Lower quality scores are better. “Clusters found” is the
0 0 number of true clusters (maximum 100) in which the algorithm
−0.5 −0.5 placed at least one center.
−1 −1

−1.5 −1.5

KM quality Clusters found
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Forgy RP Forgy RP
GEM 15.530 24.399 77 49
KM 12.771 18.396 83 60
Figure 3: Experiment 1: Forgy (left) and Random Parti-
H1 12.159 15.242 86 72
tion (right) initializations for the BIRCH dataset. Centers are
shown in the dark color, data points in the light color. This FKM 11.612 10.441 89 93
dataset has a grid of 10x10 natural clusters. H2 10.670 9.908 92 95
KHM 10.255 9.999 94 95

p(c j ) = 1/k and initialize the covariance to be 0.2I, where I is the


identity matrix.
Before clustering, all datasets used in both experiments are shifted 6. EXPERIMENTAL RESULTS
and re-scaled to give each dimension zero mean and unit variance.
This is the standard z-score transformation. This can be a good idea 6.1 Experiment 1: BIRCH
before using algorithms based on distance metrics, as it gives the Running each algorithm on the BIRCH dataset once gives an in-
same influence to each dimension. tuition for how each behaves. Figure 3 shows the two initializations
we use. The results of KM, GEM, and KHM’s runs are shown in
5.1 Experiment 1: BIRCH Figure 4 and the cluster qualities for all are shown in Table 2. FKM,
The purpose of our first experiment is to illustrate the conver- H2, and KHM all found good clusterings for both types of initial-
gence properties of the different algorithms, and to show the need izations, and they are all soft membership algorithms.
to improve clustering algorithms. We use a randomly generated The two hard membership algorithms, KM and H1, have dis-
synthetic dataset we call BIRCH, as defined by [23]. This dataset tinctly different behavior for the two initializations. Starting from
has k = 100 true clusters arranged in a 10x10 grid in d = 2 dimen- Forgy initialization these two algorithms perform reasonably well,
sions. Each cluster generates 100 data points from its own Gaus- but starting from the Random Partition these algorithms converge
sian distribution, for a total of n = 10, 000√data points. The distance with many centers remaining in the middle of the dataset, ”trapped”
between two adjacent cluster means is 4 2 with cluster radius of
√ there by hard assignment. This is because the hard membership
2 (meaning the variance in each dimension is 1). We run each function prevents centers from moving if they do not own enough
algorithm twice, once with the Forgy initialization, and once with points. In Table 2 we show the number of true clusters found, which
the Random Partition initialization. Figure 3 shows the two initial- is the number of true clusters (out of 100) that received a center by
izations. We use the same randomly chosen initializations for all the algorithm.
algorithms. Our results are similar for other random initializations. Although GEM has a soft membership function, it does poorly
on this dataset due to some centers having variance that is too large
5.2 Experiment 2: Pelleg and Moore data and taking over several clusters. The output of GEM initialized
The second experiment uses a synthetic dataset based on work by Random Partition appears to have more centers concentrated in
by [14]. Here we run many tests to determine the average-case the middle of the dataset, where the centers began. This is similar
behavior of the algorithms. We test datasets of dimensions 2, 4, to the hard membership results. The FastMix implementation we
and 6 to show that all these algorithms work well in low dimen- used for GEM started with 100 centers and removes centers whose
sions. Each dataset has k = 50 true natural clusters which generate prior became too small. For this reason, it ended with 98 (Forgy)
n = 2500 total data points. The true cluster centers are chosen at and 81 (Random Partition) centers depending on the initialization.
random in the unit hypercube, then 2500 data points are generated FastMix has the ability to search for the number of centers using
by choosing a cluster randomly, and generating a data point accord- density estimation. We tried starting FastMix without a pre-defined
ing to a Gaussian distribution with standard deviation s = d × 0.012 number of centers, and it found 23.
and mean at the true cluster center. We generate data that is more
separated (clusters have less overlap) than the work by Pelleg and 6.2 Experiment 2: Pelleg and Moore data
Moore (who used s = d × 0.025), because this presents a more dif- Our second experiment shows the average performance of the
ficult task to the clustering algorithms. This is because it is harder algorithms compared over many randomly generated data sets in
for centers to move freely through the whole dataset due to gaps several dimensions. For each dataset Xd,i where d ∈ {2, 4, 6} and
between natural clusters. 1 ≤ i ≤ 100 we compute the optimal KM partition Od,i by running
For each d ∈ {2, 4, 6} we generate 100 datasets, and two initial- KM to convergence starting with the centers that generated the data
izations (Random Partition and Forgy) for each dataset. Then we sets. Then we compute the score of a clustering Cd,i as the ratio
test each algorithm from both of these initializations. For each al- 
gorithm we allow it to run for 100 iterations, which is plenty for the KM(Xd,i ,Cd,i )
Rd,i = (18)
algorithms to converge. KM(Xd,i , Od,i )
Table 4 shows the mean and standard deviation of Rd,i for each
algorithm, computed using 100 datasets in 2 dimensions. Table 3

604
average performance times optimal

average performance times optimal


3 3
2.8 KM 2.8
2.6 KHMp(p=3.5) 2.6
2.4 FKM(r=1.3) 2.4
2.2 H1(p=3.5) 2.2
2 H2(p=3.5) 2
1.8 GEM (final result) 1.8
1.6 1.6
1.4 1.4

1.2 1.2

1 1
0 10 20 30 40 0 10 20 30 40
iteration # iteration #

Figure 5: Experiment 2: Convergence curves starting from Forgy (top) and Random Partition (bottom) initializations on 2-d syn-
thetic data. The x-axis shows the number of iterations, and the y-axis (log scale) shows average clustering quality score, where lower
values are better. Only the final results for GEM are shown. Note that KM and GEM perform worse than every other algorithm.

shows the point-wise comparison of each algorithm for the same 7. CONCLUSIONS
experiment. It is clear from this as well that soft membership al- Our experiments clearly show the superiority of the k-harmonic
gorithms (KHM, FKM, H2) perform better than hard membership means algorithm (KHM) for finding clusterings of high quality in
algorithms (KM, H1) in both average performance and variance. low dimensions. Our algorithms H1 and H2 let us study the effects
The results for 4 and 6 dimensional datasets are very similar, so we of the KHM weight and membership separately. They show that
do not report them here. soft membership is essential for finding good clusterings, as H2
Figure 5 shows the speed of convergence of each algorithm for performs nearly as well as KHM, but that varying weights are ben-
d = 2 dimensions. The x-axis shows the iteration number, and eficial with a hard membership function, since H1 performs better
the y-axis shows the average k-means quality ratio at that itera- than KM. Varying weights are intuitively similar to the weights ap-
tion, computed using the 100 datasets. We can see that GEM and plied to training examples by boosting [7]. It remains to be seen
KM are uniformly inferior to every other algorithm, and that the whether this analogy can be made precise.
soft membership algorithms KHM, H2, and FKM move quickly to Previous work in initialization methods has concluded that the
find good solutions. Only the final result for GEM is plotted as Random Partition method is good for GEM and for KM, but our
we cannot capture clustering progress before the FastMix software experiments do not confirm this conclusion. The Forgy method of
terminates. initialization (choosing random points as initial centers) works best
FastMix has the ability to add and remove centers to better fit for GEM, KM, and H1. Overall, our results suggest that the best
its data. FastMix adds a center if the model underpredicts the data algorithms available today are FKM, H2, and KHM, initialized by
and removes a center if its prior probability is too low. We expect the Random Partition method.
that FastMix’s ability to add centers would be helpful in a dataset in Clustering in high dimensions has been an open problem for
which clusters are well-separated. For experiment 2, FastMix be- many years. However, recent research has shown that it may be
gan with 50 centers and only removed centers. FastMix converged preferable to use dimensionality reduction techniques before clus-
with an average of 48.39 centers (Forgy) and 40.13 centers (Ran- tering, and then use a low-dimensional clustering algorithm such
dom Partition) in the 2-dimension test. This shows that GEM is as k-harmonic means, rather than clustering in the high dimension
also sensitive to poor initializations. directly. In [5] the author shows that using a simple, inexpensive
linear projection preserves much of the properties of data (such as
cluster distances), while making it easier to find the clusters. Thus
Table 4: Experiment 2: The mean and standard deviation of there is a need for good-quality, fast clustering algorithms for low-
R2,i , the ratio between the k-means quality and the optimum, dimensional data, such as k-harmonic means.
over 100 datasets, in 2 dimensions. Lower values are better.
Results for 4 and 6 dimensions are similar, and have the same
ranking. 8. REFERENCES
[1] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function
Algorithms. Plenum Press, New York, 1981.
Forgy Random Partition [2] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon
GEM 1.3262 +/- 0.1342 2.3653 +/- 0.4497 Press, Oxford, 1995.
KM 1.1909 +/- 0.0953 2.0905 +/- 0.2616 [3] P. S. Bradley and U. M. Fayyad. Refining initial points for K-Means
H1 1.1473 +/- 0.0650 1.7644 +/- 0.2403 clustering. In Proc. 15th International Conf. on Machine Learning,
FKM 1.1281 +/- 0.0637 1.0989 +/- 0.0499 pages 91–99. Morgan Kaufmann, San Francisco, CA, 1998.
H2 1.1077 +/- 0.0536 1.0788 +/- 0.0416 [4] D. Comaniciu and P. Meer. Mean shift: a robust approach toward
KHM 1.0705 +/- 0.0310 1.0605 +/- 0.0294 feature space analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2001.

605
Table 3: Experiment 2: Competition matrix for 2-d data starting from Forgy (left) and Random Partition (right) initializations. Each
entry shows the number of times (maximum 100) that the algorithm in the column had a better-quality clustering than the algorithm
in the row. The results for 4 and 6 dimensions are similar, so we do not report them here. In particular, KHM is better than KM for
99 to 100 of 100 times in each dimension tested.

GEM KM KHM FKM H1 H2 GEM KM KHM FKM H1 H2


GEM 97 100 100 98 100 GEM 100 100 100 100 100
KM 3 99 91 73 91 KM 0 100 100 100 100
KHM 0 1 15 7 24 KHM 0 0 21 0 34
FKM 0 9 85 19 62 FKM 0 0 79 0 70
H1 2 27 93 81 90 H1 0 0 100 100 100
H2 0 9 76 38 10 H2 0 0 66 30 0
sum: 5 143 453 325 207 367 sum: 0 100 445 351 200 404

[5] S. Dasgupta. Experiments with random projection. In Uncertainty in 2000.


Artificial Intelligence: Proceedings of the Sixteenth Conference [22] B. Zhang, M. Hsu, and U. Dayal. K-harmonic means – a data
(UAI-2000), pages 143–151, San Francisco, CA, 2000. Morgan clustering algorithm. Technical Report HPL-1999-124,
Kaufmann Publishers. Hewlett-Packard Labs, 1999.
[6] Drineas, Frieze, Kannan, Vempala, and Vinay. Clustering in large [23] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: a new data
graphs and matrices. In SODA: ACM-SIAM Symposium on Discrete clustering algorithm and its applications. volume 1, pages 141–182,
Algorithms (A Conference on Theoretical and Experimental Analysis 1997.
of Discrete Algorithms), 1999.
[7] Y. Freund and R. Schapire. A short introduction to boosting.
Japanese Society for Artificial Intelligence, 14:771–780, 1999.
[8] A. Kalton, P. Langley, K. Wagstaff, and J. Yoo. Generalized
clustering, supervised learning, and data assignment. In Proceedings
of the Seventh International Conference on Knowledge Discovery
and Data Mining, pages 299–304, San Francisco, CA, 2001. ACM
Press.
[9] M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic
analysis of hard and soft assignment methods for clustering. In
Proceedings of Uncertainty in Artificial Intelligence, pages 282–293.
AAAI, 1997.
[10] A. Likas, N. Vlassis, and J. Verbeek. The global k-means clustering
algorithm. Technical report, Computer Science Institute, University
of Amsterdam, The Netherlands, February 2001. IAS-UVA-01-02.
[11] J. MacQueen. Some methods for classification and analysis of
multivariate observations. In L. M. LeCam and J. Neyman, editors,
Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, volume 1, pages 281–297, Berkeley, CA,
1967. University of California Press.
[12] M. Meı̆la and D. Heckerman. An experimental comparison of
model-based clustering methods. Machine learning, 42:9–29, 2001.
[13] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and
an algorithm. Neural Information Processing Systems, 14, 2002.
[14] D. Pelleg and A. Moore. Accelerating exact k-means algorithms with
geometric reasoning. In Proceedings of the Fifth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
pages 277–281. AAAI Press, 1999.
[15] D. Pelleg and A. Moore. X-means: Extending K-means with efficient
estimation of the number of clusters. In Proceedings of the 17th
International Conf. on Machine Learning, pages 727–734. Morgan
Kaufmann, San Francisco, CA, 2000.
[16] J. Peña, J. Lozano, and P. Larrañaga. An empirical comparison of
four initialization methods for the k-means algorithm. Pattern
recognition letters, 20:1027–1040, 1999.
[17] P. Sand and A. Moore. Repairing faulty mixture models using density
estimation. In Proceedings of the 18th International Conf. on
Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001.
[18] P. Sand and A. Moore. Fastmix clustering software, 2002.
http://www.cs.cmu.edu/˜psand/.
[19] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
22(8):888–905, 2000.
[20] L. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.
[21] B. Zhang. Generalized k-harmonic means – boosting in unsupervised
learning. Technical Report HPL-2000-137, Hewlett-Packard Labs,

606
Forgy Random Partition

K−means:100 iterations. Quality:163.1047 K−means:100 iterations. Quality:338.435


2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

K−harmonic means (p=3.5):100 iterations. Quality:105.1674 K−harmonic means (p=3.5):100 iterations. Quality:99.98
2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

fuzzy km (r=1.3,P=2):100 iterations. Quality:134.8514 fuzzy km (r=1.3,P=2):100 iterations. Quality:109.022


2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

hybrid 1 (p=3.5):100 iterations. Quality:147.8599 hybrid 1 (p=3.5):100 iterations. Quality:232.34


2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

hybrid 2 (p=3.5):100 iterations. Quality:113.8546 hybrid 2 (p=3.5):100 iterations. Quality:98.168


2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 4: Experiment 1: Convergence on the BIRCH dataset.


From top to bottom: GEM, KM, KHM, FKM, H1, H2. GEM
and KM both converge to very low-quality optima, while KHM
does not. FastMix software generated the plots for GEM, show-
ing the 1-sigma contours.

607

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy