0% found this document useful (0 votes)
54 views7 pages

Big Data

This document summarizes a research paper that compares the K-Means and K-Medoids clustering algorithms using normal and uniform distributions of data points. The paper finds that: 1) The K-Means and K-Medoids algorithms were implemented in Java and evaluated based on clustering quality and execution time for different distributions of input data points. 2) For both normal and uniform distributions, the average time taken by the K-Means algorithm was greater than the time taken by the K-Medoids algorithm. 3) The results of comparing the two algorithms on different data distributions proved to be satisfactory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views7 pages

Big Data

This document summarizes a research paper that compares the K-Means and K-Medoids clustering algorithms using normal and uniform distributions of data points. The paper finds that: 1) The K-Means and K-Medoids algorithms were implemented in Java and evaluated based on clustering quality and execution time for different distributions of input data points. 2) For both normal and uniform distributions, the average time taken by the K-Means algorithm was greater than the time taken by the K-Medoids algorithm. 3) The results of comparing the two algorithms on different data distributions proved to be satisfactory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/47554407

Computational Complexity between K-Means and K-Medoids Clustering


Algorithms for Normal and Uniform Distributions of Data Points

Article  in  Journal of Computer Science · June 2010


Source: DOAJ

CITATIONS READS

169 5,333

2 authors, including:

Velmurugan Thambusamy
Dwaraka Doss Goverdhan Doss Vaishnav College
126 PUBLICATIONS   1,373 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Ph.D project View project

An Implementation of Substitution Techniques in Cloud Security using Playfair and Caesar algorithms View project

All content following this page was uploaded by Velmurugan Thambusamy on 02 June 2014.

The user has requested enhancement of the downloaded file.


Journal of Computer Science 6 (3): 363-368, 2010
ISSN 1549-3636
© 2010 Science Publications

Computational Complexity between K-Means and K-Medoids Clustering Algorithms


for Normal and Uniform Distributions of Data Points

T. Velmurugan and T. Santhanam


Department of Computer Science, DG Vaishnav College, Chennai, India

Abstract: Problem statement: Clustering is one of the most important research areas in the field of
data mining. Clustering means creating groups of objects based on their features in such a way that the
objects belonging to the same groups are similar and those belonging to different groups are dissimilar.
Clustering is an unsupervised learning technique. The main advantage of clustering is that interesting
patterns and structures can be found directly from very large data sets with little or none of the
background knowledge. Clustering algorithms can be applied in many domains. Approach: In this
research, the most representative algorithms K-Means and K-Medoids were examined and analyzed
based on their basic approach. The best algorithm in each category was found out based on their
performance. The input data points are generated by two ways, one by using normal distribution and
another by applying uniform distribution. Results: The randomly distributed data points were taken as
input to these algorithms and clusters are found out for each algorithm. The algorithms were implemented
using JAVA language and the performance was analyzed based on their clustering quality. The execution
time for the algorithms in each category was compared for different runs. The accuracy of the algorithm
was investigated during different execution of the program on the input data points. Conclusion: The
average time taken by K-Means algorithm is greater than the time taken by K-Medoids algorithm for both
the case of normal and uniform distributions. The results proved to be satisfactory.

Key words: K-Means clustering, K-Medoids clustering, data clustering, cluster analysis

INTRODUCTION • Its ability to discover some or all of the hidden


patterns
Clustering can be considered the most important • The definition and representation of cluster chosen
unsupervised learning problem; so, as every other
problem of this kind, it deals with finding a structure in A number of algorithms for clustering have been
a collection of unlabeled data (Jain and Dubes, 1988; proposed by researchers, of which this study establishes
Jain et al., 1999). A loose definition of clustering could with a comparative study of K-Means and K-Medoids
be “the process of organizing objects into groups whose clustering algorithms (Berkhin, 2002; Dunham, 2002;
members are similar in some way”. A cluster is Han and Kamber, 2006; Xiong et al., 2009; Park et al.,
therefore a collection of objects which are “similar” 2006; Khan and Ahmad, 2004; Borah and Ghose, 2009;
between them and are “dissimilar” to the objects Rakhlin and Caponnetto, 2007).
belonging to other clusters. Unlike classification, in
which objects are assigned to predefined classes, MATERIALS AND METHODS
clustering does not have any predefined classes. The
main advantage of clustering is that interesting patterns Problem specification: In this research, two
and structures can be found directly from very large unsupervised clustering methods, namely K-Means and
data sets with little or none of the background K-Medoids are examined to analyze based on the
knowledge. The cluster results are subjective and distance between the various input data points. The
implementation dependent. The quality of a clustering clusters are formed according to the distance between
method depends on: data points and cluster centers are formed for each
cluster. The implementation plan will be in two parts,
• The similarity measure used by the method and its one in normal distribution and another in uniform
implementation distribution of input data points. Both the algorithms
Corresponding Author: T. Santhanam, Department of Computer Science, DG Vaishnav College, Chennai, India
Tel: +91-9444169090
363
J. Computer Sci., 6 (3): 363-368, 2010

are implemented by using JAVA language. The number will always terminate, the K-Means algorithm does not
of clusters is chosen by the user. The data points in each necessarily find the most optimal configuration,
cluster are displayed by different colors and the corresponding to the global objective function
execution time is calculated in milliseconds. minimum. The algorithm is also significantly sensitive
to the initial randomly selected cluster centers. The K-
K-Means algorithm: K-Means is one of the simplest Means algorithm can be run multiple times to reduce
unsupervised learning algorithms that solve the well this effect. K-Means is a simple algorithm that has been
known clustering problem. The procedure follows a adapted to many problem domains. First, the algorithm
simple and easy way to classify a given data set through randomly selects k of the objects. Each selected object
a certain number of clusters (assume k clusters) fixed a represents a single cluster and because in this case only
priori. The main idea is to define k centroids, one for one object is in the cluster, this object represents the
each cluster. These centroids should be placed in a mean or center of the cluster.
cunning way because of different location causes
different result. So, the better choice is to place them as K-Medoids algorithm: The K-means algorithm is
much as possible far away from each other. The next sensitive to outliers since an object with an extremely
step is to take each point belonging to a given data set large value may substantially distort the distribution of
and associate it to the nearest centroid. When no point data. How might the algorithm be modified to diminish
is pending, the first step is completed and an early such sensitivity? Instead of taking the mean value of the
group is done. At this point, it is need to re-calculate k objects in a cluster as a reference point, a Medoid can
new centroids as centers of the clusters resulting from be used, which is the most centrally located object in a
the previous step. After these k new centroids, a new cluster. Thus the partitioning method can still be
binding has to be done between the same data points performed based on the principle of minimizing the
and the nearest new centroid. A loop has been sum of the dissimilarities between each object and its
generated. As a result of this loop it may notice that the corresponding reference point. This forms the basis of
k centroids change their location step by step until no the K-Medoids method. The basic strategy of K-
more changes are done. In other words centroids do not Mediods clustering algorithms is to find k clusters in n
move any more. Finally, this algorithm aims at objects by first arbitrarily finding a representative
minimizing an objective function, in this case a squared object (the Medoids) for each cluster. Each remaining
error function. The objective function: object is clustered with the Medoid to which it is the
most similar. K-Medoids method uses representative
k n
2
objects as reference points instead of taking the mean
J = ∑∑ x i( j) − c j value of the objects in each cluster. The algorithm takes
j =1 i =1
the input parameter k, the number of clusters to be
2
partitioned among a set of n objects. A typical K-
where, x i( j) − c j is a chosen distance measure between Mediods algorithm for partitioning based on Medoid or
central objects is as follows:
a data point x i( j) and the cluster centre cj, is an indicator
of the distance of the n data points from their respective Input:
cluster centers. The algorithm is composed of the K: The number of clusters
following steps: D: A data set containing n objects
Output: A set of k clusters that minimizes the sum of
1. Place k points into the space represented by the the dissimilarities of all the objects to their nearest
objects that are being clustered. These points medoid.
represent initial group centroids Method: Arbitrarily choose k objects in D as the initial
2. Assign each object to the group that has the closest representative objects;
centroid Repeat:
3. When all objects have been assigned, recalculate Assign each remaining object to the cluster
the positions of the k centroids with the nearest medoid;
4. Repeat Step 2 and 3 until the centroids no longer Randomly select a non medoid object Orandom;
move compute the total points S of swap point Oj
with Oramdom
This produces a separation of the objects into if S < 0 then swap Oj with Orandom to form the
groups from which the metric to be minimized can be new set of k medoid
calculated. Although it can be proved that the procedure Until no change;
364
J. Computer Sci., 6 (3): 363-368, 2010

Like this algorithm, a Partitioning Around Medoids The execution time of each run is calculated in
(PAM) was one of the first k-Medoids algorithms milliseconds. The time taken for execution of the
introduced. It attempts to determine k partitions for n algorithm varies from one run to another run and also
objects. After an initial random selection of k medoids, it differs from one computer to another computer. The
the algorithm repeatedly tries to make a better choice of result shows the normal distribution of 1000 data
medoids. points around ten cluster centers. The number of data
points is the size of the cluster. The size of the cluster
RESULTS is high for some clusters due to increase in the number
of data points around a particular cluster center. For
In this study, the k-Means algorithm is explained
example, size of cluster 10 is 139, because of the
with an example first, followed by k-Medoids
distance between the data points of that particular
algorithm. The two types of experimental results are
cluster and the cluster centers is very less. Since the
discussed for the K-Means algorithm. One is Normal
data points are normally distributed, clusters vary in
and another is Uniform distributions of input data
size with maximum data points in cluster 10 and
points. For both the types of distributions, the data
minimum data points in cluster 8. For the same normal
points are created by using Box-Muller formula. The
distribution of the input data points and for the
resulting clusters of the normal distribution of K-Means
uniform distribution, the algorithm is executed 10
algorithm is presented in Fig. 1. The number of clusters
times and the results are tabulated in the Table 1.
and data points is given by the user during the
From Table 1 (against the rows indicated by ‘N’),
execution of the program. The number of data points is
1000 and the number of clusters given by the user is 10 it is observed that the execution time for Run 6 is
(k = 10). The algorithm is repeated 1000 times to get 7640 m sec and for Run 8 is 7297 msec. In a normal
efficient output. The cluster centers (centroids) are distribution of the data points, there is much
calculated for each cluster by its mean value and difference between the sizes of the clusters. For
clusters are formed depending upon the distance example, in Run 7, cluster 2 has 167 points and cluster
between data points. 5 has only 56 points. Next, the uniformly distributed
For different input data points, the algorithm gives one thousand data points are taken as input. One of the
different types of outputs. The input data points are executions is shown in Fig. 2. The number of clusters
generated in red color and the output of the algorithm chosen by user is 10. The result of the algorithm for
is displayed in different colors as shown in Fig. 1. The 10 different execution of the program is given in
center point of each cluster is displayed in green color. Table 1 (against the rows indicated by ‘U’).

Fig. 1: Normal distribution output-K-Means


365
J. Computer Sci., 6 (3): 363-368, 2010

Table 1: Cluster results for K-Means algorithm


Cluster 1 2 3 4 5 6 7 8 9 10 Time (m sec)
Run 1 N 62 116 106 108 117 96 109 53 94 139 7609
U 118 87 101 109 103 91 104 100 94 93 7281
Run 2 N 149 108 60 123 90 132 99 74 81 84 7500
U 101 98 111 93 96 99 79 125 106 92 7500
Run 3 N 79 108 134 85 89 95 78 72 125 135 7578
U 103 86 98 87 77 109 108 91 121 120 7469
Run 4 N 54 102 88 83 86 131 113 129 101 113 7391
U 84 117 90 98 106 97 85 111 110 102 7375
Run 5 N 78 55 143 81 79 101 157 91 123 92 7360
U 103 115 83 120 105 100 77 103 109 85 7297
Run 6 N 96 100 119 132 119 108 66 75 52 133 7640
U 94 96 89 120 119 92 98 78 113 101 7437
Run 7 N 106 167 98 114 56 60 63 129 81 126 7515
U 89 84 116 100 108 89 107 92 99 116 7422
Run 8 N 74 64 130 114 88 103 114 83 137 93 7297
U 99 88 103 93 140 108 75 99 105 90 7469
Run 9 N 100 105 70 135 90 157 68 93 113 69 7453
U 95 103 100 84 102 104 105 95 112 100 7391
Run 10 N 58 42 151 126 121 97 74 101 71 159 7422
U 112 79 80 114 79 102 90 135 99 110 7407

Fig. 2: Uniform distribution output-K-Means

From the result, it can be inferred that the size of In the next example, one thousand normally
the clusters are near to the average data points in the distributed data points are taken as input. Number of
uniform distribution. This can easily be observed that clusters chosen by the user is 10. The output of one of
for 1000 points and 10 clusters, the average is 100. the trials is shown in Fig. 3. Next, the uniformly
distributed one thousand data points are taken as input.
But this is not true in the case of normal distribution. The number of clusters chosen by the user is 10. One of
It is easy to identify from the Fig. 1 and 2, the size of the executions is shown in Fig. 4. The result of the
the clusters are different. This shows that the behavior algorithm for ten different executions of the program is
of the algorithm is different for both types of given in Table 2. For both the algorithms, K-Means
distributions. and K-Medoids, the program is executed many times
Next, for the K-Medoids algorithm, the input data and the results are analyzed based on the number of
points and the experimental results are considered by data points and the number of clusters. The behavior
the same method as discussed for K-Means algorithm. of the algorithms is analyzed based on observations.
366
J. Computer Sci., 6 (3): 363-368, 2010

Fig. 3: Normal distribution output-K-Medoids

Fig. 4: Uniform distribution output-K-Medoids


Table 2: Cluster results for K-Medoids algorithm
Cluster 1 2 3 4 5 6 7 8 9 10 Time (m sec)
Run 1 N 73 128 103 88 85 125 89 92 111 106 7641
U 92 99 109 104 100 115 98 107 92 84 7312
Run 2 N 73 71 138 186 86 101 124 83 64 76 7578
U 119 73 80 99 93 89 141 94 105 107 7484
Run 3 N 83 53 149 75 110 123 61 93 130 123 7422
U 98 91 109 107 89 89 118 114 93 92 7375
Run 4 N 66 80 80 65 95 170 81 97 148 118 7360
U 111 90 95 89 96 117 103 112 186 81 7219
Run 5 N 78 91 73 165 82 104 154 74 60 119 7469
U 115 88 101 114 96 97 111 96 87 95 7250
Run 6 N 95 142 76 80 79 147 131 73 89 88 7390
U 116 118 110 100 99 83 79 97 99 99 7531
Run 7 N 99 81 137 98 153 63 76 73 82 138 7484
U 100 104 105 97 91 82 111 98 96 116 7375
Run 8 N 129 83 120 111 112 95 68 123 80 79 7282
U 104 112 97 80 70 82 136 95 131 93 7172
Run 9 N 98 84 54 116 116 60 110 60 156 146 7485
U 100 64 97 94 88 112 101 117 117 110 7141
Run 10 N 81 139 81 83 150 133 83 94 85 71 7312
U 102 115 113 77 68 86 91 143 105 100 7219
367
J. Computer Sci., 6 (3): 363-368, 2010

The performance of the algorithms have also been REFERENCES


analyzed for several executions by considering
different data points (for which the results are not Berkhin, P., 2002. Survey of clustering data mining
shown) as input (250 data points, 500 data points etc), techniques. Technical Report, Accrue Software,
the outcomes are found to be highly satisfactory. Inc.
http://www.ee.ucr.edu/~barth/EE242/clustering_su
DISCUSSION rvey.pdf
Borah, S. and M.K. Ghose, 2009. Performance analysis
From the experimental results, for K-Means of AIM-K-Means and K-Means in quality cluster
algorithm, the average time for clustering the normal generation. J. Comput., 1: 175-178.
distribution of data points is found to be 7476.5 m sec. http://arxiv.org/ftp/arxiv/papers/0912/0912.3983.pdf
The average time for clustering the uniform Dunham, M., 2002. Data Mining: Introductory and
distribution of data points is found to be 7404.8 m sec. Advanced Topics. 1st Edn., Prentice Hall, USA.,
For the K-Medoids algorithm, the average time for ISBN: 10: 0130888923, pp: 315.
clustering the normal distribution of data points is Han, J. and M. Kamber, 2006. Data Mining: Concepts
found to be 7442.3 m sec and for the average time for and Techniques. Morgan Kaufmann Publishers,
clustering the uniform distribution of data points are 2nd Edn., New Delhi, ISBN: 978-81-312-0535-8.
7307.8 m sec. It is understood that the average time for Jain, A.K. and R.C. Dubes, 1988. Algorithms for
normal distribution is greater than the average time for Clustering Data. Prentice Hall Inc., Englewood
the uniform distribution. This is true for both the Cliffs, New Jersey, ISBN: 0-13-022278-X, pp: 320.
algorithms K-Means and K-Medoids. If the number of Jain, A.K., M.N. Murty and P.J. Flynn, 1999. Data
data points is less, then the K-Means algorithm takes clustering: A review. ACM Comput. Surveys,
lesser execution time. But when the data points are 31: 264-323. DOI: 10.1145/331499.331504
increased to maximum the K-Means algorithm takes Khan, S.S. and A. Ahmad, 2004. Cluster center
maximum time and the K-Medoids algorithm performs initialization algorithm for K-Means clustering.
reasonably better than the K-Means algorithm. The Patt. Recog. Lett., 25: 1293-1302. DOI:
characteristic feature of this algorithm is that it requires 10.1016/j.patrec.2004.04.007
the distance between every pair of objects only once Park, H.S., J.S. Lee and C.H. Jun, 2006. A K-means-
and uses this distance at every stage of iteration. like algorithm for K-medoids clustering and its
performance.
CONCLUSION http://citeseerx.ist.psu.edu/viewdoc/download?doi=
10.1.1.90.7981&rep=rep1&type=pdf
The time taken for one execution of the program Rakhlin, A. and A. Caponnetto, 2007. Stability of k-
for the Uniform Distribution is less than the time taken Means clustering. Adv. Neural Inform. Process.
for Normal Distribution. Usually the time complexity Syst., 12: 216-222.
varies from one processor to another processor, which http://cbcl.mit.edu/projects/cbcl/publications/ps/rak
depends on the speed and the type of the system. The hlin-stability-clustering.pdf
partition based algorithms work well for finding Xiong, H., J. Wu and J. Chen, 2009. K-Means
spherical-shaped clusters in small to medium-sized data clustering versus validation measures: A data
points. The advantage of the K-Means algorithm is its distribution perspective. IEEE Trans. Syst., Man,
favorable execution time. Its drawback is that the user Cybernet. Part B, 39: 318-331.
has to know in advance how many clusters are searched http://www.ncbi.nlm.nih.gov/pubmed/19095536
for. From the experimental results (by many execution
of the programs), it is observed that K-Means algorithm
is efficient for smaller data sets and K-Medoids
algorithm seems to perform better for large data sets.

368

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy