Big Data
Big Data
net/publication/47554407
CITATIONS READS
169 5,333
2 authors, including:
Velmurugan Thambusamy
Dwaraka Doss Goverdhan Doss Vaishnav College
126 PUBLICATIONS 1,373 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
An Implementation of Substitution Techniques in Cloud Security using Playfair and Caesar algorithms View project
All content following this page was uploaded by Velmurugan Thambusamy on 02 June 2014.
Abstract: Problem statement: Clustering is one of the most important research areas in the field of
data mining. Clustering means creating groups of objects based on their features in such a way that the
objects belonging to the same groups are similar and those belonging to different groups are dissimilar.
Clustering is an unsupervised learning technique. The main advantage of clustering is that interesting
patterns and structures can be found directly from very large data sets with little or none of the
background knowledge. Clustering algorithms can be applied in many domains. Approach: In this
research, the most representative algorithms K-Means and K-Medoids were examined and analyzed
based on their basic approach. The best algorithm in each category was found out based on their
performance. The input data points are generated by two ways, one by using normal distribution and
another by applying uniform distribution. Results: The randomly distributed data points were taken as
input to these algorithms and clusters are found out for each algorithm. The algorithms were implemented
using JAVA language and the performance was analyzed based on their clustering quality. The execution
time for the algorithms in each category was compared for different runs. The accuracy of the algorithm
was investigated during different execution of the program on the input data points. Conclusion: The
average time taken by K-Means algorithm is greater than the time taken by K-Medoids algorithm for both
the case of normal and uniform distributions. The results proved to be satisfactory.
Key words: K-Means clustering, K-Medoids clustering, data clustering, cluster analysis
are implemented by using JAVA language. The number will always terminate, the K-Means algorithm does not
of clusters is chosen by the user. The data points in each necessarily find the most optimal configuration,
cluster are displayed by different colors and the corresponding to the global objective function
execution time is calculated in milliseconds. minimum. The algorithm is also significantly sensitive
to the initial randomly selected cluster centers. The K-
K-Means algorithm: K-Means is one of the simplest Means algorithm can be run multiple times to reduce
unsupervised learning algorithms that solve the well this effect. K-Means is a simple algorithm that has been
known clustering problem. The procedure follows a adapted to many problem domains. First, the algorithm
simple and easy way to classify a given data set through randomly selects k of the objects. Each selected object
a certain number of clusters (assume k clusters) fixed a represents a single cluster and because in this case only
priori. The main idea is to define k centroids, one for one object is in the cluster, this object represents the
each cluster. These centroids should be placed in a mean or center of the cluster.
cunning way because of different location causes
different result. So, the better choice is to place them as K-Medoids algorithm: The K-means algorithm is
much as possible far away from each other. The next sensitive to outliers since an object with an extremely
step is to take each point belonging to a given data set large value may substantially distort the distribution of
and associate it to the nearest centroid. When no point data. How might the algorithm be modified to diminish
is pending, the first step is completed and an early such sensitivity? Instead of taking the mean value of the
group is done. At this point, it is need to re-calculate k objects in a cluster as a reference point, a Medoid can
new centroids as centers of the clusters resulting from be used, which is the most centrally located object in a
the previous step. After these k new centroids, a new cluster. Thus the partitioning method can still be
binding has to be done between the same data points performed based on the principle of minimizing the
and the nearest new centroid. A loop has been sum of the dissimilarities between each object and its
generated. As a result of this loop it may notice that the corresponding reference point. This forms the basis of
k centroids change their location step by step until no the K-Medoids method. The basic strategy of K-
more changes are done. In other words centroids do not Mediods clustering algorithms is to find k clusters in n
move any more. Finally, this algorithm aims at objects by first arbitrarily finding a representative
minimizing an objective function, in this case a squared object (the Medoids) for each cluster. Each remaining
error function. The objective function: object is clustered with the Medoid to which it is the
most similar. K-Medoids method uses representative
k n
2
objects as reference points instead of taking the mean
J = ∑∑ x i( j) − c j value of the objects in each cluster. The algorithm takes
j =1 i =1
the input parameter k, the number of clusters to be
2
partitioned among a set of n objects. A typical K-
where, x i( j) − c j is a chosen distance measure between Mediods algorithm for partitioning based on Medoid or
central objects is as follows:
a data point x i( j) and the cluster centre cj, is an indicator
of the distance of the n data points from their respective Input:
cluster centers. The algorithm is composed of the K: The number of clusters
following steps: D: A data set containing n objects
Output: A set of k clusters that minimizes the sum of
1. Place k points into the space represented by the the dissimilarities of all the objects to their nearest
objects that are being clustered. These points medoid.
represent initial group centroids Method: Arbitrarily choose k objects in D as the initial
2. Assign each object to the group that has the closest representative objects;
centroid Repeat:
3. When all objects have been assigned, recalculate Assign each remaining object to the cluster
the positions of the k centroids with the nearest medoid;
4. Repeat Step 2 and 3 until the centroids no longer Randomly select a non medoid object Orandom;
move compute the total points S of swap point Oj
with Oramdom
This produces a separation of the objects into if S < 0 then swap Oj with Orandom to form the
groups from which the metric to be minimized can be new set of k medoid
calculated. Although it can be proved that the procedure Until no change;
364
J. Computer Sci., 6 (3): 363-368, 2010
Like this algorithm, a Partitioning Around Medoids The execution time of each run is calculated in
(PAM) was one of the first k-Medoids algorithms milliseconds. The time taken for execution of the
introduced. It attempts to determine k partitions for n algorithm varies from one run to another run and also
objects. After an initial random selection of k medoids, it differs from one computer to another computer. The
the algorithm repeatedly tries to make a better choice of result shows the normal distribution of 1000 data
medoids. points around ten cluster centers. The number of data
points is the size of the cluster. The size of the cluster
RESULTS is high for some clusters due to increase in the number
of data points around a particular cluster center. For
In this study, the k-Means algorithm is explained
example, size of cluster 10 is 139, because of the
with an example first, followed by k-Medoids
distance between the data points of that particular
algorithm. The two types of experimental results are
cluster and the cluster centers is very less. Since the
discussed for the K-Means algorithm. One is Normal
data points are normally distributed, clusters vary in
and another is Uniform distributions of input data
size with maximum data points in cluster 10 and
points. For both the types of distributions, the data
minimum data points in cluster 8. For the same normal
points are created by using Box-Muller formula. The
distribution of the input data points and for the
resulting clusters of the normal distribution of K-Means
uniform distribution, the algorithm is executed 10
algorithm is presented in Fig. 1. The number of clusters
times and the results are tabulated in the Table 1.
and data points is given by the user during the
From Table 1 (against the rows indicated by ‘N’),
execution of the program. The number of data points is
1000 and the number of clusters given by the user is 10 it is observed that the execution time for Run 6 is
(k = 10). The algorithm is repeated 1000 times to get 7640 m sec and for Run 8 is 7297 msec. In a normal
efficient output. The cluster centers (centroids) are distribution of the data points, there is much
calculated for each cluster by its mean value and difference between the sizes of the clusters. For
clusters are formed depending upon the distance example, in Run 7, cluster 2 has 167 points and cluster
between data points. 5 has only 56 points. Next, the uniformly distributed
For different input data points, the algorithm gives one thousand data points are taken as input. One of the
different types of outputs. The input data points are executions is shown in Fig. 2. The number of clusters
generated in red color and the output of the algorithm chosen by user is 10. The result of the algorithm for
is displayed in different colors as shown in Fig. 1. The 10 different execution of the program is given in
center point of each cluster is displayed in green color. Table 1 (against the rows indicated by ‘U’).
From the result, it can be inferred that the size of In the next example, one thousand normally
the clusters are near to the average data points in the distributed data points are taken as input. Number of
uniform distribution. This can easily be observed that clusters chosen by the user is 10. The output of one of
for 1000 points and 10 clusters, the average is 100. the trials is shown in Fig. 3. Next, the uniformly
distributed one thousand data points are taken as input.
But this is not true in the case of normal distribution. The number of clusters chosen by the user is 10. One of
It is easy to identify from the Fig. 1 and 2, the size of the executions is shown in Fig. 4. The result of the
the clusters are different. This shows that the behavior algorithm for ten different executions of the program is
of the algorithm is different for both types of given in Table 2. For both the algorithms, K-Means
distributions. and K-Medoids, the program is executed many times
Next, for the K-Medoids algorithm, the input data and the results are analyzed based on the number of
points and the experimental results are considered by data points and the number of clusters. The behavior
the same method as discussed for K-Means algorithm. of the algorithms is analyzed based on observations.
366
J. Computer Sci., 6 (3): 363-368, 2010
368