0% found this document useful (0 votes)
29 views

A Comparison of K-Means Clustering Algorithm and C

ALgoritmo Kmeans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

A Comparison of K-Means Clustering Algorithm and C

ALgoritmo Kmeans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331165943

A Comparison of K-Means Clustering Algorithm and CLARA Clustering


Algorithm on Iris Dataset

Article · February 2019


DOI: 10.14419/ijet.v7i4.21472

CITATIONS READS

27 3,425

2 authors:

Tanvi Gupta Supriya Panda


Manav Rachna International International Institute of Research and Studies Manav Rachna International University
13 PUBLICATIONS 90 CITATIONS 37 PUBLICATIONS 193 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Tanvi Gupta on 18 February 2019.

The user has requested enhancement of the downloaded file.


International Journal of Engineering & Technology, 7 (4) (2018) 4766-4768

International Journal of Engineering & Technology


Website: www.sciencepubco.com/index.php/IJET
doi: 10.14419/ijet. v7i4.21472
Research paper

A Comparison of K-Means Clustering Algorithm and CLARA


Clustering Algorithm on Iris Dataset
Tanvi Gupta1*, Supriya P. Panda2
1Research Scholar Manav Rachna International Institute of Research& Studies
2ProfessorManavRachna International Institute of Research&Studies
*Corresponding author E-mail:Tanvigupta.fet@mriu.edu.in

Abstract

K-Means Clustering is the clustering technique, which is used to make a number of clusters of the observations. Here the cluster’s center
point is the ‘mean’ of that cluster and the others points are the observations that are nearest to the mean value. However, in Clustering
Large Applications (CLARA) clustering, medoids are used as their center points for cluster, and rest of the observations in a cluster are
near to that center point .Actually in this, clustering will work on large datasets as compared to K-Medoids and K-Means clustering algo-
rithm, as it will select the random observations from the dataset and perform Partitioning Around Medoids (PAM) algorithm on it. This
paper will state that out of the two algorithms; K-Means and CLARA, CLARA Clustering gives better result.

Keywords:K-Means Clustering; CLARA Clustering; K-Medoids Clustering; PAM Algorithm; Iris Dataset.

1. Introduction overlapping of data points [1]. Also, it uses a number of greedy


heuristic schemes of iterative optimization.
Some of the partition based algorithm is K-Means algorithm, K-
Clustering is the technique to partition the data according to the
Medoids algorithm, Partitioning around Medoids (PAM) algo-
characteristics .It says that the data which are similar in nature are
rithm, and Clustering Large Applications (CLARA).
in one cluster. Actually clustering is the unsupervised learning
algorithm and is mainly used for data analysis and data mining. 1.1.1. K-Means Clustering

K-means Clustering is an unsupervised learning algorithm. The


basic idea of this algorithm is to define k clusters using k centers
i.e., one for each, respectively. There are certain steps for this
algorithm that are as follows:
Let X = {x1,x2,x3,……..,xn} be the set of data points and V =
Fig. 1:Clustering Concept. {v1,v2,…….,vc} be the set of centers.
1) Randomly select ‘c’ cluster centers.
Fig.1 depicts the concept of clustering, which reflects that the 2) Calculate the distance between each data point and cluster
dataset when passes through the clustering algorithm gives the centers.
final result as the number of clusters. 3) Assign the data point to the cluster center whose distance
There are four categories of clustering that are: from the cluster center is minimum of all the cluster cen-
1) Partition based algorithm ters..
2) Hierarchal based algorithm 4) Recalculate the new cluster center using:
3) Density based algorithm
4) Grid based algorithm
The variations in these categories are as following: ,
i) The procedure that is used for partitioning the dataset,
ii) In constructing the clusters use of the thresholds value, and where, ‘ci’ represents the number of data points in the ith cluster.
iii) Manner of clustering. 5) Recalculate the distance between each data point and new
obtained cluster centers.
1.1. Partition Based Algorithm 6) If no data point was reassigned then stop, otherwise repeat
from step 3).
This algorithm partitions the dataset according to the center point Advantages:
which can be mean, medoid, mode etc. into number of clusters. 1) It is easier to understand, fast and robust,
But, the drawback of this algorithm is whenever a point is close to 2) It works just on numeric values,
the center of another cluster; it gives poor outcome due to the 3) The cluster has convex shapes [2], and

Copyright © 2018 Tanvi Gupta, Supriya P. Panda. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
International Journal of Engineering & Technology 4767

4) It gives the best result when dataset are distinct or well sep- the data mining processes. It is an unsupervised learning and is the
arated from each other[3]. arrangement of the set of identical objects in one cluster. The au-
Disadvantages: thors in this paper compare the different types of clustering like
1) The learning algorithm needs the prior information of num- partition-based, hierarchal, grid-based, density –based.
ber of cluster to be formed. In [7], the author named T.Velmurugan, analyses the performance
2) If there is an overlapping of data, K-means will not be able of the two clustering algorithms using the calculation of distance
to solve as there are clusters formed at the same space. between the two data points. Also, the computational time is cal-
3) This algorithm uses a Euclidean distance whose measures culated in order to measure the performance of the algorithm. In
can unequally weight underlying factors. this paper, K-Means is better than the K-Medoid clustering in
4) It is also not able to handle outliers and noisy data, terms of efficiency for the application they have chosen.
5) This learning algorithm is not invariant to non-linear trans-
formation i.e., with different representation of data we get
different results, 3. Experimental Setup and Dataset
6) This algorithm will not work for non-linear data,
7) If we choose the cluster center randomly, this will not lead
to the actual result. To show the comparison between K-Means clustering algorithm
and CLARA clustering algorithm, R programming is used to clus-
1.1.2. K-Medoid Clustering Algorithm ter plot the graph for both clustering techniques on Iris Dataset.

K-Medoid Clustering algorithm is like the K-means clustering


algorithm but the difference lies in the center point .Medoids is the
center point in K-Medoids Clustering instead of means. Also, K-
Medoids use Manhattan as a distance metric instead of Euclidean,
which is used by K-Means. Due to this, K-Medoids is more robust
to outliers and noises.
K-Medoid Clustering Algorithm [3] is as follows:
Input: Ky: the number of clusters, Dy: a data set containing n ob-
jects.
Output: A set of Ky clusters.
Algorithm:
1) Randomly select Ky as the Medoids for ‘n’ data points.
2) Find the closest Medoids by calculating the distance be-
tween data points n and Medoids k and map data objects to
that.
3) For each Medoids ‘m’ and each data point ‘o’ associated to
‘m’, do the following:
a) Swap ‘m’ and ‘o’ to compute the total cost of the configura-
tion, next Fig. 2: Sample of Iris Dataset.
b) Select the Medoids ‘o’s with the lowest cost of the configu-
ration. In Fig. 2 above first attribute is Sepal .Length, second attribute is
4) If there is no change in the assignments, repeat steps 2 and 3 Sepal.Width, Third attribute is Petal.Length, And Last attribute is
alternatively. Petal.Width. So, according to the attributes defined Iris dataset is
Advantages: the dataset of a flower .This dataset represents the three species of
1) It is more robust to outliers and noises. flower that are setosa, versicolor, virginica having different attrib-
2) It uses Manhattan distance for calculating the dissimilarity utes specified above.
among the nodes, which is more robust than Euclidean dis- There are three types of graphs Fig. 3, Fig. 4, Fig. 5 as shown below:
tance. 1) Simple plotting of graph without clustering

1.1.3. CLARA Clustering

Clustering Large Applications is anticipated (Kaufman and


Rousseeuw, 1990). It is an extension of K-Medoids Clustering
algorithm, which uses the sampling approach to handle the large
datasets.

2. Literature Survey
In [4], the author named Tagaram Soni Madhulatha, explains the
concept of clustering with the difference between K-Means and K-
Fig. 3:Graph of Iris Dataset without Clustering Having Three Types of
Medoids clustering .Author says that clustering is an unsupervised Species (Setosa, Versicolor, Virginica) with Two Attributes of PetalLength
form of learning, which helps to partition the data in the clusters and PetalWidth.
using distance measures without any background knowledge.
In [5], the authors named T.Velmurugan, T.Santhanam, explained 2) K-Means cluster plot of Iris Dataset
the clustering by saying that it is an unsupervised learning. They
say that the clustering is decided based on the type of available
data and the purpose for which the clustering is to be done. They
say from the experiments they perform K-Medoids is more robust
than K-Means clustering in terms of noises and outliers but K-
Medoids is good for only small datasets.
In [6], the authors named K. Chitra, D.Maheswari explain the
different types of clustering. They say that the clustering is one of
4768 International Journal of Engineering & Technology

References
[1] S. AnithaElavarasi and J. Akilandeswari ,A Survey On Partition
Clustering Algorithms, International Journal of Enterprise Compu-
ting and Business Systems, 2011.
[2] Navneet Kaur, Survey Paper on Clustering Techniques, Interna-
tional Journal of Science, Engineering and Technology Research
(IJSETR) Volume 2, Issue 4, 2013.
[3] Preeti Arora, Deepali , ShipraVarshney,” Analysis of K-Means and
K-Medoids Algorithm For Big Data”, International Conference on
Information Security & Privacy (ICISP2015), 2015, Nagpur, IN-
DIA.
Fig. 4:K-Means Cluster Plot of Iris Dataset Using Two Components: [4] TagaramSoniMadhulatha, “Comparison between K-Means and K-
PetalLength and Petal Width. Medoids Clustering Algorithms”, advancing in computing and in-
formation technology, volume 198, pp. 472-481.
[5] T.Velmurugan, T. Santhanum, “Performance Analysis of K-Means
3) Cluster plot of CLARA clustering on Iris Dataset.
and K-Medoids Clustering Algorithms for A Randomly Generated
Data Set”, International Conference on Systemics, Cybernetics and
Informatics, pp.578-583, 2008.
[6] K. Chitra, Dr. D.Maheswari, “A Comparative Study of Various
Clustering Algorithms in Data Mining”, IJCSMC, Vol. 6, Issue 8,
pp.109 – 115, 2017.
[7] Dr. T. Velmurugan ,”Efficiency of K-Means and K-Medoids algo-
rithms for clustering arbitrary data points”, International Journal
Computer Technology & Applications, Volume 3(5),pp1758-1764,
2012.
[8] https://sites.google.com/site/dataclusteringalgorithms/k-means-
clustering-algorithm.

Fig. 5: Clusplot of Clara Clustering on Iris Dataset Using Two Compo-


nents Petal.Length and Petal.Width.

The above 2 figures show the cluster plot of both the clustering
techniques. Fig. 4 uses K-means algorithm with Euclidian distance
and Fig.5 uses CLARA clustering with Manhattan distances. From
both the figures we conclude that CLARA clustering has more
power to detect outliers and noise than K-Means clustering as the
point of variability is 100% for CLARA and approx. 90% for K-
Means. This proves the CLARA clustering to be more robust as
compared to K-means algorithm.
Code and output in R for finding Euclidean distance and Manhat-
tan distance on the Iris Dataset.

P<-1:10
> Q<-11:20
> x<-rbind(P,Q)
>distance(x,method="euclidean")
euclidean
31.62278
>distance(x,method="manhattan")
manhattan
100

The distance is finding between 1 to 10 rows and 11 to 20 rows.


As the distance here means dissimilarity, so from the above output
we observe that the Euclidean distance will detect less dissimilari-
ty than Manhattan.
So, Manhattan distance measure is better than the Euclidean dis-
tance measure.

4. Conclusion
This paper is regarding the comparison of K-Means Clustering
and CLARA Clustering on Iris Dataset, which are using Euclidean
distance and Manhattan Distance as a dissimilarity measure, re-
spectively. After plotting of graphs using the two attributes of
dataset that are “Petal. Length” and “Petal. Width”, we can con-
clude that CLARA Clustering using Manhattan distance is better
than K-Means Clustering with Euclidean distance.

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy