A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
1

A Survey of Clustering Algorithms for Big Data:


Taxonomy & Empirical Analysis
A. Fahad, N. Alshatri, Z. Tari, Member, IEEE , A. Alamri, I. Khalil
A. Zomaya, Fellow, IEEE, S. Foufou, and A. Bouras
AbstractClustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyse the massive
volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are
grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of
clustering and there has been attempts to analyse and categorise them for a larger number of applications. However, one of the major
issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition
of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces
concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison,
both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based
on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most
representative algorithm from each of the categories using a large number of real (big) datasets. The effectiveness of the candidate
clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime and scalability tests.
Additionally, we highlighted the set of clustering algorithms that are the best performing for big data.
Index Termsclustering algorithms, unsupervised learning, big data.

I NTRODUCTION

In the current digital era, according to (as far) massive


progress and development of the internet and online
world technologies such as big and powerful data
servers, we face a huge volume of information and data
day by day from many different resources and services
which were not available to humankind just a few
decades ago. Massive quantities of data are produced by
and about people, things, and their interactions. Diverse
groups argue about the potential benefits and costs of
analyzing information from Twitter, Google, Verizon,
23andMe, Facebook, Wikipedia, and every space where
large groups of people leave digital traces and deposit
data. This data comes from available different online
resources and services which have been established to
serve their customers. Services and resources like Sensor
A. Fahad, N. Alshatri, Z. Tari and A. Alamri are with the Distributed
Systems and Networking (DSN) research group, School of Computer
Science and Information Technology (CS&IT), Royal Melbourne Institute
of Technology (RMIT), Melbourne, Australia.
E-mail: adil.alharthi@rmit.edu.au
A. Zomaya is with the Centre for Distributed and High Performance
Computing, School of Information Technologies, The University of
Sydney, Sydney, Australia.
S. Foufou and A. Bouras are with Department of Computer Science,
College of Engineering, Qatar University, Doha, Qatar.
Placeholder for copyright info

Networks, Cloud Storages, Social Networks and etc.,


produce big volume of data and also need to manage
and reuse that data or some analytical aspects of the
data. Although this massive volume of data can be
really useful for people and corporations, it can be
problematic as well. Therefore, a big volume of data
or big data has its own deficiencies as well. They need
big storages and this volume makes operations such
as analytical operations, process operations, retrieval
operations, very difficult and hugely time consuming.
One way to overcome these difficult problems is to
have big data clustered in a compact format that is
still an informative version of the entire data. Such
clustering techniques aim to produce a good quality
of clusters/summaries. Therefore, they would hugely
benefit everyone from ordinary users to researchers and
people in the corporate world, as they could provide
an efficient tool to deal with large data such as critical
systems (to detect cyber attacks).
The main goal of this paper is to provide readers with
a proper analysis of the different classes of available
clustering techniques for big data by experimentally
comparing them on real big data. The paper does not
refer to simulation tools. However, it specifically looks
at the use and implementation of an efficient algorithm
from each class. It also provides experimental results
from a variety of big datasets. Some aspects need careful
attention when dealing with big data, and this work
will therefore help researchers as well as practitioners
in selecting techniques and algorithms that are suitable
for big data. V olume of data is the first and obvious
important characteristic to deal with when clustering
big data compared to conventional data clustering, as

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
2

this requires substantial changes in the architecture of


storage systems. The other important characteristic of
big data is V elocity. This requirement leads to a high
demand for online processing of data, where processing
speed is required to deal with the data flows. V ariety is
the third characteristic, where different data types, such
as text, image, and video, are produced from various
sources, such as sensors, mobile phones, etc. These three
V (Volume, Velocity, and Variety) are the core characteristics of big data which must be taken into account when
selecting appropriate clustering techniques.
Despite a vast number of surveys for clustering algorithms available in the literature [2][1][38][7] for various
domains (such as machine learning, data mining, information retrieval, pattern recognition, bio-informatics
and semantic ontology), it is difficult for users to decide
a priori which algorithm would be the most appropriate
for a given big dataset. This is because of some of the
limitations in existing surveys: (i) the characteristics of
the algorithms are not well studied; (ii) the field has
produced many new algorithms, which were not considered in these surveys; and (iii) no rigorous empirical
analysis has been carried out to ascertain the benefit of
one algorithm over another. Motivated by these reasons,
this paper attempts to review the field of clustering
algorithms and achieve the following objectives:
To propose a categorizing framework that systematically groups a collection of existing clustering
algorithms into categories and compares their advantages and drawbacks from a theoretical point of
view.
To present a complete taxonomy of the clustering
evaluation measurements to be used for empirical
study.
To make an empirical study analyzing the most representative algorithm of each category with respect
to both theoretical and empirical perspectives.
Therefore, the proposed survey presents a taxonomy
of clustering algorithms and proposes a categorizing
framework that covers major factors in the selection
of a suitable algorithm for big data. It further conducts experiments involving the most representative
clustering algorithm of each category, a large number
of evaluation metrics and 10 traffic datasets. The rest of
this paper is organized as follows. Section 2 provides
a review of clustering algorithms categories. Section 3
describe the proposed criteria and properties for the
categorizing framework. In Section 4, we group and
compare different clustering algorithms based on the
proposed categorizing framework. Section 5 introduces
the taxonomy of clustering evaluation measurements,
describes the experimental framework and summarises
the experimental results. In Section 6, we conclude the
paper and discuss future research.

various clustering algorithms found in the literature


into distinct categories. The proposed categorization
framework is developed from an algorithm designers
perspective that focuses on the technical details of the
general procedures of the clustering process. Accordingly, the processes of different clustering algorithms can
be broadly classified follows:

C LUSTERING A LGORITHM C ATEGORIES

As there are so many clustering algorithms, this section


introduces a categorizing framework that groups the

Partitioning-based: In such algorithms, all clusters


are determined promptly. Initial groups are specified and reallocated towards a union. In other
words, the partitioning algorithms divide data objects into a number of partitions, where each partition represents a cluster. These clusters should
fulfil the following requirements: (1) each group
must contain at least one object, and (2) each object
must belong to exactly one group. In the K-means
algorithm, for instance, a center is the average of all
points and coordinates representing the arithmetic
mean. In the K-medoids algorithm, objects which
are near the center represent the clusters. There
are many other partitioning algorithms such as Kmodes, PAM, CLARA, CLARANS and FCM.
Hierarchical-based: Data are organized in a hierarchical manner depending on the medium of
proximity. Proximities are obtained by the intermediate nodes. A dendrogram represents the datasets,
where individual data is presented by leaf nodes.
The initial cluster gradually divides into several
clusters as the hierarchy continues. Hierarchical
clustering methods can be agglomerative (bottomup) or divisive (top-down). An agglomerative clustering starts with one object for each cluster and
recursively merges two or more of the most appropriate clusters. A divisive clustering starts with
the dataset as one cluster and recursively splits
the most appropriate cluster. The process continues
until a stopping criterion is reached (frequently, the
requested number k of clusters). The hierarchical
method has a major drawback though, which relates to the fact that once a step (merge or split) is
performed, this cannot be undone. BIRCH, CURE,
ROCK and Chameleon are some of the well-known
algorithms of this category.
Density-based: Here, data objects are separated
based on their regions of density, connectivity and
boundary. They are closely related to point-nearest
neighbours. A cluster, defined as a connected dense
component, grows in any direction that density
leads to. Therefore, density-based algorithms are
capable of discovering clusters of arbitrary shapes.
Also, this provides a natural protection against
outliers. Thus the overall density of a point is
analyzed to determine the functions of datasets
that influence a particular data point. DBSCAN,
OPTICS, DBCLASD and DENCLUE are algorithms
that use such a method to filter out noise (ouliers)
and discover clusters of arbitrary shape.
Grid-based: The space of the data objects is divided

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
3

Clustering Algorithm

Partitioning-Based

Hierarchical-Based

Density-Based

Grid-Based

Model-Based

1. K-means
2. K-medoids
3. K-modes
4. PAM
5. CLARANS
6. CLARA
7. FCM

1. BIRCH
2. CURE
3. ROCK
4. Chameleon
5. Echidna

1. DBSCAN
2. OPTICS
3. DBCLASD
4. DENCLUE

1. Wave-Cluster
2. STING
3. CLIQUE
4. OptiGrid

1. EM
2. COBWEB
3. CLASSIT
4. SOMs

Fig. 1. An overview of clustering taxonomy.

into grids. The main advantage of this approach is


its fast processing time, because it goes through the
dataset once to compute the statistical values for the
grids. The accumulated grid-data make grid-based
clustering techniques independent of the number of
data objects that employ a uniform grid to collect
regional statistical data, and then perform the clustering on the grid, instead of the database directly.
The performance of a grid-based method depends
on the size of the grid, which is usually much less
than the size of the database. However, for highly
irregular data distributions, using a single uniform
grid may not be sufficient to obtain the required
clustering quality or fulfill the time requirement.
Wave-Cluster and STING are typical examples of
this category.
Model-based: Such a method optimizes the fit between the given data and some (predefined) mathematical model. It is based on the assumption that
the data is generated by a mixture of underlying probability distributions. Also, it leads to a
way of automatically determining the number of
clusters based on standard statistics, taking noise
(outliers) into account and thus yielding a robust
clustering method. There are two major approaches
that are based on the model-based method: statistical and neural network approaches. MCLUST is
probably the best-known model-based algorithm,
but there are other good algorithms, such as EM
(which uses a mixture density model), conceptual
clustering (such as COBWEB), and neural network
approaches (such as self-organizing feature maps).
The statistical approach uses probability measures in
determining the concepts or clusters. Probabilistic
descriptions are typically used to represent each
derived concept. The neural network approach uses
a set of connected input/output units, where each
connection has a weight associated with it. Neural
networks have several properties that make them
popular for clustering. First, neural networks are
inherently parallel and distributed processing architectures. Second, neural networks learn by adjusting
their interconnection weights so as to best fit the
data. This allows them to normalize or prototype.

Patterns act as features (or attributes) extractors for


the various clusters. Third, neural networks process
numerical vectors and require object patterns to
be represented by quantitative features only. Many
clustering tasks handle only numerical data or can
transform their data into quantitative features if
needed. The neural network approach to clustering
tends to represent each cluster as an exemplar. An
exemplar acts as a prototype of the cluster and does
not necessarily have to correspond to a particular
object. New objects can be assigned to the cluster
whose exemplar is the most similar, based on some
distance measure.
Figure 1 provides an overview of clustering algorithm
taxonomy following the five classes of categorization
described above.

3 C RITERION
M ETHODS

TO BENCHMARK CLUSTERING

When evaluating clustering methods for big data, specific criteria need to be used to evaluate the relative
strengths and weaknesses of every algorithm with respect to the three-dimensional properties of big data,
including Volume, Velocity, and Variety. In this section,
we define such properties and compiled the key criterion
of each property.
Volume refers to the ability of a clustering algorithm to deal with a large amount of data. To guide
the selection of a suitable clustering algorithm with
respect to the Volume property, the following criteria
are considered: (i) size of the dataset, (ii) handling
high dimensionality and (iii) handling outliers/
noisy data.
Variety refers to the ability of a clustering algorithm
to handle different types of data (numerical, categorical and hierarchical). To guide the selection of a
suitable clustering algorithm with respect to the Variety property, the following criteria are considered:
(i) type of dataset and (ii) clusters shape.
Velocity refers to the speed of a clustering algorithm
on big data. To guide the selection of a suitable
clustering algorithm with respect to the Velocity
property, the following criteria are considered: (i)

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
4

complexity of algorithm and (ii) the run time performance.


In what follows, we explain in detail the corresponding
criterion of each property of big data:
1) Type of dataset: Most of the traditional clustering
algorithms are designed to focus either on numeric
data or on categorical data. The collected data in
the real world often contain both numeric and
categorical attributes. It is difficult for applying
traditional clustering algorithm directly into these
kinds of data. Clustering algorithms work effectively either on purely numeric data or on purely
categorical data; most of them perform poorly on
mixed categorical and numerical data types.
2) Size of dataset: The size of the dataset has a major
effect on the clustering quality. Some clustering
methods are more efficient clustering methods
than others when the data size is small, and vice
versa.
3) Input parameter: A desirable feature for practical
clustering is the one that has fewer parameters,
since a large number of parameters may affect
cluster quality because they will depend on the
values of the parameters.
4) Handling outliers/ noisy data: A successful algorithm
will often be able to handle outlier/noisy data
because of the fact that the data in most of the
real applications are not pure. Also, noise makes it
difficult for an algorithm to cluster an object into
a suitable cluster. This therefore affects the results
provided by the algorithm.
5) Time Complexity: Most of the clustering methods
must be used several times to improve the clustering quality. Therefore if the process takes too long,
then it can become impractical for applications that
handle big data.
6) Stability: One of the important features for any
clustering algorithm is the ability to generate the
same partition of the data irrespective of the order
in which the patterns are presented to the algorithm.
7) Handling high dimensionality: This is particularly
important feature in cluster analysis because many
applications require the analysis of objects containing a large number of features (dimensions). For
example, text documents may contain thousands of
terms or keywords as features. It is challenging due
to the curse of dimensionality. Many dimensions
may not be relevant. As the number of dimensions
increases, the data become increasingly sparse, so
that the distance measurement between pairs of
points becomes meaningless and the average density of points anywhere in the data is likely to be
low.
8) Cluster shape: A good clustering algorithm should
be able to handle real data and their wide variety of data types, which will produce clusters of
arbitrary shape.

C ANDIDATE

CLUSTERING ALGORITHMS

This section aims to find the good candidate clustering


algorithms for big data. By good, we refer to those
algorithms that satisfy most of the criterion listed in
Section 3. Table 1 provides a summary of the evaluation
we performed on the various methods described in
Section 2 based on the described criteria. After this
evaluation, the next step is to select the most appropriate clustering algorithm from each category based
on the proposed criteria, so to benchmark them for big
data. In this way, the best algorithm is selected from
each method, and these (selected algorithms) will be
properly evaluated. This process produced the following
selections: FCM [6], BIRCH [40], DENCLUE [17], OptiGird [18] and EM [8].
This section discusses each of the selected algorithms
in detail, showing how it works, its strengths and weakness, as well as the input parameters it takes.
4.1

Fuzzy-CMeans (FCM)

FCM [6] is a representative algorithm of fuzzy clustering


which is based on K-means concepts to partition dataset
into clusters. The FCM algorithm is a soft clustering
method in which the objects are assigned to the clusters
with a degree of belief. Hence, an object may belong to
more than one cluster with different degrees of belief. It
attempts to find the most characteristic point in each
cluster, named as the centre of one cluster; then it
computes the membership degree for each object in
the clusters. The fuzzy c-means algorithm minimizes
intra-cluster variance as well. However, it inherits the
problems of K-means, as the minimum is just a local
one and the final clusters depend on the initial choice
of weights.
FCM algorithm follows the same principle of K-means
algorithm, i.e. it iteratively searches the cluster centers
and updates the memberships of objects. The main
difference is that, instead of making a hard decision
about which cluster the pixel should belong to, it assigns
a object a value ranging from 0 to 1 to measure the
likelihood with which the object belongs to that cluster.
A fuzzy rule states that the sum of the membership
value of a pixel to all clusters must be 1. The higher
the membership value, the more likely a pixel will
belong to that cluster. The FCM clustering is obtained by
minimizing the objective function shown in Equation 1:
J=

n X
c
X

m
ik |pi vk |

(1)

i=1 k=1

where J is the objective function, n is the number of


objects, c is the number of defined clusters, ik is the
likelihood values by assiging the object i to the cluster
k, m is a fuzziness factor (a value 1), and |pi vk | is
the Euclidean distance between the i-th object pi and the
k-th cluster centre vk defined by Equation 2:
|pi vk | =

pPn

i=1

(pi vk )

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

(2)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
5

TABLE 1
Categorization of clustering algorithms with respect to big data proprieties and other criteria described in Section 3.
Categories

Partitional algorithms

Hierarchical algorithms

Density-based algorithms

Grid- based algorithms

Model- based algorithms

Size of Dataset
Large
Large
Small
Small
Large
Large
Large
Large
Large
Large
Large
Large

Volume
Handling High Dimensionality
No
Yes
Yes
No
No
No
No
No
Yes
No
Yes
No

DBSCAN [9]

Large

OPTICS [5]
DBCLASD [39]
DENCLUE [17]
Wave-Cluster [34]
STING [37]
CLIQUE [21]
OptiGrid [18]
EM [8]
COBWEB [12]
CLASSIT [13]
SOMs [24]

Large
Large
Large
Large
Large
Large
Large
Large
Small
Small
Small

Abb. name
K-Means [25]
K-modes [19]
K-medoids [33]
PAM [31]
CLARA [23]
CLARANS [32]
FCM [6]
BIRCH [40]
CURE [14]
ROCK [15]
Chameleon [22]
ECHIDNA [26]

Handling Noisy Data


No
No
Yes
No
No
No
No
No
Yes
No
No
No

Variety
Type of Dataset
Numerical
Categorical
Categorical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
Categorical and Numerical
All type of data
Multivariate Data

Clusters Shape
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Arbitrary
Arbitrary
Arbitrary
Non-convex

No

No

Numerical

Arbitrary

No
No
Yes
No
No
Yes
Yes
Yes
No
No
Yes

Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
No
No

Numerical
Numerical
Numerical
Special data
Special data
Numerical
Special data
Special data
Numerical
Numerical
Multivariate Data

Arbitrary
Arbitrary
Arbitrary
Arbitrary
Arbitrary
Arbitrary
Arbitrary
Non-convex
Non-convex
Non-convex
Non-convex

The centroid of the k th cluster is updated using Equation 3:


Pn
m
ik pi
v k = Pi=1
(3)
n
m
i=1

ik

The fuzzy membership table is computed using the


original Equation 3:
1

ik =
Pc

l=1

|pi vk |
|pi vl |

2
m1

(4)

This algorithm has been extended for clustering a RGB


color images, where the distance computation given in
Equation 2 is modified as follows:
qP
n
2
2
2
|pi vk | =
i=1 (piR vkR ) + (piG vkG ) + (piB vkB )

(5)
As mentioned earlier, this has an iterative process (see
FCM pseudo-code).
FCM pseudo-code:
Input: Given the dataset, set the desire number of clusters c, the
fuzzy parameter m (a constant > 1), and the stopping condition,
initialize the fuzzy partition matrix, and set stop = f alse.
Step 1. Do:
Step 2. Calculate the cluster centroids and the objective value
J.
Step 3. Compute the membership values stored in the matrix.
Step 4. If the value of J between consecutive iterations is less
than the stopping condition, then stop = true.
Step 5. While (!stop)
Output: A list of c cluster centres and a partition matrix are
produced.

4.2

Velocity
complexity of Algorithm
O(nkd)
O(n)
O(n2 dt)
O(k(n-k)2 )
O(k(40+k)2 +k(n-k))
O(kn2 )
O(n)
O(n)
O(n2 log n)
O(n2 +nmmma+n2 logn)
O(n2 )
O(N B(1 + logB m))
O(nlogn) If a spatial index is used
Otherwise, it is O(n2 ).
O(nlogn)
O(3n2 )
O(log|D|)
O(n)
O(k)
O(Ck + mk)
Between O(nd) and O(nd log n)
O(knp)
O(n2 )
O(n2 )
O(n2 m)

Other criterion
Input Parameter
1
1
1
1
1
2
1
2
2
1
3
2
2
2
No
2
3
1
2
3
3
1
1
2

parameters: branching factor B and threshold T . The CF


tree is built while scanning the data. When a data point
is encountered, the CF tree is traversed, starting from
the root and choosing the closest node at each level.
If the closest leaf cluster for the current data point is
finally identified, a test is performed to see whether the
data point belongs to the candidate cluster or not. If
not, a new cluster is created with a diameter greater
than the given T . BIRCH can typically find a good
clustering with a single scan of the dataset and improve
the quality further with a few additional scans. It can
also handle noise effectively. However, BIRCH may not
work well when clusters are not spherical because it
uses the concept of radius or diameter to control the
boundary of a cluster. In addition, it is order-sensitive
and may generate different clusters for different orders
of the same input data. The details of the algorithm are
given below.
BIRCH pseudo-code:
Input: The dataset, threshold T , the maximum diameter (or
radius) of a cluster R, and the branching factor B
Step 1. (Load data into memory) An initial in-memory CF-tree
is constructed with one scan of the data. Subsequent phases
become fast, accurate and less order sensitive.
Step 2. (Condense data) Rebuild the CF-tree with a larger T .
Step 3. (Global clustering) Use the existing clustering algorithm
on CF leaves.
Step 4. (Cluster refining) Do additional passes over the dataset
and reassign data points to the closest centroid from step #3.
Output:. Compute CF points, where CF = (# of points in a
cluster N , linear sum of the points in the cluster LS, the square
sum of N data SS).

BIRCH

BIRCH algorithm [40] builds a dendrogram known as


a clustering feature tree (CF tree). The CF tree can be
built by scanning the dataset in an incremental and
dynamic way. Thus, it does not need the whole dataset
in advance. It has two main phases: the database is
first scanned to build an in-memory tree, and then
the algorithm is applied to cluster the leaf nodes. CFtree is a height-balanced tree which is based on two

4.3

DENCLUE

The DENCLUE algorithm [17] analytically models the


cluster distribution according to the sum of influence
functions of all of the data points. The influence function
can be seen as a function that describes the impact of
a data point within its neighbourhood. Then density
attractors can be identified as clusters. Density attractors

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
6

are local maxima of the overall density function. In


this algorithm, clusters of arbitrary shape can be easily
described by a simple equation with kernel density
functions. Even though DENCLUE requires a careful
selection of its input parameters (i.e. and ), since such
parameters may influence the quality of the clustering
results, it has several advantages in comparison to other
clustering algorithms [16]: (i) it has a solid mathematical
foundation and generalized other clustering methods,
such as partitional and hierarchical; (ii) it has good clustering properties for datasets with large amount of noise;
(iii) it allows a compact mathematical description of
arbitrarily shaped clusters in high-dimensional datasets;
and (iv) it uses grid cells and only keeps information
about the cells that actually contain points. It manages
these cells in a tree-based access structure and thus it is
significant faster than some influential algorithms, such
as DBSCAN. All of these properties make DENCLUE
able to produce good clusters in datasets with a large
amount of noise.
The details of this algorithm are given below.
DENCLUE pseudo-code:
Input: The dataset, Cluster radius, and Minimum number of
objects
Step 1. Take dataset in the grid whose each side is of 2.
Step 2. Find highly dense cells, i.e. find out the mean of highly
populated cells.
Step 3. If d (mean(c1 ), mean(c2 )) < 4a, then the two cubes are
connected.
Step 4. Now highly populated cells or cubes that are connected
to highly populated cells will be considered in determining
clusters.
Step 5. Find Density Attractors using a Hill Climbing procedure.
Step 6. Randomly pick point r.
Step 7. Compute the local 4 density.
Step 8. Pick another point (r+1) close to the previous computed
density.
Step 9. If den(r) < den(r+1) climb, then put points within (
/2) of the path into the cluster.
Step 10. Connect the density attractor based cluster.
Output: Assignment of data values to clusters.

4.4

Optimal Grid (OptiGrid)

OptiGrid algorithm [18] is designed to obtain an optimal grid partitioning. This is achieved by constructing
the best cutting hyperplanes through a set of selected
projections. These projections are then used to find the
optimal cutting planes. Each cutting plane is selected
to have minimal point density and to separate the
dense region into two half spaces. After each step of
a multi-dimensional grid construction defined by the
best cutting planes, OptiGrid finds the clusters using
the density function. The algorithm is then applied
recursively to the clusters. In each round of recursion,
OptiGrid only maintains data objects in the dense grids
from the previous round of recursion. This method
is very efficient for clustering large high-dimensional

databases. However, it may perform poorly in locating


clusters embedded in a low-dimensional subspace of a
very high-dimensional database, because its recursive
method only reduces the dimensions by one at every
step. In addition, it suffers from sensitivity to parameter
choice and does not efficiently handle grid sizes that
exceed the available memory [12]. Moreover, OptiGrid
requires very careful selection of the projections, density
estimate, and determination of what constitutes a best or
optimal cutting plane from users. The difficulty of this
is only determined on a case-by-case basis on the data
being studied.
OptiGrid pseudo-code:
Input: The dataset (x), a set of contracting projections P =
{P0 , P1 , , Pk }, a list of cutting planes BEST CUT , and
CUT ;
Step 1. For i=0,...,k, do
Step 2. CUT best local cuts Pi (D), CUT SCORE Score best
local cuts Pi (D)
Step 3. Insert all the cutting planes with a score min cut score
into BEST CUT;
Step 4. Select the q cutting planes of the highest score from
BEST CUT and construct a multidimensional grid G using the
q cutting planes;
Step 5. Insert all data points in D into G and determine the
highly populated grid cells in G; add these cells to the set of
clusters C;
Refine C: For all clusters Ci in C, perform the same process
with dataset Ci ;
Output: Assignment of data values to clusters.

4.5

Expectation-Maximization (EM)

EM algorithm [8] is designed to estimate the maximum


likelihood parameters of a statistical model in many situations, such as the one where the equations cannot be
solved directly. EM algorithm iteratively approximates
the unknown model parameters with two steps: the E
step and the M step. In the E step (expectation), the
current model parameter values are used to evaluate the
posterior distribution of the latent variables. Then the
objects are fractionally assigned to each cluster based on
this posterior distribution. In the M step (maximization),
the fractional assignment is given by re-estimating the
model parameters with the maximum likelihood rule.
The EM algorithm is guaranteed to find a local maximum for the model parameters estimate. The major
disadvantages for EM algorithm are: the requirement
of a non-singular covariance matrix, the sensitivity to
the selection of initial parameters, the possibility of convergence to a local optimum, and the slow convergence
rate. Moreover, there would be a decreased precision of
the EM algorithm within a finite number of steps [28].
The details of the EM algorithm are given below.

5 E XPERIMENTAL
DATA

E VALUATION

ON

R EAL

In some cases, it is not sufficient to decide the most


suitable clustering algorithm for big data based only on

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
7
EM pseudo-code:

TABLE 2
Data Sets Used in the Experiments

Input: The dataset (x), the total number of clusters (M ), the


accepted error for convergence (e) and the maximum number
of iterations

dataset
MHIRD
MHORD
SPFDS
DOSDS
SPDOS
SHIRD
SHORD
ITD
WTP
DARPA

E-step: Compute the expectation of the complete data loglikelihood.




h
i
Q , T = E log p (xg , xm |) xg , T
(6)

M-step: Select a new parameter estimate that maximizes the Q


-function,


t+1 = arg max Q , T
(7)

Iteration: increment t = t + 1; repeat steps 2 and 3 until the


convergence condition is satisfied.


Output: A series of parameter estimates 0 , 1 , ..., T , which
represents the achievement of the convergence criterion.

the theoretical point of view. Thus, the main focus of this


section is to investigate the behaviour of the algorithms
selected in Section 4 from an empirical perspective.
In what follows, we described the traffic datasets used
for this experimental study in Section 5.1. Section 5.2
provides details of the experimental set up and Section 5.3 presents a complete survey for performance metrics to be used to experimentally investigate the relative
strength and weakness of each algorithm. Finally, the
collected results and a comprehensive analysis study are
given in Section 5.4.
5.1 Data Set
To compare the advantages of the candidate clustering
algorithms, eight simulated datasets are used in
the experiments including: Multi-hop Outdoor
Real Data (MHORD) [36], Multi-hop Indoor Real
Data (MHIRD) [36], Single-hop Outdoor Real
Data (SHORD) [36], Single-hop Indoor Real Data
(SHIRD) [36], simulated spoofing attack for SCADA
system (detonated as SPFDS) [3] [4], simulated denial
of service attack DOS for SCADA system (detonated as
DOSDS) [3] [4], simulated of both spoofing and attacks
for SCADA system (detonated as SPDOS) [3] [4], and
the operational state water treatment plant (WTP). We
experimented also with two other publicly available
datasets, namely DARPA [35] and internet traffic
data (ITD) [29]. These two datasets have become a
benchmark for many studies since the work of Andrew
et al. [30]. Table 2 summarizes the proportion of normal
and anomalous flows, the number of attributes and
the number of classes for each dataset. This paper
does not collect the descriptions of the datasets due to
space restrictions. Thus, we recommend that readers
consult the original references [10], [11], [36], [3], [4] for
more complete details about the characteristics of the
datasets.
5.2 Experimental set up
Algorithm 1 shows the experimental procedures used
to evaluate the five candidate clustering algorithms. In

# instances
699
2500
100500
400350
290007
1800
400
377,526
512
1000,000

# attributes # classes
10
2
3
2
15
2
15
2
15
3
4
2
4
2
149
12
39
2
42
5

particular, a cross validation strategy is used to make


the best use of the traffic data and to obtain accurate
and stable results. For each dataset, all instances are
randomised and divided into two subsets as training
and testing sets. Consequently, we evaluate the performance of each clustering algorithm by building a
model using training set and measuring and using the
testing set to evaluate the constructed model. To assure
that the five candidate clustering algorithms are not
exhibiting an order effect, the result of each clustering
is averaged over 10 runs on each datasets. The five
candidate clustering algorithms studied here employ
different parameters. However, the experimental evaluation does not correspond to exhaustive search for the
best parameters settings for each algorithm. Given the
datasets at hand, the main objective is to use a default
configuration for set the parameters of the clustering
algorithms. In general, finding an optimal number of
clusters is an ill-posed problem of crucial relevance
in clustering analysis [20]. Thus, we have chosen the
number of clusters with respect to the number of unique
labels in each dataset. However, the true number of
clusters may not be the optimal number for which a
particular clustering algorithm will determine, to its best
potential, the structure in the data.
Following the procedure and the pseudo-code of each
clustering algorithm discussed in Section 4, the candidate clustering algorithms were implemented in Matlab
2013a. Our experiments were performed on a 64-bit
Windows-based system with an Intel core (i7), 2.80 GHz
processor machine with 8 Gbytes of RAM.
5.3

Validity Metrics

In response to the growing necessity for an objective


method of comparing clustering algorithms, we present
a complete survey of performance metrics that covers
all the properties and issues related to the experimental study of clustering. In particular, this survey of
performance metrics will allow researchers to compare
different algorithms in an objective way, to characterize
their advantages and drawbacks in order to choose a
clustering from an empirical point of view.
The survey covers three measurements:Validity evaluation, stability of the results and runtime performance.

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
8

dataset. Thus, a good clustering will create


clusters with instances that are similar or closest to one another. More precisely, CP measures
the average distance between every pair of
data points as follows:

Algorithm 1: Experimental Procedure


1

Input:

Parameter N := 10; M := 100;


Clustering Algorithms Cls:= {cl1 , cl2 , , clm };
DATA= {D1 , D2 , , Dn };

Output:

2
3

6
7
8
9
10
11
12
13
14
15
16
17
18

19

20

1
|i |

CP i =

foreach times [1, M ] do


randomise instance-order for Di ;
generate N bins from the randomised Di ;
foreach f old [1, N ] do
T estData = bin[f old];
T rainData = data T estData ;

CP =

Stability =ComputeStability (Assignmentcls


i );

1) Validity evaluation. Unsupervised learning techniques required different evaluation criteria than
supervised learning techniques. In this section,
we briefly summarize the criteria used for performance evaluation according to internal and external
validation indices. The former evaluation criteria is
to evaluate the goodness of a data partition using
quantities and feature inherited from the datasets,
this includes Compactness (CP) and Dunn Validity
Index (DVI). The latter evaluation criteria is similar
to the process of cross-validation that is used in
evaluating supervised learning techniques. Such
evaluation criteria include Classification Accuracy
(CA), Adjusted Rand Index (ARI) and Normalized
Mutual Information (NMI). Given a dataset whose
class labels are known, it is possible to assess
how accurately a clustering technique partitions
the data relative to their correct class labels. Note,
some of clustering algorithms do not have centroids, and therefore the internal indices are not
applicable to such an algorithms (e.g. OptiGrid and
DENCLUE). To address such as issue we get the
centroid of a cluster by using the measure in [27],
[26] and Euclidean distance metric.
The following notation is used: X is the dataset
formed by xi flows; is the set of flows that have
been grouped in a cluster; and W is the set of wj
centroids of the clusters in . We will call node to
each of the k elements of the clustering method.
Compactness (CP). It is one of the commonly
measurements used to validate clusters by employing only the information inherent to the

xi i

(8)

kxi wi k

where is the set of instances (xi ) that have


been grouped into a cluster and W is the set
of wi centroids of clusters in . As a global
measure of compactness, the average of all
clusters is calculated as follows:

Validity & Stability;


foreach Clusteringi [1, Cls] do
foreach Di DAT A do

T rain0Data = select Subset from T rainData ;


T est0Data = select Subset from T estData ;
ClsASGN =TestModel(T est0Data );
Validity=CompuValidaty(ClsASGN ,
T estlbs );
Assignmentcls
= assignmentcls
i
i ClsASGN ;

2
k2 k

k=1

CP k ,

Pk

i=1

Pk

j=i+1

1
k

Pk

i=1

(9)

maxj6=i

Ci +Cj
kwi wj k2

(10)

where a DB close to 0 indicates that the clusters


are compact and far from each other.
Dunn Validity Index (DVI). The DVI index
quantifies not only the degree of compactness
of clusters but also the degree of separation between individual clusters. DVI measures intercluster distances (separation) over intracluster
distances (compactness). For a given number
of clusters K, the definition of such an index
is given by the following equation:
min

DV I =

min {kxi xj k}

0<m6=n<K
x

xij m
n

max

max

0<mK xi ,xj m

kwi wj k2

where an SP close to 0 is an indication of closer


clusters.
Davies-Bouldin Index (DB). This index can
identify cluster overlap by measuring the ratio of the sum of within-cluster scatters to
between-cluster separations. It is defined as:
DB =

PK

where K denotes the number of clusters in


the clustering result. Ideally, the members of
each cluster should be as close to each other
as possible. Therefore, a lower value of CP
indicates better and more compact clusters.
Separation (SP). This measure quantifies the
degree of separation between individual clusters. It measures the mean Euclidean distance
between cluster centroids as follows:
SP =

1
K

{kxi xj k}

(11)

If a dataset containing compact and wellseparated clusters, the distance between the
clusters are usually large and their diameters
are expected to be small. Thus, a larger DVI
value indicates compact and well-separated
clusters.
Cluster Accuracy (CA). CA measures the percentage of correctly classified data points in the
clustering solution compared to pre-defined
class labels. The CA is defined as:
CA =

PK

i=1

max(Ci |Li )
||

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

(12)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
9

where Ci is the set of instances in the ith


cluster; Li is the class labels for all instances in
the ith cluster, and max(Ci |Li ) is the number
of instances with the majority label in the
ith cluster (e.g. if label l appears in the ith
cluster more often than any other label, then
max(Ci |Li ) is the number of instances in Ci
with the label l).
Adjusted Rand index (ARI). ARI takes into
account the number of instances that exist in
the same cluster and different clusters. The
expected value of such a validation measure
is not zero when comparing partitions.
ARI =

n11 +n00
n00 +n01 +n10 +n11

n11 +n00

(n2 )

(13)

where:
n11 : Number of pairs of instances that are in
the same cluster in both.
n00 : Number of pairs of instances that are in
different clusters.
n10 : Number of pairs of instances that are
in the same cluster in A, but in different
clusters in B.
n01 : Number of pairs of instances that are
in different clusters in A, but in the same
cluster in B.
The value of ARI lies between 0 and 1, and the
higher value indicates that all data instances
are clustered correctly and the cluster contains
only pure instances.
Normalized Mutual Information (NMI). This
is one of the common external clustering validation metrics that estimate the quality of the
clustering with respect to a given class labeling
of the data. More formally, NMI can effectively
measure the amount of statistical information
shared by random variables representing the
cluster assignments and the pre-defined label
assignments of the instances. Thus, NMI is
calculated as follows:
 ||.d

h,l
dh,l log
dh cl

 P
P
dh
( l cl log( cdl ))
h dh log
d
P

NMI =

r

(14)

where dh is the number of flows in class h,


cl is the number of flows in cluster l and dh,l
is the number of flows in class h as well as
in cluster l. The NMI value is 1 when the
clustering solution perfectly matches the predefined label assignments and close to 0 for a
low matching.
2) Stability of the results. Since most clustering algorithms rely on a random component, stability of
the results across different runs is considered to be
an asset for an algorithm. Our experimental study
examined the stability of the candidate clustering
algorithms. In doing so, we consider a pairwise
approach to measuring the stability of the candidate clusterers. In particular, the match between

each of the n(n 1)/2 runs of a single cluster is


calculated and the stability index is obtained as the
averaged degree of match across different runs. Let
Sr (Ri , Rj ) be the degree of match between runs Ri
and Rj . The cluster pairwise stability index Sk is:
Sk =

2
n(n1)

Pn1 Pn
i=1

Sr (Ri , Rj ).

(15)

if Ri (xi )= Rj (xj )
otherwise

(16)

j=i+1

where:

Sr (Ri , Rj ) =

1
0

Clearly, it can be seen that Sk (C) is the average


stability measure over all pairs of clustering across
different runs. It takes values from [0, 1], with 0
indicating the results between all pairs of Ri , Rj
are totally different and 1 indicating that the results
of all pairs across different runs are identical.
3) Time requirements. A key motivation for selecting
the candidate clustering algorithms is to deal with
big data. Therefore, if a clustering algorithm takes
too long, it will be impractical for big data.
5.4

Experimental Results and Comparison

First of all, this section presents a comparison of the


clustering outputs with respect to both the external
and internal validity measures. After that, the candidate
clustering algorithms are analyzed from the perspective
of stability, run-time performance and scalability.
5.5

Evaluating Validity

The aim of this test is to determine how accurately


a clustering algorithm can group traffic records from
two different populations. Assessing the validity of
clustering algorithms based on a single measure only
can lead to misleading conclusions. Thus, we have conducted four types of external tests: Cluster Accuracy
(CA), Adjusted Rand index (ARI), Rand index (RI) and
Normalized Mutual Information (NMI). Such measurements allow us to exploit prior knowledge of known
data partitions and cluster labels of the data. Note the
class labels of instances (e.g attack/normal) are used for
evaluation purpose only and are not used in the cluster
formation process.
Table 3 shows the results of the candidate clustering
with respect to the external validity measures. It can be
seen from Table 3 that the EM algorithm provides the
best clustering output based on all external measures
in comparison to the remaining clustering algorithms.
The second best clustering algorithm in terms of external
validity is the FCM algorithm. The analysis reveals that
BIRCH, OptiGrid and DENCLUE respectively yield the
lowest quality of clustering output in comparison to EM
and FCM.
Table 4 reports the results of clustering algorithms
according to the internal validity measures. This is very
important, especially when there is no prior knowledge of the correct class labels of the datasets. Each
of the validation measures evaluates different aspects

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
10

TABLE 3
External validity results for the candidate clustering algorithms
Measures
CA

ARI

RI

MI

Cls. Algorithms
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH

MHIRD
67.904
71.914
75.387
82.512
71.310
57.772
35.894
58.439
70.047
52.424
74.988
76.909
64.876
87.873
73.099
59.853
34.966
64.256
74.925
58.450

MHORD
69.729
71.105
73.271
81.940
69.763
45.248
30.297
53.418
69.481
44.011
73.527
75.404
81.210
83.664
65.823
48.916
38.308
65.680
85.077
58.780

SPFDS
68.042
72.045
74.682
82.786
69.553
41.535
32.140
61.489
73.914
52.470
71.217
75.963
75.118
84.858
77.521
39.949
36.906
76.428
82.405
56.230

DOSDS
61.864
72.191
74.222
82.919
69.930
39.822
29.402
64.181
70.655
41.662
68.384
75.448
77.645
65.113
71.422
49.533
39.429
69.129
86.374
57.930

SPDOS
66.731
70.632
72.873
82.114
69.716
44.510
29.970
57.038
79.205
39.627
70.043
75.631
74.855
88.302
73.069
46.986
37.328
69.708
85.550
57.376

SHIRD
63.149
72.234
74.723
79.450
68.351
35.081
32.598
58.776
67.864
40.377
69.115
76.550
66.113
81.210
70.589
37.158
34.029
72.129
81.742
55.750

SHORD
71.265
37.216
75.553
82.023
70.365
46.267
29.956
59.168
66.731
56.462
75.024
75.359
88.302
84.499
74.156
47.439
47.197
73.242
85.572
57.979

ITD
51.909
40.953
59.974
65.035
24.510
35.663
55.824
38.567
44.403
19.260
44.164
49.252
53.160
68.081
33.184
36.561
54.081
39.242
64.029
25.980

WTP
64.350
51.953
66.435
72.085
59.343
47.665
32.137
49.926
55.343
51.260
57.460
59.201
62.694
74.808
62.357
49.762
33.411
50.589
58.871
52.764

DARPA
70.460
62.215
73.114
80.685
77.343
60.665
52.137
58.534
65.725
61.483
75.477
66.201
78.981
84.395
79.890
65.762
53.411
59.257
67.142
64.994

ITD
1.014
2.454
3.945
1.537
5.529
1.073
3.170
4.926
1.592
6.425
6.870
4.673
10.824
10.882
11.882
0.620
0.449
0.476
0.640
0.536

WTP
1.485
1.189
2.555
2.874
1.834
1.993
1.245
3.038
3.294
1.916
7.702
5.078
10.239
10.013
9.336
0.420
0.630
0.516
0.548
0.632

DARPA
0.832
1.973
2.727
1.367
4.131
0.716
2.437
3.271
1.375
4.719
5.987
8.502
9.238
2.320
6.641
0.837
0.484
0.509
0.669
0.550

TABLE 4
Internal validity results for the candidate clustering algorithms
Measures
CP

SP

DB

DVI

Cls. Algorithms
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH

MHIRD
1.986
1.629
3.243
3.849
3.186
2.973
1.914
3.972
4.535
3.566
1.788
3.798
3.972
8.164
4.943
0.343
0.526
0.491
0.524
0.568

MHORD
1.207
1.678
1.523
2.163
3.466
1.450
1.990
1.636
2.389
3.907
2.467
5.582
2.315
2.065
6.471
0.508
0.518
0.606
0.580
0.562

SPFDS
1.886
1.643
3.014
4.683
3.310
2.776
1.936
3.660
5.597
3.717
4.273
3.085
4.036
4.672
3.234
0.354
0.524
0.498
0.512
0.565

DOSDS
1.104
1.232
2.540
2.405
5.164
1.247
1.311
3.017
2.696
5.979
3.524
1.703
4.104
4.989
2.512
0.560
0.615
0.517
0.567
0.539

of a clustering output separately, and based just on


the raw data. None of them uses explicit information
from the domain of application to evaluate the obtained
cluster. Note that the best value for each measure for
each of the clustering algorithms is shown in bold.
There are several observations from Table 4. First. it can
seen that DENCLUE algorithm often produces compact
clusters in comparison to other clustering algorithms.
The compactness of the DENCLUE is only 37.26 percent
of that of OptiGrid, 47.27 percent of that of EM, 47.48
percent of that FCM and 75.74 percent of that of BIRCH.
Second, for the separation measure, we observe that the
EM algorithm often yields clusters with higher mean
separations among the considered clustering algorithms.
The separation results of the EM algorithm are 42.27
percent of that of DENCLUE, 50.52 percent of that of
OptiGrid, 52.98 percent of that of FCM and 80.60 percent
of that of BIRCH. Third, according to the Davies-Bouldin
index (DB), it can be seen that EM, DENCLUE and
OptiGrid respectively were often able to produce not

SPDOS
1.300
1.271
2.961
2.255
1.692
1.632
1.370
3.588
2.505
1.742
4.551
2.655
3.586
3.198
2.600
0.472
0.603
0.500
0.575
0.646

SHIRD
1.391
2.505
3.504
4.354
2.793
1.810
3.247
4.326
5.178
3.086
0.821
5.573
6.964
2.645
5.002
0.444
0.446
0.485
0.516
0.580

SHORD
1.357
2.330
2.548
3.198
5.292
1.742
2.981
3.028
3.706
6.136
4.990
2.128
5.760
4.776
5.272
0.454
0.457
0.516
0.538
0.537

only compact clusters, but also well-separated clusters.


5.6

Evaluating Stability

The main focus of this section is to compare the stability


of the candidate clustering algorithms output for 10fold on all datasets. The stability would measure the
variation in the outputs of a particular clustering algorithm rather than a general property of the dataset.
Thus higher values indicate lower output changes and
are always preferable. For comparison, Table 5 displays
the stability results obtained for each clustering algorithm on all datasets. Note that the sequence roughly
orders the candidate clustering algorithms according to
growing stability values. Let us point out some of the
most notable phenomena that can be observed regarding
the presented stability results. First, the overall stability
level in most cases only rarely approach 0.599, indicating
that clustering algorithms often suffer from stability
issues and frequently fail to produce stable output. Second, it can be seen that in most cases the EM algorithm

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
11

achieves the highest stability value on all datasets except


for ITD and WTR and DARPA datasets. Third, it can
be seen that the OptiGrid and DENCLUE algorithms
often yield the highest stability values for ITD and WTR
and DARPA datasets among all considered clustering
algorithms. This confirms their suitability for dealing
with high-dimensional datasets. Finally, Table 5 shows
that FCM scores the lowest stability values on datasets
with high problem dimensionality. Future work would
investigate the stability of clustering algorithms using
different parameter settings.
TABLE 5
Stability of the candidate clustering algorithms
Data sets
MHIRD
MHORD
SPFDS
DOSDS
SPDOS
SHIRD
SHORD
ITD
WTP
DARPA

5.7

EM
0.495
0.408
0.478
0.481
0.479
0.476
0.486
0.473
0.436
0.459

Clustering Algorithms
OptiGrid
BIRCH
FCM
0.532
0.567
0.596
0.528
0.537
0.589
0.518
0.544
0.599
0.593
0.561
0.608
0.531
0.556
0.591
0.513
0.504
0.559
0.532
0.562
0.519
0.215
0.372
0.272
0.357
0.307
0.278
0.481
0.397
0.284

DENCLUE
0.415
0.487
0.451
0.467
0.441
0.492
0.492
0.292
0.311
0.359

Evaluating Runtime & Scalability

A key motivation for this section is to evaluate the


ability of the candidate clustering algorithms to group
similar objects efficiently (i.e. with faster runtimes). This
is particularly important when of the collected data is
very large. In order to compare the effectiveness of
the candidate clustering algorithms, we applied each
clustering algorithm to the ten datasets. We then measure the execution time required by each algorithm on
an Intel core i7 2.80 GHz processor machine with 8
Gbytes of RAM. Table 6 records the runtime of the
five candidate clustering algorithms. First, we observe
that the DENCLUE is significantly faster than all other
clustering algorithms. The runtime of DENCLUE 0.1
percent of that of EM, 2.89 percent of that of FCM, 3.71
percent of that of BIRCH and 28.19 percent of that of
OptiGrid. This indicates that DENCLUE is more efficient
than others when choosing clustering to deal with big
data. Second, the EM algorithm had the slowest runtime
of all, and was slower than FCM, BIRCH, OptiGrid and
DENCLUE by 20.40, 26.12, 198.70 and 704.94 percent,
respectively. This indicates that EM is less efficient, and
thus it is not recommended for big data.

C ONCLUSION

This survey provided a comprehensive study of the clustering algorithms proposed in the literature. In order to
reveal future directions for developing new algorithms
and to guide the selection of algorithms for big data, we
proposed a categorizing framework to classify a number
of clustering algorithms. The categorizing framework

TABLE 6
Runtime of the candidate clustering algorithms
Data sets
MHIRD
MHORD
SPFDS
DOSDS
SPDOS
SHIRD
SHORD
ITD
WTP
DARPA
Average

DENCLUE
0.336
0.290
2.5095
1.73229
6.5178
0.011
0.017
7.107
0.230
17.347
3.610

Clustering Algorithms
OptiGrid
BIRCH
FCM
0.081
1.103
0.109
0.290
2.253
7.511
8.365
67.401
139.03
5.7743
86.031
126.471
32.6625
208.875
226.55
0.038
0.811
0.603
0.058
0.780
0.824
23.689
241.074
262.353
0.388
1.246
1.768
56.716
364.592
401.795
12.806
97.416
124.701

EM
3.676
60.689
830.55
581.59
1543.4
3.140
4.929
1982.790
6.429
20429.281
2544.647

is developed from a theoretical viewpoint that would


automatically recommend the most suitable algorithm(s)
to network experts while hiding all technical details
irrelevant to an application. Thus, even future clustering
algorithms could be incorporated into the framework
according to the proposed criteria and properties. Furthermore, the most representative clustering algorithms
of each category have been empirically analyzed over a
vast number of evaluation metrics and traffic datasets. In
order to support the conclusion drawn, we have added
Table 7, which provides a summary of the evaluation.
In general, the empirical study allows us to draw the
following conclusions for big data:
No clustering algorithm performs well for all the
evaluation criteria, and future work should be dedicated to accordingly address the drawbacks of each
clustering algorithm for handling big data.
EM and FCM clustering algorithms show excellent performance with respect to the quality of
the clustering outputs, except for high-dimensional
data. However, these algorithms suffer from high
computational time requirements. Hence, a possible
solution is to rely on programming language and
advances hardware technology which may allow
such algorithms to be executed more efficiently.
All clustering algorithms suffer from stability problem. To mitigate such an issue, ensemble clustering
should be considered.
DENCLUE, OptiGrid and BIRCH are suitable clustering algorithms for dealing with large datasets,
especially DENCLUE and OptiGrid, which can also
deal with high dimensional data.
As future work, we would investigate the following
questions:
Are ensembles of single clustering algorithm more
stable and accurate than individual clustering?
Are ensembles of multi-clustering algorithms more
stable and accurate than ensemble of a single clustering?
How to incorporate the concept of distributed system to improve the performance and efficiency of
existing algorithms on big data?
How can the most suitable parameter settings be
found for each clustering algorithm?

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
12

TABLE 7
Compliance summary of the clustering algorithms based on empirical evaluation metrics
Cls. Algorithms
EM
FCM
DENCLUE
OptiGrid
BIRCH

External Validity
Yes
Yes
No
No
No

Internal Validity
Partially
Partially
Yes
Yes
Suffer from

R EFERENCES
[1]
[2]
[3]

[4]

[5]
[6]
[7]
[8]

[9]

[10]

[11]
[12]
[13]
[14]
[15]
[16]
[17]

[18]
[19]

[20]

A. A. Abbasi and M. Younis. A survey on clustering algorithms for wireless sensor networks. Computer communications,
30(14):28262841, 2007.
C. C. Aggarwal and C. Zhai. A survey of text clustering
algorithms. In Mining Text Data, pp. 77128. Springer, 2012.
A. Almalawi, Z. Tari, A. Fahad, and I. Khalil. A framework for
improving the accuracy of unsupervised intrusion detection for
SCADA systems. Proc. of the 12th IEEE International Conference
on Trust, Security and Privacy in Computing and Communications
(TrustCom), pp. 292301, 2013.
A. Almalawi, Z. Tari, I. Khalil, and A. Fahad. Scadavt-a framework for scada security testbed based on virtualization technology. Proc. the 38th IEEE Conference on Local Computer Networks
(LCN), pp. 639646. IEEE, 2013.
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. ACM
SIGMOD Record, 28(2):4960, 1999.
J. C. Bezdek, R. Ehrlich, and W. Full. Fcm: The fuzzy c-means
clustering algorithm. Computers & Geosciences, 10(2):191203,
1984.
J. Brank, M. Grobelnik, and D. Mladenic. A survey of ontology
evaluation techniques. Proc. the Conference on Data Mining and
Data Warehouses (SiKDD), 2005.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum
likelihood from incomplete data via the em algorithm. Journal
of the Royal Statistical Society. Series B (Methodological), pp. 138,
1977.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based
algorithm for discovering clusters in large spatial databases
with noise. Proc. of the ACM SIGKDD Conference on Knowledge
Discovery ad Data Mining (KDD), pp. 226231, 1996.
A. Fahad, Z. Tari, A. Almalawi, A. Goscinski, I. Khalil, and
A. Mahmood. PPFSCADA: Privacy preserving framework for
scada data publishing.
Future Generation Computer Systems
(FGCS), 2014.
A. Fahad, Z. Tari, I. Khalil, I. Habib, and H. Alnuweiri. Toward
an efficient and scalable feature selection approach for internet
traffic classification. Computer Networks, 57(9):20402057, 2013.
D. H. Fisher. Knowledge acquisition via incremental conceptual
clustering. Machine Learning, 2(2):139172, 1987.
J. H. Gennari, P. Langley, and D. Fisher. Models of incremental
concept formation. Artificial Intelligence, 40(1):1161, 1989.
S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering
algorithm for large databases. ACM SIGMOD Record, volume 27,
pp. 7384. ACM, 1998.
S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345366,
2000.
J. Han and M. Kamber. Data mining: Concepts and techniques,
2006.
A. Hinneburg and D. A. Keim. An efficient approach to clustering
in large multimedia databases with noise. Proc. of the ACM
SIGKDD Conference on Knowledge Discovery ad Data Mining (KDD),
pp. 5865, 1998.
A. Hinneburg, D. A. Keim, et al. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional
clustering. Proc. Very Large Data Bases (VLDB), pp. 506517, 1999.
Z. Huang. A fast clustering algorithm to cluster very large
categorical data sets in data mining. Proc. SIGMOD Workshop
on Research Issues on Data Mining and Knowledge Discovery, pp.
18, 1997.
L. Hubert and P. Arabie. Comparing partitions. Journal of
classification, 2(1):193218, 1985.

Stability
Suffer from
Suffer from
Suffer from
Suffer from
Suffer from

Efficiency Problem
Suffer from
Suffer from
Yes
Yes
Yes

Scalability
Low
Low
High
High
High

[21] A. K. Jain and R. C. Dubes. Algorithms for clustering data. PrenticeHall, Inc., 1988.
[22] G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierarchical
clustering using dynamic modelling. IEEE Computer, 32(8):6875,
1999.
[23] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an
introduction to cluster analysis. John Wiley & Sons, 2009.
[24] T. Kohonen. The self-organizing map. Neurocomputing, 21(1):16,
1998.
[25] J. MacQueen et al. Some methods for classification and analysis
of multivariate observations. Proc. of 5th Berkeley Symposium
on Mathematical Statistics and Probability, pp. 281297. California,
USA, 1967.
[26] A. Mahmood, C. Leckie, and P. Udaya. An efficient clustering
scheme to exploit hierarchical data in network traffic analysis.
IEEE Transactions on Knowledge and Data Engineering (TKDE),
pages 752767, 2007.
[27] A. N. Mahmood, C. Leckie, and P. Udaya. ECHIDNA: efficient
clustering of hierarchical data for network traffic analysis. In
NETWORKING (Networking Technologies, Services, and Protocols;
Performance of Computer and Communication Networks; Mobile and
Wireless Communications Systems), pp. 10921098, 2006.
[28] M. Meila and D. Heckerman. An experimental comparison of
several clustering and initialization methods. Proc. 14th Conference on Uncertainty in Artificial Intelligence, pp. 386395, 1998.
[29] A. Moore, J. Hall, C. Kreibich, E. Harris, and I. Pratt. Architecture
of a network monitor. In Passive & Active Measurement Workshop
(PAM), LaJolla, CA, April 2003.
[30] A. Moore and D. Zuev. Internet traffic classification using
bayesian analysis techniques. Proc. of the ACM International
Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 5060, 2005.
[31] R. T. Ng and J. Han. Efficient and effective clustering methods
for spatial data mining. Proc. of the International Conference Very
Large Data Bases (VLDB), pp. 144155, 1994.
[32] R. T. Ng and J. Han. Clarans: A method for clustering objects
for spatial data mining. IEEE Transactions on Knowledge and Data
Engineering (TKDE), 14(5):10031016, 2002.
[33] H.-S. Park and C.-H. Jun. A simple and fast algorithm for kmedoids clustering. Expert Systems with Applications, 36(2):3336
3341, 2009.
[34] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster:
A multi-resolution clustering approach for very large spatial
databases. Proc. of the International Conference Very Large Data
Bases (VLDB), pp. 428439, 1998.
[35] S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Costbased modeling for fraud and intrusion detection: Results from
the jam project. Proc. of the IEEE DARPA Information Survivability
Conference and Exposition (DISCEX), pp. 130144, 2000.
[36] S. Suthaharan, M. Alzahrani, S. Rajasegarar, C. Leckie, and
M. Palaniswami. Labelled data collection for anomaly detection in wireless sensor networks. Proc. of the 6th International
Conference on Intelligent Sensors, Sensor Networks and Information
Processing (ISSNIP), pp. 269274, 2010.
[37] W. Wang, J. Yang, and R. Muntz. Sting: A statistical information
grid approach to spatial data mining. Proc. of the International
Conference Very Large Data Bases (VLDB), pp. 186195, 1997.
[38] R. Xu, D. Wunsch, et al. Survey of clustering algorithms. IEEE
Transactions on Neural Network, 16(3):645678, 2005.
[39] X. Xu, M. Ester, H.-P. Kriegel, and J. Sander. A distribution-based
clustering algorithm for mining in large spatial databases. Proc. of
the 14th IEEE International Conference on Data Engineering (ICDE),
pp. 324331, 1998.
[40] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient
data clustering method for very large databases. ACM SIGMOD
Record, volume 25, pp. 103114, 1996.

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy