A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis
A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis
A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
1
I NTRODUCTION
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
2
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
3
Clustering Algorithm
Partitioning-Based
Hierarchical-Based
Density-Based
Grid-Based
Model-Based
1. K-means
2. K-medoids
3. K-modes
4. PAM
5. CLARANS
6. CLARA
7. FCM
1. BIRCH
2. CURE
3. ROCK
4. Chameleon
5. Echidna
1. DBSCAN
2. OPTICS
3. DBCLASD
4. DENCLUE
1. Wave-Cluster
2. STING
3. CLIQUE
4. OptiGrid
1. EM
2. COBWEB
3. CLASSIT
4. SOMs
3 C RITERION
M ETHODS
TO BENCHMARK CLUSTERING
When evaluating clustering methods for big data, specific criteria need to be used to evaluate the relative
strengths and weaknesses of every algorithm with respect to the three-dimensional properties of big data,
including Volume, Velocity, and Variety. In this section,
we define such properties and compiled the key criterion
of each property.
Volume refers to the ability of a clustering algorithm to deal with a large amount of data. To guide
the selection of a suitable clustering algorithm with
respect to the Volume property, the following criteria
are considered: (i) size of the dataset, (ii) handling
high dimensionality and (iii) handling outliers/
noisy data.
Variety refers to the ability of a clustering algorithm
to handle different types of data (numerical, categorical and hierarchical). To guide the selection of a
suitable clustering algorithm with respect to the Variety property, the following criteria are considered:
(i) type of dataset and (ii) clusters shape.
Velocity refers to the speed of a clustering algorithm
on big data. To guide the selection of a suitable
clustering algorithm with respect to the Velocity
property, the following criteria are considered: (i)
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
4
C ANDIDATE
CLUSTERING ALGORITHMS
Fuzzy-CMeans (FCM)
n X
c
X
m
ik |pi vk |
(1)
i=1 k=1
pPn
i=1
(pi vk )
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
(2)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
5
TABLE 1
Categorization of clustering algorithms with respect to big data proprieties and other criteria described in Section 3.
Categories
Partitional algorithms
Hierarchical algorithms
Density-based algorithms
Size of Dataset
Large
Large
Small
Small
Large
Large
Large
Large
Large
Large
Large
Large
Volume
Handling High Dimensionality
No
Yes
Yes
No
No
No
No
No
Yes
No
Yes
No
DBSCAN [9]
Large
OPTICS [5]
DBCLASD [39]
DENCLUE [17]
Wave-Cluster [34]
STING [37]
CLIQUE [21]
OptiGrid [18]
EM [8]
COBWEB [12]
CLASSIT [13]
SOMs [24]
Large
Large
Large
Large
Large
Large
Large
Large
Small
Small
Small
Abb. name
K-Means [25]
K-modes [19]
K-medoids [33]
PAM [31]
CLARA [23]
CLARANS [32]
FCM [6]
BIRCH [40]
CURE [14]
ROCK [15]
Chameleon [22]
ECHIDNA [26]
Variety
Type of Dataset
Numerical
Categorical
Categorical
Numerical
Numerical
Numerical
Numerical
Numerical
Numerical
Categorical and Numerical
All type of data
Multivariate Data
Clusters Shape
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Non-convex
Arbitrary
Arbitrary
Arbitrary
Non-convex
No
No
Numerical
Arbitrary
No
No
Yes
No
No
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
No
No
Numerical
Numerical
Numerical
Special data
Special data
Numerical
Special data
Special data
Numerical
Numerical
Multivariate Data
Arbitrary
Arbitrary
Arbitrary
Arbitrary
Arbitrary
Arbitrary
Arbitrary
Non-convex
Non-convex
Non-convex
Non-convex
ik
ik =
Pc
l=1
|pi vk |
|pi vl |
2
m1
(4)
(5)
As mentioned earlier, this has an iterative process (see
FCM pseudo-code).
FCM pseudo-code:
Input: Given the dataset, set the desire number of clusters c, the
fuzzy parameter m (a constant > 1), and the stopping condition,
initialize the fuzzy partition matrix, and set stop = f alse.
Step 1. Do:
Step 2. Calculate the cluster centroids and the objective value
J.
Step 3. Compute the membership values stored in the matrix.
Step 4. If the value of J between consecutive iterations is less
than the stopping condition, then stop = true.
Step 5. While (!stop)
Output: A list of c cluster centres and a partition matrix are
produced.
4.2
Velocity
complexity of Algorithm
O(nkd)
O(n)
O(n2 dt)
O(k(n-k)2 )
O(k(40+k)2 +k(n-k))
O(kn2 )
O(n)
O(n)
O(n2 log n)
O(n2 +nmmma+n2 logn)
O(n2 )
O(N B(1 + logB m))
O(nlogn) If a spatial index is used
Otherwise, it is O(n2 ).
O(nlogn)
O(3n2 )
O(log|D|)
O(n)
O(k)
O(Ck + mk)
Between O(nd) and O(nd log n)
O(knp)
O(n2 )
O(n2 )
O(n2 m)
Other criterion
Input Parameter
1
1
1
1
1
2
1
2
2
1
3
2
2
2
No
2
3
1
2
3
3
1
1
2
BIRCH
4.3
DENCLUE
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
6
4.4
OptiGrid algorithm [18] is designed to obtain an optimal grid partitioning. This is achieved by constructing
the best cutting hyperplanes through a set of selected
projections. These projections are then used to find the
optimal cutting planes. Each cutting plane is selected
to have minimal point density and to separate the
dense region into two half spaces. After each step of
a multi-dimensional grid construction defined by the
best cutting planes, OptiGrid finds the clusters using
the density function. The algorithm is then applied
recursively to the clusters. In each round of recursion,
OptiGrid only maintains data objects in the dense grids
from the previous round of recursion. This method
is very efficient for clustering large high-dimensional
4.5
Expectation-Maximization (EM)
5 E XPERIMENTAL
DATA
E VALUATION
ON
R EAL
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
7
EM pseudo-code:
TABLE 2
Data Sets Used in the Experiments
dataset
MHIRD
MHORD
SPFDS
DOSDS
SPDOS
SHIRD
SHORD
ITD
WTP
DARPA
# instances
699
2500
100500
400350
290007
1800
400
377,526
512
1000,000
# attributes # classes
10
2
3
2
15
2
15
2
15
3
4
2
4
2
149
12
39
2
42
5
Validity Metrics
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
8
Input:
Output:
2
3
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
|i |
CP i =
CP =
1) Validity evaluation. Unsupervised learning techniques required different evaluation criteria than
supervised learning techniques. In this section,
we briefly summarize the criteria used for performance evaluation according to internal and external
validation indices. The former evaluation criteria is
to evaluate the goodness of a data partition using
quantities and feature inherited from the datasets,
this includes Compactness (CP) and Dunn Validity
Index (DVI). The latter evaluation criteria is similar
to the process of cross-validation that is used in
evaluating supervised learning techniques. Such
evaluation criteria include Classification Accuracy
(CA), Adjusted Rand Index (ARI) and Normalized
Mutual Information (NMI). Given a dataset whose
class labels are known, it is possible to assess
how accurately a clustering technique partitions
the data relative to their correct class labels. Note,
some of clustering algorithms do not have centroids, and therefore the internal indices are not
applicable to such an algorithms (e.g. OptiGrid and
DENCLUE). To address such as issue we get the
centroid of a cluster by using the measure in [27],
[26] and Euclidean distance metric.
The following notation is used: X is the dataset
formed by xi flows; is the set of flows that have
been grouped in a cluster; and W is the set of wj
centroids of the clusters in . We will call node to
each of the k elements of the clustering method.
Compactness (CP). It is one of the commonly
measurements used to validate clusters by employing only the information inherent to the
xi i
(8)
kxi wi k
2
k2 k
k=1
CP k ,
Pk
i=1
Pk
j=i+1
1
k
Pk
i=1
(9)
maxj6=i
Ci +Cj
kwi wj k2
(10)
DV I =
min {kxi xj k}
0<m6=n<K
x
xij m
n
max
max
0<mK xi ,xj m
kwi wj k2
PK
1
K
{kxi xj k}
(11)
If a dataset containing compact and wellseparated clusters, the distance between the
clusters are usually large and their diameters
are expected to be small. Thus, a larger DVI
value indicates compact and well-separated
clusters.
Cluster Accuracy (CA). CA measures the percentage of correctly classified data points in the
clustering solution compared to pre-defined
class labels. The CA is defined as:
CA =
PK
i=1
max(Ci |Li )
||
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
(12)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
9
n11 +n00
n00 +n01 +n10 +n11
n11 +n00
(n2 )
(13)
where:
n11 : Number of pairs of instances that are in
the same cluster in both.
n00 : Number of pairs of instances that are in
different clusters.
n10 : Number of pairs of instances that are
in the same cluster in A, but in different
clusters in B.
n01 : Number of pairs of instances that are
in different clusters in A, but in the same
cluster in B.
The value of ARI lies between 0 and 1, and the
higher value indicates that all data instances
are clustered correctly and the cluster contains
only pure instances.
Normalized Mutual Information (NMI). This
is one of the common external clustering validation metrics that estimate the quality of the
clustering with respect to a given class labeling
of the data. More formally, NMI can effectively
measure the amount of statistical information
shared by random variables representing the
cluster assignments and the pre-defined label
assignments of the instances. Thus, NMI is
calculated as follows:
||.d
h,l
dh,l log
dh cl
P
P
dh
( l cl log( cdl ))
h dh log
d
P
NMI =
r
(14)
2
n(n1)
Pn1 Pn
i=1
Sr (Ri , Rj ).
(15)
if Ri (xi )= Rj (xj )
otherwise
(16)
j=i+1
where:
Sr (Ri , Rj ) =
1
0
Evaluating Validity
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
10
TABLE 3
External validity results for the candidate clustering algorithms
Measures
CA
ARI
RI
MI
Cls. Algorithms
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
MHIRD
67.904
71.914
75.387
82.512
71.310
57.772
35.894
58.439
70.047
52.424
74.988
76.909
64.876
87.873
73.099
59.853
34.966
64.256
74.925
58.450
MHORD
69.729
71.105
73.271
81.940
69.763
45.248
30.297
53.418
69.481
44.011
73.527
75.404
81.210
83.664
65.823
48.916
38.308
65.680
85.077
58.780
SPFDS
68.042
72.045
74.682
82.786
69.553
41.535
32.140
61.489
73.914
52.470
71.217
75.963
75.118
84.858
77.521
39.949
36.906
76.428
82.405
56.230
DOSDS
61.864
72.191
74.222
82.919
69.930
39.822
29.402
64.181
70.655
41.662
68.384
75.448
77.645
65.113
71.422
49.533
39.429
69.129
86.374
57.930
SPDOS
66.731
70.632
72.873
82.114
69.716
44.510
29.970
57.038
79.205
39.627
70.043
75.631
74.855
88.302
73.069
46.986
37.328
69.708
85.550
57.376
SHIRD
63.149
72.234
74.723
79.450
68.351
35.081
32.598
58.776
67.864
40.377
69.115
76.550
66.113
81.210
70.589
37.158
34.029
72.129
81.742
55.750
SHORD
71.265
37.216
75.553
82.023
70.365
46.267
29.956
59.168
66.731
56.462
75.024
75.359
88.302
84.499
74.156
47.439
47.197
73.242
85.572
57.979
ITD
51.909
40.953
59.974
65.035
24.510
35.663
55.824
38.567
44.403
19.260
44.164
49.252
53.160
68.081
33.184
36.561
54.081
39.242
64.029
25.980
WTP
64.350
51.953
66.435
72.085
59.343
47.665
32.137
49.926
55.343
51.260
57.460
59.201
62.694
74.808
62.357
49.762
33.411
50.589
58.871
52.764
DARPA
70.460
62.215
73.114
80.685
77.343
60.665
52.137
58.534
65.725
61.483
75.477
66.201
78.981
84.395
79.890
65.762
53.411
59.257
67.142
64.994
ITD
1.014
2.454
3.945
1.537
5.529
1.073
3.170
4.926
1.592
6.425
6.870
4.673
10.824
10.882
11.882
0.620
0.449
0.476
0.640
0.536
WTP
1.485
1.189
2.555
2.874
1.834
1.993
1.245
3.038
3.294
1.916
7.702
5.078
10.239
10.013
9.336
0.420
0.630
0.516
0.548
0.632
DARPA
0.832
1.973
2.727
1.367
4.131
0.716
2.437
3.271
1.375
4.719
5.987
8.502
9.238
2.320
6.641
0.837
0.484
0.509
0.669
0.550
TABLE 4
Internal validity results for the candidate clustering algorithms
Measures
CP
SP
DB
DVI
Cls. Algorithms
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
DENCLUE
OptiGrid
FCM
EM
BIRCH
MHIRD
1.986
1.629
3.243
3.849
3.186
2.973
1.914
3.972
4.535
3.566
1.788
3.798
3.972
8.164
4.943
0.343
0.526
0.491
0.524
0.568
MHORD
1.207
1.678
1.523
2.163
3.466
1.450
1.990
1.636
2.389
3.907
2.467
5.582
2.315
2.065
6.471
0.508
0.518
0.606
0.580
0.562
SPFDS
1.886
1.643
3.014
4.683
3.310
2.776
1.936
3.660
5.597
3.717
4.273
3.085
4.036
4.672
3.234
0.354
0.524
0.498
0.512
0.565
DOSDS
1.104
1.232
2.540
2.405
5.164
1.247
1.311
3.017
2.696
5.979
3.524
1.703
4.104
4.989
2.512
0.560
0.615
0.517
0.567
0.539
SPDOS
1.300
1.271
2.961
2.255
1.692
1.632
1.370
3.588
2.505
1.742
4.551
2.655
3.586
3.198
2.600
0.472
0.603
0.500
0.575
0.646
SHIRD
1.391
2.505
3.504
4.354
2.793
1.810
3.247
4.326
5.178
3.086
0.821
5.573
6.964
2.645
5.002
0.444
0.446
0.485
0.516
0.580
SHORD
1.357
2.330
2.548
3.198
5.292
1.742
2.981
3.028
3.706
6.136
4.990
2.128
5.760
4.776
5.272
0.454
0.457
0.516
0.538
0.537
Evaluating Stability
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
11
5.7
EM
0.495
0.408
0.478
0.481
0.479
0.476
0.486
0.473
0.436
0.459
Clustering Algorithms
OptiGrid
BIRCH
FCM
0.532
0.567
0.596
0.528
0.537
0.589
0.518
0.544
0.599
0.593
0.561
0.608
0.531
0.556
0.591
0.513
0.504
0.559
0.532
0.562
0.519
0.215
0.372
0.272
0.357
0.307
0.278
0.481
0.397
0.284
DENCLUE
0.415
0.487
0.451
0.467
0.441
0.492
0.492
0.292
0.311
0.359
C ONCLUSION
This survey provided a comprehensive study of the clustering algorithms proposed in the literature. In order to
reveal future directions for developing new algorithms
and to guide the selection of algorithms for big data, we
proposed a categorizing framework to classify a number
of clustering algorithms. The categorizing framework
TABLE 6
Runtime of the candidate clustering algorithms
Data sets
MHIRD
MHORD
SPFDS
DOSDS
SPDOS
SHIRD
SHORD
ITD
WTP
DARPA
Average
DENCLUE
0.336
0.290
2.5095
1.73229
6.5178
0.011
0.017
7.107
0.230
17.347
3.610
Clustering Algorithms
OptiGrid
BIRCH
FCM
0.081
1.103
0.109
0.290
2.253
7.511
8.365
67.401
139.03
5.7743
86.031
126.471
32.6625
208.875
226.55
0.038
0.811
0.603
0.058
0.780
0.824
23.689
241.074
262.353
0.388
1.246
1.768
56.716
364.592
401.795
12.806
97.416
124.701
EM
3.676
60.689
830.55
581.59
1543.4
3.140
4.929
1982.790
6.429
20429.281
2544.647
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TETC.2014.2330519,
IEEE Transactions on Emerging Topics in Computing
TRANSACTIONS ON EMERGING TOPICS
IN COMPUTING, 2014
12
TABLE 7
Compliance summary of the clustering algorithms based on empirical evaluation metrics
Cls. Algorithms
EM
FCM
DENCLUE
OptiGrid
BIRCH
External Validity
Yes
Yes
No
No
No
Internal Validity
Partially
Partially
Yes
Yes
Suffer from
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
A. A. Abbasi and M. Younis. A survey on clustering algorithms for wireless sensor networks. Computer communications,
30(14):28262841, 2007.
C. C. Aggarwal and C. Zhai. A survey of text clustering
algorithms. In Mining Text Data, pp. 77128. Springer, 2012.
A. Almalawi, Z. Tari, A. Fahad, and I. Khalil. A framework for
improving the accuracy of unsupervised intrusion detection for
SCADA systems. Proc. of the 12th IEEE International Conference
on Trust, Security and Privacy in Computing and Communications
(TrustCom), pp. 292301, 2013.
A. Almalawi, Z. Tari, I. Khalil, and A. Fahad. Scadavt-a framework for scada security testbed based on virtualization technology. Proc. the 38th IEEE Conference on Local Computer Networks
(LCN), pp. 639646. IEEE, 2013.
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. ACM
SIGMOD Record, 28(2):4960, 1999.
J. C. Bezdek, R. Ehrlich, and W. Full. Fcm: The fuzzy c-means
clustering algorithm. Computers & Geosciences, 10(2):191203,
1984.
J. Brank, M. Grobelnik, and D. Mladenic. A survey of ontology
evaluation techniques. Proc. the Conference on Data Mining and
Data Warehouses (SiKDD), 2005.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum
likelihood from incomplete data via the em algorithm. Journal
of the Royal Statistical Society. Series B (Methodological), pp. 138,
1977.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based
algorithm for discovering clusters in large spatial databases
with noise. Proc. of the ACM SIGKDD Conference on Knowledge
Discovery ad Data Mining (KDD), pp. 226231, 1996.
A. Fahad, Z. Tari, A. Almalawi, A. Goscinski, I. Khalil, and
A. Mahmood. PPFSCADA: Privacy preserving framework for
scada data publishing.
Future Generation Computer Systems
(FGCS), 2014.
A. Fahad, Z. Tari, I. Khalil, I. Habib, and H. Alnuweiri. Toward
an efficient and scalable feature selection approach for internet
traffic classification. Computer Networks, 57(9):20402057, 2013.
D. H. Fisher. Knowledge acquisition via incremental conceptual
clustering. Machine Learning, 2(2):139172, 1987.
J. H. Gennari, P. Langley, and D. Fisher. Models of incremental
concept formation. Artificial Intelligence, 40(1):1161, 1989.
S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering
algorithm for large databases. ACM SIGMOD Record, volume 27,
pp. 7384. ACM, 1998.
S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345366,
2000.
J. Han and M. Kamber. Data mining: Concepts and techniques,
2006.
A. Hinneburg and D. A. Keim. An efficient approach to clustering
in large multimedia databases with noise. Proc. of the ACM
SIGKDD Conference on Knowledge Discovery ad Data Mining (KDD),
pp. 5865, 1998.
A. Hinneburg, D. A. Keim, et al. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional
clustering. Proc. Very Large Data Bases (VLDB), pp. 506517, 1999.
Z. Huang. A fast clustering algorithm to cluster very large
categorical data sets in data mining. Proc. SIGMOD Workshop
on Research Issues on Data Mining and Knowledge Discovery, pp.
18, 1997.
L. Hubert and P. Arabie. Comparing partitions. Journal of
classification, 2(1):193218, 1985.
Stability
Suffer from
Suffer from
Suffer from
Suffer from
Suffer from
Efficiency Problem
Suffer from
Suffer from
Yes
Yes
Yes
Scalability
Low
Low
High
High
High
[21] A. K. Jain and R. C. Dubes. Algorithms for clustering data. PrenticeHall, Inc., 1988.
[22] G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierarchical
clustering using dynamic modelling. IEEE Computer, 32(8):6875,
1999.
[23] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an
introduction to cluster analysis. John Wiley & Sons, 2009.
[24] T. Kohonen. The self-organizing map. Neurocomputing, 21(1):16,
1998.
[25] J. MacQueen et al. Some methods for classification and analysis
of multivariate observations. Proc. of 5th Berkeley Symposium
on Mathematical Statistics and Probability, pp. 281297. California,
USA, 1967.
[26] A. Mahmood, C. Leckie, and P. Udaya. An efficient clustering
scheme to exploit hierarchical data in network traffic analysis.
IEEE Transactions on Knowledge and Data Engineering (TKDE),
pages 752767, 2007.
[27] A. N. Mahmood, C. Leckie, and P. Udaya. ECHIDNA: efficient
clustering of hierarchical data for network traffic analysis. In
NETWORKING (Networking Technologies, Services, and Protocols;
Performance of Computer and Communication Networks; Mobile and
Wireless Communications Systems), pp. 10921098, 2006.
[28] M. Meila and D. Heckerman. An experimental comparison of
several clustering and initialization methods. Proc. 14th Conference on Uncertainty in Artificial Intelligence, pp. 386395, 1998.
[29] A. Moore, J. Hall, C. Kreibich, E. Harris, and I. Pratt. Architecture
of a network monitor. In Passive & Active Measurement Workshop
(PAM), LaJolla, CA, April 2003.
[30] A. Moore and D. Zuev. Internet traffic classification using
bayesian analysis techniques. Proc. of the ACM International
Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 5060, 2005.
[31] R. T. Ng and J. Han. Efficient and effective clustering methods
for spatial data mining. Proc. of the International Conference Very
Large Data Bases (VLDB), pp. 144155, 1994.
[32] R. T. Ng and J. Han. Clarans: A method for clustering objects
for spatial data mining. IEEE Transactions on Knowledge and Data
Engineering (TKDE), 14(5):10031016, 2002.
[33] H.-S. Park and C.-H. Jun. A simple and fast algorithm for kmedoids clustering. Expert Systems with Applications, 36(2):3336
3341, 2009.
[34] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster:
A multi-resolution clustering approach for very large spatial
databases. Proc. of the International Conference Very Large Data
Bases (VLDB), pp. 428439, 1998.
[35] S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Costbased modeling for fraud and intrusion detection: Results from
the jam project. Proc. of the IEEE DARPA Information Survivability
Conference and Exposition (DISCEX), pp. 130144, 2000.
[36] S. Suthaharan, M. Alzahrani, S. Rajasegarar, C. Leckie, and
M. Palaniswami. Labelled data collection for anomaly detection in wireless sensor networks. Proc. of the 6th International
Conference on Intelligent Sensors, Sensor Networks and Information
Processing (ISSNIP), pp. 269274, 2010.
[37] W. Wang, J. Yang, and R. Muntz. Sting: A statistical information
grid approach to spatial data mining. Proc. of the International
Conference Very Large Data Bases (VLDB), pp. 186195, 1997.
[38] R. Xu, D. Wunsch, et al. Survey of clustering algorithms. IEEE
Transactions on Neural Network, 16(3):645678, 2005.
[39] X. Xu, M. Ester, H.-P. Kriegel, and J. Sander. A distribution-based
clustering algorithm for mining in large spatial databases. Proc. of
the 14th IEEE International Conference on Data Engineering (ICDE),
pp. 324331, 1998.
[40] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient
data clustering method for very large databases. ACM SIGMOD
Record, volume 25, pp. 103114, 1996.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.