By Lior Rokach and Oded Maimon: Clustering Methods
By Lior Rokach and Oded Maimon: Clustering Methods
This paper is based on the defining what are the criteria used to determine
whether two objects are similar or not and what are the different types of
Clustering Methods
Criteria for evaluating a cluster whether it is good or not is divided into two
categories which is Internal and External. Internal Quality metrics usually measure
the compactness of the cluster. Sum of Squared Error (SSE) being the simplest and
most widely used. It is measures by squaring the deviation for each of the value.
Clustering Method that minimize the SSE criterion are often called minimum
variance partitions. SSE criterion is suitable for cases in which the cluster form
compact clouds that are well separated from one another. Internal Quality can
also be measured by Scatter Criteria which is being derived from the scatter
matrices, reflecting the within-cluster scatter, the between-cluster scatter and
their summation that is the total scatter matrix. External Quality Criteria on other
hand can be useful for examining whether the structure of the clusters match to
some predefined classification of the instances. It can be measured using Mutual
Information. Precision-Recall is a way of measuring external quality which check
the fraction of correctly retrieved instances out of all matching instances. Rand
Index is a simple criterion used to compare two clustering structure can be
calculated by dividing number of pairs of instances that are assigned to same
cluster or are assigned to different cluster in both the clustering structure to the
total number of pairs of instances.
Using the above criteria many clustering algorithm has been developed, each of
which uses a different induction principle. Clustering methods are mainly divided
into two groups: hierarchical and partitioning methods, and three additional main
categories: density-based methods, model-based clustering and grid-based
methods.
Graph theoretic methods are method that produce cluster via graphs. A well
known graph theoretic algorithm is based on Minimum Spanning Tree(MST).
Density Based Method assumes that the points belongs to each cluster are drawn
from a specific probability distribution, where the overall distribution is a mixture
of several distributions. These methods are designed for discovering cluster of
arbitrary shape. The idea behind is to continue growing the given cluster as long
as the density in the neighborhood exceeds some threshold. Density based
method includes: DBSCAN algorithm, it check whether neighborhood of a object
contains more than the minimum number of objects; AUTOCLASS, it covers wide
variety of distribution including Gaussian, Bernoulli, poisson and log-normal
distributions; Other well known methods are SNOB and MCLUST.
Model-based Clustering Methods, while identifying clusters it also find
characteristics description for each groups. The most frequently used induction
methods are decision trees and neural networks. In decision tree the data is
represented by a hierarchical tree, where each leaf refers to a concept and
contains a probabilistic description of that concept. The most well-known
algorithm are COBWEB and CLASSIT, an extension of COBWEB. In Neural Network
input data is represented by a neurons which are connected to the prototype
neurons where each connection has a weight which learn adaptively during
learning. A very popular neural algorithm for clustering is self-organizing map
(SOM). It is useful for visualizing high-dimensional data in 2D or 3D space.
Grid-based Methods partition the space into a finite number of cells that form a
grid structure on which all of the operation are performed.
For clustering large data set, CLARANS (Clustering Large Applications based on
RANdom Search) have been developed by Ng and Han (1994). This method
identifies candidate cluster centroids by using repeated random samples of the
original data. Because of the use of random sampling, the time complexity is O(n)
for a pattern set of n elements. The BIRCH algorithm (Balanced Iterative Reducing
and Clustering) stores summary information about candidate clusters in a
dynamic tree data structure. This tree hierarchically organizes the clusters
represented at the leaf nodes. The tree can be rebuilt when a threshold specifying
cluster size is updated manually. This algorithm has a time complexity linear in the
number of instances. All algorithms presented till this point assume that the
entire dataset can be accommodated in the main memory. However, there are
cases in which this assumption is untrue. The following sub-sections describe
three current approaches to solve this problem. Decomposition Approach, In this
dataset can be stored in a secondary memory and subsets of this data clustered
independently, followed by a merging step to yield a clustering of the entire
dataset. Incremental Clustering algorithm, process the data one element at a time
and only the cluster representations are stored in the main memory to alleviate
the space limitations. Parallel Implementation, that is distribution of the
computations over a network of workstations.