0% found this document useful (0 votes)
66 views

By Lior Rokach and Oded Maimon: Clustering Methods

This document discusses various clustering methods used to group similar objects together. It describes two main types of clustering - hierarchical and partitioning methods. Hierarchical clustering recursively partitions objects into clusters, while partitioning methods preset the number of clusters and optimize cluster assignments. The document also covers density-based, model-based, grid-based, and other soft computing clustering methods. It discusses criteria for evaluating clusters and algorithms for handling large datasets that do not fit into memory.

Uploaded by

Rohit Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

By Lior Rokach and Oded Maimon: Clustering Methods

This document discusses various clustering methods used to group similar objects together. It describes two main types of clustering - hierarchical and partitioning methods. Hierarchical clustering recursively partitions objects into clusters, while partitioning methods preset the number of clusters and optimize cluster assignments. The document also covers density-based, model-based, grid-based, and other soft computing clustering methods. It discusses criteria for evaluating clusters and algorithms for handling large datasets that do not fit into memory.

Uploaded by

Rohit Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

CLUSTERING METHODS

-By Lior Rokach and Oded Maimon

This paper is based on the defining what are the criteria used to determine
whether two objects are similar or not and what are the different types of
Clustering Methods

Clustering is the grouping of similar objects while dissimilar objects belongs to


different cluster. There are two types of measure to determine similarity which is
Distance measures and Similarity measures. Under Distance Measure, for
measuring numeric attributes two data instances can be calculated using the
Minkowski metric a generalization of Euclidean distance. In case of Binary
attributes distance between object can be calculated based on contingency table
seeing if both of its states are equally valuable. For Nominal attributes difference
between the number of matches from total attributes is checked and for Ordinal
attributes, the attributes are treated as numeric of ranges [0, 1]. Where for
Mixed-Type attributes distance can be calculated by combining the methods
above. An alternative way to that of distance is the similarity function, which give
a large value when two objects are similar and largest for identical objects. One
way to calculate similarity function is by using Cosine Measures, which check the
angle between two vectors. Other common ways are by Pearson Correlation
Measure which use the average feature value and Extended Jaccard Measure.

Criteria for evaluating a cluster whether it is good or not is divided into two
categories which is Internal and External. Internal Quality metrics usually measure
the compactness of the cluster. Sum of Squared Error (SSE) being the simplest and
most widely used. It is measures by squaring the deviation for each of the value.
Clustering Method that minimize the SSE criterion are often called minimum
variance partitions. SSE criterion is suitable for cases in which the cluster form
compact clouds that are well separated from one another. Internal Quality can
also be measured by Scatter Criteria which is being derived from the scatter
matrices, reflecting the within-cluster scatter, the between-cluster scatter and
their summation that is the total scatter matrix. External Quality Criteria on other
hand can be useful for examining whether the structure of the clusters match to
some predefined classification of the instances. It can be measured using Mutual
Information. Precision-Recall is a way of measuring external quality which check
the fraction of correctly retrieved instances out of all matching instances. Rand
Index is a simple criterion used to compare two clustering structure can be
calculated by dividing number of pairs of instances that are assigned to same
cluster or are assigned to different cluster in both the clustering structure to the
total number of pairs of instances.

Using the above criteria many clustering algorithm has been developed, each of
which uses a different induction principle. Clustering methods are mainly divided
into two groups: hierarchical and partitioning methods, and three additional main
categories: density-based methods, model-based clustering and grid-based
methods.

Hierarchical methods construct the cluster by recursively partitioning the


instances in either a top-down or bottom-up fashion and so subdivided as Divisive
and Agglomerative hierarchical clustering respectively. In divisive hierarchical
clustering, all objects initially belongs to one cluster which is then divided into
sub-cluster, which are successively divided into their own sub-cluster till the
desired cluster structure is obtained. In Agglomerative hierarchical clustering,
each objects initially represents a cluster of its own, which then successively
merged until the desired cluster structure is obtained. The merging or division of
cluster is based on similarity measure, which further divide the hierarchical
clustering into three part based on the measure taken. Single-link clustering,
consider the shortest distance between two clusters. Complete-link clustering,
consider the longest distance between two clusters. Average-link clustering,
consider the average distance between two clusters. Single-link clustering is more
versatile but has a drawback known as the “chaining effect”, as it may unify two
big cluster into one, where Average-link clustering may cause elongated clusters
to split and portions of neighboring elongated cluster to merge. As a whole
hierarchical methods produce multiple partition which allow different user to
choose different partitions. The main disadvantages of hierarchical clustering is its
Inability to scale well and having no back-tracking capability.

Partitioning Method construct the number of cluster preset by the user by


relocating instances by moving them from one cluster to another, starting from
an initial clustering. Basic idea is to find a clustering structure that minimize a
certain error criterion which measure the distance of each instances to their
representative value. The most well known criterion is the Sum of Squared
Error(SSE). The simplest and most commonly used algorithm is K-means
algorithm, it partitions the data into K clusters represented by their means. It start
with an initial set of cluster center randomly. In each iteration, each instance is
assigned to its nearest cluster center. Then the cluster centers are recalculated as
the mean of all instances belonging to that cluster. Advantage of k-means is its
simplicity, speed of convergence, adaptability to sparse data and its linear
complexity. Weakness being sensitive towards selection of initial partition, not as
versatile as single link algorithm for instances and its sensitivity toward noisy data
and outlier. Another partitioning algorithm which attempts to minimize the SSE is
the K-medoids or PAM (Partition Around Medoids). Here each cluster is
represented by the most centric object in the cluster, rather than by implicit
mean that may not belong to the cluster. It is less influenced by outliers.
However, its processing is more costlier.

Graph theoretic methods are method that produce cluster via graphs. A well
known graph theoretic algorithm is based on Minimum Spanning Tree(MST).

Density Based Method assumes that the points belongs to each cluster are drawn
from a specific probability distribution, where the overall distribution is a mixture
of several distributions. These methods are designed for discovering cluster of
arbitrary shape. The idea behind is to continue growing the given cluster as long
as the density in the neighborhood exceeds some threshold. Density based
method includes: DBSCAN algorithm, it check whether neighborhood of a object
contains more than the minimum number of objects; AUTOCLASS, it covers wide
variety of distribution including Gaussian, Bernoulli, poisson and log-normal
distributions; Other well known methods are SNOB and MCLUST.
Model-based Clustering Methods, while identifying clusters it also find
characteristics description for each groups. The most frequently used induction
methods are decision trees and neural networks. In decision tree the data is
represented by a hierarchical tree, where each leaf refers to a concept and
contains a probabilistic description of that concept. The most well-known
algorithm are COBWEB and CLASSIT, an extension of COBWEB. In Neural Network
input data is represented by a neurons which are connected to the prototype
neurons where each connection has a weight which learn adaptively during
learning. A very popular neural algorithm for clustering is self-organizing map
(SOM). It is useful for visualizing high-dimensional data in 2D or 3D space.

Grid-based Methods partition the space into a finite number of cells that form a
grid structure on which all of the operation are performed.

In Clustering there are also other soft-computing method such as Fuzzy


Clustering, as traditionally in partitioning each instances belong to one and only
one cluster. By using Fuzzy set concept, each data point can belong to more than
one cluster. The most popular fuzzy clustering algorithm is fuzzy c-mean(FCM)
algorithm. Other method is Evolutionary Approaches for Clustering, It refers to
the application of evolutionary algorithm (also known as genetic algorithms) to
data clustering. A fitness value is associated with each clusters structure. A higher
fitness value indicates a better cluster structure. Cluster structures with a small
squared error will have a larger fitness value. Genetic Algorithms perform a
globalized search for solutions whereas most other clustering procedures per
form a localized search. Simulated Annealing for Clustering, another general-
purpose stochastic search technique. It is used to find a near optimal partitioning
with respect to each of several clustering criteria for a variety of simulated data
sets.

For clustering large data set, CLARANS (Clustering Large Applications based on
RANdom Search) have been developed by Ng and Han (1994). This method
identifies candidate cluster centroids by using repeated random samples of the
original data. Because of the use of random sampling, the time complexity is O(n)
for a pattern set of n elements. The BIRCH algorithm (Balanced Iterative Reducing
and Clustering) stores summary information about candidate clusters in a
dynamic tree data structure. This tree hierarchically organizes the clusters
represented at the leaf nodes. The tree can be rebuilt when a threshold specifying
cluster size is updated manually. This algorithm has a time complexity linear in the
number of instances. All algorithms presented till this point assume that the
entire dataset can be accommodated in the main memory. However, there are
cases in which this assumption is untrue. The following sub-sections describe
three current approaches to solve this problem. Decomposition Approach, In this
dataset can be stored in a secondary memory and subsets of this data clustered
independently, followed by a merging step to yield a clustering of the entire
dataset. Incremental Clustering algorithm, process the data one element at a time
and only the cluster representations are stored in the main memory to alleviate
the space limitations. Parallel Implementation, that is distribution of the
computations over a network of workstations.

As many clustering algorithm require the number of cluster to be defined initially


before clustering. It is well known that this parameter affects the performance of
the algorithm significantly. One method to choose the value of K is by the method
based on Intra-Cluster Scatter. This category includes the within-cluster
depression decay and as the number of clusters increase, the within-cluster decay
first declines rapidly. From a certain K, the curve flattens. This value is considered
the appropriate K according to this method. Other methods involves based on
both the Inter and Intra Cluster Scatter as it prefer minimization of intra cluster
scatter and at the same time maximization of inter cluster scatter. So, setting a
measure that equals the ratio of intra-cluster scatter and inter-cluster scatter.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy