0% found this document useful (0 votes)

66 views

By Lior Rokach and Oded Maimon: Clustering Methods

This document discusses various clustering methods used to group similar objects together. It describes two main types of clustering - hierarchical and partitioning methods. Hierarchical clustering recursively partitions objects into clusters, while partitioning methods preset the number of clusters and optimize cluster assignments. The document also covers density-based, model-based, grid-based, and other soft computing clustering methods. It discusses criteria for evaluating clusters and algorithms for handling large datasets that do not fit into memory.

Uploaded by

Rohit Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

By Lior Rokach and Oded Maimon: Clustering Methods

Uploaded by

Rohit Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

CLUSTERING METHODS

-By Lior Rokach and Oded Maimon

This paper is based on the defining what are the criteria used to determine
whether two objects are similar or not and what are the different types of
Clustering Methods

Clustering is the grouping of similar objects while dissimilar objects belongs to

different cluster. There are two types of measure to determine similarity which is
Distance measures and Similarity measures. Under Distance Measure, for
measuring numeric attributes two data instances can be calculated using the
Minkowski metric a generalization of Euclidean distance. In case of Binary
attributes distance between object can be calculated based on contingency table
seeing if both of its states are equally valuable. For Nominal attributes difference
between the number of matches from total attributes is checked and for Ordinal
attributes, the attributes are treated as numeric of ranges [0, 1]. Where for
Mixed-Type attributes distance can be calculated by combining the methods
above. An alternative way to that of distance is the similarity function, which give
a large value when two objects are similar and largest for identical objects. One
way to calculate similarity function is by using Cosine Measures, which check the
angle between two vectors. Other common ways are by Pearson Correlation
Measure which use the average feature value and Extended Jaccard Measure.

Criteria for evaluating a cluster whether it is good or not is divided into two
categories which is Internal and External. Internal Quality metrics usually measure
the compactness of the cluster. Sum of Squared Error (SSE) being the simplest and
most widely used. It is measures by squaring the deviation for each of the value.
Clustering Method that minimize the SSE criterion are often called minimum
variance partitions. SSE criterion is suitable for cases in which the cluster form
compact clouds that are well separated from one another. Internal Quality can
also be measured by Scatter Criteria which is being derived from the scatter
matrices, reflecting the within-cluster scatter, the between-cluster scatter and
their summation that is the total scatter matrix. External Quality Criteria on other
hand can be useful for examining whether the structure of the clusters match to
some predefined classification of the instances. It can be measured using Mutual
Information. Precision-Recall is a way of measuring external quality which check
the fraction of correctly retrieved instances out of all matching instances. Rand
Index is a simple criterion used to compare two clustering structure can be
calculated by dividing number of pairs of instances that are assigned to same
cluster or are assigned to different cluster in both the clustering structure to the
total number of pairs of instances.

Using the above criteria many clustering algorithm has been developed, each of
which uses a different induction principle. Clustering methods are mainly divided
into two groups: hierarchical and partitioning methods, and three additional main
categories: density-based methods, model-based clustering and grid-based
methods.

Hierarchical methods construct the cluster by recursively partitioning the

instances in either a top-down or bottom-up fashion and so subdivided as Divisive
and Agglomerative hierarchical clustering respectively. In divisive hierarchical
clustering, all objects initially belongs to one cluster which is then divided into
sub-cluster, which are successively divided into their own sub-cluster till the
desired cluster structure is obtained. In Agglomerative hierarchical clustering,
each objects initially represents a cluster of its own, which then successively
merged until the desired cluster structure is obtained. The merging or division of
cluster is based on similarity measure, which further divide the hierarchical
clustering into three part based on the measure taken. Single-link clustering,
consider the shortest distance between two clusters. Complete-link clustering,
consider the longest distance between two clusters. Average-link clustering,
consider the average distance between two clusters. Single-link clustering is more
versatile but has a drawback known as the “chaining effect”, as it may unify two
big cluster into one, where Average-link clustering may cause elongated clusters
to split and portions of neighboring elongated cluster to merge. As a whole
hierarchical methods produce multiple partition which allow different user to
choose different partitions. The main disadvantages of hierarchical clustering is its
Inability to scale well and having no back-tracking capability.

Partitioning Method construct the number of cluster preset by the user by

relocating instances by moving them from one cluster to another, starting from
an initial clustering. Basic idea is to find a clustering structure that minimize a
certain error criterion which measure the distance of each instances to their
representative value. The most well known criterion is the Sum of Squared
Error(SSE). The simplest and most commonly used algorithm is K-means
algorithm, it partitions the data into K clusters represented by their means. It start
with an initial set of cluster center randomly. In each iteration, each instance is
assigned to its nearest cluster center. Then the cluster centers are recalculated as
the mean of all instances belonging to that cluster. Advantage of k-means is its
simplicity, speed of convergence, adaptability to sparse data and its linear
complexity. Weakness being sensitive towards selection of initial partition, not as
versatile as single link algorithm for instances and its sensitivity toward noisy data
and outlier. Another partitioning algorithm which attempts to minimize the SSE is
the K-medoids or PAM (Partition Around Medoids). Here each cluster is
represented by the most centric object in the cluster, rather than by implicit
mean that may not belong to the cluster. It is less influenced by outliers.
However, its processing is more costlier.

Graph theoretic methods are method that produce cluster via graphs. A well
known graph theoretic algorithm is based on Minimum Spanning Tree(MST).

Density Based Method assumes that the points belongs to each cluster are drawn
from a specific probability distribution, where the overall distribution is a mixture
of several distributions. These methods are designed for discovering cluster of
arbitrary shape. The idea behind is to continue growing the given cluster as long
as the density in the neighborhood exceeds some threshold. Density based
method includes: DBSCAN algorithm, it check whether neighborhood of a object
contains more than the minimum number of objects; AUTOCLASS, it covers wide
variety of distribution including Gaussian, Bernoulli, poisson and log-normal
distributions; Other well known methods are SNOB and MCLUST.
Model-based Clustering Methods, while identifying clusters it also find
characteristics description for each groups. The most frequently used induction
methods are decision trees and neural networks. In decision tree the data is
represented by a hierarchical tree, where each leaf refers to a concept and
contains a probabilistic description of that concept. The most well-known
algorithm are COBWEB and CLASSIT, an extension of COBWEB. In Neural Network
input data is represented by a neurons which are connected to the prototype
neurons where each connection has a weight which learn adaptively during
learning. A very popular neural algorithm for clustering is self-organizing map
(SOM). It is useful for visualizing high-dimensional data in 2D or 3D space.

Grid-based Methods partition the space into a finite number of cells that form a
grid structure on which all of the operation are performed.

In Clustering there are also other soft-computing method such as Fuzzy

Clustering, as traditionally in partitioning each instances belong to one and only
one cluster. By using Fuzzy set concept, each data point can belong to more than
one cluster. The most popular fuzzy clustering algorithm is fuzzy c-mean(FCM)
algorithm. Other method is Evolutionary Approaches for Clustering, It refers to
the application of evolutionary algorithm (also known as genetic algorithms) to
data clustering. A fitness value is associated with each clusters structure. A higher
fitness value indicates a better cluster structure. Cluster structures with a small
squared error will have a larger fitness value. Genetic Algorithms perform a
globalized search for solutions whereas most other clustering procedures per
form a localized search. Simulated Annealing for Clustering, another general-
purpose stochastic search technique. It is used to find a near optimal partitioning
with respect to each of several clustering criteria for a variety of simulated data
sets.

For clustering large data set, CLARANS (Clustering Large Applications based on
RANdom Search) have been developed by Ng and Han (1994). This method
identifies candidate cluster centroids by using repeated random samples of the
original data. Because of the use of random sampling, the time complexity is O(n)
for a pattern set of n elements. The BIRCH algorithm (Balanced Iterative Reducing
and Clustering) stores summary information about candidate clusters in a
dynamic tree data structure. This tree hierarchically organizes the clusters
represented at the leaf nodes. The tree can be rebuilt when a threshold specifying
cluster size is updated manually. This algorithm has a time complexity linear in the
number of instances. All algorithms presented till this point assume that the
entire dataset can be accommodated in the main memory. However, there are
cases in which this assumption is untrue. The following sub-sections describe
three current approaches to solve this problem. Decomposition Approach, In this
dataset can be stored in a secondary memory and subsets of this data clustered
independently, followed by a merging step to yield a clustering of the entire
dataset. Incremental Clustering algorithm, process the data one element at a time
and only the cluster representations are stored in the main memory to alleviate
the space limitations. Parallel Implementation, that is distribution of the
computations over a network of workstations.

As many clustering algorithm require the number of cluster to be defined initially

before clustering. It is well known that this parameter affects the performance of
the algorithm significantly. One method to choose the value of K is by the method
based on Intra-Cluster Scatter. This category includes the within-cluster
depression decay and as the number of clusters increase, the within-cluster decay
first declines rapidly. From a certain K, the curve flattens. This value is considered
the appropriate K according to this method. Other methods involves based on
both the Inter and Intra Cluster Scatter as it prefer minimization of intra cluster
scatter and at the same time maximization of inter cluster scatter. So, setting a
measure that equals the ratio of intra-cluster scatter and inter-cluster scatter.

Paper-2 Clustering Algorithms in Data Mining A Review
No ratings yet
Paper-2 Clustering Algorithms in Data Mining A Review
7 pages
IBM Course Description - Advanced Data Warehousing Workshop - Multi-Dimensional Modeling - IBM Training - United States
No ratings yet
IBM Course Description - Advanced Data Warehousing Workshop - Multi-Dimensional Modeling - IBM Training - United States
3 pages
Clustering
No ratings yet
Clustering
8 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
3 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Cluster
No ratings yet
Cluster
20 pages
PR Assignment 02 - Seemal Ajaz (206979)
No ratings yet
PR Assignment 02 - Seemal Ajaz (206979)
5 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
DataMining_Unit4_notes
No ratings yet
DataMining_Unit4_notes
27 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Clustering
No ratings yet
Clustering
6 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
Module 5
No ratings yet
Module 5
91 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
PSO and WDO Data Clusterin
No ratings yet
PSO and WDO Data Clusterin
19 pages
Clustering
No ratings yet
Clustering
34 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
24 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Clustering
No ratings yet
Clustering
7 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unit 5
No ratings yet
Unit 5
10 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
ML_Unit-3
No ratings yet
ML_Unit-3
22 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unit 5
No ratings yet
Unit 5
5 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
UNIT 2 DMW
No ratings yet
UNIT 2 DMW
26 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Clustering
No ratings yet
Clustering
13 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Study of Clustering Methods in Data Mini PDF
No ratings yet
Study of Clustering Methods in Data Mini PDF
5 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
A_new_hierarchical_clustering_algorithm (1)
No ratings yet
A_new_hierarchical_clustering_algorithm (1)
5 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Clustering
No ratings yet
Clustering
45 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Nitish's CV
No ratings yet
Nitish's CV
1 page
Layanan
No ratings yet
Layanan
39 pages
Acid Chemical Case
100% (2)
Acid Chemical Case
13 pages
Dot Net
No ratings yet
Dot Net
138 pages
COM101 (HNU) Midterm Exam (Fall 2023) ANSWER KEY
No ratings yet
COM101 (HNU) Midterm Exam (Fall 2023) ANSWER KEY
2 pages
Analisa Anggaran Dan Realisasi Dengan Metode Varians Terhadap Laporan Laba Rugi
No ratings yet
Analisa Anggaran Dan Realisasi Dengan Metode Varians Terhadap Laporan Laba Rugi
12 pages
Civil 3D - Installing The UK Ireland Country Kit - GW
No ratings yet
Civil 3D - Installing The UK Ireland Country Kit - GW
3 pages
1211 Datasheet
No ratings yet
1211 Datasheet
6 pages
Idd Salim - CV - Full - Jan - 2010
No ratings yet
Idd Salim - CV - Full - Jan - 2010
7 pages
ImageList ToolBar
100% (1)
ImageList ToolBar
11 pages
Simulation Systems: Basic Concepts
No ratings yet
Simulation Systems: Basic Concepts
14 pages
What Is Lincdoc?
0% (1)
What Is Lincdoc?
2 pages
Higher Unit 1 Topic Test
No ratings yet
Higher Unit 1 Topic Test
18 pages
Artificial Intelligence Finals
No ratings yet
Artificial Intelligence Finals
30 pages
DFD Diagram For Online College Notice Board
100% (1)
DFD Diagram For Online College Notice Board
1 page
Industry Academe
No ratings yet
Industry Academe
4 pages
Ccure-9000-Security-Managmt-V1 92 Ds r10 LT en
No ratings yet
Ccure-9000-Security-Managmt-V1 92 Ds r10 LT en
4 pages
Fusion Cloud Inventory Flow
No ratings yet
Fusion Cloud Inventory Flow
23 pages
Notice Regarding CA2 & Others - 240919
No ratings yet
Notice Regarding CA2 & Others - 240919
1 page
Shuai Dissertation
No ratings yet
Shuai Dissertation
138 pages
IT Checklist
No ratings yet
IT Checklist
127 pages
Introduction To Docker
No ratings yet
Introduction To Docker
41 pages
Data, Information and Knowledge Management Framework and The Data Management Book of Knowledge (DMBOK)
100% (10)
Data, Information and Knowledge Management Framework and The Data Management Book of Knowledge (DMBOK)
356 pages
801G GigaPoint
No ratings yet
801G GigaPoint
3 pages
PANUKL
No ratings yet
PANUKL
72 pages
Chapter-2: Software Requirement Specification (SRS) : Technologies TO BE USED
No ratings yet
Chapter-2: Software Requirement Specification (SRS) : Technologies TO BE USED
16 pages
Galaxy G1 User Manual
No ratings yet
Galaxy G1 User Manual
70 pages
ISBN Information Sheet
50% (2)
ISBN Information Sheet
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

By Lior Rokach and Oded Maimon: Clustering Methods

Uploaded by

By Lior Rokach and Oded Maimon: Clustering Methods

Uploaded by

CLUSTERING METHODS

-By Lior Rokach and Oded Maimon

Clustering is the grouping of similar objects while dissimilar objects belongs to

Hierarchical methods construct the cluster by recursively partitioning the

Partitioning Method construct the number of cluster preset by the user by

In Clustering there are also other soft-computing method such as Fuzzy

As many clustering algorithm require the number of cluster to be defined initially

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.