4.5-Cluster Analysis
4.5-Cluster Analysis
Cluster analysis is the task of grouping a set of objects in such a way that
objects in the same group called a cluster are more similar (in some sense) to
each other than to those in other groups (clusters).
Distance metrics and similarity metrics have been developed more or less
independently for different purposes, but usually specific similarity metrics are
intuitively inverses of corresponding distance metrics and can be transformed into
each other.
The more than 100 published clustering algorithms can be clustered in many ways.
Below is depicted the structure chosen for this lecture
Partitioning-based Clustering
Algorithms
K-means clustering
K-medio
CLARA
Partitioning–based clustering as exemplified by the
approach in the k-means algorithm
Goal : partition N instances into k clusters.
Algorithms:
• CURE
• BIRCH
• ROCK
• Chameleon
Dendrogram
A dendrogram is a diagram that
shows the hierarchical relationship
between objects.
Properties of algorithms:
• Clusters are dense regions in the instance space, separated by regions of
lower instance density
• A cluster is defined as a set of connected instances with maximal density
• Does not need a predefined target value for # of clusters but needs
definitions of tresholds for reachability and density
• Discovers clusters of arbitrary shape.
• Is insensitive to noise.
Examples of algorithms:
• DBSCAN
• OPTICS
Density-based clustering as exemplified with
the approach in DBSCAN
Instances are classified as core instances, reachable instances
or outliers.
• A core instance has a minimum numbers of instances with a
treshold radius.
• An instance is density reachable fram another instance if it is
within a treshold radius from a core instance.
• An instance is density connected to another instance if both
instances are density reachable from a third instance or if they
are directly density reachable from each other.
• All instances not reachable from any other instances are
Point A and the other red instances are core instances,
considered as outliers (possibly noise).
because the area surrounding these instances in an
• If p is a core instance, then it forms a cluster together with
ε radius contain a specified minimum of 4 points
all instances that are reachable from it. Each cluster
Because they are all reachable from one another,
contains at least one core instance ; non-core points can be part
they form a single cluster. Points B and C are not core
of a cluster, but they form its "edge“.
points, but are reachable from A (via other core points)
• All points within the cluster are mutually density-connected.
and thus belong to the cluster as well.
• If a point is density-reachable from any point of the cluster, it
Point N is a noise point that is neither
is part of the cluster as well.
a core point nor directly-reachable.
Grid-based clustering
Grid based methods quantize the instance space into a finite number of
cells (hyper-rectangles) and then perform the required operations on the
quantized space.
Examples of algorithms:
• CLIQUE (CLustering In QUEst)
• STING (STatistical INformation Grid)
• Wave Cluster
Model-based clustering
Model-based Clustering means that clustering is based on some model or background
knowledge about the domain from which the instances of the dataset is harvested.
The model can be more or less extensive but can in all cases guide the clustering process.
Model-based clustering can in principle be an extension to any of the other clustering
approaches.
If the domain knowledge is some statistical information about the distributions for the
various kinds of instances involved one can call this kind of clustering techniques
Distribution-based clustering.