0% found this document useful (0 votes)
7 views

4.5-Cluster Analysis

Uploaded by

Sujithra Jones
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

4.5-Cluster Analysis

Uploaded by

Sujithra Jones
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 4 Inductive Learning based on


Symbolic Representations
and Weak Theories

Video 4.5 Cluster Analysis


Agenda for the Lecture
Clustering Analysis in general
Hyperparameters
Distance measures

Categories of clustering algorithms:


• Partitioning–based clustering
• Hierarchical-based Clustering
• Density-based clustering
• Grid-based clustering
• Model-based clustering
Cluster analysis is an important element in unsupervised concept
learning, i.e. learning of multiple concepts from unsorted examples

Apart from being an important methodology for


preprocessing of datasets in the unsupervised machine
learning scenario, Cluster Analys can be used as a stand
alone technique for particular categorization purposes.

As instances are not classified in the unsupervised scenario,


algorithms have to identify commonalities and structures in
the data-set and to group the instances based on similarity.

The detailed concept formation can then be performed by


any of techniques for supervised learning ( scenario 1-10 as
of an earlier lecture).
Cluster Analysis
Synonyms: Clustering, Conceptual Clustering, Clustering techniques

Cluster analysis is the task of grouping a set of objects in such a way that
objects in the same group called a cluster are more similar (in some sense) to
each other than to those in other groups (clusters).

Cluster analysis can be achieved by various algorithms that differ significantly


in their understanding of what constitutes a cluster and how to efficiently find
them.

There are possibly over 100 published clustering algorithms.

Typically clustering algorithms are dependent on several hyper parameter


settings. Potentially the parameter settings can also be automated based on
separate learning processes.
Examples of hyper parameters that may
need to be specified for Clustering algorithms

1. Number of Clusters to establish

2. Number of features used to describe instances

3. Type of distance measure to employ

4. Treshold for maximum distance between instances and


minimum numbers of instances that satisfies that treshold
as one kind of definition of density

5. Alternative density threshold measures

6. Number of sessions for inspection of the data-set


Distance and Similarity Metrices
Distance metrics have been described in the lecture on instance-based learning but
has also a crucial role in Cluster Analysis. A distance metric (measure, function) is
typically a real-valued function that quantifies the distance between two objects:

Distance metrics and similarity metrics have been developed more or less
independently for different purposes, but usually specific similarity metrics are
intuitively inverses of corresponding distance metrics and can be transformed into
each other.

We will exemplify by:


Metrices in a normed Euclidean vector space
Minkovsky distance
Manhattan or taxicab distance = the Minkovsky distance with k=1.
Euclidean distance = the Minkovsky distance with k=2
Chebyshev or chessboard distance = the Minkovsky distance with k=inf.
Cosine similarity measure
Metrices based on overlapping elements
Levenshtein Distance
Jaccard Similarity, Index or Coefficient
Hamming distance
Categorization of Clustering Methods

The more than 100 published clustering algorithms can be clustered in many ways.
Below is depicted the structure chosen for this lecture
Partitioning-based Clustering

Partitioning algorithms are clustering techniques that subdivide the data


sets into a set of k clusters.

A majority of partitioning algorithms are based on the selection of


prototypical instances or synonymously centroid intances. These algorithms
may be termed Centroid clustering techniques. In this approach, the
selection of centroids are iteratively optimized and instances are iteratively
reallocated to the closest centroid to ultimately form the resulting clusters.
The result can be illustrated in a partitioning of the data space in a Voronoi
diagram.

Properties of the algorithms:


• the target number of clusters = k needs to be preset, a sensitive choice
• initial seeds have a strong impact on the final results
• partitioning may produce tighter clusters than hierarchical approaches

Algorithms
K-means clustering
K-medio
CLARA
Partitioning–based clustering as exemplified by the
approach in the k-means algorithm
Goal : partition N instances into k clusters.

Steps of the algorithm:


1. Select k instances and allocate these as initial means (centroids,
prototypes)
2. Calculate the distance (typically Euclidean) from each instance to all the
centroids
3. Associate all instances to the closest means (centroids, protype)
4. Let the resulting subsets of instances constitute the initial clusters
5. Create new means (centroids, prototype) as the centroid of all instances
in each cluster
6. Recalculate and reallocate all instances. An instance can change cluster
when the centroids are recomputed.
7. Reiterate from 4 until centroids remains stable.
Hierarchical-based Clustering or
Hierarchical clustering
Hierarchical clustering is a cluster technique which seeks to build
a hierarchy of clusters. The results of the hierarchical clustering
are usually presented in a dendrogram.
Properties of hierachical clustering:
• It does not assume a particular value of 𝑘, as needed by e.g.
𝑘-means clustering.
• The generated tree may correspond to a meaningful
taxonomy= concept hierarchy.
• A distance matrix is needed to compute the clustering steps.
• Initial seeds have a strong impact on the final results as
assignments cannot be undone iteratively
• Very sensitive to outliers

Algorithms:
• CURE
• BIRCH
• ROCK
• Chameleon
Dendrogram
A dendrogram is a diagram that
shows the hierarchical relationship
between objects.

It is most commonly created as an


output from hierarchical clustering.

The main use of a dendrogram is to


work out the best way to allocate
objects to clusters.
Hierarchical-based Clustering or
Hierarchical clustering
Hierarchical clustering proceeds successively either in a:
• Agglomerative fashion: a bottom-up approach where
each observation starts in its own cluster, and pairs
of clusters are merged as one moves up the hierarchy

• Divisive fashion: a top-down approach where all


observations start in one cluster, and splits are performed
recursively as one moves down the hierarchy.
Splits and Merges are typically performed based on a
proximity matrix between clusters.
Proximity of two clusters is the average of the distances
between the instances in the two clusters.
A Proximity matrics for clusters can be calculated from a
Distance Matrix for the instances.

The ´proximity´ matrix is recalculated in each step of the


algorithm. In general, the merges and splits are determined in
a greedy manner.
Density-based clustering
Density-based clustering is a clustering technique which groups together
instances that are closely packed together (instances with many nearby
neighbors), marking as outliers instances that lie alone in low-density
regions (whose nearest neighbors are far away).

Properties of algorithms:
• Clusters are dense regions in the instance space, separated by regions of
lower instance density
• A cluster is defined as a set of connected instances with maximal density
• Does not need a predefined target value for # of clusters but needs
definitions of tresholds for reachability and density
• Discovers clusters of arbitrary shape.
• Is insensitive to noise.

Examples of algorithms:
• DBSCAN
• OPTICS
Density-based clustering as exemplified with
the approach in DBSCAN
Instances are classified as core instances, reachable instances
or outliers.
• A core instance has a minimum numbers of instances with a
treshold radius.
• An instance is density reachable fram another instance if it is
within a treshold radius from a core instance.
• An instance is density connected to another instance if both
instances are density reachable from a third instance or if they
are directly density reachable from each other.
• All instances not reachable from any other instances are
Point A and the other red instances are core instances,
considered as outliers (possibly noise).
because the area surrounding these instances in an
• If p is a core instance, then it forms a cluster together with
ε radius contain a specified minimum of 4 points
all instances that are reachable from it. Each cluster
Because they are all reachable from one another,
contains at least one core instance ; non-core points can be part
they form a single cluster. Points B and C are not core
of a cluster, but they form its "edge“.
points, but are reachable from A (via other core points)
• All points within the cluster are mutually density-connected.
and thus belong to the cluster as well.
• If a point is density-reachable from any point of the cluster, it
Point N is a noise point that is neither
is part of the cluster as well.
a core point nor directly-reachable.
Grid-based clustering
Grid based methods quantize the instance space into a finite number of
cells (hyper-rectangles) and then perform the required operations on the
quantized space.

Typical steps in algoritms:


• Define a set of grid-cells
• Assign instances to grid cells and compute densities of cells
• Eliminate cells that have densities below a certain threshold
• Form clusters from adjacent cells based upon some objective
(optimization) function.

Examples of algorithms:
• CLIQUE (CLustering In QUEst)
• STING (STatistical INformation Grid)
• Wave Cluster
Model-based clustering
Model-based Clustering means that clustering is based on some model or background
knowledge about the domain from which the instances of the dataset is harvested.

The model can be more or less extensive but can in all cases guide the clustering process.
Model-based clustering can in principle be an extension to any of the other clustering
approaches.
If the domain knowledge is some statistical information about the distributions for the
various kinds of instances involved one can call this kind of clustering techniques
Distribution-based clustering.

Example of a distribution based clustering scenario:


• Sample instances arise from a distribution that is a mixture of two or more components.
• Each component is described by a density function and has an associated probability or
“weight” in the mixture.
• In principle, we can adopt any probability model for the components, but typically we
will assume that components are p-variate normal distributions.
• Thus, the probability model for clustering will often be a mixture of multivariate normal
distributions.
• Each component in the mixture is what we call a cluster.
NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 4.6 will be on the topic:

Tutorial for Week 4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy