0% found this document useful (0 votes)

7 views

4.5-Cluster Analysis

Uploaded by

Sujithra Jones

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

4.5-Cluster Analysis

Uploaded by

Sujithra Jones

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 4 Inductive Learning based on

Symbolic Representations
and Weak Theories

Video 4.5 Cluster Analysis

Agenda for the Lecture
Clustering Analysis in general
Hyperparameters
Distance measures

Categories of clustering algorithms:

• Partitioning–based clustering
• Hierarchical-based Clustering
• Density-based clustering
• Grid-based clustering
• Model-based clustering
Cluster analysis is an important element in unsupervised concept
learning, i.e. learning of multiple concepts from unsorted examples

Apart from being an important methodology for

preprocessing of datasets in the unsupervised machine
learning scenario, Cluster Analys can be used as a stand
alone technique for particular categorization purposes.

As instances are not classified in the unsupervised scenario,

algorithms have to identify commonalities and structures in
the data-set and to group the instances based on similarity.

The detailed concept formation can then be performed by

any of techniques for supervised learning ( scenario 1-10 as
of an earlier lecture).
Cluster Analysis
Synonyms: Clustering, Conceptual Clustering, Clustering techniques

Cluster analysis is the task of grouping a set of objects in such a way that
objects in the same group called a cluster are more similar (in some sense) to
each other than to those in other groups (clusters).

Cluster analysis can be achieved by various algorithms that differ significantly

in their understanding of what constitutes a cluster and how to efficiently find
them.

There are possibly over 100 published clustering algorithms.

Typically clustering algorithms are dependent on several hyper parameter

settings. Potentially the parameter settings can also be automated based on
separate learning processes.
Examples of hyper parameters that may
need to be specified for Clustering algorithms

1. Number of Clusters to establish

2. Number of features used to describe instances

3. Type of distance measure to employ

4. Treshold for maximum distance between instances and

minimum numbers of instances that satisfies that treshold
as one kind of definition of density

5. Alternative density threshold measures

6. Number of sessions for inspection of the data-set

Distance and Similarity Metrices
Distance metrics have been described in the lecture on instance-based learning but
has also a crucial role in Cluster Analysis. A distance metric (measure, function) is
typically a real-valued function that quantifies the distance between two objects:

Distance metrics and similarity metrics have been developed more or less
independently for different purposes, but usually specific similarity metrics are
intuitively inverses of corresponding distance metrics and can be transformed into
each other.

We will exemplify by:

Metrices in a normed Euclidean vector space
Minkovsky distance
Manhattan or taxicab distance = the Minkovsky distance with k=1.
Euclidean distance = the Minkovsky distance with k=2
Chebyshev or chessboard distance = the Minkovsky distance with k=inf.
Cosine similarity measure
Metrices based on overlapping elements
Levenshtein Distance
Jaccard Similarity, Index or Coefficient
Hamming distance
Categorization of Clustering Methods

The more than 100 published clustering algorithms can be clustered in many ways.
Below is depicted the structure chosen for this lecture
Partitioning-based Clustering

Partitioning algorithms are clustering techniques that subdivide the data

sets into a set of k clusters.

A majority of partitioning algorithms are based on the selection of

prototypical instances or synonymously centroid intances. These algorithms
may be termed Centroid clustering techniques. In this approach, the
selection of centroids are iteratively optimized and instances are iteratively
reallocated to the closest centroid to ultimately form the resulting clusters.
The result can be illustrated in a partitioning of the data space in a Voronoi
diagram.

Properties of the algorithms:

• the target number of clusters = k needs to be preset, a sensitive choice
• initial seeds have a strong impact on the final results
• partitioning may produce tighter clusters than hierarchical approaches

Algorithms
K-means clustering
K-medio
CLARA
Partitioning–based clustering as exemplified by the
approach in the k-means algorithm
Goal : partition N instances into k clusters.

Steps of the algorithm:

1. Select k instances and allocate these as initial means (centroids,
prototypes)
2. Calculate the distance (typically Euclidean) from each instance to all the
centroids
3. Associate all instances to the closest means (centroids, protype)
4. Let the resulting subsets of instances constitute the initial clusters
5. Create new means (centroids, prototype) as the centroid of all instances
in each cluster
6. Recalculate and reallocate all instances. An instance can change cluster
when the centroids are recomputed.
7. Reiterate from 4 until centroids remains stable.
Hierarchical-based Clustering or
Hierarchical clustering
Hierarchical clustering is a cluster technique which seeks to build
a hierarchy of clusters. The results of the hierarchical clustering
are usually presented in a dendrogram.
Properties of hierachical clustering:
• It does not assume a particular value of 𝑘, as needed by e.g.
𝑘-means clustering.
• The generated tree may correspond to a meaningful
taxonomy= concept hierarchy.
• A distance matrix is needed to compute the clustering steps.
• Initial seeds have a strong impact on the final results as
assignments cannot be undone iteratively
• Very sensitive to outliers

Algorithms:
• CURE
• BIRCH
• ROCK
• Chameleon
Dendrogram
A dendrogram is a diagram that
shows the hierarchical relationship
between objects.

It is most commonly created as an

output from hierarchical clustering.

The main use of a dendrogram is to

work out the best way to allocate
objects to clusters.
Hierarchical-based Clustering or
Hierarchical clustering
Hierarchical clustering proceeds successively either in a:
• Agglomerative fashion: a bottom-up approach where
each observation starts in its own cluster, and pairs
of clusters are merged as one moves up the hierarchy

• Divisive fashion: a top-down approach where all

observations start in one cluster, and splits are performed
recursively as one moves down the hierarchy.
Splits and Merges are typically performed based on a
proximity matrix between clusters.
Proximity of two clusters is the average of the distances
between the instances in the two clusters.
A Proximity matrics for clusters can be calculated from a
Distance Matrix for the instances.

The ´proximity´ matrix is recalculated in each step of the

algorithm. In general, the merges and splits are determined in
a greedy manner.
Density-based clustering
Density-based clustering is a clustering technique which groups together
instances that are closely packed together (instances with many nearby
neighbors), marking as outliers instances that lie alone in low-density
regions (whose nearest neighbors are far away).

Properties of algorithms:
• Clusters are dense regions in the instance space, separated by regions of
lower instance density
• A cluster is defined as a set of connected instances with maximal density
• Does not need a predefined target value for # of clusters but needs
definitions of tresholds for reachability and density
• Discovers clusters of arbitrary shape.
• Is insensitive to noise.

Examples of algorithms:
• DBSCAN
• OPTICS
Density-based clustering as exemplified with
the approach in DBSCAN
Instances are classified as core instances, reachable instances
or outliers.
• A core instance has a minimum numbers of instances with a
treshold radius.
• An instance is density reachable fram another instance if it is
within a treshold radius from a core instance.
• An instance is density connected to another instance if both
instances are density reachable from a third instance or if they
are directly density reachable from each other.
• All instances not reachable from any other instances are
Point A and the other red instances are core instances,
considered as outliers (possibly noise).
because the area surrounding these instances in an
• If p is a core instance, then it forms a cluster together with
ε radius contain a specified minimum of 4 points
all instances that are reachable from it. Each cluster
Because they are all reachable from one another,
contains at least one core instance ; non-core points can be part
they form a single cluster. Points B and C are not core
of a cluster, but they form its "edge“.
points, but are reachable from A (via other core points)
• All points within the cluster are mutually density-connected.
and thus belong to the cluster as well.
• If a point is density-reachable from any point of the cluster, it
Point N is a noise point that is neither
is part of the cluster as well.
a core point nor directly-reachable.
Grid-based clustering
Grid based methods quantize the instance space into a finite number of
cells (hyper-rectangles) and then perform the required operations on the
quantized space.

Typical steps in algoritms:

• Define a set of grid-cells
• Assign instances to grid cells and compute densities of cells
• Eliminate cells that have densities below a certain threshold
• Form clusters from adjacent cells based upon some objective
(optimization) function.

Examples of algorithms:
• CLIQUE (CLustering In QUEst)
• STING (STatistical INformation Grid)
• Wave Cluster
Model-based clustering
Model-based Clustering means that clustering is based on some model or background
knowledge about the domain from which the instances of the dataset is harvested.

The model can be more or less extensive but can in all cases guide the clustering process.
Model-based clustering can in principle be an extension to any of the other clustering
approaches.
If the domain knowledge is some statistical information about the distributions for the
various kinds of instances involved one can call this kind of clustering techniques
Distribution-based clustering.

Example of a distribution based clustering scenario:

• Sample instances arise from a distribution that is a mixture of two or more components.
• Each component is described by a density function and has an associated probability or
“weight” in the mixture.
• In principle, we can adopt any probability model for the components, but typically we
will assume that components are p-variate normal distributions.
• Thus, the probability model for clustering will often be a mixture of multivariate normal
distributions.
• Each component in the mixture is what we call a cluster.
NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 4.6 will be on the topic:

Tutorial for Week 4

C - 20-005E Attachment of Starting Failure Due To Crank Gear Sli - Yanmar
No ratings yet
C - 20-005E Attachment of Starting Failure Due To Crank Gear Sli - Yanmar
15 pages
UNIT5
No ratings yet
UNIT5
60 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Clustering
No ratings yet
Clustering
12 pages
Unit 2
No ratings yet
Unit 2
33 pages
Module 5
No ratings yet
Module 5
91 pages
Clustering
No ratings yet
Clustering
29 pages
DMW UNIT 5
No ratings yet
DMW UNIT 5
10 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster
100% (1)
Cluster
72 pages
M5
No ratings yet
M5
40 pages
Clustering Techniques
No ratings yet
Clustering Techniques
23 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering Class Ppt
No ratings yet
Clustering Class Ppt
103 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
65 pages
M5
No ratings yet
M5
40 pages
Clustering
No ratings yet
Clustering
75 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering
No ratings yet
Clustering
34 pages
DWDM 5
No ratings yet
DWDM 5
12 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
Clustering
No ratings yet
Clustering
75 pages
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
No ratings yet
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
66 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Cluster_analysis
No ratings yet
Cluster_analysis
22 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Clustering
No ratings yet
Clustering
39 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Clustering new
No ratings yet
Clustering new
6 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
3 4-ArtificialNeuralNetworks
No ratings yet
3 4-ArtificialNeuralNetworks
18 pages
4.3-DecisionTreesLearningAlgorithms Part 2
No ratings yet
4.3-DecisionTreesLearningAlgorithms Part 2
15 pages
3 3-BayesianNetworks
No ratings yet
3 3-BayesianNetworks
13 pages
3 5-GeneticAlgorithms
No ratings yet
3 5-GeneticAlgorithms
16 pages
2 3-FeatureRelatedIssues
No ratings yet
2 3-FeatureRelatedIssues
10 pages
Week 2 Characterization of Learning Problems: Nptel Video Course On Machine Learning
No ratings yet
Week 2 Characterization of Learning Problems: Nptel Video Course On Machine Learning
18 pages
Week 2 Characterization of Learning Problems: Nptel Video Course On Machine Learning
No ratings yet
Week 2 Characterization of Learning Problems: Nptel Video Course On Machine Learning
17 pages
3 6-LogicProgramming
No ratings yet
3 6-LogicProgramming
8 pages
Week1 Annotated
No ratings yet
Week1 Annotated
4 pages
Week 2 Watermark
No ratings yet
Week 2 Watermark
84 pages
Week 1
No ratings yet
Week 1
12 pages
Freidora Broaster
No ratings yet
Freidora Broaster
112 pages
Gamification Water 160051028
No ratings yet
Gamification Water 160051028
24 pages
Bjtechknowledge Movies Special M3u8
No ratings yet
Bjtechknowledge Movies Special M3u8
2,298 pages
syllabus-it-212-computer-programming-2
No ratings yet
syllabus-it-212-computer-programming-2
13 pages
LBI 60 HZ
No ratings yet
LBI 60 HZ
12 pages
MATH204 Linear Algebra
No ratings yet
MATH204 Linear Algebra
3 pages
Introduction To The Atmel Atmega32
No ratings yet
Introduction To The Atmel Atmega32
14 pages
Import Java
No ratings yet
Import Java
5 pages
KJSSE+BTech_Brochure_2025
No ratings yet
KJSSE+BTech_Brochure_2025
28 pages
DV Bits
No ratings yet
DV Bits
5 pages
BIO A01 F2020 Syllabus
No ratings yet
BIO A01 F2020 Syllabus
10 pages
MDG Day6-MDG Workflows
No ratings yet
MDG Day6-MDG Workflows
39 pages
Declaration of Conformity: Elcometer Limited
No ratings yet
Declaration of Conformity: Elcometer Limited
1 page
Terms of Reference Event Management 1803
No ratings yet
Terms of Reference Event Management 1803
8 pages
Aadhaar Update Form: Aadhaar Enrolment Is Free & Voluntary
No ratings yet
Aadhaar Update Form: Aadhaar Enrolment Is Free & Voluntary
4 pages
The Dreaded Switching Jack Problem
No ratings yet
The Dreaded Switching Jack Problem
4 pages
F110-GE-129 EFE - Enhanced Power Through Low Risk Derivative Technology
No ratings yet
F110-GE-129 EFE - Enhanced Power Through Low Risk Derivative Technology
11 pages
1 PB
No ratings yet
1 PB
26 pages
Mario G. Sapongay JR.: Registered Civil Engineer Materials Engineer
No ratings yet
Mario G. Sapongay JR.: Registered Civil Engineer Materials Engineer
7 pages
Project 1
No ratings yet
Project 1
11 pages
ED-2002-099 Mixed Analog Module (2364) User Manual
No ratings yet
ED-2002-099 Mixed Analog Module (2364) User Manual
28 pages
Baseband Upgrade
50% (2)
Baseband Upgrade
30 pages
Pair of Linear Equations - Chapter - 3 - RD Sharma
No ratings yet
Pair of Linear Equations - Chapter - 3 - RD Sharma
478 pages
Pygame
No ratings yet
Pygame
3 pages
Inxpect SRE-UR+ - SafetyGuide - 9 - 00301 - v01
No ratings yet
Inxpect SRE-UR+ - SafetyGuide - 9 - 00301 - v01
3 pages
What Drives You Mad Everyday Annoyances Conversation Topics Dialogs Flashcards Icebreakers - 119266
No ratings yet
What Drives You Mad Everyday Annoyances Conversation Topics Dialogs Flashcards Icebreakers - 119266
4 pages
Deloitte USI Consulting JD Business Strategy - Generalist - vDRAFT
No ratings yet
Deloitte USI Consulting JD Business Strategy - Generalist - vDRAFT
2 pages
Unit - 2 NLP - R20
No ratings yet
Unit - 2 NLP - R20
21 pages
Prediction of Heart Disease Using Machine Learning Techniques
No ratings yet
Prediction of Heart Disease Using Machine Learning Techniques
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

4.5-Cluster Analysis

Uploaded by

4.5-Cluster Analysis

Uploaded by

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 4 Inductive Learning based on

Video 4.5 Cluster Analysis

Categories of clustering algorithms:

Apart from being an important methodology for

As instances are not classified in the unsupervised scenario,

The detailed concept formation can then be performed by

Cluster analysis can be achieved by various algorithms that differ significantly

There are possibly over 100 published clustering algorithms.

Typically clustering algorithms are dependent on several hyper parameter

1. Number of Clusters to establish

2. Number of features used to describe instances

3. Type of distance measure to employ

4. Treshold for maximum distance between instances and

5. Alternative density threshold measures

6. Number of sessions for inspection of the data-set

We will exemplify by:

Partitioning algorithms are clustering techniques that subdivide the data

A majority of partitioning algorithms are based on the selection of

Properties of the algorithms:

Steps of the algorithm:

It is most commonly created as an

The main use of a dendrogram is to

• Divisive fashion: a top-down approach where all

The ´proximity´ matrix is recalculated in each step of the

Typical steps in algoritms:

Example of a distribution based clustering scenario:

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 4.6 will be on the topic:

Tutorial for Week 4

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.