Clustering
Clustering
November 2023
Outline
2
1.
Clustering
- General Concepts
Main idea, real-life applications,
types
3
Motivating Example: Customer Segmentation
Clusters
● Demographic ● Geographic
● Behavioral ● Psychographic
4
What is Cluster Analysis or Clustering?
Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from
(or unrelated to) the objects in other groups
Inter-cluster distances
are maximized
Intra-cluster distances
are minimized
5
Real-life Applications: Google News
6
Real-life Applications: Anomaly Detection
Source:
https://towardsdatascience.com/unsupervised-
anomaly-detection-on-spotify-data-k-means-vs-local-
outlier-factor-f96ae783d7a7
7
Real-life Applications: Sport Science
Source: https://www.americansocceranalysis.com/home/2019/3/11/using-k-means-to-learn-
what-soccer-passing-tells-us-about-playing-styles
8
Real-life Applications: Image Segmentation
Source: http://pixelsciences.blogspot.com/2017/07/image-segmentation-k-means-clustering.html
9
Real-life Applications: Recommendation
● Cluster-based ranking
● Group recommendation
● …
10
What do affect on Cluster Analysis?
Clustering
Data Algorithm
Cluster
11
Characteristics of the Input Data Are Important
● High dimensionality
○ Dimensionality reduction
● Types of attributes
○ Binary, discrete, continuous, asymmetric
○ Mixed attribute types, e.g., continuous & nominal)
● Differences in attribute scales
○ Normalization techniques
● Size of data set
● Noise and Outliers
● Properties of the data space
12
Characteristics of Cluster
● Data distribution
○ Parametric models
● Shape
○ Globular or arbitrary shape
● Differing sizes
● Differing densities
● Level of separation among clusters Distance Metrics
● Relationship among clusters
● Subspace clusters
13
How to Measure the Similarity/Distance?
Source: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa 14
Notion of a Cluster can be Ambiguous
15
Types of Clusterings
Source: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/
16
Partitional Clustering
17
Hierarchical Clustering
Hierarchical Clustering
Clustering dendrogram
18
Fuzzy Clustering
19
Density-based Clustering
Non-linear separation
20
Model-based Clustering
21
2.
Typical Clustering
Algorithms
Intuition, Main Idea, Limitation
22
Typical Clustering Algorithms
◎ Partitional Clustering
○ K-Means & Variants
◎ Hierarchical Clustering
○ HAC
◎ Density-based Clustering
○ DBSCAN
23
K-Means Clustering: An Example
24
K-Means Clustering
● Main idea: Each point is assigned to the cluster with the closest centroid
● Number of clusters, K, must be specified
● Sum of Squared Error (SSE)
● Complexity: O(n * K * I * d)
○ n = number of points, K = number of clusters,
○ I = number of iterations, d = number of attributes
25
Elbow Method for Optimal Value of K
WCSS is the sum of squared distance between each point and the centroid in a cluster
The graph will rapidly change at a point named Elbow Point
26
Two different K-means Clusterings
Optimal
Clustering
Original Points
Sub-optimal
Clustering
27
Importance of Choosing Initial Centroids
28
Solutions to Initial Centroids Problem
● Multiple runs
○ Helps, but probability is not on your side
● Use some strategies to select the k initial centroids and then
select among these initial centroids
○ Select most widely separated, e.g., K-means++
○ Use hierarchical clustering to determine initial centroids
● Bisecting K-Means
○ Not as susceptible to initialization issues
29
K-Means++
1. Choose one center uniformly at random among the data points.
2. For each data point x not chosen yet, compute D(x), the distance
between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a
weighted probability distribution where a point x is chosen with
probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using
standard K-Means clustering
30
Bisecting K-Means
31
Limitations of K-means: Differing Sizes
32
Limitations of K-means: Differing Density
33
Limitations of K-means: Non-globular Shapes
34
Hierarchical Agglomerative Clustering
dendrogram
● Main Idea:
○ Start with the points as individual clusters
○ At each step, merge the closest pair of clusters until only one
cluster (or K clusters) left
● Key operation is the computation of the proximity of two clusters
○ Worst-case Complexity: O(N3)
35
HAC: Algorithm
36
Closest Pair of Clusters
● Many variants to defining closest pair of clusters
● Single-link
○ Similarity of the closet elements
● Complete-link
○ Similarity of the “furthest” points
● Average-link
○ Average cosine between pairs of elements
● Ward’s Method
○ The increase in squared error when two clusters are merged
37
HAC - Single-link (MIN)
5
1
3
5
2 1
2 3 6
4
4
Nested Clusters Dendrogram
38
HAC - Complete-link (MAX)
5
4 1
22 5
5
2 1
4 3 6
3 1
4
3
39
HAC - Average-link
5 4 1
2
5
2
3 6
1
4
3
Nested Clusters Dendrogram
40
HAC: Limitations
● Once two clusters are combined, it cannot be undone
● No global objective function is directly minimized
● Typical Problems:
○ Sensitivity to noise
○ Difficulty handling clusters of different
sizes and non-globular shapes
○ Breaking large clusters
41
Density-based Clustering - DBSCAN
42
DBSCAN: Algorithm
43
How to Determine Points?
MinPts = 7
● Core point: Has at least a specified number of points (MinPts) within Eps
● Border point: not a core point, but is in the neighborhood of a core point
● Noise point: any point that is not a core point or a border point
44
DBSCAN: Core, Border and Noise Points
45
DBSCAN: How to Determine Eps, MinPts?
Intuition:
● Core point: the k-th nearest
neighbors are at a close distance.
● Noise point: the k-th nearest
neighbors are at a far distance.
Plot sorted distance of every point to
its k-th nearest neighbor
46
DBSCAN: Limitations
(MinPts=4, Eps=9.92).
● Varying densities
(MinPts=4, Eps=9.75). ● High-dimensional
data
Original Points
47
Which Clustering Algorithm?
● Type of Clustering
● Type of Cluster
○ Prototype vs connected regions vs density-based
● Characteristics of Clusters
● Characteristics of Data Sets and Attributes
● Noise and Outliers
● Number of Data Objects
● Number of Attributes
● Algorithmic Considerations
48
A Comparison on Clustering Algorithms
49
Summary
◎ General Concepts of Clustering
○ Definition
○ Real-life Applications
○ Types of Clustering
◎ Typical Clustering Algorithms
○ K-Means
○ HAC
○ DBSCAN
50
Dimensionality Reduction
Curse of Dimensionality
The number of training examples required
increases exponentially with dimensionality
d (i.e., kd).
31 bins
32 bins
33 bins
k: number of bins per feature
Dimensionality Reduction
53
Dimensionality Reduction
Feature extraction: Extract features from sample (increase features)
Feature selection: get relevant features from the set of features
Dimensionality Reduction: get proper mapping to
a lower dimensional space
The mapping f()
could be linear
é x1 ù é x1 ù or non-linear
êx ú
êx ú ê 2ú
ê 2ú é xi1 ù é y1 ù
ê . ú êy ú
ê . ú ê ú ê ú ê 2ú
ê ú ê xi2 ú x= ê . ú
¾¾¾
f (x)
®y = ê . ú
.
x=ê ú ® y=ê . ú ê . ú ê ú
ê . ú ê ú ê ú ê . ú
ê ú ê . ú ê ú .
êë yK úû
ê ú . ê ú ê . ú
ê . ú ë xiK û ê ú
ê ú K<<N êë xN úû
êë xN úû K<<N
54
Dimensionality Reduction
Σx u!= !! u!
(2) Choosing the K “largest” eigenvectors u! (corresponding to the K “largest” eigenvalues !!)
57
PCA - Steps
Suppose we are given x1, x2, ..., xM N x 1 vectors
Φi = xi - x 1 n
Sˆ = å (x k - μˆ )(x k - μˆ )t
n k =1
Step 3: compute the sample covariance matrix Σx
1 M
1
Sx =
M
å Fi FTi =
i =1 M
AAT where A=[Φ1 Φ2 ... ΦΜ]
(N x M matrix)
58
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
S x ui = li ui
(x - x)T ui
yi = T
= (x - x)T ui if || ui ||= 1
ui ui (normalized)
59
PCA - Steps
Step 5: Aprpoximation with the Tranformation matrix U (using the first K vectors)
N
x - x = å yi ui = y1u1 + y2u2 + ... + y N u N
i =1
K
xˆ - x = å yi ui = y1u1 + y2u2 + ... + yK u K
i =1
é y1 ù
êy ú
ê 2ú where U = [u1 u2 ... u K ] N x K
or (xˆ - x ) = U ê . ú
ê ú
ê . ú
êë yK úû
60
Example
Compute the PCA for dataset
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
61
Example
The eigenvectors are the solutions of the systems:
S xui = li ui
63
How do we choose K ?
• K is typically chosen based on how much information (variance) we want to
preserve:
K
ål i
i =1
N
>T where T is a threshold (e.g ., 0.9)
ål
i =1
i
• If K=N, then we “preserve” 100% of the information in the data (i.e., just a
change of basis)
64
Approximation Error
• The approximation error (or reconstruction error) can be
computed as:
|| x - xˆ ||
K