0% found this document useful (0 votes)
9 views

Clustering

The document provides an overview of unsupervised learning, focusing on clustering concepts, types, and algorithms. It discusses real-life applications of clustering, such as customer segmentation and anomaly detection, and details typical clustering algorithms including K-Means, Hierarchical Agglomerative Clustering (HAC), and DBSCAN. Additionally, it addresses dimensionality reduction techniques like PCA to enhance classification accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Clustering

The document provides an overview of unsupervised learning, focusing on clustering concepts, types, and algorithms. It discusses real-life applications of clustering, such as customer segmentation and anomaly detection, and details typical clustering algorithms including K-Means, Hierarchical Agglomerative Clustering (HAC), and DBSCAN. Additionally, it addresses dimensionality reduction techniques like PCA to enhance classification accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Unsupervise Learning

November 2023
Outline

◎ Part I: Clustering - General Concepts


○ Real-life Applications
○ Types of Clusterings
◎ Part II: Typical Clustering Algorithms

2
1.
Clustering
- General Concepts
Main idea, real-life applications,
types
3
Motivating Example: Customer Segmentation

Clusters

● Demographic ● Geographic
● Behavioral ● Psychographic
4
What is Cluster Analysis or Clustering?
Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster distances
are maximized

Intra-cluster distances
are minimized
5
Real-life Applications: Google News

6
Real-life Applications: Anomaly Detection

● Fake News Detection


● Fraud Detection
● Spam Email Detection

Source:
https://towardsdatascience.com/unsupervised-
anomaly-detection-on-spotify-data-k-means-vs-local-
outlier-factor-f96ae783d7a7

7
Real-life Applications: Sport Science

Find players with


similar styles

Source: https://www.americansocceranalysis.com/home/2019/3/11/using-k-means-to-learn-
what-soccer-passing-tells-us-about-playing-styles
8
Real-life Applications: Image Segmentation

Source: http://pixelsciences.blogspot.com/2017/07/image-segmentation-k-means-clustering.html

9
Real-life Applications: Recommendation

● Cluster-based ranking
● Group recommendation
● …

10
What do affect on Cluster Analysis?

Clustering

Data Algorithm

Cluster

11
Characteristics of the Input Data Are Important
● High dimensionality
○ Dimensionality reduction
● Types of attributes
○ Binary, discrete, continuous, asymmetric
○ Mixed attribute types, e.g., continuous & nominal)
● Differences in attribute scales
○ Normalization techniques
● Size of data set
● Noise and Outliers
● Properties of the data space

12
Characteristics of Cluster
● Data distribution
○ Parametric models
● Shape
○ Globular or arbitrary shape
● Differing sizes
● Differing densities
● Level of separation among clusters Distance Metrics
● Relationship among clusters
● Subspace clusters

13
How to Measure the Similarity/Distance?

Source: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa 14
Notion of a Cluster can be Ambiguous

15
Types of Clusterings

Source: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/

16
Partitional Clustering

Data objects are separated into


non-overlapping subsets, i.e.,
clusters

17
Hierarchical Clustering

Data objects are separated into


nested clusters as a hierarchical
tree

Hierarchical Clustering

Clustering dendrogram

18
Fuzzy Clustering

Fuzzy clustering, i.e., soft


clustering, is a form of clustering
in which each data point can
belong to more than one cluster
with weights

19
Density-based Clustering

A cluster is a dense region of


points, which is separated by
low-density regions, from other
regions of high density.

Non-linear separation

20
Model-based Clustering

Model-based clustering assumes


that the data were generated by
a model and tries to recover the
original model from the data.

Gaussian Mixture Model

21
2.
Typical Clustering
Algorithms
Intuition, Main Idea, Limitation

22
Typical Clustering Algorithms

◎ Partitional Clustering
○ K-Means & Variants
◎ Hierarchical Clustering
○ HAC
◎ Density-based Clustering
○ DBSCAN

23
K-Means Clustering: An Example

24
K-Means Clustering

● Main idea: Each point is assigned to the cluster with the closest centroid
● Number of clusters, K, must be specified
● Sum of Squared Error (SSE)
● Complexity: O(n * K * I * d)
○ n = number of points, K = number of clusters,
○ I = number of iterations, d = number of attributes

25
Elbow Method for Optimal Value of K

WCSS is the sum of squared distance between each point and the centroid in a cluster
The graph will rapidly change at a point named Elbow Point
26
Two different K-means Clusterings

Optimal
Clustering

Original Points
Sub-optimal
Clustering

27
Importance of Choosing Initial Centroids

28
Solutions to Initial Centroids Problem
● Multiple runs
○ Helps, but probability is not on your side
● Use some strategies to select the k initial centroids and then
select among these initial centroids
○ Select most widely separated, e.g., K-means++
○ Use hierarchical clustering to determine initial centroids
● Bisecting K-Means
○ Not as susceptible to initialization issues

29
K-Means++
1. Choose one center uniformly at random among the data points.
2. For each data point x not chosen yet, compute D(x), the distance
between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a
weighted probability distribution where a point x is chosen with
probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using
standard K-Means clustering

30
Bisecting K-Means

It is a variant of K-means that can produce a


partitional or a hierarchical clustering

31
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

32
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

33
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

34
Hierarchical Agglomerative Clustering

dendrogram

● Main Idea:
○ Start with the points as individual clusters
○ At each step, merge the closest pair of clusters until only one
cluster (or K clusters) left
● Key operation is the computation of the proximity of two clusters
○ Worst-case Complexity: O(N3)
35
HAC: Algorithm

36
Closest Pair of Clusters
● Many variants to defining closest pair of clusters
● Single-link
○ Similarity of the closet elements
● Complete-link
○ Similarity of the “furthest” points
● Average-link
○ Average cosine between pairs of elements
● Ward’s Method
○ The increase in squared error when two clusters are merged

37
HAC - Single-link (MIN)
5
1
3
5
2 1
2 3 6

4
4
Nested Clusters Dendrogram

38
HAC - Complete-link (MAX)
5
4 1
22 5
5
2 1
4 3 6
3 1
4
3

Nested Clusters Dendrogram

39
HAC - Average-link

5 4 1
2
5
2
3 6
1
4
3
Nested Clusters Dendrogram

40
HAC: Limitations
● Once two clusters are combined, it cannot be undone
● No global objective function is directly minimized
● Typical Problems:
○ Sensitivity to noise
○ Difficulty handling clusters of different
sizes and non-globular shapes
○ Breaking large clusters

41
Density-based Clustering - DBSCAN

● Main Idea: Clusters are regions of high


density that are separated from one
another by regions on low density.
● Density = number of points within
a specified radius (Eps)
○ Core point
○ Border point
○ Noise point

42
DBSCAN: Algorithm

43
How to Determine Points?

MinPts = 7

● Core point: Has at least a specified number of points (MinPts) within Eps
● Border point: not a core point, but is in the neighborhood of a core point
● Noise point: any point that is not a core point or a border point

44
DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border


and noise

Eps = 10, MinPts = 4

45
DBSCAN: How to Determine Eps, MinPts?

Intuition:
● Core point: the k-th nearest
neighbors are at a close distance.
● Noise point: the k-th nearest
neighbors are at a far distance.
Plot sorted distance of every point to
its k-th nearest neighbor

46
DBSCAN: Limitations
(MinPts=4, Eps=9.92).

● Varying densities
(MinPts=4, Eps=9.75). ● High-dimensional
data
Original Points

47
Which Clustering Algorithm?
● Type of Clustering
● Type of Cluster
○ Prototype vs connected regions vs density-based
● Characteristics of Clusters
● Characteristics of Data Sets and Attributes
● Noise and Outliers
● Number of Data Objects
● Number of Attributes
● Algorithmic Considerations

48
A Comparison on Clustering Algorithms

Source: Text Clustering


Algorithms: A Review

49
Summary
◎ General Concepts of Clustering
○ Definition
○ Real-life Applications
○ Types of Clustering
◎ Typical Clustering Algorithms
○ K-Means
○ HAC
○ DBSCAN

50
Dimensionality Reduction
Curse of Dimensionality
The number of training examples required
increases exponentially with dimensionality
d (i.e., kd).

We have to choose the right set features

31 bins

32 bins

33 bins
k: number of bins per feature
Dimensionality Reduction

What is the objective?


Choose an optimum set of features of lower dimensionality to
improve classification accuracy.

53
Dimensionality Reduction
Feature extraction: Extract features from sample (increase features)
Feature selection: get relevant features from the set of features
Dimensionality Reduction: get proper mapping to
a lower dimensional space
The mapping f()
could be linear
é x1 ù é x1 ù or non-linear
êx ú
êx ú ê 2ú
ê 2ú é xi1 ù é y1 ù
ê . ú êy ú
ê . ú ê ú ê ú ê 2ú
ê ú ê xi2 ú x= ê . ú
¾¾¾
f (x)
®y = ê . ú
.
x=ê ú ® y=ê . ú ê . ú ê ú
ê . ú ê ú ê ú ê . ú
ê ú ê . ú ê ú .
êë yK úû
ê ú . ê ú ê . ú
ê . ú ë xiK û ê ú
ê ú K<<N êë xN úû
êë xN úû K<<N
54
Dimensionality Reduction

Linear combinations are faster, and easy to optimize

Given x ϵ RN, find an N x K matrix U such that:

y = UTx ϵ RK where K<N


é x1 ù
êx ú
ê 2ú é y1 ù This is a projection from
ê . ú U T êy ú
ê ú the N-dimensional
. ú ê 2ú
x= ê ¾¾¾
f (x)
®y = ê . ú space to a K-
ê . ú ê ú dimensional space.
ê ú ê . ú
ê ú .
êë yK úû
ê . ú
ê ú
êë xN úû
55
Principal Component Analysis (PCA)
• Let recall the linear algebra, assume a data point x∈RN as a linear combination of an
orthonormal set of N basis vectors <v1,v2,…,v!> in RN :
N
x = å xi vi = x1v1 + x2v2 + ... + xN vN
ì1 if i = j i =1
viT v j = í
î0 otherwise xT vi
where xi = T = xT vi
vi vi
• PCA seeks to represent x in a new space of lower dimensionality using only K basis vectors
(K < N)
K
xˆ = å yi ui = y1u1 + y2u2 + ... + yK u K
i =1

such that || x - xˆ || is minimized, for all x ∈ D


(i.e., minimize information loss)
56
Principal Component Analysis (PCA)
How should we determine the “optimal” lower dimensional space basis
vectors <u1, u2, …,uK> ?

The optimal space of lower dimensionality can be defined by:

(1) Finding the eigenvectors u! of the covariance matrix of the data Σx

Σx u!= !! u!

(2) Choosing the K “largest” eigenvectors u! (corresponding to the K “largest” eigenvalues !!)

57
PCA - Steps
Suppose we are given x1, x2, ..., xM N x 1 vectors

Step 1: compute sample mean


M
1
x=
M
åx
i =1
i

Step 2: subtract sample mean (i.e., center data at zero)

Φi = xi - x 1 n
Sˆ = å (x k - μˆ )(x k - μˆ )t
n k =1
Step 3: compute the sample covariance matrix Σx

1 M
1
Sx =
M
å Fi FTi =
i =1 M
AAT where A=[Φ1 Φ2 ... ΦΜ]

(N x M matrix)
58
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
S x ui = li ui

Assume that l1 > l2 > ... > lN and u1 , u2 ,..., u N

Question: How can we say about "N > 0 ?

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis in RN, therefore:


N
x - x = å yi ui = y1u1 + y2u2 + ... + y N u N
i =1

(x - x)T ui
yi = T
= (x - x)T ui if || ui ||= 1
ui ui (normalized)
59
PCA - Steps
Step 5: Aprpoximation with the Tranformation matrix U (using the first K vectors)

N
x - x = å yi ui = y1u1 + y2u2 + ... + y N u N
i =1

K
xˆ - x = å yi ui = y1u1 + y2u2 + ... + yK u K
i =1

é y1 ù
êy ú
ê 2ú where U = [u1 u2 ... u K ] N x K
or (xˆ - x ) = U ê . ú
ê ú
ê . ú
êë yK úû
60
Example
Compute the PCA for dataset

(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

Compute the sample covariance matrix is:


n
ˆS = 1 å (x - μˆ )(x - μˆ )t
k k
n k =1

The eigenvalues can be computed by finding the roots of the


characteristic polynomial:

61
Example
The eigenvectors are the solutions of the systems:
S xui = li ui

Normalize the eigenvectors vectors to unit-length.

Note: if ui is a solution, then (cui) is also a solution where c is


any constant.
62
Geometric interpretation of PCA
• PCA chooses the eigenvectors of the covariance matrix corresponding to the largest
eigenvalues.
• The eigenvalues correspond to the variance of the data along the eigenvector
directions.
• Therefore, PCA projects the data along the directions where the data varies most.

u1: direction of max variance


u2: orthogonal to u1

63
How do we choose K ?
• K is typically chosen based on how much information (variance) we want to
preserve:
K

ål i
i =1
N
>T where T is a threshold (e.g ., 0.9)
ål
i =1
i

• If T=0.9, for example, we say that we “preserve” 90% of the information


(variance) in the data.

• If K=N, then we “preserve” 100% of the information in the data (i.e., just a
change of basis)
64
Approximation Error
• The approximation error (or reconstruction error) can be
computed as:
|| x - xˆ ||
K

where xˆ = å yi ui = y1u1 + y2u2 + ... + yK u K + x (reconstruction)


i =1

• It can also be shown that the approximation error can be


computed as follows:
1 N
|| x - xˆ ||= å li
2 i = K +1
65

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy