V DM Clustering
V DM Clustering
1
Cluster Analysis: Basic Concepts and Methods
◼ Partitioning Methods
◼ Hierarchical Methods
◼ Density-Based Methods
◼ Grid-Based Methods
◼ Evaluation of Clustering
◼ Summary
2
What is Cluster Analysis?
◼ Cluster: A collection of data objects
◼ similar (or related) to one another within the same group
3
Clustering for Data Understanding and
Applications
◼ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
◼ Information retrieval: document clustering
◼ Land use: Identification of areas of similar land use in an earth
observation database
◼ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
◼ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◼ Earth-quake studies: Observed earth-quake epicenters should be
clustered along continent faults
◼ Climate: understanding earth climate, find patterns of atmospheric
and ocean
◼ Economic Science: market research
4
Clustering as a Preprocessing Tool (Utility)
◼ Summarization:
◼ Preprocessing for regression, PCA, classification, and
association analysis
◼ Compression:
◼ Image processing: vector quantization
◼ Finding K-nearest Neighbors
◼ Localizing search to one or a small number of clusters
◼ Outlier detection
◼ Outliers are often viewed as those “far away” from any
cluster
5
Quality: What Is Good Clustering?
6
Measure the Quality of Clustering
◼ Dissimilarity/Similarity metric
◼ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
◼ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
◼ Weights should be associated with different variables
based on applications and data semantics
◼ Quality of clustering:
◼ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
◼ It is hard to define “similar enough” or “good enough”
◼ The answer is typically highly subjective
7
Understanding Data
◼ Nominal
◼ Binary - Symmetric, Asymmetric
◼ Ordinal
◼ Numeric: quantitative
◼ Interval-scaled
◼ Ratio-scaled
9
Categorical Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome (e.g., HIV
positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
10
Numeric Attribute Types
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values
Val Logn(Val)
445 2.65
22 1.34
164 2.21
1210 3.08
Dissimilarity Measures – Categorical (Nominal)
m= No. of matches
p= Toal No. nominal
attributes
(p=1 here)
Dissimilarity Measures – Categorical (Ordinal)
-e.g. test2 here
Attribute Type
Gender Bin-Symmetric 1 1 1
Fever Bin-Asymmetric 1 0 0
Cough Bin-Asymmetric 0 0 0
Test1 Bin-Asymmetric 1 0 0
Test2 Bin-Asymmetric 0 0 0
Test3 Bin-Asymmetric 1 1 1
Test4 Bin-Asymmetric 0 0 0
20
Considerations for Cluster Analysis
◼ Partitioning criteria
◼ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
◼ Separation of clusters
◼ Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one
class)
◼ Similarity measure
◼ Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
◼ Clustering space
◼ Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
21
Requirements and Challenges
◼ Scalability
◼ Clustering all the data instead of only on samples
these
◼ Constraint-based clustering
◼ User may give inputs on constraints
◼ Use domain knowledge to determine input parameters
◼ Interpretability and usability
◼ Others
◼ Discovery of clusters with arbitrary shape
◼ High dimensionality
22
Major Clustering Approaches (I)
◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects)
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Grid-based approach:
◼ based on a multiple-level granularity structure
23
Major Clustering Approaches (II)
◼ Model-based:
◼ A model is hypothesized for each of the clusters and tries to find
◼ Frequent pattern-based:
◼ Based on the analysis of frequent patterns
◼ User-guided or constraint-based:
◼ Clustering by considering user-specified or application-specific
constraints
◼ Typical methods: COD (obstacles), constrained clustering
◼ Link-based clustering:
◼ Objects are often linked together in various ways
24
Illustration – capabilities of different approaches
https://machinelearningmastery.com/clustering-algorithms-with-python/
25
Cluster Analysis: Basic Concepts and Methods
◼ Partitioning Methods
◼ Hierarchical Methods
◼ Density-Based Methods
◼ Grid-Based Methods
◼ Evaluation of Clustering
◼ Summary
26
Partitioning Algorithms: Basic Concept
E = ik=1 pCi ( p − ci ) 2
◼ Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
◼ Global optimal: exhaustively enumerate all partitions
◼ Heuristic methods: k-means and k-medoids algorithms
◼ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
◼ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
27
The K-Means Clustering Method
28
An Example of K-Means Clustering
K=2
30
Determining K : Cost Function, Elbow method
31
Comments on the K-Means Method
◼ Dissimilarity calculations
33
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
34
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
K
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change R
5 5
If quality is 4 4
A
improved. 3 CA = d(A,P)- 3
d(A,R)
P
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
35
Cost of swapping
◼ K=2, Initial Seed= A, C A B C D
A 0 5 3 4
◼ Iteration 1
B 5 0 7 4
◼ Clusters – {A, B} {C,D} C 3 7 0 2
◼ Swap- D 4 4 2 0
36
PAM (Partitioning Around Medoids) (1987)
38
What Is the Problem with PAM?
39
CLARA (Clustering Large Applications)
(1990)
40
CLARANS (“Randomized” CLARA) (1994)
◼ CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
◼ Draws sample of neighbors (Oi) for swap dynamically
41
The K-Medoid Clustering Method
◼ PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
42
Cluster Analysis: Basic Concepts and Methods
43
Hierarchical Clustering
◼ Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
44
AGNES (Agglomerative Nesting)
◼ Introduced in Kaufmann and Rousseeuw (1990)
◼ Implemented in statistical packages, e.g., Splus
◼ Use the single-link method and the dissimilarity matrix
◼ Merge nodes that have the least dissimilarity
◼ Go on in a non-descending fashion
◼ Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
45
Bottom-Up-AGNES
◼ {2,3 ,5, 8, 10}
◼ Single-link: distance bw two groups is the distance bw the nearest
elements
46
Bottom-Up-AGNES- Single-link
◼ {2,3 ,5, 8, 10}
◼ Single-link: distance bw two groups is the distance bw the nearest
elements
◼ {2},{3} ,{5}, {8}, {10} => distances= 1,2,3,2
◼ {2,3}, {5}, {8,10} => distances= 2,3
◼ {2,3,5}, {8,10} =>
◼ {2,3,5,8,10}=>
47
Bottom-Up-AGNES- Complete-link
◼ {2,3 ,5, 8, 10}
◼ Complete-link: distance bw two groups is the distance bw the farthest
elements
◼ {2},{3} ,{5}, {8}, {10} => distances= 1,2,3,2
◼ {2,3,5,8,10}=>
48
Bottom-Up-AGNES- Average-link
◼ {2,3 ,5, 8, 10}
◼ Average-link: distance bw two groups is the distance between the
means of the two groups
◼ {2},{3} ,{5}, {8}, {10} => distances= 1,2,3,2
◼ {2,3}, {5}, {8,10} => distances= {avg(5-2,5-3), avg(8-5,10-5)}= 2.5, 4
◼ {2,3,5}, {8,10} => distances= avg{8-2,8-3,8-5,10-2,10-3,10-5}= 34/6=5.67
◼ {2,3,5,8,10}=>
49
Dendrogram: Shows How Clusters are Merged
50
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
51
Top-Down- DIANA
◼ {2,3 ,5, 8, 10} => distances= 1,2,3,2
◼ {2,3 ,5}=> distances= 1,2
◼ {2}
◼ {3,5}
◼ {3}
◼ {5}
◼ {8, 10}
◼ {8}
◼ {10}
52
Distance between Clusters X X
N N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
54
Extensions to Hierarchical Clustering
◼ Major weakness of agglomerative clustering methods
56
Clustering Feature Vector in BIRCH
CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
Xi
9
i =1
7
6
(2,6)
5
4 (4,5)
3
1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)
57
CF-Tree in BIRCH
◼ Clustering feature:
◼ Summary of the statistics for a given subcluster: the 0-th,
nodes 58
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
59
The Birch Algorithm
◼ Cluster Diameter 1
( xi − x j )
2
n( n − 1)
parents
◼ Algorithm is O(n)
◼ Concerns
◼ Sensitive to insertion order of data points
◼ Since we fix the size of leaf nodes, so clusters may not be so natural
measures
60
Cluster Analysis: Basic Concepts and Methods
61
Density-Based Clustering Methods
◼ Handle noise
◼ One scan
62
Density-Based Clustering: Basic Concepts
◼ Two parameters:
◼ Eps: Maximum radius of the neighbourhood
◼ MinPts: Minimum number of points in an Eps-
neighbourhood of that point
◼ NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
◼ Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
◼ p belongs to NEps(q)
◼ core point condition: p MinPts = 5
63
Density-Reachable and Density-Connected
◼ Density-reachable:
◼ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
◼ Density-connected
◼ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
64
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
◼ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
◼ Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
65
DBSCAN
[https://towardsdatascience.com/dbscan-make-density-based-clusters-by-
hand-2689dc335120]
66
DBSCAN: The Algorithm
◼ Arbitrary select a point p
67
DBSCAN: Sensitive to Parameters
68
Reachability
-distance
undefined
‘
Cluster-order
of the objects
69
Density-Based Clustering: OPTICS & Its Applications
70
Cluster Analysis: Basic Concepts and Methods
◼ Partitioning Methods
◼ Hierarchical Methods
◼ Density-Based Methods
◼ Grid-Based Methods
◼ Evaluation of Clustering
◼ Summary
71
Assessing Clustering Tendency
◼ Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
◼ Test spatial randomness by statistic test: Hopkins Static
◼ Given a dataset D regarded as a sample of a random variable o,
◼ Elbow method
◼ Use the turning point in the curve of sum of within cluster variance
◼ E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
◼ For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
73
Measuring Clustering Quality
75
Outlir
◼ https://www.analyticsvidhya.com/blog/2018/03/int
roduction-k-neighbours-algorithm-clustering/
◼ https://www.kaggle.com/code/kimchanyoung/sim
ple-anomaly-detection-using-unsupervised-knn
76