0% found this document useful (0 votes)
32 views

Clusters - Density-Based

The document discusses density-based clustering algorithms. It defines density-based clusters as sets of density-connected points that are maximal with respect to density-reachability. A point p is density-reachable from another point q if there is a chain of points connecting them where each subsequent point is directly density-reachable from the previous. Direct density-reachability requires the points to be neighbors and the neighbor point to have sufficient density. DBSCAN is presented as a density-based clustering algorithm that groups together densely connected points and marks outliers as noise. Parameters epsilon and delta control neighborhood size and density.

Uploaded by

Fareed Naouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Clusters - Density-Based

The document discusses density-based clustering algorithms. It defines density-based clusters as sets of density-connected points that are maximal with respect to density-reachability. A point p is density-reachable from another point q if there is a chain of points connecting them where each subsequent point is directly density-reachable from the previous. Direct density-reachability requires the points to be neighbors and the neighbor point to have sufficient density. DBSCAN is presented as a density-based clustering algorithm that groups together densely connected points and marks outliers as noise. Parameters epsilon and delta control neighborhood size and density.

Uploaded by

Fareed Naouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Non-convex Clusters

Clusters – Density-based 26/34


Neighborhood and Reachability
• -neighborhood of p ∈ D defined as N (p) = {x ∈ D | d(p, x) ≤ }
• p is directly density-reachable from q ∈ D w.r.t. some  and δ if
• p ∈ N (q)
• |N (q)| ≥ δ, i.e. is a core point

• p is density-reachable from q w.r.t. some  and δ if


• ∃p1 , . . . , pn ∈ D such that p1 = q, pn = p, and
• pi+1 is directly density-reachable from pi for 2 ≤ i ≤ n

• p is density-connected to q w.r.t. some  and δ if


• ∃o ∈ D such that both p and q are density-reachable from o

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some  and δ if


• ∀p, q ∈ D: if p ∈ C and q is density-reachable from p then q ∈ C
• ∀p, q ∈ C: p is density-connected to q

• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters

Clusters – Density-based 27/34


Neighborhood and Reachability

Clusters – Density-based 28/34


DBSCAN

1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1

Clusters – Density-based 29/34


DBSCAN
1: function ExpandCluster(D, p, i, , δ)
2: if |N (p)| < δ then
3: p(p) ← 0 . mark p as noise
4: return false
5: else
6: for all x ∈ N (p) do
7: p(x) ← i . assign all x to cluster i
8: S ← N (p) \ {p}
9: while S 6= ∅ do
10: s ← S1 . Get the first point from S
11: if |N (s)| ≥ δ then
12: for all x ∈ N (s) do
13: if p(x) ≤ 0 then
14: if p(x) = −1 then
15: S ← S ∪ {x}
16: p(x) ← i
17: S ← S \ {s}
18: return true
Clusters – Density-based 30/34
How to guess  and δ?
k-distance
• k-dist: D → R
• k-dist(x) is the distance of x to its k-th nearest neighbor

Clusters – Density-based 31/34


DBSCAN – “good to know”

Pros
• Clusters of an arbitrary shape
• Robust to outliers

Cons
• Computationally complex
• Hard to set the parameters

Clusters – Density-based 32/34


Final remarks

• domain knowledge might help in choosing the right similarity


measure
• be aware of the range of values of the attributes
• e.g. similarities between x = (3.2, 178) and y = (3.1, 170) affected
more by the second co-ordinate

• there are various other approaches to similarity computation


• Janos Podani (2000). Introduction to the Exploration of Multivariate
Biological Data. Chapter 3: Distance, similarity, correlation...
Backhuys Publishers, Leiden, The Netherlands, ISBN 90-5782-067-6.

Clusters – Density-based 33/34


Thanks for your attention
References

• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On


Clustering Validation Techniques. Journal on Intelligent Information
Systems 17, 2-3.

• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).


Introduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.

• Chris Ding and Xiaofeng He (2004). K-means clustering via principal


component analysis. In Proceedings of the twenty-first international
conference on Machine learning (ICML ’04). ACM, New York, NY, USA.

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A


density-based algorithm for discovering clusters in large spatial databases
with noise. Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, AAAI Press.

Clusters – Density-based 34/34


Homework

• Download a clustering dataset from the UCI Machine Learning


Repository

• Cluster the dataset using


• Agglomerative clustering
• k-means method
• DBSCAN method

• Justify the choice of the values for the hyper-parameters


• similarity, linkage, k, δ, , . . .

Clusters – Density-based 34/34


Questions?

tomas.horvath@inf.elte.hu

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy