0% found this document useful (0 votes)
63 views

Clarans Clustering

This document discusses efficient clustering methods for spatial data mining. It introduces CLARANS, an effective clustering algorithm that improves upon PAM and CLARA. CLARANS uses a randomized search of a graph to find high quality clusters faster than exhaustive searches. The document also presents the SD and NSD approaches which apply either spatial or non-spatial clustering first before generalizing the other attributes, and observes that quality is dependent on the initial dataset and query.

Uploaded by

20repsqt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Clarans Clustering

This document discusses efficient clustering methods for spatial data mining. It introduces CLARANS, an effective clustering algorithm that improves upon PAM and CLARA. CLARANS uses a randomized search of a graph to find high quality clusters faster than exhaustive searches. The document also presents the SD and NSD approaches which apply either spatial or non-spatial clustering first before generalizing the other attributes, and observes that quality is dependent on the initial dataset and query.

Uploaded by

20repsqt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Efficient and

Effective Clustering
Methods for Spatial
Data Mining
Raymond T. Ng, Jiawei Han

1
Overview
 Spatial Data Mining
 Clustering techniques
 CLARANS
 Spatial and Non-Spatial dominant
CLARANS
 Observations
 Summary

2
Overview
 Spatial Data Mining
 Clustering techniques
 CLARANS
 Spatial and Non-Spatial dominant
CLARANS
 Observations
 Summary

3
Spatial Data Mining
 Identifying interesting relationships and
characteristics that may exist implicitly in Spatial
Databases
 Different from Relational Databases
 Spatial objects - store both spatial and non-
spatial attributes
 Queries (“All Walmart stores within 10 miles of
UH)
 Spatial Joins, work on spatial indexes (R-tree)
 Huge sizes (Tera bytes)
 GIS is a classic example

4
Overview
 Spatial Data Mining
 Clustering techniques
 CLARANS
 Spatial and Non-Spatial dominant
CLARANS
 Observations
 Summary

5
Partitioning Methods
Given K, the number of partitions to create, a partitioning
method constructs initial partitions. It then iterative
refines the quality of these clusters so as to maximize
intra-cluster similarity and inter-cluster dissimilarity.
[Quality of Clustering]: Average dissimilarity of objects
from their cluster centers (medoids)

Selected algorithms:
1. K-medoids
2. PAM
3. CLARA
4. CLARANS

6
K-Medoids 10

 Partition based clustering (K 6

partitions) 5

 Effective, why ? 3

Resistant to outliers
2

 1

 Do not depend on order in 0


0 1 2 3 4 5 6 7 8 9 10

which data points are K-medoids


examined
 Cluster center is part of 10

dataset, unlike k-means 9

where cluster center is gravity 8

based 6

 Experiments show that large 4

data sets are handled 2

efficiently 0
0 1 2 3 4 5 6 7 8 9 10

K-means
7
PAM (Partitioning Around Medoids)
 [Goal]: Find K representative objects of
the data set. Each of the K objects is
called a Medoid, the most centrally
located object within a cluster.

8
PAM (2)
 Start with K data points designated
as medoids. Create cluster around
a medoid by moving data points
close to the medoid
Oj belongs to Oi
if d(Oj, Oi) = minOe d(Oj, Oe)
 Iteratively replace Oi with Oh if
quality of clustering improves.
 Swapping cost, Cijh, associated for
replacing a selected object Oi with
a non-selected object Oh

9
PAM (3)
* O(k(n-k)2) for each iteration
* Good for small data sets
(n=100, k=5)

10
CLARA (Clustering LARge
Applications)

 Improvement over PAM


 Finds medoids in a sample from the dataset
 [Idea]: If the samples are sufficiently
random, the medoids of the sample
approximate the medoids of the dataset
 [Heuristics]: 5 samples of size 40+2k gives
satisfactory results
 Works well for large datasets (n=1000, k=10)

11
Overview
 Spatial Data Mining
 Clustering techniques
 CLARANS
 Spatial and Non-Spatial dominant
CLARANS
 Observations
 Summary

12
CLARANS (Clustering Large Applications
based on RANdomized Search)

 A graph abstraction, Gn,k


 Each vertex is a {Om1, ..., Omk}
S1
collection of k medoids
 | S1∩ S2 | = k – 1 S2
 Each node has k(n-k)
{Oa1, ..., Oak}
neighbors
 Cost of each node is total
dissimilarity of objects to {Ob1, ..., Obk}
their medoids
 PAM searches whole graph {Oc1, ..., Ock}

 CLARA searches subgraph

{Od1, ..., Odk}

13
CLARANS (2)

Experimental values

• numLocal = 2
• maxNeighbors =
max(1.25% of k(n-k), 250)

14
CLARANS (3)
 Outperforms PAM and CLARA in terms
of running time and quality of
clustering
 O(n2) for each iteration

CLARANS vs CLARA
15

CLARANS vs PAM
Overview
 Spatial Data Mining
 Clustering techniques
 CLARANS
 Spatial and Non-Spatial dominant
CLARANS
 Observations
 Summary

16
Generalization
 Useful to mine non-spatial
attributes
 Process of merging tuples
based on a concept hierarchy
 DBLearn – SQL query, gen.
hierarchy and threshold

Sphere(color, diameter)

Initial relation Generalized relation

17
Silhouette
Silhouette of object Oj
 determines how much
Oj belongs to it’s
cluster
 Between -1 and 1
 1 indicates high
degree of membership

Silhouette width of
cluster Silhoutte width Interpretation
 Average silhouette of 0.71 – 1 Strong cluster
all objects in cluster 0.51 – 0.7 Reasonable cluster

0.26 – 0.5 Weak or artificial


Silhouette coefficient cluster
 Average silhouette ≤ 0.25 No cluster found
widths of k clusters 18
SD and NSD approach
 SD – Spatial Dominant
 NSD – Non-Spatial Dominant
 Clustering for spatial attributes /
Generalization for non-spatial attributes
 Dominance is decided by what is
carried out first
(clustering/generalization)
 Second phase works on tuples from
previous stage
19
SD(CLARANS)
Data
SQL For every cluster
CLARANS
on spatial Collect non-spatial
Specify learning attributes components
Oi
request in the
Tuples Knat clusters
form of SQL Oh
Oj

query

Apply DBLearn

 Finds non-spatial generalizations from spatial


clustering
 Value for Knat is determined through heuristics
using the silhouette coefficients
 Clustering phase can be treated as finding
spatial generalization hierarchy 20
NSD(CLARANS)

 Finds spatial clusters from non-spatial


generalizations
 Clusters may overlap
21
Overview
 Spatial Data Mining
 Clustering techniques
 CLARANS
 Spatial and Non-Spatial dominant
CLARANS
 Observations
 Summary

22
Observations
 In all previous methods, quality of
mining depends on the SQL query
 CLARANS assumes that the entire
dataset is in memory. Not always the
case for large data sets.
 Quality of results cannot be guaranteed
when N is very large – due to
Randomized Search

23
Observations (2)
 Other clustering algorithms proposed
for Spatial Data Mining
 Hierarchical: BIRCH
 Density based: DBSCAN, GDBSCAN,
DBRS
 Grid based: STING

24
Summary
 A seminal paper on use of clustering for
spatial data mining
 CLARANS is an effective clustering
technique for large datasets
 SD(CLARANS)/NSD(CLARANS) are
effective spatial data mining algorithms

25
References
 Primary
 Efficient and Effective Clustering Methods for
Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han
 Secondary
 CLARANS: A Method for Clustering Objects for
Spatial Data Mining - Raymond T. Ng, Jiawei Han
 Clustering for Mining in Large Spatial Databases -
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu
 An Introduction to Spatial Database Systems - Ralf
Hartmut Güting

26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy