0% found this document useful (0 votes)
46 views

Custer Analysis: Prepared by Navin Ninama

This document provides an overview of cluster analysis techniques. It defines cluster analysis as the process of grouping similar data objects into clusters. The document then categorizes and describes major clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based methods. Examples of applications are also given for areas like marketing, insurance, and earthquake studies. Quality measures and requirements for good clustering are discussed.

Uploaded by

Nishith Lakhlani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Custer Analysis: Prepared by Navin Ninama

This document provides an overview of cluster analysis techniques. It defines cluster analysis as the process of grouping similar data objects into clusters. The document then categorizes and describes major clustering methods, including partitioning, hierarchical, density-based, grid-based, and model-based methods. Examples of applications are also given for areas like marketing, insurance, and earthquake studies. Quality measures and requirements for good clustering are discussed.

Uploaded by

Nishith Lakhlani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Custer Analysis

PREPARED BY
Navin Ninama
140160702007

Cluster Analysis
1. What a Cluster Analysis?
2. A Categorization of Major

Clustering Methods
3. Partitioning Methods
4. Hierarchical Methods
5. Density-Based Methods
6. Grid-Based Methods
7. Model-Based Methods

What is Cluster Analysis?


Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping
similar data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

Clustering:
Rich
Applications
Multidisciplinary Efforts

and

Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns

Examples of Clustering Applications


Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning: Identifying groups of houses according to their house type,
value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults

Quality: What Is Good Clustering?


A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns

Measure the Quality of Clustering


Dissimilarity/Similarity metric: Similarity is expressed in terms
of a distance function, typically metric: d(i, j)
There is a separate quality function that measures the
goodness of a cluster.
The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal ratio, and
vector variables.
Weights should be associated with different variables based on
applications and data semantics.
It is hard to define similar enough or good enough

the answer is typically highly subjective.

Requirements of Clustering in Data


Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

Major Clustering Approaches (I)


Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue

Major Clustering Approaches (II)


Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering

Recent Hierarchical Clustering Methods


Major weakness of agglomerative clustering methods

do not scale well: time complexity of at least O(n2),


where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based clustering

BIRCH (1996): uses CF-tree and incrementally adjusts


the quality of sub-clusters
ROCK (1999): clustering categorical data by neighbor
and link analysis
CHAMELEON (1999): hierarchical clustering using
dynamic modeling

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)


CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar99
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high relative to
the internal interconnectivity of the clusters and closeness of
items within the clusters
Cure ignores information about interconnectivity of the objects,
Rock ignores information about the closeness of two clusters
A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters

Density-Based Clustering Methods


Clustering based on density (local cluster criterion), such as density-connected
points
Major features:

Discover clusters of arbitrary shape


Handle noise
One scan
Need density parameters as termination condition

Several interesting studies:

DBSCAN: Ester, et al. (KDD96)


OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98) (more grid-based)

Density-Based Clustering: Basic Concepts


Two parameters:

Eps: Maximum radius of the neighbourhood


MinPts: Minimum number of points in an Epsneighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-reachable from
a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)
core point condition:
|NEps (q)| >= MinPts

Grid-Based Clustering Method

Using multi-resolution grid data structure


Several interesting methods
STING (a STatistical INformation Grid approach) by Wang,
Yang and Muntz (1997)

WaveCluster by Sheikholeslami, Chatterjee, and Zhang


(VLDB98)

A multi-resolution clustering approach using wavelet


method

CLIQUE: Agrawal, et al. (SIGMOD98)


On high-dimensional data (thus put in the section of clustering
high-dimensional data

The STING Clustering Method


Each cell at a high level is partitioned into a
number of smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily
calculated from parameters of lower level cell
count, mean, s, min, max
type of distributionnormal, uniform, etc.

Use a top-down approach to answer spatial data


queries
Start from a pre-selected layertypically with a
small number of cells
For each cell in the current level compute the
confidence interval

Comments on STING
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower
level
Repeat this process until the bottom layer is reached
Advantages:

Query-independent,
easy
to
parallelize,
incremental update
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:

All the cluster boundaries are either horizontal


or vertical, and no diagonal boundary is
detected

Model-Based Clustering
What is model-based clustering?

Attempt to optimize the fit between the given data


and some mathematical model
Based on the assumption: Data are generated by a
mixture of underlying probability distribution
Typical methods

Statistical approach
EM (Expectation maximization), AutoClass

Machine learning approach


COBWEB, CLASSIT

Neural network approach


SOM (Self-Organizing Feature Map)

EM Expectation Maximization

EM A popular iterative refinement algorithm


An extension to k-means
Assign each object to a cluster according to a weight (prob.
distribution)
New means are computed based on weighted measures
General idea
Starts with an initial estimate of the parameter vector
Iteratively rescores the patterns against the mixture density
produced by the parameter vector
The rescored patterns are used to update the parameter
updates
Patterns belonging to the same cluster, if they are placed by
their scores in a particular component
Algorithm converges fast but may not be in global optima

Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy