0% found this document useful (0 votes)
43 views29 pages

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters. The goal is to maximize the similarity of data points within each cluster while maximizing the dissimilarity between clusters. Good clustering produces high intra-cluster similarity and low inter-cluster similarity. Cluster analysis is commonly used for pattern recognition, image processing, bioinformatics, and document classification. Common requirements for clustering in data mining include scalability, ability to handle different data types, and discovery of clusters with arbitrary shapes.

Uploaded by

Osama Qahatany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views29 pages

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters. The goal is to maximize the similarity of data points within each cluster while maximizing the dissimilarity between clusters. Good clustering produces high intra-cluster similarity and low inter-cluster similarity. Cluster analysis is commonly used for pattern recognition, image processing, bioinformatics, and document classification. Common requirements for clustering in data mining include scalability, ability to handle different data types, and discovery of clusters with arbitrary shapes.

Uploaded by

Osama Qahatany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Cluster Analysis

What is Cluster Analysis?


 Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined
classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may unveil important information
 As a preprocessing step for other algorithms
 Efficient indexing or compression often relies on clustering
Some Applications of
Clustering
 Pattern Recognition
 Image Processing
 cluster images based on their visual content
 Bio-informatics
 WWW and IR
 document classification
 cluster Weblog data to discover groups of similar access patterns
What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Requirements of Clustering in
Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 usability
Outliers
 Outliers are objects that do not belong to any cluster
or form clusters of very small cardinality

cluster

outliers

 In some applications we are interested in discovering


outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
 data matrix
 x 11 ... x
1f
... x
1p

tuples/objects
 (two modes)  
 ... ... ... ... ... 
x ... x ... x 
 i1 if ip 
 ... ... ... ... ... 
x ... x ... x

 n1 nf np 
objects
 dissimilarity or distance
 0 
matrix  d(2,1) 0

 
objects
 (one mode)  d(3,1 ) d ( 3, 2) 0 
 
Assuming simmetric distance  : : : 
d(i,j) = d(j, i) d ( n , 1 ) d (n ,2) ... ... 0

Measuring Similarity in
Clustering
 Dissimilarity/Similarity metric:

 The dissimilarity d(i, j) between two objects i and j is expressed in


terms of a distance function, which is typically a metric:
 d(i, j)0 (non-negativity)
 d(i, i)=0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

 The definitions of distance functions are usually different


for interval-scaled, boolean, categorical, ordinal and ratio-
scaled variables.

 Weights may be associated with different variables based


on applications and data semantics.
Type of data in cluster
analysis
 Interval-scaled variables
 e.g., salary, height

 Binary variables
 e.g., gender (M/F), has_cancer(T/F)

 Nominal (categorical) variables


 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

 Ratio-scaled variables
 population growth (1,10,100,1000,...)

 Variables of mixed types


 multiple attributes with various types
Similarity and Dissimilarity Between
Objects
 Distance metrics are normally used to measure the
similarity or dissimilarity between two data objects
 The most popular conform to Minkowski distance:


1/ p

p p p
L p (i , j )   x |  | x  x | ...  | x  x |

| x


 i1
j1 i2 j2 in jn 
where i = (xi1 , x , …, xin ) and j = (xj1 , xj2 , …, xjn ) are two n-dimensional

i2

data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block) distance:

L ( i , j ) | x  x |  | x  x | ...  | x  x |
1 i1 j1 i2 j2 in jn
Similarity and Dissimilarity
Between Objects (Cont.)
 If p = 2, L2 is the Euclidean distance:
d ( i , j )  (| x x | x  x ...  | x  x
2 2 2
| | | )
i1 j1 i2 j2 in jn

 Properties
 d(i,j) 0
 d(i,i) =0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance:
d (i , j )  ( w | x  x | w x ...  w n | x  x
2 2 2
|x | | )
1 i1 j1 2 i2 j2 in jn
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data object j
1 0 sum
i= (0011101001)
1 a b a b
J=(1001100110)
object i 0 c d c d
sum a c b d p

 Simple matching coefficient distance :

d (i , j )  b c
 Jaccard coefficient distance : a b c  d

d (i , j )  b c
a b c
Binary Variables
 Another approach is to define the similarity of two
objects and not their distance.
 In that case we have the following:
 Simple matching coefficient similarity:

s( i , j )  a d
a b c  d
 Jaccard coefficient similarity:

s( i , j )  a
a b c

Note that: s(i,j) = 1 – d(i,j)


Dissimilarity between Binary
Variables
 Example (Jaccard coefficient)

 all attributes are asymmetric binary


 1 denotes presence or positive test
 0 denotes absence or negative test
0 1
d( jack , mary )   0 . 33
2  0 1
1 1
d( jack , jim )   0 . 67
1 1 1
12
d( jim , mary )   0 . 75
1 1  2
A simpler definition
 Each variable is mapped to a bitmap (binary vector)

 Jack: 101000
 Mary: 101010
 Jim:110000
 Simple match distance:
number of non - common bit positions
d (i , j ) 
total number of bits

 Jaccard coefficient:
number of 1' s in i j
d (i , j ) 1
number of 1' s in i j
Variables of Mixed Types
 A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio-scaled.
 One may use a weighted formula to combine their effects.
Major Clustering Approaches
 Partitioning algorithms: Construct random partitions and then
iteratively refine them by some criterion
 Hierarchical algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
K-means Clustering

 Partitional clustering approach


 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 Number of clusters, K, must be specified
K-means Clustering
Algorithm k-Means Clustering Algorithm
Input: a database D, of m records, r1, ..., rm and a desired number of
clusters k
Output: set of k clusters that minimizes the squared error criterion
Begin
randomly choose k records as the centroids for the k clusters;
repeat
assign each record, ri, to a cluster such that the distance between ri
and the cluster centroid (mean) is the smallest among the k clusters;
recalculate the centroid (mean) for each cluster based on the records
assigned to the cluster;
until no change;
End;
K-means Clustering Example

Sample 2-dimensional records for clustering example


RID Age Years_of_service
1 30 5
2 50 25
3 50 15 C1
4 25 5
5 30 10
6 55 25 C2

Assume that the number of desired clusters k is 2.


 Let the algorithm choose records with RID 3 for cluster C1 and RID 6
for cluster C2 as the initial cluster centroids
The remaining records will be assigned to one of those clusters
during the first iteration of the repeat loop
K-means Clustering Example
The Euclidean distance between record rj and rk in n-
dimensional space is calculated as:

rj and rk represent the records wanted to calculate the distance between t


rk indicate to the C1 and C2 in current example

Record distance from C1 distance from C2 it joins cluster

1 22.4 32.0 C1
rj 2 10.0 5.0 C2
4 25.5 36.6 C1
5 20.6 29.2 C1
K-means Clustering Example

 Now,the new means (centroids) for the two clusters are computed. The
mean for a cluster, Ci, with n records of m dimensions is the vector:

 In our example records ( 1,3,4,5) belong to C1


 And (2,6) belong to C2
so C1(new) =(1/4(30+50+25+30) , 1/2(50+55)=(33.75, 8.75)
 C2(new) =(1/4(5+15+5+10) , 1/2(25+25)=(52.5, 25)
K-means Clustering Example

 A second iteration proceeds to get the distance of each record with new
centroids
 In the following table: calculate the distance of each record from the new
C1 and C2, and assign each to the suitable cluster

Record distance from C1 distance from C2 it joins cluster

1
2
3
4
5
6
calculate new C1 and C2 ?
tip : C1 will be (28.3, 6.7) and C2 will be (51.7, 21.7)
K-means Clustering Example

 Move to the next iteration and do as the previous slide


 Stop if you get same results
K-means Clustering – Details

 Initial centroids are often chosen randomly.


 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to ‘Until relatively few
points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
 The terminating condition is usually the squared-error criterion. For clusters
C1, ..., Ck with means m1, ..., mk, the error is defined as:

 rj is a data point in cluster Ci and mi is the corresponds mean of


the cluster
Solutions to Initial Centroids
Problem
 Multiple runs
 Helps, but probability is not on your side
 Sample and use hierarchical clustering to
determine initial centroids
 Select more than k initial centroids and then
select among these initial centroids
 Select most widely separated

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy