0% found this document useful (0 votes)

43 views29 pages

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters. The goal is to maximize the similarity of data points within each cluster while maximizing the dissimilarity between clusters. Good clustering produces high intra-cluster similarity and low inter-cluster similarity. Cluster analysis is commonly used for pattern recognition, image processing, bioinformatics, and document classification. Common requirements for clustering in data mining include scalability, ability to handle different data types, and discovery of clusters with arbitrary shapes.

Uploaded by

Osama Qahatany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views29 pages

Cluster Analysis

Uploaded by

Osama Qahatany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Cluster Analysis

What is Cluster Analysis?

 Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined
classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may unveil important information
 As a preprocessing step for other algorithms
 Efficient indexing or compression often relies on clustering
Some Applications of
Clustering
 Pattern Recognition
 Image Processing
 cluster images based on their visual content
 Bio-informatics
 WWW and IR
 document classification
 cluster Weblog data to discover groups of similar access patterns
What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Requirements of Clustering in
Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 usability
Outliers
 Outliers are objects that do not belong to any cluster
or form clusters of very small cardinality

cluster

outliers

 In some applications we are interested in discovering

outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
 data matrix
 x 11 ... x
1f
... x
1p


tuples/objects
 (two modes)  
 ... ... ... ... ... 
x ... x ... x 
 i1 if ip 
 ... ... ... ... ... 
x ... x ... x

 n1 nf np 
objects
 dissimilarity or distance
 0 
matrix  d(2,1) 0

 
objects
 (one mode)  d(3,1 ) d ( 3, 2) 0 
 
Assuming simmetric distance  : : : 
d(i,j) = d(j, i) d ( n , 1 ) d (n ,2) ... ... 0

Measuring Similarity in
Clustering
 Dissimilarity/Similarity metric:

 The dissimilarity d(i, j) between two objects i and j is expressed in

terms of a distance function, which is typically a metric:
 d(i, j)0 (non-negativity)
 d(i, i)=0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

 The deﬁnitions of distance functions are usually different

for interval-scaled, boolean, categorical, ordinal and ratio-
scaled variables.

 Weights may be associated with different variables based

on applications and data semantics.
Type of data in cluster
analysis
 Interval-scaled variables
 e.g., salary, height

 Binary variables
 e.g., gender (M/F), has_cancer(T/F)

 Nominal (categorical) variables

 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

 Ratio-scaled variables
 population growth (1,10,100,1000,...)

 Variables of mixed types

 multiple attributes with various types
Similarity and Dissimilarity Between
Objects
 Distance metrics are normally used to measure the
similarity or dissimilarity between two data objects
 The most popular conform to Minkowski distance:


1/ p

p p p
L p (i , j )   x |  | x  x | ...  | x  x |

| x


 i1
j1 i2 j2 in jn 
where i = (xi1 , x , …, xin ) and j = (xj1 , xj2 , …, xjn ) are two n-dimensional

i2


data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block) distance:

L ( i , j ) | x  x |  | x  x | ...  | x  x |
1 i1 j1 i2 j2 in jn
Similarity and Dissimilarity
Between Objects (Cont.)
 If p = 2, L2 is the Euclidean distance:
d ( i , j )  (| x x | x  x ...  | x  x
2 2 2
| | | )
i1 j1 i2 j2 in jn

 Properties
 d(i,j) 0
 d(i,i) =0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance:
d (i , j )  ( w | x  x | w x ...  w n | x  x
2 2 2
|x | | )
1 i1 j1 2 i2 j2 in jn
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data object j
1 0 sum
i= (0011101001)
1 a b a b
J=(1001100110)
object i 0 c d c d
sum a c b d p

 Simple matching coefﬁcient distance :

d (i , j )  b c
 Jaccard coefﬁcient distance : a b c  d

d (i , j )  b c
a b c
Binary Variables
 Another approach is to deﬁne the similarity of two
objects and not their distance.
 In that case we have the following:
 Simple matching coefﬁcient similarity:

s( i , j )  a d
a b c  d
 Jaccard coefﬁcient similarity:

s( i , j )  a
a b c

Note that: s(i,j) = 1 – d(i,j)

Dissimilarity between Binary
Variables
 Example (Jaccard coefﬁcient)

 all attributes are asymmetric binary

 1 denotes presence or positive test
 0 denotes absence or negative test
0 1
d( jack , mary )   0 . 33
2  0 1
1 1
d( jack , jim )   0 . 67
1 1 1
12
d( jim , mary )   0 . 75
1 1  2
A simpler deﬁnition
 Each variable is mapped to a bitmap (binary vector)

 Jack: 101000
 Mary: 101010
 Jim:110000
 Simple match distance:
number of non - common bit positions
d (i , j ) 
total number of bits

 Jaccard coefﬁcient:
number of 1' s in i j
d (i , j ) 1
number of 1' s in i j
Variables of Mixed Types
 A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio-scaled.
 One may use a weighted formula to combine their effects.
Major Clustering Approaches
 Partitioning algorithms: Construct random partitions and then
iteratively reﬁne them by some criterion
 Hierarchical algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
K-means Clustering

 Partitional clustering approach

 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 Number of clusters, K, must be speciﬁed
K-means Clustering
Algorithm k-Means Clustering Algorithm
Input: a database D, of m records, r1, ..., rm and a desired number of
clusters k
Output: set of k clusters that minimizes the squared error criterion
Begin
randomly choose k records as the centroids for the k clusters;
repeat
assign each record, ri, to a cluster such that the distance between ri
and the cluster centroid (mean) is the smallest among the k clusters;
recalculate the centroid (mean) for each cluster based on the records
assigned to the cluster;
until no change;
End;
K-means Clustering Example

Sample 2-dimensional records for clustering example

RID Age Years_of_service
1 30 5
2 50 25
3 50 15 C1
4 25 5
5 30 10
6 55 25 C2

Assume that the number of desired clusters k is 2.

 Let the algorithm choose records with RID 3 for cluster C1 and RID 6
for cluster C2 as the initial cluster centroids
The remaining records will be assigned to one of those clusters
during the ﬁrst iteration of the repeat loop
K-means Clustering Example
The Euclidean distance between record rj and rk in n-
dimensional space is calculated as:

rj and rk represent the records wanted to calculate the distance between t

rk indicate to the C1 and C2 in current example

Record distance from C1 distance from C2 it joins cluster

1 22.4 32.0 C1
rj 2 10.0 5.0 C2
4 25.5 36.6 C1
5 20.6 29.2 C1
K-means Clustering Example

 Now,the new means (centroids) for the two clusters are computed. The
mean for a cluster, Ci, with n records of m dimensions is the vector:

 In our example records ( 1,3,4,5) belong to C1

 And (2,6) belong to C2
so C1(new) =(1/4(30+50+25+30) , 1/2(50+55)=(33.75, 8.75)
 C2(new) =(1/4(5+15+5+10) , 1/2(25+25)=(52.5, 25)
K-means Clustering Example

 A second iteration proceeds to get the distance of each record with new
centroids
 In the following table: calculate the distance of each record from the new
C1 and C2, and assign each to the suitable cluster

Record distance from C1 distance from C2 it joins cluster

1
2
3
4
5
6
calculate new C1 and C2 ?
tip : C1 will be (28.3, 6.7) and C2 will be (51.7, 21.7)
K-means Clustering Example

 Move to the next iteration and do as the previous slide

 Stop if you get same results
K-means Clustering – Details

 Initial centroids are often chosen randomly.

 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 Most of the convergence happens in the ﬁrst few iterations.
 Often the stopping condition is changed to ‘Until relatively few
points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
 The terminating condition is usually the squared-error criterion. For clusters
C1, ..., Ck with means m1, ..., mk, the error is deﬁned as:

 rj is a data point in cluster Ci and mi is the corresponds mean of

the cluster
Solutions to Initial Centroids
Problem
 Multiple runs
 Helps, but probability is not on your side
 Sample and use hierarchical clustering to
determine initial centroids
 Select more than k initial centroids and then
select among these initial centroids
 Select most widely separated

DM Clustering
No ratings yet
DM Clustering
51 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Lect 4
No ratings yet
Lect 4
34 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering
No ratings yet
Clustering
80 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Clustering
No ratings yet
Clustering
47 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Clustering
No ratings yet
Clustering
125 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Analysis of Cluteruing
No ratings yet
Analysis of Cluteruing
16 pages
Clustering
No ratings yet
Clustering
29 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
19 - Clustering in Operation Research
No ratings yet
19 - Clustering in Operation Research
11 pages
Clustering
No ratings yet
Clustering
84 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Module 5
No ratings yet
Module 5
98 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Unit 2
No ratings yet
Unit 2
89 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
DM 4
No ratings yet
DM 4
76 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
8 Cluster
No ratings yet
8 Cluster
33 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
K Medoids
No ratings yet
K Medoids
101 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Clustering
No ratings yet
Clustering
55 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Clustering
No ratings yet
Clustering
75 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
2 ADA Cluster Analysis
No ratings yet
2 ADA Cluster Analysis
12 pages
Clustering
0% (1)
Clustering
127 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Artificial life: Random walk
From Everand
Artificial life: Random walk
Mietek Szyszkowicz
No ratings yet
Vectors
No ratings yet
Vectors
8 pages
Transportation and Assignment
No ratings yet
Transportation and Assignment
33 pages
Delta Star
No ratings yet
Delta Star
5 pages
Multicomplex Hyperfunctions
No ratings yet
Multicomplex Hyperfunctions
13 pages
Leon Hall Stan Wagon: Mathematics Magazine, Vol. 65, No. 5. (Dec., 1992), Pp. 283-301
No ratings yet
Leon Hall Stan Wagon: Mathematics Magazine, Vol. 65, No. 5. (Dec., 1992), Pp. 283-301
20 pages
IGCSE Mathematics 0580 - 21 Paper 2 May-June 2023
No ratings yet
IGCSE Mathematics 0580 - 21 Paper 2 May-June 2023
6 pages
(WWW Cgaspirants Com) EMT-EE PDF
No ratings yet
(WWW Cgaspirants Com) EMT-EE PDF
55 pages
Unit 5 (The Method of Least Squares)
No ratings yet
Unit 5 (The Method of Least Squares)
34 pages
SLG Module 10.5.3 Operations Involving Rational Expressions Addition - Subtraction Casas Albiso 1
No ratings yet
SLG Module 10.5.3 Operations Involving Rational Expressions Addition - Subtraction Casas Albiso 1
2 pages
Curriculum Cam I 342
No ratings yet
Curriculum Cam I 342
5 pages
Summer Vacation Homework 7
No ratings yet
Summer Vacation Homework 7
2 pages
Bahan Ajar Minggu 13 Simsis
No ratings yet
Bahan Ajar Minggu 13 Simsis
9 pages
1 The IT Giant Tirnop Has Recently Crossed A Head Count of 150000 and Earnings of
No ratings yet
1 The IT Giant Tirnop Has Recently Crossed A Head Count of 150000 and Earnings of
4 pages
The Number of Subgroups Contained in The Dihedral Group Research Project
No ratings yet
The Number of Subgroups Contained in The Dihedral Group Research Project
39 pages
Cambridge Assessment International Education: Additional Mathematics 0606/21 May/June 2019
No ratings yet
Cambridge Assessment International Education: Additional Mathematics 0606/21 May/June 2019
9 pages
Deriving Uncertainity Principle
No ratings yet
Deriving Uncertainity Principle
11 pages
RPT Matematik DLP Tahun5
No ratings yet
RPT Matematik DLP Tahun5
24 pages
ITN Module 5
No ratings yet
ITN Module 5
18 pages
368 Quantitative Analysis of Flexible Manufacturing Systems
No ratings yet
368 Quantitative Analysis of Flexible Manufacturing Systems
2 pages
Math IB Revision Matrices SL
No ratings yet
Math IB Revision Matrices SL
5 pages
Heinemann Maths Zone 9 - Chapter 3
No ratings yet
Heinemann Maths Zone 9 - Chapter 3
38 pages
Divisibility Solutions PDF
No ratings yet
Divisibility Solutions PDF
10 pages
Cracku - In: Progressions and Series Tips and Formulas
No ratings yet
Cracku - In: Progressions and Series Tips and Formulas
21 pages
Numerical Langlands Via Condensed P Adic Shtukas A-304af8fd
No ratings yet
Numerical Langlands Via Condensed P Adic Shtukas A-304af8fd
3 pages
Siyuan Yujian 1
0% (1)
Siyuan Yujian 1
389 pages
Experiment, Otcome, Sample Space and Event
100% (3)
Experiment, Otcome, Sample Space and Event
19 pages
Central Philippine University: College of Engineering
No ratings yet
Central Philippine University: College of Engineering
4 pages
Learning Activity Sheets: General Mathematics Quarter 1 - Week 1A
100% (1)
Learning Activity Sheets: General Mathematics Quarter 1 - Week 1A
7 pages
Algorithms, Pseudocodes, and Flowchart
No ratings yet
Algorithms, Pseudocodes, and Flowchart
24 pages
(English (Auto-Generated) ) How I'd Learn ML in 2025 (If I Could Start Over) (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) How I'd Learn ML in 2025 (If I Could Start Over) (DownSub - Com)
14 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cluster Analysis

Uploaded by

Cluster Analysis

Uploaded by

Cluster Analysis

What is Cluster Analysis?

 In some applications we are interested in discovering

 The dissimilarity d(i, j) between two objects i and j is expressed in

 The deﬁnitions of distance functions are usually different

 Weights may be associated with different variables based

 Nominal (categorical) variables

 Variables of mixed types

data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block) distance:

 Simple matching coefﬁcient distance :

Note that: s(i,j) = 1 – d(i,j)

 all attributes are asymmetric binary

 Partitional clustering approach

Sample 2-dimensional records for clustering example

Assume that the number of desired clusters k is 2.

rj and rk represent the records wanted to calculate the distance between t

Record distance from C1 distance from C2 it joins cluster

 In our example records ( 1,3,4,5) belong to C1

Record distance from C1 distance from C2 it joins cluster

 Move to the next iteration and do as the previous slide

 Initial centroids are often chosen randomly.

 rj is a data point in cluster Ci and mi is the corresponds mean of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.