0% found this document useful (0 votes)
13 views

M4 DM Clustering Part I

Uploaded by

purvaisrani2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

M4 DM Clustering Part I

Uploaded by

purvaisrani2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Module 4 : Data Mining

Clustering Algorithms

Priya R L
Faculty Incharge for CSC 504
Department of Computer Engineering
VES Institute of Technology, Mumbai

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agenda
● Introduction to Clustering
● What is Cluster Analysis?
● Applications of Clustering
● Hard Clustering Vs. Soft Clustering
● Stages in Clustering
● Clustering Algorithms
● K-Means – Examples
● K-Medoids
● Hierarchical Clustering
● Q&A

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Clustering : Definition

• Clustering is “the process of organizing objects into


groups whose members are similar in some way”.

• A cluster is therefore a collection of objects which are


“similar” between them and are “dissimilar” to the
objects belonging to other clusters.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


What is Cluster Analysis?
Cluster: a collection of data objects
● Similar to one another within the same cluster
● Dissimilar to the objects in other clusters
Cluster analysis
● Finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters
Cluster analysis (or clustering , data segmentation)
● Given a set of data points, a partition them into a set of groups(i.e.
clusters) which are as similar as possible.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


What is Cluster Analysis?

● Unsupervised learning: no predefined classes


● Typical way to user/apply cluster analysis

○ As a stand-alone tool to get insight into data distribution

○ As a preprocessing step for other algorithms

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Applications of Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Applications of Clustering
● Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
● Information retrieval: document clustering
● Land use: Identification of areas of similar land use in an earth observation database
● Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
● City-planning: Identifying groups of houses according to their house type, value, and
geographical location
● Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults
● Climate: understanding earth climate, find patterns of atmospheric and ocean
● Economic Science: market research
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
Clustering Examples
● Image Segmentation – Goal : Break up the image into
meaningful or perceptually similar regions.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Clustering as a Preprocessing Tool
Summarization:
● Preprocessing for regression, PCA, classification, and association analysis
Compression:
● Image processing: vector quantization
Finding K-nearest Neighbors
● Localizing search to one or a small number of clusters
Outlier detection
● Outliers are often viewed as those “far away” from any cluster

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


What is Good Clustering?
● A good clustering method will produce high quality clusters

○ high intra-class similarity: cohesive within clusters

○ low inter-class similarity: distinctive between clusters


● The quality of a clustering method depends on

○ the similarity measure used by the method

○ its implementation, and

○ Its ability to discover some or all of the hidden patterns

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Hard Clustering Vs. Soft Clustering

Hard clustering: Each document belongs to exactly one cluster


● More common and easier to do
Soft clustering: A document can belong to more than one cluster.
● Makes more sense for applications like creating browsable hierarchies
● You may want to put a pair of sneakers in two clusters: (i) sports apparel
and (ii) shoes
● You can only do that with a soft clustering approach.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Considerations for Cluster Analysis
Partitioning criteria
● Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
Separation of clusters
● Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
Similarity measure
● Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g.,
density or contiguity)
Clustering space
● Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Stages in Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Types of Data
❖ Common forms of data variables:
✧ Interval-scaled (real-valued, linear).
✧ Ratio-scaled (real-valued, nonlinear).
✧ Binary (symmetric & asymmetric).
✧ Categorical (generalization of binary).
✧ Ordinal (ordered categories).

Object ID Length Rarity Selected Colour Quality


1 57.0 0.02 0 blue poor
2 48.6 0.000001 1 blue excellent
3 51.9 0.0005 1 green fair

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 14


Common Similarity Measures
Interval-scaled vectors:

• Euclidean distance.

• Manhattan (L1) distance.

• Weighted Minkowski (Lp)


distance.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 15


DM : Clustering Techniques

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Clustering Algorithms

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Partitioning Algorithms: Basic
concepts
● Partitioning method: Construct a partition of a database D of n objects into a set of k
clusters
● Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

○ Global optimal: exhaustively enumerate all partitions

○ Heuristic methods: k-means and k-medoids algorithms

○ k-means (MacQueen’67): Each cluster is represented by the center of the cluster

○ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each


cluster is represented by one of the objects in the cluster

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithms: Basic concepts
● Given a K, find a partition of K clusters to optimize the chosen
partitioning criterion (cost function)

o global optimum: exhaustively search all partitions

● The K-means algorithm: a heuristic method

o K-means algorithm (MacQueen’67): each cluster is represented by


the center of the cluster and the algorithm converges to stable
centroids of clusters.

o K-means algorithm is the simplest partitioning method for


clustering analysis and widely used in data mining applications.
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm
● First it selects k number of objects at random from the set of n objects. These k
objects are treated as the centroids or center of gravities of k clusters.
● For each of the remaining objects, it is assigned to one of the closest centroid.
Thus, it forms a collection of objects assigned to each centroid and is called a
cluster.
● Next, the centroid of each cluster is then updated (by calculating the mean
values of attributes of each object).
● The assignment and update procedure is until it reaches some stopping criteria
(such as, number of iteration, centroids remain unchanged or no assignment,
etc.)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do

• Compute distance between the current objects and k cluster centroids

• Assign the current object to that cluster to which it is closest.


3. Compute the “cluster centers” of each cluster. These become the new cluster
centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Notes

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example
Consider, K=2

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example
• Step 1:
• Initialization: Randomly we choose following two centroids (k=2) for two
clusters.
• In this case the 2 centroids are: m1=(1.0,1.0) and m2=(5.0,7.0).

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example

Step 2:
● Thus, we obtain two clusters
containing: {1,2,3} and {4,5,6,7}.
● Their new centroids are:

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example

Step 3:

● Now using these centroids we compute the


Euclidean distance of each object, as shown
in table.

● Therefore, the new clusters are: {1,2} and


{3,4,5,6,7}

● Next centroids are: m1=(1.25,1.5) and m2 =


(3.9,5.1)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example
● Step 4 :

The clusters obtained are:


{1,2} and {3,4,5,6,7}

● Therefore, there is no change in the cluster.


● Thus, the algorithm comes to a halt here and
final result consist of 2 clusters {1,2} and
{3,4,5,6,7}.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means - Example: Output Plot

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example
• Suppose we have 4 types of medicines and each has two attributes (pH
Problem and weight index). Our goal is to group these objects into K=2 group
of medicine.
D
Medicine Weight pH-Index
C
A 1 1

B 2 1
A B
C 4 3

D 5 4

30
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm: Example
● Step 1: Use initial seed points for partitioning

D
Euclidean distance
C

A B

Assign each object to the cluster


with the nearest seed point

31
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm: Example
● Step 2: Compute new centroids of the current partition

Knowing the members of each cluster, now


we compute the new centroid of each
group based on these new memberships.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example
● Step 2: Renew membership based on new centroids

Compute the distance of all


objects to the new centroids

Assign the membership to objects

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example
● Step 3: Repeat the first two steps until its convergence

Knowing the members of each cluster,


now we compute the new centroid of
each group based on these new
memberships.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Means Algorithm: Example
● Step 3: Repeat the first two steps until its convergence
Compute the distance of all objects
to the new centroids

Stop due to no new assignment.


Membership in each cluster no
longer change

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Comments on k-Means algorithm
Distance Measurement:
• To assign a point to the closest centroid, we need a proximity measure that should

quantify the notion of “closest” for the objects under clustering.

• Usually Euclidean distance (L2 norm) is the best measure when object points are
defined in n-dimensional Euclidean space.

• Other measure namely cosine similarity is more appropriate when objects are of
document type.

• Further, there may be other type of proximity measures that appropriate in the context
of applications.

• For example, Manhattan distance (L1 norm), Jaccard measure, etc.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Comments on k-Means algorithm

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 37


Comments on k-Means algorithm
Distance with document objects
Suppose a set of n document objects is defined as d document term matrix (DTM) (a typical look
is shown in the below form).
Document Term
t1 t2 tn
D1
D2

Dn

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 38


Comments on k-Means algorithm

Note: The criteria of objective function with different proximity measures

1. SSE (using L2 norm) : To minimize the SSE.

2. SAE (using L1 norm) : To minimize the SAE.

3. TC(using cosine similarity) : To maximize the TC.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 39


Comments on k-Means algorithm: Complexity Analysis

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 40


Comments on k-Means algorithm: Complexity Analysis

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 41


K-Means Algorithm: Pros and Cons
● Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))


● Weakness
○ Applicable only when mean is defined, then what about categorical data?

○ Need to specify k, the number of clusters, in advance

○ Unable to handle noisy data and outliers

○ Not suitable to discover clusters with non-convex shapes

○ k-means finds a local optima and may actually minimize the global optimum.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Comments on k-Means algorithm : Disadvantages
• k-means has trouble clustering data that contains outliers. When the SSE is
used as objective function, outliers can unduly influence the cluster that are
produced. More precisely, in the presence of outliers, the cluster centroids,
in fact, not truly as representative as they would be otherwise. It also
influence the SSE measure as well.

• k-Means algorithm cannot handle non-globular clusters, clusters of


different sizes and densities (see Fig 16.6 in the next slide).

• k-Means algorithm not really beyond the scalability issue (and not so
practical for large databases).

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 43


Comments on k-Means algorithm

Cluster with different sizes Cluster with different densities

Non-convex shaped
clusters
Fig 16.6: Some failure instance of k-Means algorithm
44
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
What is the problem of k-Means Method?
● The k-means algorithm is sensitive to outliers !

○ Since an object with an extremely large value may substantially distort the
distribution of the data.
● K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster. 10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


How can we modify the k-means algorithm to diminish
such sensitivity to outliers?
● We can pick actual objects to represent the clusters, using one representative
object per cluster
● The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
corresponding representative object
● That is, an absolute-error criterion is used, defined as

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Four Cases of the Cost Function for K-medoids Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-medoids Clustering : Algorithm

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


A Typical K-Medoids Algorithm (PAM)

Total Cost = 20
1
0
9

6
Arbitrary Assign
5
choose k each
4 object as remainin
3 initial g object
2
medoids to
1

0
nearest
0 1 2 3 4 5 6 7 8 9 1
0
medoids

Randomly select a
K=2
Total Cost = 26 nonmedoid object,Oramdom
1 1
0 0
9 9

8
Compute 8
Swapping O 7 total cost of 7

and Oramdom 6
swapping 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
49
K-Medoids : Example
● The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|
● Given Dataset, D of n (n=10) objects

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Medoids : Example
● If a graph is drawn using the above data points, we obtain the
following:

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Medoids : Example
● Step 1:
Let the randomly selected 2 medoids, so select k = 2 and let C1 -(4, 5) and C2 -(8, 5) are the
two medoids.
● Step 2: Calculating cost.
The dissimilarity of each non-medoid point with the medoids is calculated and tabulated:

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


K-Medoids : Example
• Each point is assigned to the cluster of that medoid whose dissimilarity is less.
The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
• Step 3: randomly select one non-medoid point and recalculate the cost.
Let the randomly selected point be (8, 4). The dissimilarity of each non-medoid point
with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.

CSC 603 : Data Warehousing & Mining (CBCGS - Rev. 2016)


K-Medoids : Example
• Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2,
5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
• As the swap cost is not less than zero, we undo the swap. Hence previous medoids
are the final medoids. The clustering would be in the following way
K-Medoids : Pros & Cons
The time complexity is .
Advantages:
• It is simple to understand and easy to implement.
• K-Medoid Algorithm is fast and converges in a fixed number of steps.
• PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages:
• Not suitable for clustering non-spherical (arbitrary shaped) groups of objects. This is
because it relies on minimizing the distances between the non-medoid objects and the
medoid (the cluster center) – briefly, it uses compactness as clustering criteria instead of
connectivity.
• It may obtain different results for different runs on the same dataset because the first k
medoids are chosen randomly.
Exercise

● Given: {2,4,10,12,3,20,30,11,25}, k=2


● Randomly assign means: m1=3,m2=4
● Solve for the rest ….
● Similarly try for k-medoids

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Comparison between K-Means & K-Medoids
Mean Values Medoids
1. Suited only for continuous 1. Suited for either cont. or discrete
domains domains

2. Algorithms using means are 2. Algorithms using medoids are less


sensitive to outliers sensitive to outliers

3. The mean possess a clear 3. The medoid has not a clear


geometrical and statistical geometrical meaning
meaning

4. Algorithms using means are not 4. Algorithms using medoids are more
computationally demanding computationally demanding

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 57


Taxonomy of Clustering Approaches

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Hierarchical Clustering
● Clusters are created in levels actually creating sets of clusters at each level.

● Agglomerative
○ Initially each item in its own cluster
○ Iteratively clusters are merged together
○ Bottom Up
● Divisive
○ Initially all items in one cluster
○ Large clusters are successively divided
○ Top Down

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Hierarchical Clustering
● Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a termination
condition
Step Step Step Step Step agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
Hierarchical
Clustering
Dendrogram
● Dendrogram: a tree data structure which
illustrates hierarchical clustering techniques.
● Each level shows clusters for that level.

○ Leaf – individual clusters

○ Root – one cluster


● A cluster at level i is the union of its children
clusters at level i+1.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Dendrogram
• Tree structure describing merge
/ split history. 0 1 2 3 4
• This example: split / merge Agglomerative Steps (AGNES)
according to closest pair of A
A
cluster members. B
B 0.1 ABC
• “Single-linkage” strategy. DE
C C 0.5

D
d (*,*) A B C D E D
D E
0.3
A 0 0.1 0.8 0.7 1.0
E
B 0.1 0 0.5 0.6 0.9 E 0.2
C 0.8 0.5 0 0.3 0.4 Divisive Steps (DIANA)
D 0.7 0.6 0.3 0 0.2
4 3 2 1 0
E 1.0 0.9 0.4 0.2 0

63
Hierarchical Algorithms

● Single Link
● Complete Link
● Average Link

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Distance Between Clusters

● Single Link: smallest distance between points


● Complete Link: largest distance between points
● Average Link: average distance between points
● Centroid: distance between centroids

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


1

Inter-Cluster Distance
• Common measures:

• Minimum distance (single linkage).

• Maximum distance (complete


linkage).

• Average distance.

• Mean distance.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 66


Agglomerative Clustering : Single Link

Single link

• In single-link hierarchical clustering, we merge in each step the two clusters whose
two closest members have the smallest distance.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering : Complete Link

Complete link

• In complete-link hierarchical clustering, we merge in each step the two


clusters whose merger has the smallest diameter.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering : Example

• Consider the following matrix. Apply Single Link, Complete Link and
Average Link

Items A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering : Example
A B C D E
A 0 1 2 2 3 A C D E A C D E
B 0 2 4 3 B B
C 0 1 5 A 0 2 2 3 A 0 2 2 3
B B
D 0 3
C 0 1 5 C 0 1 5
E 0
D 0 3 D 0 3
E 0 E 0

A C E
A C E B D
ABC E
B D
D A 0 2 3
A 0 2 3 B
ABC 0 3
B
D C 0 3
C 0 3 D
E 0
D
E 0
E 0
Use single link technique to find clusters in
the given data

Object X Y
A 2 2
B 3 2
C 1 1
D 3 1
E 1.5 0.5

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering : Example
○ Construct distance matrix

○ Use Euclidean’s law do determine min distance:

d(A,B) = (x2-x1)2 + (y2-y1)2

A B C D E
A 0
B 1 0
C 1.41 2.24 0
D 1.41 1 2 0
E 1.58 2.12 0.71 1.58 0
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
Agglomerative Clustering : Example
● SINGLE LINK :
● Step 1:
● SINCE C, E is minimum we can combine clusters C, E

A B C,E D
A 0
B 1 0
C,E 1.41 2.12 0
D 1.41 1 1.58 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering : Example
● Step 2:
● A, B has minimum value therefore we merge these two clusters

A,B C,E D
A,B 0
C,E 1.41 0
D 1 1.58 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering : Example
● Step 3:
● (A, B) and D has minimum value therefore we merge these two clusters

A,B,D C,E
A,B,D 0
C,E 1.41 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Agglomerative Clustering : Example

Step 4:
● We have two clusters to be combined
● Construct a dendrogram for the same

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


Density-Based Methods

● To find clusters of arbitrary shape, we can model clusters


as dense regions in the data space, separated by sparse
regions.
● This is the main strategy behind density-based clustering
methods, which can discover clusters of non spherical
shape.
● Density = number of points within a specified radius (Eps)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


DBSCAN: Density-Based Clustering Based on
Connected Regions with High Density
● DBSCAN(Density-Based Spatial Clustering of
Applications with Noise) finds core objects, that is,
objects that have dense neighbourhoods
● It connects core objects and their neighbourhoods to
form dense regions as clusters

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


DBSCAN: Density-Based Clustering Based on
Connected Regions with High Density
● The density of an object o can be measured by the number of objects
close to o
● A user-specified parameter Eps > 0 is used to specify the radius of a
neighborhood we consider for every object ie., – Eps neighborhood
of an object o is the space within a radius Eps centered at o
● To determine whether a neighborhood is dense or not, DBSCAN uses
another user-specified parameter, MinPts, which specifies the density
threshold of dense regions
● An object is a core object if the Eps -neighborhood of the object
contains at least MinPts objects

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)


DBSCAN: Density-Based Clustering Based on
Connected Regions with High Density
● Directly density-reachable: For a core object q and an object p
we say p is directly density-reachable from a point q w.r.t. Eps,
MinPts if

○ if p is within the Eps-neighborhood of q.


p MinPts = 5
q Eps = 1 cm

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy