0% found this document useful (0 votes)

13 views

M4 DM Clustering Part I

Uploaded by

purvaisrani2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

M4 DM Clustering Part I

Uploaded by

purvaisrani2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Module 4 : Data Mining

Clustering Algorithms

Priya R L
Faculty Incharge for CSC 504
Department of Computer Engineering
VES Institute of Technology, Mumbai

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agenda
● Introduction to Clustering
● What is Cluster Analysis?
● Applications of Clustering
● Hard Clustering Vs. Soft Clustering
● Stages in Clustering
● Clustering Algorithms
● K-Means – Examples
● K-Medoids
● Hierarchical Clustering
● Q&A

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Clustering : Deﬁnition

• Clustering is “the process of organizing objects into

groups whose members are similar in some way”.

• A cluster is therefore a collection of objects which are

“similar” between them and are “dissimilar” to the
objects belonging to other clusters.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

What is Cluster Analysis?
Cluster: a collection of data objects
● Similar to one another within the same cluster
● Dissimilar to the objects in other clusters
Cluster analysis
● Finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters
Cluster analysis (or clustering , data segmentation)
● Given a set of data points, a partition them into a set of groups(i.e.
clusters) which are as similar as possible.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

What is Cluster Analysis?

● Unsupervised learning: no predefined classes

● Typical way to user/apply cluster analysis

○ As a stand-alone tool to get insight into data distribution

○ As a preprocessing step for other algorithms

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Applications of Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Applications of Clustering
● Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
● Information retrieval: document clustering
● Land use: Identiﬁcation of areas of similar land use in an earth observation database
● Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
● City-planning: Identifying groups of houses according to their house type, value, and
geographical location
● Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults
● Climate: understanding earth climate, ﬁnd patterns of atmospheric and ocean
● Economic Science: market research
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
Clustering Examples
● Image Segmentation – Goal : Break up the image into
meaningful or perceptually similar regions.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Clustering as a Preprocessing Tool
Summarization:
● Preprocessing for regression, PCA, classification, and association analysis
Compression:
● Image processing: vector quantization
Finding K-nearest Neighbors
● Localizing search to one or a small number of clusters
Outlier detection
● Outliers are often viewed as those “far away” from any cluster

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

What is Good Clustering?
● A good clustering method will produce high quality clusters

○ high intra-class similarity: cohesive within clusters

○ low inter-class similarity: distinctive between clusters

● The quality of a clustering method depends on

○ the similarity measure used by the method

○ its implementation, and

○ Its ability to discover some or all of the hidden patterns

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Hard Clustering Vs. Soft Clustering

Hard clustering: Each document belongs to exactly one cluster

● More common and easier to do
Soft clustering: A document can belong to more than one cluster.
● Makes more sense for applications like creating browsable hierarchies
● You may want to put a pair of sneakers in two clusters: (i) sports apparel
and (ii) shoes
● You can only do that with a soft clustering approach.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Considerations for Cluster Analysis
Partitioning criteria
● Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
Separation of clusters
● Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
Similarity measure
● Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g.,
density or contiguity)
Clustering space
● Full space (often when low dimensional) vs. subspaces (often in high-dimensional
clustering)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Stages in Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Types of Data
❖ Common forms of data variables:
✧ Interval-scaled (real-valued, linear).
✧ Ratio-scaled (real-valued, nonlinear).
✧ Binary (symmetric & asymmetric).
✧ Categorical (generalization of binary).
✧ Ordinal (ordered categories).

Object ID Length Rarity Selected Colour Quality

1 57.0 0.02 0 blue poor
2 48.6 0.000001 1 blue excellent
3 51.9 0.0005 1 green fair

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 14

Common Similarity Measures
Interval-scaled vectors:

• Euclidean distance.

• Manhattan (L1) distance.

• Weighted Minkowski (Lp)

distance.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 15

DM : Clustering Techniques

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Clustering Algorithms

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Partitioning Algorithms: Basic
concepts
● Partitioning method: Construct a partition of a database D of n objects into a set of k
clusters
● Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

○ Global optimal: exhaustively enumerate all partitions

○ Heuristic methods: k-means and k-medoids algorithms

○ k-means (MacQueen’67): Each cluster is represented by the center of the cluster

○ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each

cluster is represented by one of the objects in the cluster

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithms: Basic concepts
● Given a K, find a partition of K clusters to optimize the chosen
partitioning criterion (cost function)

o global optimum: exhaustively search all partitions

● The K-means algorithm: a heuristic method

o K-means algorithm (MacQueen’67): each cluster is represented by

the center of the cluster and the algorithm converges to stable
centroids of clusters.

o K-means algorithm is the simplest partitioning method for

clustering analysis and widely used in data mining applications.
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm
● First it selects k number of objects at random from the set of n objects. These k
objects are treated as the centroids or center of gravities of k clusters.
● For each of the remaining objects, it is assigned to one of the closest centroid.
Thus, it forms a collection of objects assigned to each centroid and is called a
cluster.
● Next, the centroid of each cluster is then updated (by calculating the mean
values of attributes of each object).
● The assignment and update procedure is until it reaches some stopping criteria
(such as, number of iteration, centroids remain unchanged or no assignment,
etc.)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do

• Compute distance between the current objects and k cluster centroids

• Assign the current object to that cluster to which it is closest.

3. Compute the “cluster centers” of each cluster. These become the new cluster
centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Notes
●

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example
Consider, K=2

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example
• Step 1:
• Initialization: Randomly we choose following two centroids (k=2) for two
clusters.
• In this case the 2 centroids are: m1=(1.0,1.0) and m2=(5.0,7.0).

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example

Step 2:
● Thus, we obtain two clusters
containing: {1,2,3} and {4,5,6,7}.
● Their new centroids are:

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example

Step 3:

● Now using these centroids we compute the

Euclidean distance of each object, as shown
in table.

● Therefore, the new clusters are: {1,2} and

{3,4,5,6,7}

● Next centroids are: m1=(1.25,1.5) and m2 =

(3.9,5.1)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example
● Step 4 :

The clusters obtained are:

{1,2} and {3,4,5,6,7}

● Therefore, there is no change in the cluster.

● Thus, the algorithm comes to a halt here and
final result consist of 2 clusters {1,2} and
{3,4,5,6,7}.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means - Example: Output Plot

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example
• Suppose we have 4 types of medicines and each has two attributes (pH
Problem and weight index). Our goal is to group these objects into K=2 group
of medicine.
D
Medicine Weight pH-Index
C
A 1 1

B 2 1
A B
C 4 3

D 5 4

30
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm: Example
● Step 1: Use initial seed points for partitioning

D
Euclidean distance
C

A B

Assign each object to the cluster

with the nearest seed point

31
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm: Example
● Step 2: Compute new centroids of the current partition

Knowing the members of each cluster, now

we compute the new centroid of each
group based on these new memberships.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example
● Step 2: Renew membership based on new centroids

Compute the distance of all

objects to the new centroids

Assign the membership to objects

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example
● Step 3: Repeat the ﬁrst two steps until its convergence

Knowing the members of each cluster,

now we compute the new centroid of
each group based on these new
memberships.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Means Algorithm: Example
● Step 3: Repeat the ﬁrst two steps until its convergence
Compute the distance of all objects
to the new centroids

Stop due to no new assignment.

Membership in each cluster no
longer change

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Comments on k-Means algorithm
Distance Measurement:
• To assign a point to the closest centroid, we need a proximity measure that should

quantify the notion of “closest” for the objects under clustering.

• Usually Euclidean distance (L2 norm) is the best measure when object points are
defined in n-dimensional Euclidean space.

• Other measure namely cosine similarity is more appropriate when objects are of
document type.

• Further, there may be other type of proximity measures that appropriate in the context
of applications.

• For example, Manhattan distance (L1 norm), Jaccard measure, etc.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Comments on k-Means algorithm
●

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 37

Comments on k-Means algorithm
Distance with document objects
Suppose a set of n document objects is defined as d document term matrix (DTM) (a typical look
is shown in the below form).
Document Term
t1 t2 tn
D1
D2

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 38

Comments on k-Means algorithm

Note: The criteria of objective function with different proximity measures

1. SSE (using L2 norm) : To minimize the SSE.

2. SAE (using L1 norm) : To minimize the SAE.

3. TC(using cosine similarity) : To maximize the TC.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 39

Comments on k-Means algorithm: Complexity Analysis
●

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 40

Comments on k-Means algorithm: Complexity Analysis
●

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 41

K-Means Algorithm: Pros and Cons
● Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

■

● Weakness
○ Applicable only when mean is defined, then what about categorical data?

○ Need to specify k, the number of clusters, in advance

○ Unable to handle noisy data and outliers

○ Not suitable to discover clusters with non-convex shapes

○ k-means finds a local optima and may actually minimize the global optimum.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Comments on k-Means algorithm : Disadvantages
• k-means has trouble clustering data that contains outliers. When the SSE is
used as objective function, outliers can unduly influence the cluster that are
produced. More precisely, in the presence of outliers, the cluster centroids,
in fact, not truly as representative as they would be otherwise. It also
influence the SSE measure as well.

• k-Means algorithm cannot handle non-globular clusters, clusters of

different sizes and densities (see Fig 16.6 in the next slide).

• k-Means algorithm not really beyond the scalability issue (and not so
practical for large databases).

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 43

Comments on k-Means algorithm

Cluster with different sizes Cluster with different densities

Non-convex shaped
clusters
Fig 16.6: Some failure instance of k-Means algorithm
44
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
What is the problem of k-Means Method?
● The k-means algorithm is sensitive to outliers !

○ Since an object with an extremely large value may substantially distort the
distribution of the data.
● K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster. 10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

How can we modify the k-means algorithm to diminish
such sensitivity to outliers?
● We can pick actual objects to represent the clusters, using one representative
object per cluster
● The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
corresponding representative object
● That is, an absolute-error criterion is used, defined as

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Four Cases of the Cost Function for K-medoids Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-medoids Clustering : Algorithm

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

A Typical K-Medoids Algorithm (PAM)

Total Cost = 20
1
0
9

6
Arbitrary Assign
5
choose k each
4 object as remainin
3 initial g object
2
medoids to
1

0
nearest
0 1 2 3 4 5 6 7 8 9 1
0
medoids

Randomly select a
K=2
Total Cost = 26 nonmedoid object,Oramdom
1 1
0 0
9 9

8
Compute 8
Swapping O 7 total cost of 7

and Oramdom 6
swapping 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
49
K-Medoids : Example
● The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|
● Given Dataset, D of n (n=10) objects

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Medoids : Example
● If a graph is drawn using the above data points, we obtain the
following:

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Medoids : Example
● Step 1:
Let the randomly selected 2 medoids, so select k = 2 and let C1 -(4, 5) and C2 -(8, 5) are the
two medoids.
● Step 2: Calculating cost.
The dissimilarity of each non-medoid point with the medoids is calculated and tabulated:

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

K-Medoids : Example
• Each point is assigned to the cluster of that medoid whose dissimilarity is less.
The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
• Step 3: randomly select one non-medoid point and recalculate the cost.
Let the randomly selected point be (8, 4). The dissimilarity of each non-medoid point
with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.

CSC 603 : Data Warehousing & Mining (CBCGS - Rev. 2016)

K-Medoids : Example
• Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2,
5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
• As the swap cost is not less than zero, we undo the swap. Hence previous medoids
are the final medoids. The clustering would be in the following way
K-Medoids : Pros & Cons
The time complexity is .
Advantages:
• It is simple to understand and easy to implement.
• K-Medoid Algorithm is fast and converges in a fixed number of steps.
• PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages:
• Not suitable for clustering non-spherical (arbitrary shaped) groups of objects. This is
because it relies on minimizing the distances between the non-medoid objects and the
medoid (the cluster center) – briefly, it uses compactness as clustering criteria instead of
connectivity.
• It may obtain different results for different runs on the same dataset because the first k
medoids are chosen randomly.
Exercise

● Given: {2,4,10,12,3,20,30,11,25}, k=2

● Randomly assign means: m1=3,m2=4
● Solve for the rest ….
● Similarly try for k-medoids

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Comparison between K-Means & K-Medoids
Mean Values Medoids
1. Suited only for continuous 1. Suited for either cont. or discrete
domains domains

2. Algorithms using means are 2. Algorithms using medoids are less

sensitive to outliers sensitive to outliers

3. The mean possess a clear 3. The medoid has not a clear

geometrical and statistical geometrical meaning
meaning

4. Algorithms using means are not 4. Algorithms using medoids are more
computationally demanding computationally demanding

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 57

Taxonomy of Clustering Approaches

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Hierarchical Clustering
● Clusters are created in levels actually creating sets of clusters at each level.

● Agglomerative
○ Initially each item in its own cluster
○ Iteratively clusters are merged together
○ Bottom Up
● Divisive
○ Initially all items in one cluster
○ Large clusters are successively divided
○ Top Down

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Hierarchical Clustering
● Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a termination
condition
Step Step Step Step Step agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
Hierarchical
Clustering
Dendrogram
● Dendrogram: a tree data structure which
illustrates hierarchical clustering techniques.
● Each level shows clusters for that level.

○ Leaf – individual clusters

○ Root – one cluster

● A cluster at level i is the union of its children
clusters at level i+1.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Dendrogram
• Tree structure describing merge
/ split history. 0 1 2 3 4
• This example: split / merge Agglomerative Steps (AGNES)
according to closest pair of A
A
cluster members. B
B 0.1 ABC
• “Single-linkage” strategy. DE
C C 0.5

D
d (*,*) A B C D E D
D E
0.3
A 0 0.1 0.8 0.7 1.0
E
B 0.1 0 0.5 0.6 0.9 E 0.2
C 0.8 0.5 0 0.3 0.4 Divisive Steps (DIANA)
D 0.7 0.6 0.3 0 0.2
4 3 2 1 0
E 1.0 0.9 0.4 0.2 0

63
Hierarchical Algorithms

● Single Link
● Complete Link
● Average Link

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Distance Between Clusters

● Single Link: smallest distance between points

● Complete Link: largest distance between points
● Average Link: average distance between points
● Centroid: distance between centroids

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Inter-Cluster Distance
• Common measures:

• Minimum distance (single linkage).

• Maximum distance (complete

linkage).

• Average distance.

• Mean distance.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019) 66

Agglomerative Clustering : Single Link

Single link

• In single-link hierarchical clustering, we merge in each step the two clusters whose
two closest members have the smallest distance.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering : Complete Link

Complete link

• In complete-link hierarchical clustering, we merge in each step the two

clusters whose merger has the smallest diameter.

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering : Example

• Consider the following matrix. Apply Single Link, Complete Link and
Average Link

Items A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering : Example
A B C D E
A 0 1 2 2 3 A C D E A C D E
B 0 2 4 3 B B
C 0 1 5 A 0 2 2 3 A 0 2 2 3
B B
D 0 3
C 0 1 5 C 0 1 5
E 0
D 0 3 D 0 3
E 0 E 0

A C E
A C E B D
ABC E
B D
D A 0 2 3
A 0 2 3 B
ABC 0 3
B
D C 0 3
C 0 3 D
E 0
D
E 0
E 0
Use single link technique to ﬁnd clusters in
the given data

Object X Y
A 2 2
B 3 2
C 1 1
D 3 1
E 1.5 0.5

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering : Example
○ Construct distance matrix

○ Use Euclidean’s law do determine min distance:

d(A,B) = (x2-x1)2 + (y2-y1)2

A B C D E
A 0
B 1 0
C 1.41 2.24 0
D 1.41 1 2 0
E 1.58 2.12 0.71 1.58 0
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
Agglomerative Clustering : Example
● SINGLE LINK :
● Step 1:
● SINCE C, E is minimum we can combine clusters C, E

A B C,E D
A 0
B 1 0
C,E 1.41 2.12 0
D 1.41 1 1.58 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering : Example
● Step 2:
● A, B has minimum value therefore we merge these two clusters

A,B C,E D
A,B 0
C,E 1.41 0
D 1 1.58 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering : Example
● Step 3:
● (A, B) and D has minimum value therefore we merge these two clusters

A,B,D C,E
A,B,D 0
C,E 1.41 0

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Agglomerative Clustering : Example

Step 4:
● We have two clusters to be combined
● Construct a dendrogram for the same

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Density-Based Methods

● To ﬁnd clusters of arbitrary shape, we can model clusters

as dense regions in the data space, separated by sparse
regions.
● This is the main strategy behind density-based clustering
methods, which can discover clusters of non spherical
shape.
● Density = number of points within a speciﬁed radius (Eps)

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

DBSCAN: Density-Based Clustering Based on
Connected Regions with High Density
● DBSCAN(Density-Based Spatial Clustering of
Applications with Noise) ﬁnds core objects, that is,
objects that have dense neighbourhoods
● It connects core objects and their neighbourhoods to
form dense regions as clusters

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

DBSCAN: Density-Based Clustering Based on
Connected Regions with High Density
● The density of an object o can be measured by the number of objects
close to o
● A user-specified parameter Eps > 0 is used to specify the radius of a
neighborhood we consider for every object ie., – Eps neighborhood
of an object o is the space within a radius Eps centered at o
● To determine whether a neighborhood is dense or not, DBSCAN uses
another user-specified parameter, MinPts, which specifies the density
threshold of dense regions
● An object is a core object if the Eps -neighborhood of the object
contains at least MinPts objects

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

DBSCAN: Density-Based Clustering Based on
Connected Regions with High Density
● Directly density-reachable: For a core object q and an object p
we say p is directly density-reachable from a point q w.r.t. Eps,
MinPts if

○ if p is within the Eps-neighborhood of q.

p MinPts = 5
q Eps = 1 cm

CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)

Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Source: Diginotes - In: Cluster Analysis
No ratings yet
Source: Diginotes - In: Cluster Analysis
48 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
clustering
No ratings yet
clustering
16 pages
Clustering
No ratings yet
Clustering
65 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 8. Cluster Analysis
80 pages
Clustering
No ratings yet
Clustering
125 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
Data Analytics: Clustering Techniques
No ratings yet
Data Analytics: Clustering Techniques
47 pages
9-Types of data in cluster analysis, Partitioning methods-21-10-2024
No ratings yet
9-Types of data in cluster analysis, Partitioning methods-21-10-2024
54 pages
AI27
No ratings yet
AI27
10 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Lect 12
No ratings yet
Lect 12
80 pages
LP3 Soft Computing 4 Practical
No ratings yet
LP3 Soft Computing 4 Practical
7 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Clustering 47698 Techniques
No ratings yet
Clustering 47698 Techniques
47 pages
DMW UNIT 5
No ratings yet
DMW UNIT 5
10 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Unit 4
No ratings yet
Unit 4
5 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Cluster
100% (1)
Cluster
72 pages
Lec. 15-Final. ClusAdvanced
No ratings yet
Lec. 15-Final. ClusAdvanced
103 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
K Mean
No ratings yet
K Mean
7 pages
Clustering
No ratings yet
Clustering
32 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
13 Clustering Techniques
No ratings yet
13 Clustering Techniques
47 pages
Clustering
No ratings yet
Clustering
104 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
BMW M-5
No ratings yet
BMW M-5
48 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Cluster
No ratings yet
Cluster
50 pages
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
No ratings yet
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
19 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
K MEANS
No ratings yet
K MEANS
40 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering
No ratings yet
Clustering
51 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Unit - V DW
No ratings yet
Unit - V DW
6 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Swami Keshvanand Institute of Technology, Management &gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
No ratings yet
Swami Keshvanand Institute of Technology, Management &gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
48 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Lesson 2.0 DFT
No ratings yet
Lesson 2.0 DFT
42 pages
Chapter 2 Doolittle Method
No ratings yet
Chapter 2 Doolittle Method
2 pages
RAM Model
No ratings yet
RAM Model
11 pages
Imc14 04 Arithmetic Codes
No ratings yet
Imc14 04 Arithmetic Codes
31 pages
Unit 3
100% (1)
Unit 3
11 pages
Numerical Integration
No ratings yet
Numerical Integration
6 pages
From RE To NFA and Vise Versa
0% (1)
From RE To NFA and Vise Versa
47 pages
Face Recognition Using Back Propagation Neural Network
No ratings yet
Face Recognition Using Back Propagation Neural Network
1 page
EE-322L ADC Lab-3
No ratings yet
EE-322L ADC Lab-3
5 pages
Testing of Different Machine Learning Algorithm For Detecting
No ratings yet
Testing of Different Machine Learning Algorithm For Detecting
18 pages
Introduction To Algorithms CS 445: Discussion Session 4 Instructor: DR Alon Efrat TA: Pooja Vaswani 02/28/2005
No ratings yet
Introduction To Algorithms CS 445: Discussion Session 4 Instructor: DR Alon Efrat TA: Pooja Vaswani 02/28/2005
16 pages
Tree Traversals: ECE 250 Algorithms and Data Structures
No ratings yet
Tree Traversals: ECE 250 Algorithms and Data Structures
24 pages
L015 Trapezoidal Rule and Simpson's Rule
No ratings yet
L015 Trapezoidal Rule and Simpson's Rule
3 pages
Matlab For Advanced Users, WI4141TU: K.dekker@tudelft - NL P.wilders@tudelft - NL
No ratings yet
Matlab For Advanced Users, WI4141TU: K.dekker@tudelft - NL P.wilders@tudelft - NL
7 pages
Ec8352 Signals and Systems 1152444425 SSQB
100% (1)
Ec8352 Signals and Systems 1152444425 SSQB
15 pages
Analysis of Algorithms II: The Recursive Case
No ratings yet
Analysis of Algorithms II: The Recursive Case
28 pages
NEWTON RAPHSON CASE STUDY
No ratings yet
NEWTON RAPHSON CASE STUDY
9 pages
Artificial Intelligence A Textbook Charu C Aggarwal download
100% (1)
Artificial Intelligence A Textbook Charu C Aggarwal download
77 pages
Optimization Without Gradients: Powell's Method: Slides Adapted From Numerical Recipes in C, Second Edition (1992)
No ratings yet
Optimization Without Gradients: Powell's Method: Slides Adapted From Numerical Recipes in C, Second Edition (1992)
35 pages
AID 3rd Semester - Design and Analysis of Algorithms Laboratory - AD3351 - Lab Manual
No ratings yet
AID 3rd Semester - Design and Analysis of Algorithms Laboratory - AD3351 - Lab Manual
36 pages
Assignment-1 ADS1 2021-22 - 1936
No ratings yet
Assignment-1 ADS1 2021-22 - 1936
7 pages
Cuckoo Hashing and Universal Hashing
No ratings yet
Cuckoo Hashing and Universal Hashing
31 pages
Sample Questions
No ratings yet
Sample Questions
4 pages
Iterative Methods
No ratings yet
Iterative Methods
32 pages
Lecture C Ring Signature PDF
No ratings yet
Lecture C Ring Signature PDF
13 pages
LPS and Phase King Algorithm IIITH
No ratings yet
LPS and Phase King Algorithm IIITH
11 pages
Discrete-Time System: 3.1.1 Accumulator
No ratings yet
Discrete-Time System: 3.1.1 Accumulator
27 pages
Disk Scheduling Algorithms
No ratings yet
Disk Scheduling Algorithms
8 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Computer Laboratory 2 Oral Question Answer Set I
No ratings yet
Computer Laboratory 2 Oral Question Answer Set I
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.