0% found this document useful (0 votes)

12 views

Partitioning Methods & Hierachical Methods

The document provides an overview of cluster analysis, including basic concepts, partitioning methods like k-means and k-medoids, and hierarchical clustering techniques. It discusses the strengths and weaknesses of different clustering algorithms, their applications, and variations, as well as the importance of evaluating clustering quality. Additionally, it highlights the use of dendrograms in hierarchical clustering and introduces methods like AGNES and DIANA.

Uploaded by

Vidhya B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Partitioning Methods & Hierachical Methods

Uploaded by

Vidhya B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Datamining & Warehousing

Unit 4 – Cluster Analysis–

Part1 1 B
Dr.VIDHYA
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

■ Cluster Analysis: Basic Concepts

■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
2
Partitioning Algorithms: Basic
Concept
■ Partitioning method: Partitioning a database D of n objects
into a set of k clusters, organizes the objects into k partitions
(k<n), each partition represents a cluster.
Partitioning Methods

k-means k-medoids

■ Global optimal: exhaustively enumerate all partitions

■ Heuristic methods: k-means and k-medoids algorithms
■ k-means Each cluster is represented by the center of
the cluster
■ k-medoids or PAM (Partition around medoids) : Each
cluster is represented by one of the objects in the
3
The K-Means Centroid based
Technique
■ Dataset D contains n objects into a set of k clusters, organizes
the objects into k partitions (k<n), each partition represents a cluster.
■ Partitioning methods distribute the objects in D into k clusters,
C1, ,C2…..Ck, that is, Ci ⃀ D and Ci ∩ Cj =0.
■ The objective function aims for high intra cluster similarity and low inter
cluster similarity.
■ A centroid-based partitioning technique uses the centroid of a cluster, Ci,
to represent that cluster.
■ The quality of cluster Ci can be measured by the within cluster variation,
which is the sum of squared error between all objects in Ci and the
centroid ci, defined as

■ E is the sum of the squared error for all objects in the data set;
■ p is the point in space representing a given object; and
■ ci is the centroid of cluster Ci (both p and ci are multidimensional).

4
The K-Means Centroid based
Technique

■ Algorithm: k-means.
The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster
Input:
k:the number of clusters, D:a dataset containing n objects.
Output: A set of k clusters.
Method:
■ Partition objects into k nonempty subsets
■ Compute seed points as the centroids of the clusters of the
current partitioning (the centroid is the center, i.e., mean point,
of the cluster)
■ Assign each object to the cluster with the nearest seed point
■ Go back to Step 2, stop when the assignment does not change

5
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
■ Partition objects into k nonempty
subsets
■ Repeat
■ Compute centroid (i.e., mean Update
the
point) for each partition cluster
■ Assign each object to the centroids
cluster of its nearest centroid
■ Until no change
6
Comments on the K-Means
Method

■ Strength: Efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.
■ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
■ Comment: Often terminates at a local optimal
■ Weakness
■ Applicable only to objects in a continuous n-dimensional
space
■ Using the k-modes method for categorical data
■ In comparison, k-medoids can be applied to a wide
range of data
■ Need to specify k, the number of clusters, in advance
(there are ways to automatically determine the best k (see
Hastie et al., 2009)
■ 7
Variations of the K-Means Method

■ Most of the variants of the k-means which differ in

■ Selection of the initial k means
■ Dissimilarity calculations
■ Strategies to calculate cluster means
■ Handling categorical data: k-modes
■ Replacing means of clusters with modes
■ Using new dissimilarity measures to deal with categorical
objects
■ Using a frequency-based method to update modes of
clusters
■ A mixture of categorical and numerical data: k-prototype 8
What Is the Problem of the K-Means Method?

■ The k-means algorithm is sensitive to outliers !

■ Since an object with an extremely large value may
substantially distort the distribution of the data
■ K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

9
PAM: A Typical K-Medoids Algorithm
Total Cost =
20
1
0
9

6
Arbitrar Assign
5
y each
4 choose remaini
3
k object ng
2
as object
initial to
1

0
0 1 2 3 4 5 6 7 8 9 1
0
medoid nearest
K=2 s medoid
s Randomly select a
Total Cost = nonmedoid
26 object,Oramdom
1 1

Do loop
0 0
9 9

8
Compute 8
Swapping total cost
Until no
7 7

O and 6
of 6

change Oramdom 5
swapping
5

4 4

If quality is 3 3

2 2
improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0

10
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

11
Hierarchical Clustering
■ A hierarchical clustering method works by grouping data
objects into a hierarchy or “tree” of clusters.
■ Representing data objects in the form of a hierarchy is
useful for data summarization and visualization.
■ For example, as the manager of human resources at a
company, may organize the employees into major
groups such as executives, managers, and staff.
■ Further partition these groups into smaller subgroups. For
instance, the general group of staff can be further divided
into subgroups of senior officers, officers, and trainees.
All these groups form a hierarchy. Thus summarize or
characterize the data that are organized into a hierarchy,
which can be used to find, say, the average salary of
managers and of officers
12
Hierarchical Clustering
■ Agglomerative methods start with individual objects as clusters,
which are iteratively merged to form larger clusters.
■ Conversely, divisive methods initially let all the given objects form
one cluster, which they iteratively split into smaller clusters.
■ Hierarchical clustering methods can encounter difficulties regarding
the selection of merge or split points. Such a decision is critical,
because once a group of objects is merged or split, the process at
the next step will operate on the newly generated clusters.
■ It will neither undo what was done previously, nor perform object
swapping between clusters. if not well chosen, may lead to low-
quality clusters. To improving the clustering quality of hierarchical
methods integrate hierarchical clustering with other clustering
techniques, resulting in multiple-phase (or multiphase) clustering.
■ Two methods, namely BIRCH and Chameleon. BIRCH begins by
partitioning objects hierarchically using tree structures, where the
leaf or low-level nonleaf nodes can be viewed as “microclusters”
depending on the resolution scale. It then applies other
■ Chameleon explores dynamic modeling in hierarchical clustering. 13
Hierarchical Clustering
■Chameleon explores dynamic modeling in hierarchical
clustering.
■There are several orthogonal ways to categorize hierarchical
clustering methods.
■Agglomerative, divisive, and multiphase methods are
algorithmic, meaning they consider data objects as
deterministic and compute clusters according to the
deterministic distances between objects.
■Probabilistic methods use probabilistic models to capture
clusters and measure the quality of clusters by the fitness of
models. Bayesian methods compute a distribution of possible
clusterings.
■ That is, instead of outputting a single deterministic
clustering over a data set, they return a group of clustering
structures and their probabilities, conditional on the given
data..
14
Agglomerative versus Divisive
Hierarchical Clustering
■An agglomerative hierarchical clustering method uses a bottom-up
strategy.
■Starts by letting each object form its own cluster and iteratively
merges clusters into larger and larger clusters,
■The single cluster becomes the hierarchy’s root.
■For the merging step,finds the two clusters that are closest to each
other combines the two to form one cluster. An agglomerative method
requires at most n iterations.
■A divisive hierarchical clustering method employs a top-down strategy.
It starts by placing all objects in one cluster, which is the hierarchy’s
root. It then divides the root cluster into several smaller subclusters,
and recursively partitions those clusters into smaller ones.
■ In either agglomerative or divisive hierarchical clustering, a user can
specify the desired number of clusters as a termination condition.

15
Hierarchical Clustering
■ The diagram shows the application of AGNES (AGglomerative
NESting), an agglomerative hierarchical clustering method, and
DIANA (DIvisive ANAlysis), a divisive hierarchical clustering
method, on a data set of five objects, a,b,c,d,e .
■ Initially, AGNES, the agglomerative method, places each object into
a cluster of its own.
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0 16
Hierarchical Clustering
■ The clusters are then merged step-by-step according to some
criterion. For example, clusters C1 and C2 may be merged if an
object in C1 and an object in C2 form the minimum Euclidean
distance between any two objects from different clusters.
■ This is a single-linkage approach in that each cluster is
represented by all the objects in the cluster, the similarity between
two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters.
■ DIANA,the divisive method, proceeds in the contrasting way.
■ All the objects are used to form one initial cluster. The cluster is split
as the maximum Euclidean distance between the closest
neighboring objects in the cluster.
■ The cluster-splitting process repeats until, eventually, each new
cluster contains only a single object

17
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of

clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at

the desired level, then each connected component forms a cluster

18
AGNES (Agglomerative
Nesting)
■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical packages, e.g., Splus
■ Use the single-link method and the dissimilarity
matrix
■ Merge nodes that have the least dissimilarity
■ Go on in a non-descending fashion
■ Eventually all nodes belong to the same cluster

19
DIANA (Divisive Analysis)

■ Introduced in Kaufmann and Rousseeuw (1990)

■ Implemented in statistical analysis packages, e.g.,
Splus
■ Inverse order of AGNES
■ Eventually each node forms a cluster on its own

20
Distance between X X

Clusters
■ Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip,
tjq)
■ Complete link: largest distance between an element in
one cluster and an element in the other, i.e., dist(Ki, Kj) =
max(tip, tjq)
■ Average: avg distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
■ Centroid: distance between the centroids of two clusters,
i.e., dist(Ki, Kj) = dist(Ci, Cj)
■ Medoid: distance between the medoids of two clusters, i.e.,
21
Centroid, Radius and Diameter
of a Cluster (for numerical X

data sets)
■ Centroid: the “middle” of a cluster

■ Radius: square root of average distance from any

point of the cluster to its centroid

■ Diameter: square root of average mean squared

distance between all pairs of points in the cluster

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
R Book PDF
100% (3)
R Book PDF
291 pages
Clustering
No ratings yet
Clustering
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Cluster
No ratings yet
Cluster
20 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Clustering
No ratings yet
Clustering
32 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
unit iv[1]
No ratings yet
unit iv[1]
96 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
5. Clustering
No ratings yet
5. Clustering
89 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Clustering
No ratings yet
Clustering
25 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
Lecture 16
No ratings yet
Lecture 16
29 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Unit 4
No ratings yet
Unit 4
4 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
DWM UNIT-5 SEM ANS
No ratings yet
DWM UNIT-5 SEM ANS
8 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Clustering
No ratings yet
Clustering
104 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Clustering
No ratings yet
Clustering
45 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Clustering
No ratings yet
Clustering
7 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Partitioning Methods
No ratings yet
Partitioning Methods
26 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Data Mining - Lecture 9
No ratings yet
Data Mining - Lecture 9
29 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Soal Ujian Akhir Semester Statistika Bisnis Semester Genap T.A. 2018/2019 Jurusan Agribisnis Fakultas Pertanian Uho
No ratings yet
Soal Ujian Akhir Semester Statistika Bisnis Semester Genap T.A. 2018/2019 Jurusan Agribisnis Fakultas Pertanian Uho
6 pages
Disciplinesandideasinappliedsocialsciences12 q1 Mod9
No ratings yet
Disciplinesandideasinappliedsocialsciences12 q1 Mod9
33 pages
How Big Data Has Changed Technology Roadmapping a Review on Data-Driven Roadmapping
No ratings yet
How Big Data Has Changed Technology Roadmapping a Review on Data-Driven Roadmapping
13 pages
Resume Template by Anubhav
No ratings yet
Resume Template by Anubhav
1 page
Probit and Logit Models Stata Program and Output PDF
No ratings yet
Probit and Logit Models Stata Program and Output PDF
10 pages
02 - Multivariate - Multiple Regression Analysis With Excel - RVSD
No ratings yet
02 - Multivariate - Multiple Regression Analysis With Excel - RVSD
82 pages
Praktikum M2 (Minitab)
No ratings yet
Praktikum M2 (Minitab)
78 pages
Pivot Table Example 1
No ratings yet
Pivot Table Example 1
6 pages
Chapter 3 Research Design
No ratings yet
Chapter 3 Research Design
21 pages
Robust Regression
No ratings yet
Robust Regression
52 pages
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
No ratings yet
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
5 pages
ABS 2007 Parallel Box Plot Tool
No ratings yet
ABS 2007 Parallel Box Plot Tool
8 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
Jal1603 Tsaf Unit-2 Ppt
No ratings yet
Jal1603 Tsaf Unit-2 Ppt
24 pages
Managing The Quality of Education in Assela Town The Internal Efficiency of ADM NO 2 Elementary School
No ratings yet
Managing The Quality of Education in Assela Town The Internal Efficiency of ADM NO 2 Elementary School
21 pages
Dietary Patterns: A Literature Review of The Methodological Characteristics of The Main Steps of The Multivariate Analyzes
No ratings yet
Dietary Patterns: A Literature Review of The Methodological Characteristics of The Main Steps of The Multivariate Analyzes
21 pages
TIME TABLE 24-25 Revised
No ratings yet
TIME TABLE 24-25 Revised
1 page
Bab Iii
No ratings yet
Bab Iii
5 pages
Chapter 8- Time series forecasting
No ratings yet
Chapter 8- Time series forecasting
15 pages
Data Analyst Job Roles
No ratings yet
Data Analyst Job Roles
3 pages
Group 3 Intro Methods and Questionnaire
No ratings yet
Group 3 Intro Methods and Questionnaire
15 pages
Sta630 Final Mega File
No ratings yet
Sta630 Final Mega File
110 pages
GenQI and QMS - Whitepaper - Suraj Premnath
No ratings yet
GenQI and QMS - Whitepaper - Suraj Premnath
10 pages
Statistics and Probability II-7
No ratings yet
Statistics and Probability II-7
7 pages
Bharathidasan University-Statistics-QP-Nov-2010
No ratings yet
Bharathidasan University-Statistics-QP-Nov-2010
3 pages
Timetable - Data Analytics Using R Programming Language
No ratings yet
Timetable - Data Analytics Using R Programming Language
1 page
Political Campaigns and Big Data: David W. Nickerson and Todd Rogers
No ratings yet
Political Campaigns and Big Data: David W. Nickerson and Todd Rogers
29 pages
Smartpls Report: Complete Final Results
No ratings yet
Smartpls Report: Complete Final Results
107 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Partitioning Methods & Hierachical Methods

Uploaded by

Partitioning Methods & Hierachical Methods

Uploaded by

Datamining & Warehousing

Unit 4 – Cluster Analysis–

■ Cluster Analysis: Basic Concepts

■ Global optimal: exhaustively enumerate all partitions

■ Strength: Efficient: O(tkn), where n is # objects, k is #

■ Most of the variants of the k-means which differ in

■ The k-means algorithm is sensitive to outliers !

Decompose data objects into a several levels of nested partitioning (tree of

A clustering of the data objects is obtained by cutting the dendrogram at

■ Introduced in Kaufmann and Rousseeuw (1990)

■ Radius: square root of average distance from any

■ Diameter: square root of average mean squared

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.