0% found this document useful (0 votes)
11 views

clustering

Machine learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

clustering

Machine learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

INSTITUTE OF INFORMATION TECHNOLOGY & MANAGEMENT

Accredited ‘A’ Grade by NAAC &Recognised U/s 2(f) of UGC act


Rated Category `A+’ by SFRC & `A’ by JAC Govt. of NCT of Delhi
Approved by AICTE & Affiliated to GGS Indraprastha University, New Delhi

Machine Learning with Python

Programme : BCA
Semester : V
Subject Code : BCAT311
Subject : Machine Learning with Python
Topic : Clustering
Faculty : Ms. Shilpi Bansal
© Institute of Information Technology and Management, D-29, Institutional Area, Janakpuri, New Delhi-110058
List of Topics

 Introduction to clustering
 K-mean clustering
 Hierarchical clustering

© Institute of Information Technology and Management, D-29,


Institutional Area, Janakpuri, New Delhi-110058
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
 Urban planning: Identifying groups of houses according to their house
type, value, and geographical location
 Seismology: Observed earth quake epicenters should be clustered along
continent faults
4
What Is a Good Clustering?

 A good clustering method will produce clusters with


 High intra-class similarity
 Low inter-class similarity

 Precise definition of clustering quality is difficult


 Application-dependent
 Ultimately subjective

5
Requirements for Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal domain knowledge required to determine input
parameters
 Ability to deal with noise and outliers
 Insensitivity to order of input records
 Robustness wrt high dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
6
Similarity and Dissimilarity Between
Objects

 Same we used for IBL (e.g, Lp norm)


 Euclidean distance (p = 2):
 Properties | + | x d(i,j)
d (i, j) = (| xof−ax metric
2
− x |: +...+ | x
2
−x |2 )
i
1 j
1 i
2 j2 i p jp
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)

7
Major Clustering Approaches

 Partitioning: Construct various partitions and then evaluate them by


some criterion
 Hierarchical: Create a hierarchical decomposition of the set of objects
using some criterion
 Model-based: Hypothesize a model for each cluster and find best fit of
models to data
 Density-based: Guided by connectivity and density functions

8
Partitioning Algorithms

 Partitioning method: Construct a partition of a database D of n objects


into a set of k clusters
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen, 1967): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw,
1987): Each cluster is represented by one of the objects in the cluster

9
K-Means Clustering

 Given k, the k-means algorithm consists of four steps:


 Select initial centroids at random.
 Assign each object to the cluster with the nearest centroid.
 Compute each centroid as the mean of the objects assigned to it.
 Repeat previous 2 steps until no change.

10
K-Means Clustering (contd.)
 Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

11
Comments on the K-Means Method

 Strengths
 Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
 Often terminates at a local optimum. The global optimum may be found
using techniques such as simulated annealing and genetic algorithms
 Weaknesses
 Applicable only when mean is defined (what about categorical data?)
 Need to specify k, the number of clusters, in advance
 Trouble with noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

12
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a termination
condition

Step 0 Step 1 Step 2 Step 3 Step 4


agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
13 Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting)

 Produces tree of clusters (nodes)


 Initially: each object is a cluster (leaf)
 Recursively merges nodes that have the least dissimilarity
 Criteria: min distance, max distance, avg distance, center distance
 Eventually all nodes belong to the same cluster (root)

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

14
A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level. Then each connected
component forms a cluster.

15
DIANA (Divisive Analysis)

 Inverse order of AGNES


 Start with root cluster containing all objects
 Recursively divide into subclusters
 Eventually each cluster contains a single object

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

16
© Institute of Information Technology and Management, D-29,
Institutional Area, Janakpuri, New Delhi-110058

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy