Unit 4

INTRODUCTION:
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points together.
The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each
group are more similar to each other than to data points in other groups. This process is often used for
exploratory data analysis and can help identify patterns or relationships within the data that may not be
immediately obvious. There are many different algorithms used for cluster analysis, such as k-means,
hierarchical clustering, and density-based clustering. The choice of algorithm will depend on the specific
requirements of the analysis and the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an unsupervised
machine learning-based algorithm that acts on unlabelled data. A group of data points would comprise together
to form a cluster in which all the objects would belong to the same group.
The given data is divided into different groups by combining similar objects into a group. This group is nothing
but a cluster. A cluster is nothing but a collection of similar data which is grouped together.
For example, consider a dataset of vehicles given in which it contains information about different vehicles like
cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like Cars, Bikes, etc for all
the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like cars cluster
which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data should
be scalable, if it is not scalable, then we can’t get the appropriate result which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the data
of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with algorithms of
clustering. It should be capable of dealing with different types of data like discrete, categorical and interval-
based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values, and noisy
or erroneous data. If the algorithms are sensitive to such data then it may lead to poor quality clusters. So it
should be able to handle unstructured data and give some structure to the data by organising it into groups of
similar data objects. This makes the job of the data expert easier in order to process the data and discover
new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n” partitions are
done on “p” objects of the database then each partition is represented by a cluster and n < p. The two
conditions which need to be satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the object will be
moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects is created.
We can classify hierarchical methods and will be able to know the purpose of classification on the basis of
how the hierarchical decomposition is formed. There are two types of approaches for the creation of
hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is also known as the bottom-up approach.
Initially, the given data is divided into which objects form separate groups. Thereafter it keeps on merging
the objects or the groups that are close to one another which means that they exhibit similar properties.
This merging process continues until the termination condition holds.
• Divisive Approach: The divisive approach is also known as the top-down approach. In this approach, we
would start with the data objects that are in the same cluster. The group of individual clusters is divided into
small clusters by continuous iteration. The iteration continues until the condition of termination is met or
until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so flexible.
The two approaches which can be used to improve the Hierarchical Clustering Quality in Data Mining are: –
• One should carefully analyze the linkages of the object at every partitioning of hierarchical clustering.
• One can use a hierarchical agglomerative algorithm for the integration of hierarchical agglomeration. In this
approach, first, the objects are grouped into micro-clusters. After grouping data objects into microclusters,
macro clustering is performed on the microcluster
Partitioning Method (K-Mean) in Data Mining
• Read
• Discuss
• Courses
•
Partitioning Method: This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of clusters that has to be
generated for the clustering methods. In the partitioning method when database(D) that contains multiple(N)
objects then the partitioning method constructs user-specified(K) partitions of the data in which each partition
represents a cluster and a particular region. There are many algorithms that come under partitioning method
some of the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications)
etc. In this article, we will be seeing the working of K Mean algorithm in detail. K-Mean (A centroid based
Technique): The K means algorithm takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity among the data objects inside the group
(intracluster) is high but the similarity of data objects with the data objects from outside the cluster is low
(intercluster). The similarity of the cluster is determined with respect to the mean value of the cluster. It is a
type of square error algorithm. At the start randomly k objects from the dataset are chosen in which each of
the objects represents a cluster mean(centre). For the rest of the data objects, they are assigned to the
nearest cluster based on their distance from the cluster mean. The new mean of each of the cluster is then
calculated with the added data objects. Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 2 until no change occurs.
Figure – K-mean ClusteringFlowchart:
Figure – K-mean
ClusteringExample: Suppose we want to group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset. Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61,
62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-29) and (36-66) as 2
clusters we get using K Mean Algorithm.

Unit 4

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4

Uploaded by

Copyright:

Available Formats

INTRODUCTION:

Figure – K-mean ClusteringFlowchart:

You might also like

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!