0% found this document useful (0 votes)

18 views26 pages

IDS Unit-3 L2

Uploaded by

poojithakothapalli13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views26 pages

IDS Unit-3 L2

Uploaded by

poojithakothapalli13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Unit -3

Topic: Unsupervised methods

1.Introduction:
■ Unsupervised methods - discovering hidden relationships in data.
■ No specific outcome or prediction is involved.
■ Focus is on finding patterns or groupings within the data.
■ Examples include:
■ Grouping customers with similar purchase behaviors.
■ Identifying correlations between population movement and
socioeconomic factors.
■ Used to explore and understand data structure rather than making predictions.

Two classes of unsupervised methods:

★ Cluster analysis - finds groups with similar characteristics.

★ Association rule mining - finds elements or properties in the data that tend to occur
together.

Cluster Analysis:

Cluster analysis groups observations into clusters where each datum is more similar to others
in the same cluster than to those in different clusters.

Example: A tour company could cluster clients based on:

● Preferred destinations (countries they like to visit).

● Tour preferences (adventure, luxury, or educational tours).
● Types of activities clients engage in.

To help the company design appealing travel packages and better target specific client
segments.
2

Two approaches to clustering:

1. K-means clustering: A fast and popular method for identifying clusters in quantitative
data.
2. Hierarchical clustering: Finds nested groups of clusters, similar to plant taxonomy
(family, then genus, then species).

1.1 Distances

Different notions of distance:

● Euclidean distance
● Hamming distance
● Manhattan (city block) distance
● Cosine similarity

EUCLIDEAN DISTANCE :

● Euclidean distance is a good choice for clustering when measurements are numerical
and continuous.
● K-means clustering is based on optimizing squared Euclidean distance.
● For categorical data, especially binary, other distance metrics should be used.
● Formula for Euclidean distance between two vectors :

HAMMING DISTANCE

When all variables are categorical, use Hamming distance, which counts mismatches

For categorical variables (e.g., recipe ingredients, gender, size), you can define the distance as:

● 0 if two points are in the same category.

● 1 if they are in different categories.

Hamming Distance (for unordered categorical variables):

Let’s say we have:

● Recipe 1: chicken, spicy, medium

● Recipe 2: beef, mild, medium

Now, compare the categories:

● Ingredient: chicken ≠ beef → 1

● Spice level: spicy ≠ mild → 1
● Serving size: medium = medium → 0

The Hamming distance is:

MANHATTAN (CITY BLOCK) DISTANCE

Manhattan distance measures distance in the number of horizontal and vertical units it takes to
get from one (real-valued) point to the other (no diagonal moves). This is also known as L1
distance.
4

COSINE SIMILARITY

Cosine similarity is a common similarity metric in text analysis. It measures the smallest angle
between two vectors
5

K - Means Clustering:

K-Means is a partition-based clustering algorithm used to divide a dataset into K

distinct non-overlapping clusters based on feature similarity. It is an iterative algorithm
that minimizes the within-cluster sum of squares (WCSS), a measure of how close the
data points in a cluster are to the cluster centroid.

Steps of the K-Means Algorithm:

1. Initialize:
○ Choose the number of clusters K.
○ Randomly initialize K cluster centroids from the data points.
2. Assign:
○ For each data point, assign it to the cluster whose centroid is closest to it.
3. Update:
○ Recalculate the centroid of each cluster by taking the mean of all the
points assigned to that cluster.
4. Repeat:
○ Repeat the Assign and Update steps until the centroids no longer change
significantly or the algorithm reaches the maximum number of iterations.
5. Terminate:
○ The algorithm terminates when the centroids stabilize (i.e., do not change
between iterations) or the maximum number of iterations is reached.

Numerical Example with Manhattan distance:

6
7

K-Means Clustering in R
Step 1: Load the Necessary Packages

library(factoextra)
library(cluster)
Step 2: Load and Prep the Data
● Load the dataset
● Remove any rows with missing values
● Scale each variable in the dataset to have a mean of 0 and a standard deviation of 1

#load data
df <- USArrests

#remove rows with missing values

df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1

df <- scale(df)

#view first six rows of dataset

head(df)

Step 3: Find the Optimal Number of Clusters

To perform k-means clustering in R we can use the built-in kmeans() function, which
uses the following syntax:
8

kmeans(data, centers, nstart, iter.max)

where:
● data: Name of the dataset.
● centers: The number of clusters, denoted k.
● nstart: This tells the kmeans() function to run the algorithm for
specific no. of times with different initial cluster centers and
choose the best solution based on the total within-cluster sum of
squares.
● Iter.max: maximum number of iterations in each configuration
allowed to converge

Method 1: Number of Clusters vs. the Total Within Sum of Squares

(Elbow Method)

The Within-Cluster Sum of Squares (WSS) is a metric used in k-means

clustering to measure the compactness of the clusters. It calculates the
sum of the squared distances between each data point and the centroid
of the cluster it belongs to. Lower WSS values indicate that the data
points within a cluster are close to each other, implying well-defined
clusters.
9
10

Method2 :
11
12
13

Average Silhouette Method

The average silhouette is a method used to evaluate the quality of clustering

results. It helps determine how well data points are clustered and provides a
way to assess the optimal number of clusters (commonly used with k-means or
hierarchical clustering).
14
15

fviz_nbclust(df, kmeans, method = "wss")

Step 4: Perform K-Means Clustering with Optimal K

#make this example reproducible
set.seed(1)

#perform k-means clustering with k = 4 clusters

km <- kmeans(df, centers = 4, nstart = 25)

# View the size of each cluster

km$size

We can visualize the clusters on a scatterplot that displays the first two principal
components on the axes using the fivz_cluster() function:

fviz_cluster(km, data = df)

Hierarchical Clustering:

Hierarchical clustering is a method of clustering that builds a hierarchy

of clusters either by starting with individual points and merging them
(agglomerative) or starting with all points in one cluster and splitting
them (divisive)

A: (2, 3) B: (3, 3)C: (6, 6)D: (8, 8)E: (8, 9)

17
18
19

Methods to Merge Clusters:

Single Linkage:
For two clusters R and S, the single linkage returns the minimum distance between
two points i and j such that i belongs to R and j belongs to S.

Complete Linkage:

For two clusters R and S, the complete linkage returns the maximum distance
between two points i and j such that i belongs to R and j belongs to S.
20

Ward's Method:

Ward's method minimizes the total within-cluster variance (also known as

error sum of squares, or ESS). When merging two clusters, the method
selects the pair of clusters whose merging results in the smallest increase
in the total within-cluster variance.
Steps:

● Calculate the initial distance between every pair of points.

● For each merge, select the pair of clusters whose union leads to
the smallest increase in the total variance.
● Recalculate distances after every merge.
21
22
23
24
25

Bootstrap Evaluation

Purpose of Evaluation:

● Assess whether a cluster genuinely represents a structure in the

data or if it's merely an artifact of the clustering algorithm.
● Particularly relevant for clustering algorithms like k-means,
where the number of clusters must be specified beforehand.
Cluster Characteristics:

● Clusters often reveal actual relationships in the data.

● Clusters of "other" or "miscellaneous" are composed of data
points with no real relationship, merely fitting into an arbitrary
category.
Assessment Method:

● The fpc package provides the clusterboot() function for

evaluating cluster stability using bootstrap resampling.
● This function integrates clustering and evaluation for various
algorithms, including hclust and kmeans.
Jaccard Coefficient:

● A similarity measure between sets defined as:

It quantifies the similarity of two clusters by comparing the

intersection and union of their members.
Evaluation Strategy:

● Step 1: Perform the initial clustering on the original dataset.

● Step 2: Create a new dataset through bootstrap resampling
(sampling with replacement) and cluster this new dataset.
● Step 3: For each original cluster, find the most similar cluster in
the new clustering using the maximum Jaccard coefficient.
○ If this coefficient is less than 0.5, the original cluster is
considered "dissolved," indicating instability.
● Step 4: Repeat steps 2 and 3 multiple times to obtain a
comprehensive assessment of cluster stability.
Interpretation of Results:

● Clusters that frequently dissolve (low Jaccard coefficients) are

likely not representative of true structure in the data and should
be treated with caution.
● High stability (high Jaccard coefficients) indicates that the cluster
is more likely to reflect genuine patterns in the data

Unit II Final
No ratings yet
Unit II Final
152 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
Module 4
No ratings yet
Module 4
63 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
Clustering
No ratings yet
Clustering
55 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
Unit 6 - Machine Learning in R
No ratings yet
Unit 6 - Machine Learning in R
45 pages
Clustering
No ratings yet
Clustering
75 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
K Means
No ratings yet
K Means
25 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Unit 4
No ratings yet
Unit 4
63 pages
K-Means Cluster Analysis UC Business Analytics R Programming Guide
No ratings yet
K-Means Cluster Analysis UC Business Analytics R Programming Guide
19 pages
Press Tool Tech
No ratings yet
Press Tool Tech
48 pages
4 Clustring
No ratings yet
4 Clustring
48 pages
Clustering
No ratings yet
Clustering
24 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Unit 4
No ratings yet
Unit 4
19 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Unit 4
No ratings yet
Unit 4
125 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Week 9
No ratings yet
Week 9
66 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering
No ratings yet
Clustering
7 pages
Agenda: 1. Introduction To Clustering
No ratings yet
Agenda: 1. Introduction To Clustering
47 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Chapter 04 Clustering
No ratings yet
Chapter 04 Clustering
36 pages
Unsupervised Unit 1
No ratings yet
Unsupervised Unit 1
12 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Chapter 7pde
No ratings yet
Chapter 7pde
58 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
K Mean Notes
No ratings yet
K Mean Notes
5 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
DD (Raised Chimney) #FoundationDesign Submission - Rev D
No ratings yet
DD (Raised Chimney) #FoundationDesign Submission - Rev D
387 pages
K Means
No ratings yet
K Means
3 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
KQL Internals Sentinel
No ratings yet
KQL Internals Sentinel
86 pages
Advanced Data Modeling in Power BI
No ratings yet
Advanced Data Modeling in Power BI
31 pages
2learn Fibonacci Series - Time Zones & Projections
No ratings yet
2learn Fibonacci Series - Time Zones & Projections
24 pages
EC6361 Lab Manual
No ratings yet
EC6361 Lab Manual
214 pages
Procedure For Accessment of Concrete Structure
No ratings yet
Procedure For Accessment of Concrete Structure
267 pages
Iodine NUMBER WORKSHEET 2!
No ratings yet
Iodine NUMBER WORKSHEET 2!
6 pages
4 - Stress and Strain 20181108
No ratings yet
4 - Stress and Strain 20181108
35 pages
Electrostatics 4
No ratings yet
Electrostatics 4
47 pages
S1 Research Methodology &amp Biostatistic QP
No ratings yet
S1 Research Methodology &amp Biostatistic QP
2 pages
Radial Check Etc
No ratings yet
Radial Check Etc
38 pages
Dental Air: LFX - LF - SF Class 0 Clean Air Compressors
No ratings yet
Dental Air: LFX - LF - SF Class 0 Clean Air Compressors
9 pages
Lecture 8-Process Capability PDF
100% (1)
Lecture 8-Process Capability PDF
29 pages
109class II It
No ratings yet
109class II It
4 pages
Concrete Mix Proportioning
No ratings yet
Concrete Mix Proportioning
19 pages
Install Instruction For Diagbox v7.65
No ratings yet
Install Instruction For Diagbox v7.65
2 pages
Business Analytics M17 EVAN7821 09 SE SUPPA Online - Evans
No ratings yet
Business Analytics M17 EVAN7821 09 SE SUPPA Online - Evans
30 pages
34 BasicCallback JS
No ratings yet
34 BasicCallback JS
3 pages
Sag Curve
No ratings yet
Sag Curve
14 pages
8.1 - Production and Diagnostic Use of Ultrasound
No ratings yet
8.1 - Production and Diagnostic Use of Ultrasound
3 pages
Construction of Ic Engine
No ratings yet
Construction of Ic Engine
11 pages
OMS2K-C2016 Solar Powered Medium Intensity Obstruction Light - Datasheet - v202008
No ratings yet
OMS2K-C2016 Solar Powered Medium Intensity Obstruction Light - Datasheet - v202008
2 pages
Switch Beam Antenna 28ghz
No ratings yet
Switch Beam Antenna 28ghz
4 pages
NOX TPA E Description
No ratings yet
NOX TPA E Description
5 pages
1597 KI Avant Tone Horn S140 HELLA EN
No ratings yet
1597 KI Avant Tone Horn S140 HELLA EN
2 pages
Problem No. 1: Solving Problems Involving Ellipse & Hyperbola
No ratings yet
Problem No. 1: Solving Problems Involving Ellipse & Hyperbola
2 pages
VSP para Buques de Offshore PDF
No ratings yet
VSP para Buques de Offshore PDF
6 pages
Passwords and Hash Summ For Data Files
No ratings yet
Passwords and Hash Summ For Data Files
3 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IDS Unit-3 L2

Uploaded by

IDS Unit-3 L2

Uploaded by

Unit -3

Topic: Unsupervised methods

Two classes of unsupervised methods:

★ Cluster analysis - finds groups with similar characteristics.

Example: A tour company could cluster clients based on:

● Preferred destinations (countries they like to visit).

Two approaches to clustering:

Different notions of distance:

● 0 if two points are in the same category.

Hamming Distance (for unordered categorical variables):

Let’s say we have:

● Recipe 1: chicken, spicy, medium

Now, compare the categories:

● Ingredient: chicken ≠ beef → 1

The Hamming distance is:

MANHATTAN (CITY BLOCK) DISTANCE

K-Means is a partition-based clustering algorithm used to divide a dataset into K

Steps of the K-Means Algorithm:

Numerical Example with Manhattan distance:

#remove rows with missing values

#scale each variable to have a mean of 0 and sd of 1

#view first six rows of dataset

Step 3: Find the Optimal Number of Clusters

kmeans(data, centers, nstart, iter.max)

Method 1: Number of Clusters vs. the Total Within Sum of Squares

The Within-Cluster Sum of Squares (WSS) is a metric used in k-means

Average Silhouette Method

The average silhouette is a method used to evaluate the quality of clustering

fviz_nbclust(df, kmeans, method = "wss")

Step 4: Perform K-Means Clustering with Optimal K

#perform k-means clustering with k = 4 clusters

# View the size of each cluster

fviz_cluster(km, data = df)

Hierarchical clustering is a method of clustering that builds a hierarchy

A: (2, 3) B: (3, 3)C: (6, 6)D: (8, 8)E: (8, 9)

Methods to Merge Clusters:

Ward's method minimizes the total within-cluster variance (also known as

● Calculate the initial distance between every pair of points.

● Assess whether a cluster genuinely represents a structure in the

● Clusters often reveal actual relationships in the data.

● The fpc package provides the clusterboot() function for

● A similarity measure between sets defined as:

It quantifies the similarity of two clusters by comparing the

● Step 1: Perform the initial clustering on the original dataset.

● Clusters that frequently dissolve (low Jaccard coefficients) are

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.