0% found this document useful (0 votes)
18 views26 pages

IDS Unit-3 L2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views26 pages

IDS Unit-3 L2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Unit -3

Topic: Unsupervised methods

1.Introduction:
■ Unsupervised methods - discovering hidden relationships in data.
■ No specific outcome or prediction is involved.
■ Focus is on finding patterns or groupings within the data.
■ Examples include:
■ Grouping customers with similar purchase behaviors.
■ Identifying correlations between population movement and
socioeconomic factors.
■ Used to explore and understand data structure rather than making predictions.

Two classes of unsupervised methods:

★ Cluster analysis - finds groups with similar characteristics.


★ Association rule mining - finds elements or properties in the data that tend to occur
together.

Cluster Analysis:

Cluster analysis groups observations into clusters where each datum is more similar to others
in the same cluster than to those in different clusters.

Example: A tour company could cluster clients based on:

● Preferred destinations (countries they like to visit).


● Tour preferences (adventure, luxury, or educational tours).
● Types of activities clients engage in.

To help the company design appealing travel packages and better target specific client
segments.
2

Two approaches to clustering:

1. K-means clustering: A fast and popular method for identifying clusters in quantitative
data.
2. Hierarchical clustering: Finds nested groups of clusters, similar to plant taxonomy
(family, then genus, then species).

1.1 Distances

Different notions of distance:


● Euclidean distance
● Hamming distance
● Manhattan (city block) distance
● Cosine similarity

EUCLIDEAN DISTANCE :

● Euclidean distance is a good choice for clustering when measurements are numerical
and continuous.
● K-means clustering is based on optimizing squared Euclidean distance.
● For categorical data, especially binary, other distance metrics should be used.
● Formula for Euclidean distance between two vectors :

HAMMING DISTANCE

When all variables are categorical, use Hamming distance, which counts mismatches

For categorical variables (e.g., recipe ingredients, gender, size), you can define the distance as:

● 0 if two points are in the same category.


● 1 if they are in different categories.

Hamming Distance (for unordered categorical variables):

Let’s say we have:

● Recipe 1: chicken, spicy, medium


● Recipe 2: beef, mild, medium

Now, compare the categories:


3

● Ingredient: chicken ≠ beef → 1


● Spice level: spicy ≠ mild → 1
● Serving size: medium = medium → 0

The Hamming distance is:

MANHATTAN (CITY BLOCK) DISTANCE

Manhattan distance measures distance in the number of horizontal and vertical units it takes to
get from one (real-valued) point to the other (no diagonal moves). This is also known as L1
distance.
4

COSINE SIMILARITY

Cosine similarity is a common similarity metric in text analysis. It measures the smallest angle
between two vectors
5

K - Means Clustering:

K-Means is a partition-based clustering algorithm used to divide a dataset into K


distinct non-overlapping clusters based on feature similarity. It is an iterative algorithm
that minimizes the within-cluster sum of squares (WCSS), a measure of how close the
data points in a cluster are to the cluster centroid.

Steps of the K-Means Algorithm:

1. Initialize:
○ Choose the number of clusters K.
○ Randomly initialize K cluster centroids from the data points.
2. Assign:
○ For each data point, assign it to the cluster whose centroid is closest to it.
3. Update:
○ Recalculate the centroid of each cluster by taking the mean of all the
points assigned to that cluster.
4. Repeat:
○ Repeat the Assign and Update steps until the centroids no longer change
significantly or the algorithm reaches the maximum number of iterations.
5. Terminate:
○ The algorithm terminates when the centroids stabilize (i.e., do not change
between iterations) or the maximum number of iterations is reached.

Numerical Example with Manhattan distance:


6
7

K-Means Clustering in R
Step 1: Load the Necessary Packages

library(factoextra)
library(cluster)
Step 2: Load and Prep the Data
● Load the dataset
● Remove any rows with missing values
● Scale each variable in the dataset to have a mean of 0 and a standard deviation of 1

#load data
df <- USArrests

#remove rows with missing values


df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1


df <- scale(df)

#view first six rows of dataset


head(df)

Step 3: Find the Optimal Number of Clusters


To perform k-means clustering in R we can use the built-in kmeans() function, which
uses the following syntax:
8

kmeans(data, centers, nstart, iter.max)

where:
● data: Name of the dataset.
● centers: The number of clusters, denoted k.
● nstart: This tells the kmeans() function to run the algorithm for
specific no. of times with different initial cluster centers and
choose the best solution based on the total within-cluster sum of
squares.
● Iter.max: maximum number of iterations in each configuration
allowed to converge

Method 1: Number of Clusters vs. the Total Within Sum of Squares


(Elbow Method)

The Within-Cluster Sum of Squares (WSS) is a metric used in k-means


clustering to measure the compactness of the clusters. It calculates the
sum of the squared distances between each data point and the centroid
of the cluster it belongs to. Lower WSS values indicate that the data
points within a cluster are close to each other, implying well-defined
clusters.
9
10

Method2 :
11
12
13

Average Silhouette Method

The average silhouette is a method used to evaluate the quality of clustering


results. It helps determine how well data points are clustered and provides a
way to assess the optimal number of clusters (commonly used with k-means or
hierarchical clustering).
14
15

fviz_nbclust(df, kmeans, method = "wss")

Step 4: Perform K-Means Clustering with Optimal K


#make this example reproducible
set.seed(1)

#perform k-means clustering with k = 4 clusters


km <- kmeans(df, centers = 4, nstart = 25)

# View the size of each cluster


km$size

We can visualize the clusters on a scatterplot that displays the first two principal
components on the axes using the fivz_cluster() function:

fviz_cluster(km, data = df)


16

Hierarchical Clustering:

Hierarchical clustering is a method of clustering that builds a hierarchy


of clusters either by starting with individual points and merging them
(agglomerative) or starting with all points in one cluster and splitting
them (divisive)

A: (2, 3) B: (3, 3)C: (6, 6)D: (8, 8)E: (8, 9)


17
18
19

Methods to Merge Clusters:

Single Linkage:
For two clusters R and S, the single linkage returns the minimum distance between
two points i and j such that i belongs to R and j belongs to S.

Complete Linkage:

For two clusters R and S, the complete linkage returns the maximum distance
between two points i and j such that i belongs to R and j belongs to S.
20

Ward's Method:

Ward's method minimizes the total within-cluster variance (also known as


error sum of squares, or ESS). When merging two clusters, the method
selects the pair of clusters whose merging results in the smallest increase
in the total within-cluster variance.
Steps:

● Calculate the initial distance between every pair of points.


● For each merge, select the pair of clusters whose union leads to
the smallest increase in the total variance.
● Recalculate distances after every merge.
21
22
23
24
25

Bootstrap Evaluation

Purpose of Evaluation:

● Assess whether a cluster genuinely represents a structure in the


data or if it's merely an artifact of the clustering algorithm.
● Particularly relevant for clustering algorithms like k-means,
where the number of clusters must be specified beforehand.
Cluster Characteristics:

● Clusters often reveal actual relationships in the data.


● Clusters of "other" or "miscellaneous" are composed of data
points with no real relationship, merely fitting into an arbitrary
category.
Assessment Method:

● The fpc package provides the clusterboot() function for


evaluating cluster stability using bootstrap resampling.
● This function integrates clustering and evaluation for various
algorithms, including hclust and kmeans.
Jaccard Coefficient:

● A similarity measure between sets defined as:


26

It quantifies the similarity of two clusters by comparing the


intersection and union of their members.
Evaluation Strategy:

● Step 1: Perform the initial clustering on the original dataset.


● Step 2: Create a new dataset through bootstrap resampling
(sampling with replacement) and cluster this new dataset.
● Step 3: For each original cluster, find the most similar cluster in
the new clustering using the maximum Jaccard coefficient.
○ If this coefficient is less than 0.5, the original cluster is
considered "dissolved," indicating instability.
● Step 4: Repeat steps 2 and 3 multiple times to obtain a
comprehensive assessment of cluster stability.
Interpretation of Results:

● Clusters that frequently dissolve (low Jaccard coefficients) are


likely not representative of true structure in the data and should
be treated with caution.
● High stability (high Jaccard coefficients) indicates that the cluster
is more likely to reflect genuine patterns in the data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy