IDS Unit-3 L2
IDS Unit-3 L2
1.Introduction:
■ Unsupervised methods - discovering hidden relationships in data.
■ No specific outcome or prediction is involved.
■ Focus is on finding patterns or groupings within the data.
■ Examples include:
■ Grouping customers with similar purchase behaviors.
■ Identifying correlations between population movement and
socioeconomic factors.
■ Used to explore and understand data structure rather than making predictions.
Cluster Analysis:
Cluster analysis groups observations into clusters where each datum is more similar to others
in the same cluster than to those in different clusters.
To help the company design appealing travel packages and better target specific client
segments.
2
1. K-means clustering: A fast and popular method for identifying clusters in quantitative
data.
2. Hierarchical clustering: Finds nested groups of clusters, similar to plant taxonomy
(family, then genus, then species).
1.1 Distances
EUCLIDEAN DISTANCE :
● Euclidean distance is a good choice for clustering when measurements are numerical
and continuous.
● K-means clustering is based on optimizing squared Euclidean distance.
● For categorical data, especially binary, other distance metrics should be used.
● Formula for Euclidean distance between two vectors :
HAMMING DISTANCE
When all variables are categorical, use Hamming distance, which counts mismatches
For categorical variables (e.g., recipe ingredients, gender, size), you can define the distance as:
Manhattan distance measures distance in the number of horizontal and vertical units it takes to
get from one (real-valued) point to the other (no diagonal moves). This is also known as L1
distance.
4
COSINE SIMILARITY
Cosine similarity is a common similarity metric in text analysis. It measures the smallest angle
between two vectors
5
K - Means Clustering:
1. Initialize:
○ Choose the number of clusters K.
○ Randomly initialize K cluster centroids from the data points.
2. Assign:
○ For each data point, assign it to the cluster whose centroid is closest to it.
3. Update:
○ Recalculate the centroid of each cluster by taking the mean of all the
points assigned to that cluster.
4. Repeat:
○ Repeat the Assign and Update steps until the centroids no longer change
significantly or the algorithm reaches the maximum number of iterations.
5. Terminate:
○ The algorithm terminates when the centroids stabilize (i.e., do not change
between iterations) or the maximum number of iterations is reached.
K-Means Clustering in R
Step 1: Load the Necessary Packages
library(factoextra)
library(cluster)
Step 2: Load and Prep the Data
● Load the dataset
● Remove any rows with missing values
● Scale each variable in the dataset to have a mean of 0 and a standard deviation of 1
#load data
df <- USArrests
where:
● data: Name of the dataset.
● centers: The number of clusters, denoted k.
● nstart: This tells the kmeans() function to run the algorithm for
specific no. of times with different initial cluster centers and
choose the best solution based on the total within-cluster sum of
squares.
● Iter.max: maximum number of iterations in each configuration
allowed to converge
Method2 :
11
12
13
We can visualize the clusters on a scatterplot that displays the first two principal
components on the axes using the fivz_cluster() function:
Hierarchical Clustering:
Single Linkage:
For two clusters R and S, the single linkage returns the minimum distance between
two points i and j such that i belongs to R and j belongs to S.
Complete Linkage:
For two clusters R and S, the complete linkage returns the maximum distance
between two points i and j such that i belongs to R and j belongs to S.
20
Ward's Method:
Bootstrap Evaluation
Purpose of Evaluation: