0% found this document useful (0 votes)
20 views11 pages

10.lab Activity

This document outlines a lab experiment focused on implementing K-Means clustering in Python, covering objectives such as understanding unsupervised learning, implementing the algorithm, and analyzing clustering results. It details the K-Means process, including initialization, assignment, and update steps, as well as the advantages and disadvantages of the algorithm. Additionally, the document provides Python code examples for data preprocessing, clustering, and visualizing results using Matplotlib.

Uploaded by

Azhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

10.lab Activity

This document outlines a lab experiment focused on implementing K-Means clustering in Python, covering objectives such as understanding unsupervised learning, implementing the algorithm, and analyzing clustering results. It details the K-Means process, including initialization, assignment, and update steps, as well as the advantages and disadvantages of the algorithm. Additionally, the document provides Python code examples for data preprocessing, clustering, and visualizing results using Matplotlib.

Uploaded by

Azhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Experiment No.

10
Implementation of Unsupervised Classification using K-Means
Clustering: Implementation and Analysis in Python.

Objectives:

This lab will guide you through:

 Learn the fundamentals of unsupervised learning and its application in data


clustering.
 Implement the K-Means clustering algorithm from scratch using Python and popular
libraries like NumPy and Pandas.
 Analyze the performance of the K-Means algorithm on different datasets and
evaluate the clustering results based on metrics such as within-cluster variance.
 Understand the impact of key parameters (like the number of clusters) on the quality
of clustering results.
 Gain hands-on experience in preparing and preprocessing data for clustering,
including handling missing data and scaling features.
 Visualize the clusters formed by the algorithm and interpret the findings in the
context of the given data.
Prerequisites
 Familiarity with Python programming.
 Basic understanding of Classifier performance metrics concepts.

Unsupervised Classification (Clustering)


Unsupervised classification, also known as unsupervised learning, is a type of machine learning
where the model is trained on data that does not have labeled outputs. In contrast to supervised
learning, where the data includes input-output pairs (features and labels), unsupervised learning
involves finding patterns, relationships, or groupings within the data without any predefined
labels.
The goal of unsupervised classification is to discover inherent structures or groupings in the
data, which can be used for various tasks like data summarization, anomaly detection, or feature
extraction.
Common types of unsupervised learning algorithms include:
Clustering: Grouping data points into clusters based on similarity.
Dimensionality Reduction: Reducing the number of features while retaining important
information (e.g., PCA).
K-Means Clustering
K-Means clustering is one of the most popular and widely used algorithms for clustering in
unsupervised learning. It divides a set of data points into K clusters, where K is a pre-specified
number of clusters. The goal is to minimize the variance within each cluster while maximizing
the variance between clusters.
How K-Means Works:
1. Initialization: Choose the number of clusters, K, and initialize K cluster centroids
(typically, these are selected randomly from the data points).
2. Assignment Step: Assign each data point to the closest centroid based on a distance
metric, typically Euclidean distance. This creates K groups of data points.
3. Update Step: Recalculate the centroids of the clusters by taking the mean of all the
points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until the centroids no longer change significantly or the
algorithm converges. The clusters are then considered stable.

Key Points in K-Means Clustering:


 K value: The number of clusters (K) must be specified before the algorithm runs. This
can be chosen based on domain knowledge, trial and error, or methods like the elbow
method.
 Distance Metric: The algorithm uses a distance metric (typically Euclidean distance) to
calculate the similarity between data points and centroids.
 Convergence: The algorithm iterates until the centroids do not change much between
iterations, signaling convergence.
 Sensitive to Initialization: The outcome of K-Means can depend on the initial placement
of centroids. To address this, techniques like K-Means++ initialization are used to
improve results.

Advantages of K-Means:
 Scalability: K-Means works well for large datasets because it is computationally
efficient.
 Simplicity: The algorithm is easy to implement and understand.
 Versatility: It can be applied to many types of clustering problems (e.g., customer
segmentation, image compression).

Disadvantages:
 Choosing K: The algorithm requires you to specify the number of clusters in advance,
which may not always be obvious.
 Sensitivity to Outliers: K-Means can be affected by outliers, as they can significantly
alter the mean of a cluster.
 Non-Spherical Clusters: K-Means assumes that clusters are spherical and equally sized,
which may not always be the case in real-world data.
Clustering With K Means - Python
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
%matplotlib inline

df = pd.read_csv("income.csv")
df.head()

plt.scatter(df.Age,df['Income($)'])
plt.xlabel('Age')
plt.ylabel('Income($)')

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
y_predicted

df['cluster']=y_predicted
df.head()

km.cluster_centers_

df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',m
arker='*',label='centroid')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()

Preprocessing using min max scaler


scaler = MinMaxScaler()

scaler.fit(df[['Income($)']])
df['Income($)'] = scaler.transform(df[['Income($)']])

scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])
df.head()
plt.scatter(df.Age,df['Income($)'])

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
y_predicted
df['cluster']=y_predicted
df.head()

km.cluster_centers_

df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',m
arker='*',label='centroid')
plt.legend()

Elbow Plot

sse = []
k_rng = range(1,10)
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df[['Age','Income($)']])
sse.append(km.inertia_)

plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_rng,sse)
Activity Name 

Group No. 

Student Roll No.



C P Domain +
L L Taxonom
No. O O y Criteria Awarded Score (out of 4 for each cell)

Operational Skills for Anaconda


1 5 5 P3
/Python /Spyder

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy