Cluster Analysis in Python Chapter1 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Unsupervised

learning: basics
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Everyday example: Google news
How does Google News classify articles?

Unsupervised Learning Algorithm: Clustering

Match frequent terms in articles to nd


similarity

CLUSTER ANALYSIS IN PYTHON


Labeled and unlabeled data
Data with no labels Point 1: (1, 2)

Point 2: (2, 2)

Point 3: (3, 1)

Data with labels Point 1: (1, 2), Label: Danger Zone

Point 2: (2, 2), Label: Normal Zone

Point 3: (3, 1), Label: Normal Zone

CLUSTER ANALYSIS IN PYTHON


What is unsupervised learning?
A group of machine learning algorithms that nd patterns in data

Data for algorithms has not been labeled, classi ed or characterized

The objective of the algorithm is to interpret any structure in the data

Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

CLUSTER ANALYSIS IN PYTHON


What is clustering?
The process of grouping items with similar characteristics

Items in groups similar to each other than in other groups

Example: distance between points on a 2D plane

CLUSTER ANALYSIS IN PYTHON


Plotting data for clustering - Pokemon sightings
from matplotlib import pyplot as plt

x_coordinates = [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 56, 49, 62, 44]
y_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25, 2, 10, 24, 10]

plt.scatter(x_coordinates, y_coordinates)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Up next - some
practice
C L U S T E R A N A LY S I S I N P Y T H O N
Basics of cluster
analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
What is a cluster?
A group of items with similar characteristics

Google News: articles where similar words and


word associations appear together

Customer Segments

CLUSTER ANALYSIS IN PYTHON


Clustering algorithms
Hierarchical clustering

K means clustering

Other clustering algorithms: DBSCAN, Gaussian Methods

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Hierarchical clustering in SciPy
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,


10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates,
'y_coordinate': y_coordinates})

Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion='maxclust')

sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
K-means clustering in SciPy
from scipy.cluster.vq import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

import random
random.seed((1000,2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,


10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate': y_coordinates})

centroids,_ = kmeans(df, 3)
df['cluster_labels'], _ = vq(df, centroids)

sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()

CLUSTER ANALYSIS IN PYTHON


CLUSTER ANALYSIS IN PYTHON
Next up: hands-on
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Data preparation for
cluster analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Why do we need to prepare data for clustering?
Variables have incomparable units (product dimensions in cm, price in $)

Variables with same units have vastly different scales and variances (expenditures on cereals, travel)

Data in raw form may lead to bias in clustering

Clusters may be heavily dependent on one variable

Solution: normalization of individual variables

CLUSTER ANALYSIS IN PYTHON


Normalization of data
Normalization: process of rescaling data to a standard deviation of 1

x_new = x / std_dev(x)

from scipy.cluster.vq import whiten

data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]

scaled_data = whiten(data)
print(scaled_data)

[2.73, 0.55, 1.64, 1.64, 1.09, 1.64, 1.64, 4.36, 0.55, 1.09, 1.09, 1.64, 2.73]

CLUSTER ANALYSIS IN PYTHON


Illustration: normalization of data
# Import plotting library
from matplotlib import pyplot as plt

# Initialize original, scaled data


plt.plot(data,
label="original")
plt.plot(scaled_data,
label="scaled")

# Show legend and display plot


plt.legend()
plt.show()

CLUSTER ANALYSIS IN PYTHON


Next up: some DIY
exercises
C L U S T E R A N A LY S I S I N P Y T H O N

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy