Cluster Analysis in Python Chapter1 PDF
Cluster Analysis in Python Chapter1 PDF
Cluster Analysis in Python Chapter1 PDF
learning: basics
C L U S T E R A N A LY S I S I N P Y T H O N
Shaumik Daityari
Business Analyst
Everyday example: Google news
How does Google News classify articles?
Point 2: (2, 2)
Point 3: (3, 1)
x_coordinates = [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 56, 49, 62, 44]
y_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25, 2, 10, 24, 10]
plt.scatter(x_coordinates, y_coordinates)
plt.show()
Shaumik Daityari
Business Analyst
What is a cluster?
A group of items with similar characteristics
Customer Segments
K means clustering
df = pd.DataFrame({'x_coordinate': x_coordinates,
'y_coordinate': y_coordinates})
Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion='maxclust')
sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()
import random
random.seed((1000,2000))
centroids,_ = kmeans(df, 3)
df['cluster_labels'], _ = vq(df, centroids)
sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()
Shaumik Daityari
Business Analyst
Why do we need to prepare data for clustering?
Variables have incomparable units (product dimensions in cm, price in $)
Variables with same units have vastly different scales and variances (expenditures on cereals, travel)
x_new = x / std_dev(x)
data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]
scaled_data = whiten(data)
print(scaled_data)
[2.73, 0.55, 1.64, 1.64, 1.09, 1.64, 1.64, 4.36, 0.55, 1.09, 1.09, 1.64, 2.73]