K Means
K Means
K Means
com/ambarish/ml-kaggler-types-using-kmeans-and-pca
Therefore, we scale our data before employing a distance based algorithm so that all the
features contribute equally to the result.
https://towardsdatascience.com/segmenting-customers-using-k-means-and-transaction-records-
76f4055d856a
https://www.quora.com/Should-you-standardize-binary-categorical-and-indicator-primary-key-
variables-before-performing-K-means-clustering
https://github.com/adelweiss/RFM_Kmeans
https://medium.com/@16611050/k-means-clustering-8476c74ad462
https://heartbeat.fritz.ai/understanding-the-mathematics-behind-k-means-clustering-40e1d55e2f4c
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-
standardization/
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-
auto-examples-cluster-plot-kmeans-silhouette-analysis-py
https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
https://www.guru99.com/r-k-means-clustering.html ( R )
https://www.geeksforgeeks.org/k-means-clustering-introduction/
https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-
drawbacks-aa03e644b48a
https://www.slideshare.net/kasunrangawijeweera/kmeans-example
http://people.csail.mit.edu/dsontag/courses/ml13/slides/lecture14.pdf
Seraj k-means
https://www.youtube.com/watch?edufilter=NULL&v=9991JlKnFmk
https://www.kaggle.com/isaikumar/credit-card-fraud-detection-using-k-means-and-knn
An Improved Credit Card Fraud Detection Using K-Means Clustering Algorithm Paper
https://www.krishisanskriti.org/vol_image/03Jul201510071210.pdf
A Cluster Based Approach for Credit Card Fraud Detection System using Hmm with the Implementation
of Big Data Technology
https://www.ripublication.com/ijaer19/ijaerv14n2_08.pdf
https://towardsdatascience.com/clustering-machine-learning-combination-in-sales-prediction-
330a7a205102
Sales Prediction using Clustering & Machine Learning (ARIMA & Holt’s Winter Approach) (R-
programming)
https://www.slideshare.net/annafensel/kmeans-clustering-122651195
unique_vals = data['cluster'].unique() # [0, 1, 2]
class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0
.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=None,
algorithm='auto')[source]
K-Means clustering.
Parameters
n_clustersint, default=8
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed
up convergence. See section Notes in k_init for more details.
‘random’: choose k observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial
centers.
n_initint, default=10
Number of time the k-means algorithm will be run with different centroid seeds. The
final results will be the best output of n_init consecutive runs in terms of inertia.
max_iterint, default=300
tolfloat, default=1e-4
verboseint, default=0
Verbosity mode.
Determines random number generation for centroid initialization. Use an int to make the
randomness deterministic. See Glossary.
copy_xbool, default=True
When pre-computing distances it is more numerically accurate to center the data first. If
copy_x is True (default), then the original data is not modified, ensuring X is C-
contiguous. If False, the original data is modified, and put back before the function
returns, but small numerical differences may be introduced by subtracting and then
adding the data mean, in this case it will also not ensure that data is C-contiguous which
may cause a significant slowdown.
n_jobsint, default=None
The number of jobs to use for the computation. This works by computing each of the
n_init runs in parallel.
K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation
is more efficient by using the triangle inequality, but currently doesn’t support sparse
data. “auto” chooses “elkan” for dense data and “full” for sparse data.
Attributes
inertia_float
n_iter_int
See also
MiniBatchKMeans
Notes
The average complexity is given by O(k n T), were n is the number of samples and T is
the number of iteration.
In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms
available), but it falls in local minima. That’s why it can be useful to restart it several
times.
Examples
>>>
Methods
fit(self, X, y=None, sample_weight=None)[source]
Parameters
Training instances to cluster. It must be noted that the data will be converted to C
ordering, which will cause a memory copy if the given data is not C-contiguous.
yIgnored
The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).
Returns
self
Fitted estimator.
fit_predict(self, X, y=None, sample_weight=None)[source]
Compute cluster centers and predict cluster index for each sample.
Parameters
yIgnored
The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).
Returns
fit_transform(self, X, y=None, sample_weight=None)[source]
Parameters
yIgnored
The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).
Returns
get_params(self, deep=True)[source]
Parameters
deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are
estimators.
Returns
predict(self, X, sample_weight=None)[source]
Parameters
The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).
Returns
labelsarray, shape [n_samples,]
score(self, X, y=None, sample_weight=None)[source]
Parameters
New data.
yIgnored
The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).
Returns
scorefloat
set_params(self, **params)[source]
The method works on simple estimators as well as on nested objects (such as pipelines).
The latter have parameters of the form <component>__<parameter> so that it’s possible
to update each component of a nested object.
Parameters
**paramsdict
Estimator parameters.
Returns
selfobject
Estimator instance.
transform(self, X)[source]
In the new space, each dimension is the distance to the cluster centers. Note that even if
X is sparse, the array returned by transform will typically be dense.
Parameters
Returns
Imad Dabbura
Follow
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying subgroups in
the data such that data points in the same subgroup (cluster) are very similar while data points in
different clusters are very different. In other words, we try to find homogeneous subgroups
within the data such that data points in each cluster are as similar as possible according to a
similarity measure such as euclidean-based distance or correlation-based distance. The decision
of which similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of
samples based on features or on the basis of samples where we try to find subgroups of features
based on samples. We’ll cover here clustering based on features. Clustering is used in market
segmentation; where we try to fined customers that are similar to each other whether in terms of
behaviors or attributes, image segmentation/compression; where we try to group similar regions
together, document clustering based on topics, etc.
In this post, we will cover only Kmeans which is considered as one of the most used clustering
algorithms due to its simplicity.
Kmeans Algorithm
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.
Compute the sum of the squared distance between data points and all centroids.
Compute the centroids for the clusters by taking the average of the all data points that
belong to each cluster.
where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid
of xi’s cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then
we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik
first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the
centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum of squared
distance from cluster’s centroid.
Given kmeans iterative nature and the random initialization of centroids at the start of
the algorithm, different initializations may lead to different clusters since kmeans algorithm
may stuck in a local optimum and may not converge to global optimum. Therefore, it’s
recommended to run the algorithm using different initializations of centroids and pick the
results of the run that that yielded the lower sum of squared distance.
Implementation
We’ll use simple implementation of kmeans here to just illustrate some concepts. Then we will
use sklearn implementation that is more efficient take care of many things for us.
Applications
kmeans algorithm is very popular and used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The goal
usually when we undergo a cluster analysis is either:
1. Get a meaningful intuition of the structure of the data we’re dealing with.
Image compression.
We’ll first implement the kmeans algorithm on 2D dataset and see how it works. The dataset has
272 observations and 2 features. The data covers the waiting time between eruptions and the
duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming,
USA. We will try to find K subgroups within the data points and group them accordingly. Below is
the description of the features:
Next, we’ll show that different initializations of centroids may yield to different results. I’ll use 9
different random_state to change the initialization of the centroids and plot the results. The
title of each plot will be the sum of squared distance of each initialization.
As a side note, this dataset is considered very easy and converges in less than 10 iterations.
Therefore, to see the effect of random initialization on convergence, I am going to go with 3
iterations to illustrate the concept. However, in real world applications, datasets are not at all
that clean and nice!
As the graph above shows that we only ended up with two different ways of clusterings based on
different initializations. We would pick the one with the lowest sum of squared distance.
In this part, we’ll implement kmeans to compress an image. The image that we’ll be working on is
396 x 396 x 3. Therefore, for each pixel location we would have 3 8-bit integers that specify the
red, green, and blue intensity values. Our goal is to reduce the number of colors to 30 and
represent (compress) the photo using those 30 colors only. To pick which colors to use, we’ll use
kmeans algorithm on the image and treat every pixel as a data point. That means reshape the
image from height x width x channels to (height * width) x channel, i,e we would have 396 x 396
= 156,816 data points in 3-dimensional space which are the intensity of RGB. Doing so will allow
us to represent the image using the 30 centroids for each pixel and would significantly reduce the
size of the image by a factor of 6. The original image size was 396 x 396 x 24 = 3,763,584 bits;
however, the new compressed image would be 30 x 24 + 396 x 396 x 4 = 627,984 bits. The huge
difference comes from the fact that we’ll be using centroids as a lookup for pixels’ colors and that
would reduce the size of each pixel location to 4-bit instead of 8-bit.
n_init is the number of times of running the kmeans with different centroid’s
initialization. The result of the best one will be reported.
Evaluation Methods
Contrary to supervised learning where we have the ground truth to evaluate the model’s
performance, clustering analysis doesn’t have a solid evaluation metric that we can use to
evaluate the outcome of different clustering algorithms. Moreover, since kmeans requires k as an
input and doesn’t learn it from data, there is no right answer in terms of the number of clusters
that we should have in any problem. Sometimes domain knowledge and intuition may help but
usually that is not the case. In the cluster-predict methodology, we can evaluate how well the
models are performing based on different K clusters since clusters are used in the downstream
modeling.
In this post we’ll cover two metrics that may give us some intuition about k:
Elbow method
Silhouette analysis
Elbow Method
Elbow method gives us an idea on what a good k number of clusters would be based on the sum
of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at
the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser dataset and
evaluate SSE for different values of k and see where the curve might form an elbow and flatten
out.
The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out a good
number of clusters to use because the curve is monotonically decreasing and may not show any
elbow or has an obvious point where the curve starts flattening out.
Silhouette Analysis
Silhouette analysis can be used to determine the degree of separation between clusters. For
each sample:
Compute the average distance from all data points in the same cluster (ai).
Compute the average distance from all data points in the closest cluster (bi).
Therefore, we want the coefficients to be as big as possible and close to 1 to have a good clusters.
We’ll use here geyser dataset again because its cheaper to run the silhouette analysis and it is
actually obvious that there is most likely only two groups of data points.
As the above plots show, n_clusters=2 has the best average silhouette score of around 0.75
and all clusters being above the average shows that it is actually a good choice. Also, the thickness
of the silhouette plot gives an indication of how big each cluster is. The plot shows that cluster 1
has almost double the samples than cluster 2. However, as we increased n_clusters to 3 and 4,
the average silhouette score decreased dramatically to around 0.48 and 0.39 respectively.
Moreover, the thickness of silhouette plot started showing wide fluctuations. The bottom line is:
Good n_clusters will have a well above 0.5 silhouette average score as well as all of the clusters
have higher than the average score.
Drawbacks
Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like
shape. It always try to construct a nice spherical shape around the centroid. That means, the
minute the clusters have a complicated geometric shapes, kmeans does a poor job in clustering
the data. We’ll illustrate three cases where kmeans will not perform well.
First, kmeans algorithm doesn’t let data points that are far-away from each other share the same
cluster even though they obviously belong to the same cluster. Below is an example of data points
on two different horizontal lines that illustrates how kmeans tries to group half of the data points
of each horizontal lines together.
Kmeans considers the point ‘B’ closer to point ‘A’ than point ‘C’ since they have non-spherical
shape. Therefore, points ‘A’ and ‘B’ will be in the same cluster but point ‘C’ will be in a different
cluster. Note the Single Linkage hierarchical clustering method gets this right because it
doesn’t separate similar points).
Second, we’ll generate data from multivariate normal distributions with different means and
standard deviations. So we would have 3 groups of data where each group was generated from
different multivariate normal distribution (different mean/standard deviation). One group will
have a lot more data points than the other two combined. Next, we’ll run kmeans on the data with
K=3 and see if it will be able to cluster the data correctly. To make the comparison easier, I am
going to plot first the data colored based on the distribution it came from. Then I will plot the
same data but now colored based on the clusters they have been assigned to.
Looks like kmeans couldn’t figure out the clusters correctly. Since it tries to minimize the within-
cluster variation, it gives more weight to bigger clusters than smaller ones. In other words, data
points in smaller clusters may be left away from the centroid in order to focus more on the larger
cluster.
Last, we’ll generate data that have complicated geometric shapes such as moons and circles
within each other and test kmeans on both of the datasets.
As expected, kmeans couldn’t figure out the correct clusters for both datasets. However, we can
help kmeans perfectly cluster these kind of datasets if we use kernel methods. The idea is we
transform to higher dimensional representation that make the data linearly separable (the same
idea that we use in SVMs). Different kinds of algorithms work very well in such scenarios such
as SpectralClustering, see below:
Conclusion
Kmeans clustering is one of the most popular clustering algorithms and usually the first thing
practitioners apply when solving clustering tasks to get an idea of the structure of the dataset.
The goal of kmeans is to group data points into distinct non-overlapping subgroups. It does a
very good job when the clusters have a kind of spherical shapes. However, it suffers as the
geometric shapes of clusters deviates from spherical shapes. Moreover, it also doesn’t learn the
number of clusters from the data and requires it to be pre-defined. To be a good practitioner, it’s
good to know the assumptions behind algorithms/methods so that you would have a pretty good
idea about the strength and weakness of each method. This will help you decide when to use each
method and under what circumstances. In this post, we covered both strength, weaknesses, and
some evaluation methods related to kmeans.
Elbow method in selecting number of clusters doesn’t usually work because the error
function is monotonically decreasing for all ks.
Kmeans assumes spherical shapes of clusters (with radius equal to the distance between
the centroid and the furthest data point) and doesn’t work well when clusters are in different
shapes such as elliptical clusters.
If there is overlapping between clusters, kmeans doesn’t have an intrinsic measure for
uncertainty for the examples belong to the overlapping region in order to determine for
which cluster to assign each data point.
Kmeans may still cluster the data even if it can’t be clustered such as data that comes
from uniform distributions.
https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-
f2ad05ed5203
customer profiling
market segmentation
computer vision
search engines
astronomy
How it works
1. Select K (i.e. 2) random points as cluster centers called centroids
2. Assign each data point to the closest cluster by calculating its distance with respect to each
centroid
3. Determine the new cluster center by computing the average of the assigned points
4. Repeat steps 2 and 3 until none of the cluster assignments change
Choosing the right number of clusters
Often times the data you’ll be working with will have multiple dimensions making it difficult to
visual. As a consequence, the optimum number of clusters is no longer obvious. Fortunately, we
have a way of determining this mathematically.
We graph the relationship between the number of clusters and Within Cluster Sum of Squares
(WCSS) then we select the number of clusters where the change in WCSS begins to level off
(elbow method).
WCSS is defined as the sum of the squared distance between each member of the cluster and its
centroid.
For example, the computed WCSS for figure 1 would be greater than the WCSS calculated
for figure 2.
Figure 1
Figure 2
Code
Let’s take a look at how we could go about classifying data using the K-Means algorithm with
python. As always, we need to start by importing the required libraries.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
In this tutorial, we’ll generate our own data using the make_blobs function from
the sklearn.datasets module. The centers parameter specifies the number of clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)plt.scatter(X[:,0], X[:,1])
Even though we already know the optimal number of clusters, I figured we could still benefit
from determining it using the elbow method. To get the values used in the graph, we train
multiple models using a different number of clusters and storing the value of
the intertia_ property (WCSS) every time.
wcss = []for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300,
n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Next, we’ll categorize the data using the optimum number of clusters (4) we determined in the
last step. k-means++ ensures that you get don’t fall into the random initialization trap.
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10,
random_state=0)
pred_y = kmeans.fit_predict(X)plt.scatter(X[:,0], X[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=300, c='red')
plt.show()
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
Here, we see that there is a lot of variation in the magnitude of the data. Variables like Channel
and Region have low magnitude whereas variables like Fresh, Milk, Grocery, etc. have a higher
magnitude.
Overview
Introduction
What truly fascinates me about these systems is how we can group similar items, products, and
users together. This grouping, or segmenting, works across industries. And that’s what makes
the concept of clustering such an important one in data science.
Clustering helps us understand our data in a unique way – by grouping things together into –
you guessed it – clusters.
In this article, we will cover k-means clustering and it’s components comprehensively. We’ll look
at clustering, why it matters, its applications and then deep dive into k-means clustering
(including how to perform it in Python on a real-world dataset).
And if you want to directly work on the Python code, jump straight here. We have a live coding
window where you can build your own k-means clustering algorithm without leaving this article!
Learn more about clustering and other machine learning algorithms (both supervised and
unsupervised) in the comprehensive ‘Applied Machine Learning‘ course.
Table of Contents
1. What is Clustering?
2. How is Clustering an Unsupervised Learning Problem?
3. Properties of Clusters
4. Applications of Clustering in Real-World Scenarios
5. Understanding the Different Evaluation Metrics for Clustering
6. What is K-Means Clustering?
7. Implementing K-Means Clustering from scratch in Python
8. Challenges with K-Means Algorithm
9. K-Means ++ to choose initial cluster centroids for K-Means Clustering
10. How to choose the Right Number of Clusters in K-Means?
11. Implementing K-Means Clustering in Python
What is Clustering?
Let’s kick things off with a simple example. A bank wants to give credit card offers to its
customers. Currently, they look at the details of each customer and based on this information,
decide which offer should be given to which customer.
Now, the bank can potentially have millions of customers. Does it make sense to look at the
details of each customer separately and then make a decision? Certainly not! It is a manual
process and will take a huge amount of time.
So what can the bank do? One option is to segment its customers into different groups. For
instance, the bank can group the customers based on their income:
Can you see where I’m going with this? The bank can now make three different strategies or
offers, one for each group. Here, instead of creating different strategies for individual customers,
they only have to make 3 strategies. This will reduce the effort as well as the time.
The groups I have shown above are known as clusters and the process of creating these
groups is known as clustering. Formally, we can say that:
Clustering is the process of dividing the entire data into groups (also known as clusters) based
on the patterns in the data.
Can you guess which type of learning problem clustering is? Is it a supervised or unsupervised
learning problem?
Think about it for a moment and make use of the example we just saw. Got it? Clustering is an
unsupervised learning problem!
Let’s say you are working on a project where you need to predict the sales of a big mart:
Or, a project where your task is to predict whether a loan will be approved or not:
We have a fixed target to predict in both of these situations. In the sales prediction problem, we
have to predict the Item_Outlet_Sales based on outlet_size, outlet_location_type, etc. and in the
loan approval problem, we have to predict the Loan_Status depending on the Gender, marital
status, the income of the customers, etc.
So, when we have a target variable to predict based on a given set of predictors or
independent variables, such problems are called supervised learning problems.
Now, there might be situations where we do not have any target variable to predict.
Such problems, without any fixed target variable, are known as unsupervised learning
problems. In these problems, we only have the independent variables and no target/dependent
variable.
In clustering, we do not have a target to predict. We look at the data and then try to club
similar observations and form different groups. Hence it is an unsupervised learning
problem.
We now know what are clusters and the concept of clustering. Next, let’s look at the properties
of these clusters which we must consider while forming the clusters.
Properties of Clusters
How about another example? We’ll take the same bank as before who wants to segment its
customers. For simplicity purposes, let’s say the bank only wants to use the income and debt to
make the segmentation. They collected the customer data and used a scatter plot to visualize it:
On the X-axis, we have the income of the customer and the y-axis represents the amount of
debt. Here, we can clearly visualize that these customers can be segmented into 4 different
clusters as shown below:
This is how clustering helps to create segments (clusters) from the data. The bank can further
use these clusters to make strategies and offer discounts to its customers. So let’s look at the
properties of these clusters.
Property 1
All the data points in a cluster should be similar to each other. Let me illustrate it using the
above example:
If the customers in a particular cluster are not similar to each other, then their requirements
might vary, right? If the bank gives them the same offer, they might not like it and their interest
in the bank might reduce. Not ideal.
Having similar data points within the same cluster helps the bank to use targeted marketing.
You can think of similar examples from your everyday life and think about how clustering will (or
already does) impact the business strategy.
Property 2
The data points from different clusters should be as different as possible. This will
intuitively make sense if you grasped the above property. Let’s again take the same example to
understand this property:
Which of these cases do you think will give us the better clusters? If you look at case I:
Customers in the red and blue clusters are quite similar to each other. The top four points in the
red cluster share similar properties as that of the top two customers in the blue cluster. They
have high income and high debt value. Here, we have clustered them differently. Whereas, if
you look at case II:
Points in the red cluster are completely different from the customers in the blue cluster. All the
customers in the red cluster have high income and high debt and customers in the blue cluster
have high income and low debt value. Clearly we have a better clustering of customers in this
case.
Hence, data points from different clusters should be as different from each other as possible to
have more meaningful clusters.
So far, we have understood what clustering is and the different properties of clusters. But why
do we even need clustering? Let’s clear this doubt in the next section and look at some
applications of clustering.
Clustering is a widely used technique in the industry. It is actually being used in almost every
domain, ranging from banking to recommendation engines, document clustering to image
segmentation.
Customer Segmentation
We covered this earlier – one of the most common applications of clustering is customer
segmentation. And it isn’t just limited to banking. This strategy is across functions, including
telecom, e-commerce, sports, advertising, sales, etc.
Document Clustering
This is another common application of clustering. Let’s say you have multiple documents and
you need to cluster similar documents together. Clustering helps us group these documents
such that similar documents are in the same clusters.
Image Segmentation
We can also use clustering to perform image segmentation. Here, we try to club similar pixels in
the image together. We can apply clustering to create clusters having similar pixels in the same
group.
You can refer to this article to see how we can make use of clustering for image segmentation
tasks.
Recommendation Engines
Clustering can also be used in recommendation engines. Let’s say you want to recommend
songs to your friends. You can look at the songs liked by that person and then use clustering to
find similar songs and finally recommend the most similar songs.
There are many more applications which I’m sure you have already thought of. You can share
these applications in the comments section below. Next, let’s look at how we can evaluate our
clusters.
The primary aim of clustering is not just to make clusters, but to make good and meaningful
ones. We saw this in the below example:
Here, we used only two features and hence it was easy for us to visualize and decide which of
these clusters is better.
Unfortunately, that’s not how real-world scenarios work. We will have a ton of features to work
with. Let’s take the customer segmentation example again – we will have features like
customer’s income, occupation, gender, age, and many more. Visualizing all these features
together and deciding better and meaningful clusters would not be possible for us.
This is where we can make use of evaluation metrics. Let’s discuss a few of them and
understand how we can use them to evaluate the quality of our clusters.
Inertia
Recall the first property of clusters we covered above. This is what inertia evaluates. It tells us
how far the points within a cluster are. So, inertia actually calculates the sum of distances of
all the points within a cluster from the centroid of that cluster.
We calculate this for all the clusters and the final inertial value is the sum of all these distances.
This distance within the clusters is known as intracluster distance. So, inertia gives us the sum
of intracluster distances:
Now, what do you think should be the value of inertia for a good cluster? Is a small inertial value
good or do we need a larger value? We want the points within the same cluster to be similar to
each other, right? Hence, the distance between them should be as low as possible.
Keeping this in mind, we can say that the lesser the inertia value, the better our clusters are.
Dunn Index
We now know that inertia tries to minimize the intracluster distance. It is trying to make more
compact clusters.
Let me put it this way – if the distance between the centroid of a cluster and the points in that
cluster is small, it means that the points are closer to each other. So, inertia makes sure that the
first property of clusters is satisfied. But it does not care about the second property – that
different clusters should be as different from each other as possible.
Along with the distance between the centroid and points, the Dunn index also takes into
account the distance between two clusters. This distance between the centroids of two
different clusters is known as inter-cluster distance. Let’s look at the formula of the Dunn
index:
Dunn index is the ratio of the minimum of inter-cluster distances and maximum of intracluster
distances.
We want to maximize the Dunn index. The more the value of the Dunn index, the better will be
the clusters. Let’s understand the intuition behind Dunn index:
In order to maximize the value of the Dunn index, the numerator should be maximum. Here, we
are taking the minimum of the inter-cluster distances. So, the distance between even the closest
clusters should be more which will eventually make sure that the clusters are far away from
each other.
Also, the denominator should be minimum to maximize the Dunn index. Here, we are taking the
maximum of intracluster distances. Again, the intuition is the same here. The maximum distance
between the cluster centroids and the points should be minimum which will eventually make
sure that the clusters are compact.
Introduction to K-Means Clustering
Recall the first property of clusters – it states that the points within a cluster should be similar to
each other. So, our aim here is to minimize the distance between the points within a
cluster.
There is an algorithm that tries to minimize the distance of the points in a cluster with their
centroid – the k-means clustering technique.
K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the
distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances
between the points and their respective cluster centroid.
Let’s now take an example to understand how K-Means actually works:
We have these 8 points and we want to apply k-means to create clusters for these points.
Here’s how we can do it.
Next, we randomly select the centroid for each cluster. Let’s say we want to have 2 clusters, so
k is equal to 2 here. We then randomly select the centroid:
Here, the red and green circles represent the centroid for these clusters.
Once we have initialized the centroids, we assign each point to the closest cluster centroid:
Here you can see that the points which are closer to the red point are assigned to the red
cluster whereas the points which are closer to the green point are assigned to the green cluster.
Now, once we have assigned all of the points to either cluster, the next step is to compute the
centroids of newly formed clusters:
Here, the red and green crosses are the new centroids.
Stopping Criteria for K-Means Clustering
There are essentially three stopping criteria that can be adopted to stop the K-means algorithm:
We can stop the algorithm if the centroids of newly formed clusters are not changing. Even after
multiple iterations, if we are getting the same centroids for all the clusters, we can say that the
algorithm is not learning any new pattern and it is a sign to stop the training.
Another clear sign that we should stop the training process if the points remain in the same
cluster even after training the algorithm for multiple iterations.
Finally, we can stop the training if the maximum number of iterations is reached. Suppose if we
have set the number of iterations as 100. The process will repeat for 100 iterations before
stopping.
Implementing K-Means Clustering in Python from Scratch
Time to fire up our Jupyter notebooks (or whichever IDE you use) and get our hands dirty in
Python!
We will be working on the loan prediction dataset that you can download here. I encourage you
to read more about the dataset and the problem statement here. This will help you visualize
what we are working on (and why we are doing this). Two pretty important questions in any data
science project.
Now, we will read the CSV file and look at the first five rows of the data:
For this article, we will be taking only two variables from the data – “LoanAmount” and
“ApplicantIncome”. This will make it easy to visualize the steps as well. Let’s pick these two
variables and visualize the data points:
Steps 1 and 2 of K-Means were about choosing the number of clusters (k) and selecting random
centroids for each cluster. We will pick 3 clusters and then select random observations from the
data as the centroids:
Here, the red dots represent the 3 centroids for each cluster. Note that we have chosen these
points randomly and hence every time you run this code, you might get different centroids.
Next, we will define some conditions to implement the K-Means Clustering algorithm. Let’s first
look at the code:
These values might vary every time we run this. Here, we are stopping the training when the
centroids are not changing after two iterations. We have initially defined the diff as 1 and inside
the while loop, we are calculating this diff as the difference between the centroids in the
previous iteration and the current iteration.
When this difference is 0, we are stopping the training. Let’s now visualize the clusters we have
got:
Awesome! Here, we can clearly visualize three clusters. The red dots represent the centroid of
each cluster. I hope you now have a clear understanding of how K-Means work.
Here is a LIVE CODING window for you to play around with the code and see the results for
yourself – without leaving this article! Go ahead and start working on it:
However, there are certain situations where this algorithm might not perform as well. Let’s look
at some challenges which you can face while working with k-means.
One of the common challenges we face while working with K-Means is that the size of
clusters is different. Let’s say we have the below points:
The left and the rightmost clusters are of smaller size compared to the central cluster. Now, if
we apply k-means clustering on these points, the results will be something like this:
Another challenge with k-means is when the densities of the original points are
different. Let’s say these are the original points:
Here, the points in the red cluster are spread out whereas the points in the remaining clusters
are closely packed together. Now, if we apply k-means on these points, we will get clusters like
this:
We can see that the compact points have been assigned to a single cluster. Whereas the points
that are spread loosely but were in the same cluster, have been assigned to different clusters.
Not ideal so what can we do about this?
One of the solutions is to use a higher number of clusters. So, in all the above scenarios,
instead of using 3 clusters, we can have a bigger number. Perhaps setting k=10 might lead to
more meaningful clusters.
Remember how we randomly initialize the centroids in k-means clustering? Well, this is also
potentially problematic because we might get different clusters every time. So, to solve this
problem of random initialization, there is an algorithm called K-Means++ that can be used to
choose the initial values, or the initial cluster centroids, for K-Means.
In some cases, if the initialization of clusters is not appropriate, K-Means can result in arbitrarily
bad clusters. This is where K-Means++ helps. It specifies a procedure to initialize the cluster
centers before moving forward with the standard k-means clustering algorithm.
Using the K-Means++ algorithm, we optimize the step where we randomly pick the cluster
centroid. We are more likely to find a solution that is competitive to the optimal K-Means solution
while using the K-Means++ initialization.
1. The first cluster is chosen uniformly at random from the data points that we want to
cluster. This is similar to what we do in K-Means, but instead of randomly picking all the
centroids, we just pick one centroid here
2. Next, we compute the distance (D(x)) of each data point (x) from the cluster center that
has already been chosen
3. Then, choose the new cluster center from the data points with the probability of x being
proportional to (D(x))2
4. We then repeat steps 2 and 3 until k clusters have been chosen
Let’s take an example to understand this more clearly. Let’s say we have the following points
and we want to make 3 clusters here:
Now, the first step is to randomly pick a data point as a cluster centroid:
Let’s say we pick the green point as the initial centroid. Now, we will calculate the distance
(D(x)) of each data point with this centroid:
The next centroid will be the one whose squared distance (D(x)2) is the farthest from the current
centroid:
In this case, the red point will be selected as the next centroid. Now, to select the last centroid,
we will take the distance of each point from its closest centroid and the point having the largest
squared distance will be selected as the next centroid:
We will select the last centroid as:
We can continue with the K-Means algorithm after initializing the centroids. Using K-Means++ to
initialize the centroids tends to improve the clusters. Although it is computationally costly relative
to random initialization, subsequent K-Means often converge more rapidly.
I’m sure there’s one question which you’ve been wondering about since the start of this article –
how many clusters should we make? Aka, what should be the optimum number of clusters to
have while performing K-Means?
One of the most common doubts everyone has while working with K-Means is selecting the right
number of clusters.
So, let’s look at a technique that will help us choose the right value of clusters for the K-Means
algorithm. Let’s take the customer segmentation example which we saw earlier. To recap, the
bank wants to segment its customers based on their income and amount of debt:
Here, we can have two clusters which will separate the customers as shown below:
All the customers with low income are in one cluster whereas the customers with high income
are in the second cluster. We can also have 4 clusters:
Here, one cluster might represent customers who have low income and low debt, other cluster
is where customers have high income and high debt, and so on. There can be 8 clusters as
well:
Honestly, we can have any number of clusters. Can you guess what would be the maximum
number of possible clusters? One thing which we can do is to assign each point to a separate
cluster. Hence, in this case, the number of clusters will be equal to the number of points or
observations. So,
Next, we will start with a small cluster value, let’s say 2. Train the model using 2 clusters,
calculate the inertia for that model, and finally plot it in the above graph. Let’s say we got an
inertia value of around 1000:
Now, we will increase the number of clusters, train the model again, and plot the inertia value.
This is the plot we get:
When we changed the cluster value from 2 to 4, the inertia value reduced very sharply. This
decrease in the inertia value reduces and eventually becomes constant as we increase the
number of clusters further.
So,
the cluster value where this decrease in inertia value becomes constant can be
chosen as the right cluster value for our data.
Here, we can choose any number of clusters between 6 and 10. We can have 7, 8, or even 9
clusters. You must also look at the computation cost while deciding the number of
clusters. If we increase the number of clusters, the computation cost will also increase. So, if
you do not have high computational resources, my advice is to choose a lesser number of
clusters.
Let’s now implement the K-Means Clustering algorithm in Python. We will also see how to use
K-Means++ to initialize the centroids and will also plot this elbow curve to decide what should be
the right number of clusters for our dataset.
We will be working on a wholesale customer segmentation problem. You can download the
dataset using this link. The data is hosted on the UCI Machine Learning repository.
The aim of this problem is to segment the clients of a wholesale distributor based on
their annual spending on diverse product categories, like milk, grocery, region, etc. So,
let’s start coding!
Next, let’s read the data and look at the first five rows:
We have the spending details of customers on different products like Milk, Grocery, Frozen,
Detergents, etc. Now, we have to segment the customers based on the provided details. Before
doing that, let’s pull out some statistics related to the data:
Here, we see that there is a lot of variation in the magnitude of the data. Variables like Channel
and Region have low magnitude whereas variables like Fresh, Milk, Grocery, etc. have a higher
magnitude.
The magnitude looks similar now. Next, let’s create a kmeans function and fit it on the data:
We have initialized two clusters and pay attention – the initialization is not random here. We
have used the k-means++ initialization which generally produces better results as we have
discussed in the previous section as well.
Let’s evaluate how well the formed clusters are. To do that, we will calculate the inertia of the
clusters:
Output: 2599.38555935614
We got an inertia value of almost 2600. Now, let’s see how we can use the elbow curve to
determine the optimum number of clusters in Python.
We will first fit multiple k-means models and in each successive model, we will increase the
number of clusters. We will store the inertia value of each model and then plot it to visualize the
result:
Can you tell the optimum cluster value from this plot? Looking at the above elbow curve, we
can choose any number of clusters between 5 to 8. Let’s set the number of clusters as 6 and
fit the model:
Finally, let’s look at the value count of points in each of the above-formed clusters:
So, there are 234 data points belonging to cluster 4 (index 3), then 125 points in cluster 2 (index
1), and so on. This is how we can implement K-Means Clustering in Python.
End Notes
In this article, we discussed one of the most famous clustering algorithms – K-Means. We
implemented it from scratch and looked at its step-by-step implementation. We looked at the
challenges which we might face while working with K-Means and also saw how K-Means++ can
be helpful when initializing the cluster centroids.
Finally, we implemented k-means and looked at the elbow curve which helps to find the
optimum number of clusters in the K-Means algorithm.
If you have any doubts or feedback, feel free to share them in the comments section below. And
make sure you check out the comprehensive ‘Applied Machine Learning‘ course that takes
you from the basics of machine learning to advanced algorithms (including an entire module on
deploying your machine learning models!)
K-means Clustering¶
The plots display firstly what a K-means algorithm would yield using three clusters. It is then
shown what the effect of a bad initialization is on the classification process: By setting n_init to
only 1 (default is 10), the amount of times that the algorithm will be run with different centroid
seeds is reduced. The next plot displays what using eight clusters would deliver and finally the
ground truth.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# Though the following import is not directly being used, it is required
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D
np.random.seed(5)
iris = datasets.load_iris()
X = iris.data
y = iris.target
fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
est.fit(X)
labels = est.labels_
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(titles[fignum - 1])
ax.dist = 12
fignum = fignum + 1
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12
fig.show()
Selecting the number of clusters with silhouette analysis on KMeans clustering¶
Silhouette analysis can be used to study the separation distance between the resulting clusters.
The silhouette plot displays a measure of how close each point in one cluster is to points in the
neighboring clusters and thus provides a way to assess parameters like number of clusters
visually. This measure has a range of [-1, 1].
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far
away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to
the decision boundary between two neighboring clusters and negative values indicate that
those samples might have been assigned to the wrong cluster.
In this example the silhouette analysis is used to choose an optimal value for n_clusters. The
silhouette plot shows that the n_clusters value of 3, 5 and 6 are a bad pick for the given data
due to the presence of clusters with below average silhouette scores and also due to wide
fluctuations in the size of the silhouette plots. Silhouette analysis is more ambivalent in deciding
between 2 and 4.
Also from the thickness of the silhouette plot the cluster size can be visualized. The silhouette
plot for cluster 0 when n_clusters is equal to 2, is bigger in size owing to the grouping of the
3 sub clusters into one big cluster. However when the n_clusters is equal to 4, all the plots are
more or less of similar thickness and hence are of similar sizes as can be also verified from the
labelled scatter plot on the right.
Out:
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262
For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437
For n_clusters = 5 The average silhouette_score is : 0.5745566973301872
For n_clusters = 6 The average silhouette_score is : 0.43902711183132426
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
print(__doc__)
range_n_clusters = [2, 3, 4, 5, 6]
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.show()
Total running time of the script: ( 0 minutes 1.106 seconds)
https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter
notebooks are available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT
license. If you find this content useful, please consider supporting the work by buying the book!
https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187
Did you know that 60% of newly-launched products may not perform well because they fail to
represent or actually offer something, their customers really want?
This is the era of personalization. Using personalization you can efficiently attract new customers
and retain existing customers. These days, a one-size-fits-all approach generally doesn’t work.
In the realm of machine learning, k-means clustering can be used to segment customers (or other
data) efficiently.
K-means clustering is one of the simplest unsupervised machine learning algorithms. Here,
we’ll explore what it can do and work through a simple implementation in Python.
Photo by Alice Achterhof on Unsplash
2. The computational cost of the k-means algorithm is O(k*n*d), where n is the number of
data points, k the number of clusters, and d the number of attributes.
3. Compared to other clustering methods, the k-means clustering technique is fast and
efficient in terms of its computational cost.
4. It’s difficult to predict the optimal number of clusters or the value of k. To find the
number of clusters, we need to run the k-means clustering algorithm for a range of k values
and compare the results.
Photo by Markus Spiske on Unsplash
Embedding machine learning models on mobile apps can help you scale while reducing
costs. Subscribe to the Fritz AI Newsletter for more on this and other ways mobile ML can benefit
your business.
Example Implementation
Let’s implement k-means clustering using a famous dataset: the Iris dataset. This dataset
contains 3 classes of 50 instances each and each class refers to a type of iris plant. The dataset has
four features: sepal length, sepal width, petal length, and petal width. The fifth column is for
species, which holds the value for these types of plants. For example, one of the types is
a setosa, as shown in the image below.
iris dataset for k-means clustering
To start Python coding for k-means clustering, let’s start by importing the required libraries.
Apart from NumPy, Pandas, and Matplotlib, we’re also importing KMeans from sklearn.cluster,
as shown below.
k-means clustering with python
We’re reading the Iris dataset using the read_csv Pandas method and storing the data in a data
frame df. After populating the data frame df, we use the head() method on the dataset to see
its first 10 records.
read iris dataset using pandas
Now we select all four features (sepal length, sepal width, petal length, and petal width) of the
dataset in a variable called x so that we can train our model with these features. For this, we use
the iloc function on df, and the column index (0,1,2,3) for the above four columns are used, as
shown below:
select iris dataset features into variable x
To start, let’s arbitrarily assign the value of k as 5. We will implement k-means clustering
using k=5. For this we will instantiate the KMeans class and assign it to the variable kmeans5:
k-means clustering with k = 5
Below, you can see the output of the k-means clustering model with k=5. Note that we can find
the centers of 5 clusters formed from the data:
There’s a method called the Elbow method, which is designed to help find the optimal number of
clusters in a dataset. So let’s use this method to calculate the optimum value of k. To implement
the Elbow method, we need to create some Python code (shown below), and we’ll plot a graph
between the number of clusters and the corresponding error value.
This graph generally ends up shaped like an elbow, hence its name:
elbow method to calculate the optimum value of k
The output graph of the Elbow method is shown below. Note that the shape of elbow is
approximately formed at k=3.
As you can see, the optimal value of k is between 2 and 4, as the elbow-like shape is formed
at k=3 in the above graph.
Closing comments
I hope you learned how to implement k-means clustering using sklearn and Python. Finding the
optimal k value is an important step here. In case the Elbow method doesn’t work, there
are several other methods that can be used to find optimal value of k.
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-
methods/
https://heartbeat.fritz.ai/understanding-the-mathematics-behind-k-means-clustering-40e1d55e2f4c
and limitations
In this post, we’re going to dive deep into one of the most influential unsupervised learning
algorithms—k-means clustering. K-means clustering is one of the simplest and most popular
unsupervised machine learning algorithms, and we’ll be discussing how the algorithm works,
distance and accuracy metrics, and a lot more.
What is meant by unsupervised learning?
What is Clustering?
Clustering is the process of dividing the data space or data points into a number of groups, such
that data points in the same groups are more similar to other data points in the same group, and
dissimilar to the data points in other groups.
Clustering Objectives
The major objective of clustering is to find patterns (i.e. similarities within data points) in an
unlabeled dataset and cluster them together. But how do we decide what constitutes a good
clustering? There isn’t a definitive best way of clustering, which would be independent of the final
aim of the clustering. The end results usually depends on users and the parameters they select,
focusing on the most important features used for clustering.
Did you know: Machine learning isn’t just happening on servers and in the cloud. It’s also being
deployed to the edge. Fritz AI has the developer tools to make this transition possible.
Vector quantization
K-means originates from signal processing, but it’s also used for vector quantization. For
example, color quantization is the task of reducing the color palette of an image to a fixed
number of colors k. The k-means algorithm can easily be used for this task.
An illness or condition frequently has a number of variations, and cluster analysis can be used to
identify these different subcategories. For example, clustering has been used to identify different
types of depression. Cluster analysis can also be used to detect patterns in the spatial or temporal
distribution of a disease.
Recommender Systems
Clustering can also be used in recommendation engines. In the case of recommending movies to
someone, you can look at the movies enjoyed by a user and then use clustering to find similar
movies.
This is another common application of clustering. Let’s say you have multiple documents and you
need to cluster similar documents together. Clustering helps us group these documents such that
similar documents are in the same clusters.
Image Segmentation
Image segmentation is a wide-spread application of clustering. Similar pixels in the image are
grouped together. We can apply this technique to create clusters having similar pixels in the same
group.
The k-means clustering algorithm
Procedure
We first choose k initial centroids, where k is a user-specified parameter; namely, the number of
clusters desired. Each point is then assigned to the closest centroid, and each collection of points
assigned to a centroid is called a cluster. The centroid of each cluster is then updated based on the
points assigned to the cluster. We repeat the assignment and update steps until no point changes
clusters, or similarly, until the centroids remain the same.
Source: https://www.researchgate.net/figure/The-pseudo-code-for-K-means-clustering-
algorithm_fig2_273063437
Proximity Measures
For clustering, we need to define a proximity measure for two data points. Proximity here means
how similar/dissimilar the samples are with respect to each other.
Consider data whose proximity measure is Euclidean distance. For our objective function,
which measures the quality of a clustering, we use the sum of the squared error (SSE), which
is also known as scatter.
In other words, we calculate the error of each data point, i.e., its Euclidean distance to the closest
centroid, and then compute the total sum of the squared errors. Given two different sets of
clusters that are produced by two different runs of K-means, we prefer the one with the smallest
squared error, since this means that the prototypes (centroids) of this clustering are a better
representation of the points in their cluster.
Document Data
To illustrate that K-means is not restricted to data in Euclidean space, we consider document
data and the cosine similarity measure:
Implementation in scikit-learn
It merely takes four lines to apply the algorithm in Python with sklearn: import the classifier,
create an instance, fit the data on the training set, and predict outcomes for the test set:
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
The space requirements for k-means clustering are modest, because only the data points and
centroids are stored. Specifically, the storage required is O((m + K)n), where m is the number of
points and n is the number of attributes. The time requirements for k-means are also modest —
basically linear in terms of the number of data points. In particular, the time required is
O(I∗K∗m∗n), where I is the number of iterations required for convergence.
Machine learning is rapidly moving closer to where data is collected — edge devices. Subscribe to
the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.
When random initialization of centroids is used, different runs of K-means typically produce
different total SSEs. Choosing the proper initial centroids is the key step of the basic K-means
procedure. A common approach is to choose the initial centroids randomly, but the resulting
clusters are often poor.
Another technique that’s commonly used to address the problem of choosing initial centroids is
to perform multiple runs, each with a different set of randomly-chosen initial centroids, and then
select the set of clusters with the minimum SSE.
But often, random initialization leads to sub-optimal results, and may not work well in cases with
clusters of different shapes and densities, or centroids located too far or too close to each other.
This can result in overlapping clusters of different classes, or the distribution of clusters
belonging to the same class.
There can be various methods to determine the optimal value of k for convergence of the
algorithm and to make clear distinction between clusters or different classes in a dataset.
Elbow Method
There’s a popular method known as elbow method, which is used to determine the optimal
value of k to perform clustering. The basic idea behind this method is that it plots the various
values of cost with changing k. The point where this distortion declines the most is the elbow
point, which works as an optimal value of k.
Silhouette Method
In the silhouette method, we assume that the data has already been clustered into k clusters by k-
means clustering. Then for each data point, we define the following:
|C(i)|: The number of data points in the cluster assigned to the ith data point
a(i): Gives a measure of how well assigned the ith data point is to it’s cluster
b(i): Defined as the average dissimilarity to the closest cluster which is not it’s cluster
We determine the average silhouette for each value of k, and the value of k that has
the maximum value of s(i) is considered the optimal number of clusters for the unsupervised
learning algorithm.
The common theme of these problems is that when the dimensionality increases, the volume of
the space increases so fast that the available data become sparse. This sparsity is problematic for
any method that requires statistical significance.
In order to obtain a statistically sound and reliable result, the amount of data needed to support
the result often grows exponentially with dimensionality. Also, organizing and searching data
often relies on detecting areas where objects form groups with similar properties; in high-
dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which
prevents common data organization strategies from being efficient.
In case of k-means clustering, the curse of dimensionality results in difficulty in clustering data
due to vast data space. For example, with Euclidean space as a proximity measure, two data
points that may be very dissimilar could be grouped together because, due to too many
dimensions, somehow, their net distance from the centroid is comparable.
4. The algorithm is fast and efficient in terms of computational cost, which is typically
O(K*n*d).
2. Clustering data of varying sizes and density. K-means doesn’t perform well with
clusters of different sizes, shapes, and density. To cluster such data, you need to generalize k-
means.
3. Clustering outliers. Outliers must be removed before clustering, or they may affect the
position of the centroid or make a new cluster of their own.
1. The algorithm provides best results when the data points are well separated from each
other; thus, we must ensure that all the data points are the most similar to their centroid and
as different as possible from the other centroids. Various iterations are required for
convergence, and we can also use methods like splitting clusters, choosing one centroid
randomly, and placing the next centroid as far from the previously chosen one as possible.
All of these techniques can help reduce the overall SSE.
Here are a few sources which will help you to implement k-means on your dataset:
K-Means Clustering + PCA
Explore and run machine learning code with Kaggle Notebooks | Using data from
Simplified Human Activity Recognition…
www.kaggle.com
Conclusion
In this post, we read about k-means clustering in detail and gained insights about the
mathematics behind it. Despite being widely used and strongly supported, it has its share of
advantages and disadvantages.
Let me know if you liked the article and how I can improve it. All feedback is welcome. Check out
my other articles in the series: Understanding the mathematics behind Naive
Bayes, Support Vector Machines and Principal Component Analysis.
I’ll be exploring the mathematics involved in other foundational machine learning algorithms in
future posts, so stay tuned.
If you’d like to contribute, head on over to our call for contributors. You can also sign up to
receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter),
join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.
https://www.geeksforgeeks.org/k-means-clustering-introduction/
We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the
kMeans algorithm; an unsupervised learning algorithm.
Overview
(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups of similarity. To calculate that similarity, we will use the
euclidean distance as measurement.
edit
play_arrow
brightness_4
def ReadData(fileName):
# Read the file, splitting by lines
f = open(fileName, 'r');
lines = f.read().splitlines();
f.close();
items = [];
for i in range(1, len(lines)):
line = lines[i].split(',');
itemFeatures = [];
for j in range(len(line)-1):
v = float(line[j]); # Convert feature value to float
itemFeatures.append(v); # Add feature value to dict
items.append(itemFeatures);
shuffle(items);
return items;
Initialize Means
We want to initialize each mean’s values in the range of the feature values of the items. For
that, we need to find the min and max for each feature. We accomplish that with the following
function:
filter_none
edit
play_arrow
brightness_4
def FindColMinMax(items):
n = len(items[0]);
minima = [sys.maxint for i in range(n)];
maxima = [-sys.maxint -1 for i in range(n)];
for item in items:
for f in range(len(item)):
if (item[f] < minima[f]):
minima[f] = item[f];
if (item[f] > maxima[f]):
maxima[f] = item[f];
return minima,maxima;
The variables minima, maxima are lists containing the min and max values of the items
respectively. We initialize each mean’s feature values randomly between the corresponding
minimum and maximum in those above two lists:
filter_none
edit
play_arrow
brightness_4
def InitializeMeans(items, k, cMin, cMax):
# Initialize means to random numbers between
# the min and max of each column/feature
f = len(items[0]); # number of features
means = [[0 for i in range(f)] for j in range(k)];
for mean in means:
for i in range(len(mean)):
# Set value to a random float
# (adding +-1 to avoid a wide placement of a
mean)
mean[i] = uniform(cMin[i]+1, cMax[i]-1);
return means;
Euclidean Distance
We will be using the euclidean distance as a metric of similarity for our data set (note:
depending on your items, you can use another similarity metric).
filter_none
edit
play_arrow
brightness_4
def EuclideanDistance(x, y):
S = 0; # The sum of the squared differences of the elements
for i in range(len(x)):
S += math.pow(x[i]-y[i], 2);
return math.sqrt(S); #The square root of the sum
Update Means
To update a mean, we need to find the average value for its feature, for all the items in the
mean/cluster. We can do this by adding all the values and then dividing by the number of items,
or we can use a more elegant solution. We will calculate the new average without having to re-
add all the values, by doing the following:
m = (m*(n-1)+x)/n
where m is the mean value for a feature, n is the number of items in the cluster and x is the
feature value for the added item. We do the above for each feature to get the new mean.
filter_none
edit
play_arrow
brightness_4
def UpdateMean(n,mean,item):
for i in range(len(mean)):
m = mean[i];
m = (m*(n-1)+item[i])/float(n);
mean[i] = round(m, 3);
return mean;
Classify Items
Now we need to write a function to classify an item to a group/cluster. For the given item, we will
find its similarity to each mean, and we will classify the item to the closest one.
filter_none
edit
play_arrow
brightness_4
def Classify(means,item):
# Classify item to the mean with minimum
distance
minimum = sys.maxint;
index = -1;
for i in range(len(means)):
# Find distance from item to mean
dis = EuclideanDistance(item, means[i]);
if (dis < minimum):
minimum = dis;
index = i;
return index;
Find Means
To actually find the means, we will loop through all the items, classify them to their nearest
cluster and update the cluster’s mean. We will repeat the process for some fixed number of
iterations. If between two iterations no item changes classification, we stop the process as the
algorithm has found the optimal solution.
The below function takes as input k (the number of desired clusters), the items and the number
of maximum iterations, and returns the means and the clusters. The classification of an item is
stored in the array belongsTo and the number of items in a cluster is stored in clusterSizes.
filter_none
edit
play_arrow
brightness_4
def CalculateMeans(k,items,maxIterations=100000):
# Find the minima and maxima for columns
cMin, cMax = FindColMinMax(items);
# Initialize means at random points
means = InitializeMeans(items,k,cMin,cMax);
# Initialize clusters, the array to hold
# the number of items in a class
clusterSizes= [0 for i in range(len(means))];
# An array to hold the cluster an item is in
belongsTo = [0 for i in range(len(items))];
# Calculate means
for e in range(maxIterations):
# If no change of cluster occurs, halt
noChange = True;
for i in range(len(items)):
item = items[i];
# Classify item into a cluster and update the
# corresponding means.
index = Classify(means,item);
clusterSizes[index] += 1;
cSize = clusterSizes[index];
means[index] = UpdateMean(cSize,means[index],item);
# Item changed cluster
if(index != belongsTo[i]):
noChange = False;
belongsTo[i] = index;
# Nothing changed, return
if (noChange):
break;
return means;
Find Clusters
Finally we want to find the clusters, given the means. We will iterate through all the items and
we will classify each item to its closest cluster.
filter_none
edit
play_arrow
brightness_4
def FindClusters(means,items):
clusters = [[] for i in range(len(means))]; # Init clusters
for item in items:
# Classify item into a cluster
index = Classify(means,item);
# Add item to cluster
clusters[index].append(item);
return clusters;
The other popularly used similarity measures are:-
1. Cosine distance: It determines the cosine of the angle between the point vectors of the two
points in the n dimensional space
2. Manhattan distance: It computes the sum of the absolute differences between the co-
ordinates of the two data points.
3. Minkowski distance: It is also known as the generalised distance metric. It can be used for
both ordinal and quantitative variables
You can find the entire code on my GitHub, along with a sample data set and a plotting function.
Thanks for reading.
This article is contributed by Antonis Maronikolakis. If you like GeeksforGeeks and would like
to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your
article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks
main page and help other Geeks.
Please write comments if you find anything incorrect, or you want to share more information
about the topic discussed above.
The objective of clustering is to identify distinct groups in a dataset such that the observations within a
group are similar to each other but different from observations in other groups. In k-means clustering, we
specify the number of desired clusters k, and the algorithm will assign each observation to exactly one of
these k clusters. The algorithm optimizes the groups by minimizing the within-cluster variation (also
known as inertia) such that the sum of the within-cluster variations across all k clusters is as small as
possible.
Different runs of k-means will result in slightly different cluster assignments because k-means randomly
assigns each observation to one of the k clusters to kick off the clustering process. k-means does this
random initialization to speed up the clustering process. After this random initialization, k-means
reassigns the observations to different clusters as it attempts to minimize the Euclidean distance between
each observation and its cluster’s center point, or centroid. This random initialization is a source of
randomness, resulting in slightly different clustering assignments, from one k-means run to another.
Typically, the k-means algorithm does several runs and chooses the run that has the best separation,
defined as the lowest total sum of within-cluster variations across all k clusters.
k-Means Inertia
Let’s introduce the algorithm. We need to set the number of clusters we would like (n_clusters), the
number of initializations we would like to perform (n_init), the maximum number of iterations the
algorithm will run to reassign observations to minimize inertia (max_iter), and the tolerance to declare
convergence (tol).
We will keep the default values for number of initializations (10), maximum number of iterations (300),
and tolerance (0.0001). Also, for now, we will use the first 100 principal components from PCA
(cutoff). To test
how the number of clusters we designate affects the inertia measure, let’s run k-means for cluster sizes 2
through 20 and record the inertia for each.
n_clusters = 10
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 2018
n_jobs = 2
kMeans_inertia = pd.DataFrame(data=[],index=range(2,21), \
columns=['inertia'])
for n_clusters in range(2,21):
kmeans = KMeans(n_clusters=n_clusters, n_init=n_init, \
max_iter=max_iter, tol=tol, random_state=random_state,
\
n_jobs=n_jobs)
cutoff = 99
kmeans.fit(X_train_PCA.loc[:,0:cutoff])
kMeans_inertia.loc[n_clusters] = kmeans.inertia_
As Figure 5-1 shows, the inertia decreases as the number of clusters increases. This makes sense. The
more clusters we have, the greater the homogeneity among observations within each cluster. However,
fewer clusters are easier to work with than more, so finding the right number of clusters to generate is an
important consideration when running k-means.
Figure 5-1. k-means inertia for cluster sizes 2 through 20
Evaluating the Clustering Results
To demonstrate how k-means works and how increasing the number of clusters results in more
homogeneous clusters, let’s define a function to analyze the results of each experiment we do. The cluster
assignments—generated by the clustering algorithm—will be stored in a Pandas DataFrame called
clusterDF.
Let’s count the number of observations in each cluster and store these in a Pandas DataFrame called
countByCluster:
countByCluster = \
pd.DataFrame(data=clusterDF['cluster'].value_counts())
countByCluster.reset_index(inplace=True,drop=False)
countByCluster.columns = ['cluster','clusterCount']
Next, let’s join the clusterDF with the true labels array, which we will call labelsDF:
preds.columns = ['trueLabel','cluster']
Let’s also count the number of observations for each true label in the training set (this won’t change but is
good for us to know):
countByLabel = pd.DataFrame(data=preds.groupby('trueLabel').count())
Now, for each cluster, we will count the number of observations for each distinct label within a cluster.
For example, if a given cluster has three thousand observations, two thousand may represent the number
two, five hundred may represent the number one, three hundred may represent the number zero, and the
remaining two hundred may represent the number nine.
Once we calculate these, we will store the count for the most frequently occurring number for each
cluster. In the example above, we would store a count of two thousand for this cluster:
countMostFreq = \
pd.DataFrame(data=preds.groupby('cluster').agg( \
lambda x:x.value_counts().iloc[0]))
countMostFreq.reset_index(inplace=True,drop=False)
countMostFreq.columns = ['cluster','countMostFrequent']
Finally, we will judge the success of each clustering run based on how tightly grouped the observations
are within each cluster. For example, in the example above, the cluster has two thousand observations that
have the same label out of a total of three thousand observations in the cluster.
This cluster is not great since we ideally want to group similar observations together in the same cluster
and exclude dissimilar ones.
Let’s define the overall accuracy of the clustering as the sum of the counts of the most frequently
occuring observations across all the clusters divided by the total number of observations in the training set
(i.e., 50,000):
accuracyDF = countMostFreq.merge(countByCluster, \
left_on="cluster",right_on="cluster")
overallAccuracy = accuracyDF.countMostFrequent.sum()/ \
accuracyDF.clusterCount.sum()
accuracyByLabel = accuracyDF.countMostFrequent/ \
accuracyDF.clusterCount
For the sake of conciseness, we have all this code in a single function, available on GitHub.
k-Means Accuracy
Let’s now perform the experiments we did earlier, but instead of calculating inertia, we will calculate the
overall homogeneity of the clusters based on the accuracy measure we’ve defined for this MNIST digits
dataset:
n_clusters = 5
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 2018
n_jobs = 2
kMeans_inertia = \
pd.DataFrame(data=[],index=range(2,21),columns=['inertia'])
overallAccuracy_kMeansDF = \
pd.DataFrame(data=[],index=range(2,21),columns=['overallAccuracy'])
cutoff = 99
kmeans.fit(X_train_PCA.loc[:,0:cutoff])
kMeans_inertia.loc[n_clusters] = kmeans.inertia_
X_train_kmeansClustered =
kmeans.predict(X_train_PCA.loc[:,0:cutoff])
X_train_kmeansClustered = \
pd.DataFrame(data=X_train_kmeansClustered,
index=X_train.index, \
columns=['cluster'])
countByCluster_kMeans, countByLabel_kMeans,
countMostFreq_kMeans, \
accuracyDF_kMeans, overallAccuracy_kMeans,
accuracyByLabel_kMeans \
= analyzeCluster(X_train_kmeansClustered, y_train)
overallAccuracy_kMeansDF.loc[n_clusters] = overallAccuracy_kMeans
Figure 5-2 shows the plot of the overall accuracy for different cluster sizes.
https://medium.com/rahasak/k-means-clustering-with-apache-spark-cab44aef0a16
Happy ML
This is the first part of my Happy ML blog series. In this post I will discuss about Machine
Learning basics and K-Means unsupervised machine learning algorithm with an example. The
second part of this blog series which discussed about Logistic Regression algorithm can be
found from here.
Machine learning uses algorithms to find patterns in data. It first built a model based on the
patterns on existing/historical data. Then use this model to do the prediction on newly generated
live data. In general machine learning can be categorized into three main
categories Supervised, Unsupervised and Reinforcement machine learning.
In this post I’m gonna use K-Means algorithm to build a machine learning model with Apache
Spark.(if you are new to Apache Spark please find more informations for here). The K-Means
model clusters the uber trip data based on the trip attributes. Then this model can be used to do
real time analysis of new uber trips. All the source codes and dataset which relates to this post
available on the gitlab. Please clone the repo and continue the post.
About K-Means
K-Means clustering is one of the simplest and popular unsupervised machine learning
algorithms. The goal of this algorithm is to find groups in the data, with the number of
groups/clusters represented by the variable K. K-Means algorithm iteratively allocates every data
point to the nearest cluster based on the features. In every iteration of the algorithm, each data
point is assigned to its nearest cluster based on some distance metric, which is
usually Euclidean distance. The outputs of the K-means clustering algorithm are the
centroids of K clusters and the labels of training data. Once the algorithm runs and identified the
groups from a data set, any new data can be easily assigned to a group.
K-Means algorithm can be used to identifies unknown groups in complex and unlabeled data
sets. Following are some business use cases of K-Means clustering.
As mentioned previously I’m gonna use K-Means to build model from uber trip data. This model
clusters the uber trips based based on trip attributes/features(lat, lon etc). The uber trip data
set exists on the gitlab repo as .CSV file. Following is the structure/schema of single uber trip
record.
To build K-Means model from this data set first we need to load this data set into
spark DataFrame. Following is the way to do that. It load the data into DataFrame
from .CSV file based on the schema.
Next we can build K-Means model by defining no of clusters, feature column and output
prediction column. In order to train and test the K-Means model the data set need to be split into
a training data set and a test data set. 70% of the data is used to train the model, and 30% will be
used for testing.
Finally the K-Means model can use to detect the clusters/category of new data(ex real time uber
trip data). Following example shows the detecting clusters of sample records on a DataFrame.
Reference
1. https://www.quora.com/What-is-machine-learning-in-laymans-terms-1
2. https://www.goodworklabs.com/machine-learning-algorithm/
3. https://mapr.com/blog/apache-spark-machine-learning-tutorial/
4. https://mapr.com/blog/fast-data-processing-pipeline-predicting-flight-delays-using-
apache-apis-pt-1/
5. https://www.datascience.com/blog/k-means-clustering
6. https://medium.com/rahasak/hacking-with-apache-spark-f6b0cabf0703
7. https://medium.com/rahasak/hacking-with-spark-dataframe-d717404c5812
https://www.kaggle.com/xvivancos/tutorial-clustering-wines-with-k-means ( R Analysis )
1 Introduction
2 Loading data
3 Data analysis
4 Data preparation
5 k-means execution
6 How many clusters?
7 Results
8 Summary
9 Citations for used packages
1 Introduction
k-means is an unsupervised machine learning algorithm used to find groups of observations
(clusters) that share similar characteristics. What is the meaning of unsupervised learning? It
means that the observations given in the data set are unlabeled, there is no outcome to be
predicted. We are going to use a Wine data set to cluster different types of wines. This data set
contains the results of a chemical analysis of wines grown in a specific area of Italy.
2 Loading data
First we need to load some libraries and read the data set.
# Load libraries
library(tidyverse)
library(corrplot)
library(gridExtra)
library(GGally)
library(knitr)
2.1 First rows
2.2 Last rows
2.3 Summary
2.4 Structure
# First rows
kable(head(wines))
Alco Malic_A As Ash_Alca Magnesi Total_Phe Flavano Nonflavanoid_Ph Proanthocya Color_Inte Hu OD
hol cid h nity um nols ids enols nins nsity e 8
14.2 1.71 2. 15.6 127 2.80 3.06 0.28 2.29 5.64 1. 3.9
3 43 04
13.2 1.78 2. 11.2 100 2.65 2.76 0.26 1.28 4.38 1. 3.4
0 14 05
13.1 2.36 2. 18.6 101 2.80 3.24 0.30 2.81 5.68 1. 3.1
6 67 03
14.3 1.95 2. 16.8 113 3.85 3.49 0.24 2.18 7.80 0. 3.4
7 50 86
13.2 2.59 2. 21.0 118 2.80 2.69 0.39 1.82 4.32 1. 2.9
4 87 04
14.2 1.76 2. 15.2 112 3.27 3.39 0.34 1.97 6.75 1. 2.8
0 45 05
3 Data analysis
First we have to explore and visualize the data.
# Histogram for each Attribute
wines %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_histogram(colour="black", show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
theme_bw()
wines %>%
ggplot(aes(x=value, fill=Attributes)) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Density",
theme_bw()
wines %>%
geom_boxplot(show.legend=FALSE) +
theme_bw() +
theme(axis.title.y=element_blank(),
axis.title.x=element_blank()) +
ylim(0, 35) +
coord_flip()
We haven’t included magnesium and proline, since their values are very high and worsen the
visualization.
What is the relationship between the different attributes? We can use the corrplot() function
to create a graphical display of a correlation matrix.
# Correlation matrix
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(title="Wines Attributes",
theme_bw()
Now that we have done a exploratory data analysis, we can prepare the data in order to execute
the k-means algorithm.
4 Data preparation
We have to normalize the variables to express them in the same range of values. In other
words, normalization means adjusting values measured on different scales to a common scale.
# Normalization
# Original data
geom_point() +
labs(title="Original data") +
theme_bw()
# Normalized data
geom_point() +
labs(title="Normalized data") +
theme_bw()
# Subplot
The points in the normalized data are the same as the original one. The only thing that changes
is the scale of the axis.
5 k-means execution
In this section we are going to execute the k-means algorithm and analyze the main
components that the function returns.
# Execution of k-means with k=2
set.seed(1234)
The kmeans() function returns an object of class “kmeans” with information about the partition:
cluster. A vector of integers indicating the cluster to which each point is allocated.
centers. A matrix of cluster centers.
size. The number of points in each cluster.
# Cluster to which each point is allocated
wines_k2$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2
1
## [71] 2 1 2 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2
1
## [106] 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2
## [176] 2 2 2
# Cluster centers
wines_k2$centers
# Cluster size
wines_k2$size
## [1] 87 91
Additionally, the kmeans() function returns some ratios that let us know how compact is a
cluster and how different are several clusters among themselves.
wines_k2$betweenss
## [1] 651.56
wines_k2$withinss
wines_k2$tot.withinss
## [1] 1649.44
wines_k2$totss
## [1] 2301
set.seed(1234)
for(i in 1:10){
theme_bw()
theme_bw()
# Subplot
Which is the optimal value for k? One should choose a number of clusters so that adding
another cluster doesn’t give much better partition of the data. At some point the gain will drop,
giving an angle in the graph (elbow criterion). The number of clusters is chosen at this point. In
our case, it is clear that 3 is the appropriate value for k.
7 Results
# Execution of k-means with k=3
set.seed(1234)
wines_k3 <- kmeans(winesNorm, centers=3)
# Clustering
ggpairs(cbind(wines, Cluster=as.factor(wines_k3$cluster)),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both") +
theme_bw()
8 Summary
In this entry we have learned about the k-means algorithm, including the data normalization
before we execute it, the choice of the optimal number of clusters (elbow criterion) and the
visualization of the clustering.
It has been a pleasure to make this post, I have learned a lot! Thank you for reading and if you
like it, please upvote it.
https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/?ref=rp
Now the question should be raised is – Why should we use DBSCAN where K-Means is the
widely used method in clustering analysis?
Disadvantage Of K-MEANS:
1. K-Means forms spherical clusters only. This algorithm fails when data is not spherical
( i.e. same variance in all directions).
2. K-Means algorithm is sensitive towards outlier. Outliers can skew the clusters in K-
Means in very large extent.
3. K-Means algorithm requires one to specify the number of clusters a priory etc.
Basically, DBSCAN algorithm overcomes all the above-mentioned drawbacks of K-Means
algorithm. DBSCAN algorithm identifies the dense region by grouping together data points that
are closed to each other based on distance measurement.
Python implementation of above algorithm without using the sklearn library can be found
here dbscan_in_python.
References :
https://en.wikipedia.org/wiki/DBSCAN
https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/?ref=rp
ML | Determine the optimal value of K in K-Means Clustering
brightness_4
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
style.use("fivethirtyeight")
# make_blobs() is used to generate sample points
# around c centers (randomly chosen)
X, y = make_blobs(n_samples = 100, centers = 4,
cluster_std = 1, n_features = 2)
plt.scatter(X[:, 0], X[:, 1], s = 30, color ='b')
# label the axes
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
plt.clf() # clear the figure
Output:
filter_none
brightness_4
cost =[]
for i in range(1, 11):
KM = KMeans(n_clusters = i, max_iter = 500)
KM.fit(X)
# calculates squared error
# for the clustered points
cost.append(KM.inertia_)
# plot the cost against K values
plt.plot(range(1, 11), cost, color ='g', linewidth ='3')
plt.xlabel("Value of K")
plt.ylabel("Sqaured Error (Cost)")
plt.show() # clear the plot
# the point of the elbow is the
# most optimal value for choosing k
Output:
Intro
Clustering was always a subject I tried to avoid (for no reason). In this project I will finally use
my knowledge of clustering and PCA algorithms to explore the Human Activity Recognition
dataset.
In [1]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from IPython.display import display
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import homogeneity_score, completeness_score, \
v_measure_score, adjusted_rand_score, adjusted_mutual_info_score,
silhouette_score
%matplotlib inline
np.random.seed(123)
In [2]:
Data = pd.read_csv('../input/train.csv')
In [3]:
Data.sample(5)
Out[3]:
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
In [4]:
print('Shape of the data set: ' + str(Data.shape))
Shape of the data set: (3609, 563)
In [5]:
#save labels as string
Labels = Data['activity']
Data = Data.drop(['rn', 'activity'], axis = 1)
Labels_keys = Labels.unique().tolist()
Labels = np.array(Labels)
print('Activity labels: ' + str(Labels_keys))
Activity labels: ['STANDING', 'SITTING', 'LAYING', 'WALKING',
'WALKING_DOWNSTAIRS', 'WALKING_UPSTAIRS']
In [6]:
#check for missing values
Temp = pd.DataFrame(Data.isnull().sum())
Temp.columns = ['Sum']
print('Amount of rows with missing values: ' + str(len(Temp.index[Temp['Sum']
> 0])) )
Amount of rows with missing values: 0
In [7]:
#normalize the dataset
scaler = StandardScaler()
Data = scaler.fit_transform(Data)
In [8]:
#check the optimal k value
ks = range(1, 10)
inertias = []
for k in ks:
model = KMeans(n_clusters=k)
model.fit(Data)
inertias.append(model.inertia_)
plt.figure(figsize=(8,5))
plt.style.use('bmh')
plt.plot(ks, inertias, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks)
plt.show()
Looks like the best value ("elbow" of the line) for k is 2 (two clusters).
In [9]:
def k_means(n_clust, data_frame, true_labels):
"""
Function k_means applies k-means clustering alrorithm on dataset and
prints the crosstab of cluster and actual labels
and clustering performance parameters.
Input:
n_clust - number of clusters (k value)
data_frame - dataset we want to cluster
true_labels - original labels
Output:
1 - crosstab of cluster and actual labels
2 - performance table
"""
k_means = KMeans(n_clusters = n_clust, random_state=123, n_init=30)
k_means.fit(data_frame)
c_labels = k_means.labels_
df = pd.DataFrame({'clust_label': c_labels, 'orig_label':
true_labels.tolist()})
ct = pd.crosstab(df['clust_label'], df['orig_label'])
y_clust = k_means.predict(data_frame)
display(ct)
print('% 9s' % 'inertia homo compl v-meas ARI AMI
silhouette')
print('%i %.3f %.3f %.3f %.3f %.3f %.3f'
%(k_means.inertia_,
homogeneity_score(true_labels, y_clust),
completeness_score(true_labels, y_clust),
v_measure_score(true_labels, y_clust),
adjusted_rand_score(true_labels, y_clust),
adjusted_mutual_info_score(true_labels, y_clust),
silhouette_score(data_frame, y_clust, metric='euclidean')))
More on clustering metrics can be found in DataCamp Tutorial.
In [10]:
k_means(n_clust=2, data_frame=Data, true_labels=Labels)
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S
c
l
u
s
t
_
l
a
b
e
l
6 6 6
0 8 2 6 0 0 6
0 2 8
6 4 5
1 1 1 0 0 9 3
3 3 5
In [11]:
k_means(n_clust=6, data_frame=Data, true_labels=Labels)
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S
c
l
u
s
t
_
l
a
b
e
l
0 5 2 0 0 0 0
5
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S
c
l
u
s
t
_
l
a
b
e
l
4 1
2 3
9
1 0 0 0 4 1
7
8 1
2 1 0 0 3 1 4
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S
c
l
u
s
t
_
l
a
b
e
l
2 0 3
9 7 8
4 4
2
3 4 7 0 0 0
0
5 9
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S
c
l
u
s
t
_
l
a
b
e
l
2 7
4 0 0 0 4
6 5
1 1 1
5 0 5 8 0 0 2
6 7 9
In [12]:
#change labels into binary: 0 - not moving, 1 - moving
Labels_binary = Labels.copy()
for i in range(len(Labels_binary)):
if (Labels_binary[i] == 'STANDING' or Labels_binary[i] == 'SITTING' or
Labels_binary[i] == 'LAYING'):
Labels_binary[i] = 0
else:
Labels_binary[i] = 1
Labels_binary = np.array(Labels_binary.astype(int))
In [13]:
k_means(n_clust=2, data_frame=Data, true_labels=Labels_binary)
orig_label 0 1
clust_labe
l
0 1970 6
1 2 1631
In [14]:
#check for optimal number of features
pca = PCA(random_state=123)
pca.fit(Data)
features = range(pca.n_components_)
plt.figure(figsize=(8,4))
plt.bar(features[:15], pca.explained_variance_[:15], color='lightskyblue')
plt.xlabel('PCA feature')
plt.ylabel('Variance')
plt.xticks(features[:15])
plt.show()
In [15]:
def pca_transform(n_comp):
pca = PCA(n_components=n_comp, random_state=123)
global Data_reduced
Data_reduced = pca.fit_transform(Data)
print('Shape of the new Data df: ' + str(Data_reduced.shape))
In [16]:
# pca_transform(n_comp=3)
# k_means(n_clust=2, data_frame=Data_reduced, true_labels=Labels)
Code
In [18]:
pca_transform(n_comp=1)
k_means(n_clust=2, data_frame=Data_reduced, true_labels=Labels_binary)
Shape of the new Data df: (3609, 1)
orig_label 0 1
clust_labe
l
0 1971 8
1 1 1629
In [19]:
pca_transform(n_comp=2)
k_means(n_clust=2, data_frame=Data_reduced, true_labels=Labels_binary)
Shape of the new Data df: (3609, 2)
orig_label 0 1
clust_labe
l
0 1969 6
1 3 1631
If you know any interesting dataset to practice clustering on (not Iris dataset, haha),
please suggest!
https://mubaris.com/posts/kmeans-clustering/
All Articles
K-Means Clustering
Use Cases
Image Segmentation
Clustering Gene Segementation Data
News Article Clustering
Clustering Languages
Species Clustering
Anomaly Detection
Algorithm
Our algorithm works as follows, assuming we have inputs x_1, x_2, x_3, ..., x_nx1
,x2,x3,...,xn and value of K
Step 1
Step 2
In this step we assign each input value to closest center. This is done by
calculating Euclidean(L2) distance between the point and the each centroid.
Step 3
In this step, we find the new centroid by taking the average of all the points
assigned to that cluster.
Step 4
In this step, we repeat step 2 and 3 until none of the cluster assignments
change. That means until our clusters remain stable, we repeat the
algorithm.
We often know the value of K. In that case we use the value of K. Else we
use the Elbow Method.
We run the algorithm for different values of K(say K = 10 to 1) and plot the K
values against SSE(Sum of Squared Errors). And select the value of K for the
elbow point as shown in the figure.
The dataset we are gonna use has 3000 entries with 3 clusters. So we
already know the value of K.
%matplotlib inline
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('xclara.csv')
print(data.shape)
data.head()
(3000, 2)
V1 V2
0 2.072345 -3.241693
1 17.936710 15.784810
2 1.083576 7.319176
3 11.120670 14.406780
4 23.711550 2.557729
If you run K-Means with wrong values of K, you will get completely
misleading clusters. For example, if you run K-Means on this with values 2, 4,
5 and 6, you will get the following clusters.
# Number of clusters
kmeans = KMeans(n_clusters=3)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
# Comparing with scikit-learn centroids
print(C) # From Scratch
print(centroids) # From sci-kit learn
[[ 9.47804546 10.68605232]
[ 40.68362808 59.71589279]
[ 69.92418671 -10.1196413 ]]
[[ 9.4780459 10.686052 ]
[ 69.92418447 -10.11964119]
[ 40.68362784 59.71589274]]
You can see that the centroid values are equal, but in different order.
Example 2
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
plt.rcParams['figure.figsize'] = (16, 9)
# Initializing KMeans
kmeans = KMeans(n_clusters=4)
# Fitting with inputs
kmeans = kmeans.fit(X)
# Predicting the clusters
labels = kmeans.predict(X)
# Getting the cluster centers
C = kmeans.cluster_centers_
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y)
ax.scatter(C[:, 0], C[:, 1], C[:, 2], marker='*', c='#050505', s=1000)
In the above image, you can see 4 clusters and their centroids as stars.
scikit-learn approach is very simple and concise.
More Resources
Conclusion
Even though it works very well, K-Means clustering has its own issues. That
include:
K-Means Objective
The objective of k-means is to minimize the total sum of the squared distance of every
point to its corresponding cluster centroid. Given a set of observations (x1, x2, …, xn), where
each observation is a d-dimensional real vector, k-means clustering aims to partition the n
observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of
squares where µi is the mean of points in Si.
The k-means algorithm is guaranteed to converge a local optimum.
Business Uses
This is a versatile algorithm that can be used for any type of grouping. Some examples of use
cases are:
Behavioral segmentation: Segment by purchase history ,Segment by activities on
application, website, or platform.
Inventory categorization:Group inventory by sales activity.
Sorting sensor measurements:Detect activity types in motion sensors ,Group images.
Detecting bots or anomalies:Separate valid activity groups from bots.
K-Means Clustering Algorithm
Step 1: Choose the number K of clusters.
Step 2: Select at random K points, the centroids.(not necessarily from your dataset)
Step 3: Assign each data point to the closest centroid -> That forms K clusters.
Step 4: Compute and place the new centroid of each cluster.
Step 5: Reassign each data point to the new closest centroid. If any reassignment took
place, go to Step 4, otherwise go to FIN.
Example: Applying K-Means Clustering to Customer Expenses and Invoices Data in
python.
For python i am using Spyder Editor. As an example, we’ll show how the K-means algorithm
works with a Customer Expenses and Invoices Data.We have 500 customers data we’ll looking
at two customer features: Customer Invoices, Customer Expenses. In general, this algorithm
can be used for any number of features, so long as the number of data samples is much greater
than the number of features.
Step 1: Clean and Transform Your Data
For this example, we’ve already cleaned and completed some simple data transformations. A
sample of the data as a pandas DataFrame is shown below. Import libraries in python i.e.
1. numpy for mathematical tool to include any types of mathematics in our code.
2. matplotlib.pyplot it help to plot nice chart.
3. pandas for import dataset and manage dataset.
Step 2: We want to apply clustering on Total Expenses and Total Invoices. So select
required columns in X.
The chart below shows the dataset for 500 customers, with the Total Invoices on the x-axis and
Total Expenses on the y-axis.
Step 3: Choose K and Run the Algorithm
Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen
K. To find the number of clusters in the data, the user needs to run the K-means clustering
algorithm for a range of K values and compare the results. In general, there is no method for
determining exact value of K, but an accurate estimate can be obtained using the following
techniques.
One of the metrics that is commonly used to compare results across different values of K is the
mean distance between data points and their cluster centroid. Since increasing the number of
clusters will always reduce the distance to data points, increasing K will always decrease this
metric, to the extreme of reaching zero when K is the same as the number of data points. Thus,
this metric cannot be used as the sole target. Instead, mean distance to the centroid as function
of K is plotted and the “elbow point,” where the rate of decrease sharply shifts, can be used to
roughly determine K.
Using the elbow method we find the optimal number of clusters i.e. K=3. For this example, use
the Python packages scikit-learn for computations as shown below:
# K-Means Clustering
Adeline Ong
Follow
In this article, I will walk you through how I applied K-means and RFM segmentation to
cluster online gift shop customers based on their transaction records.
Introduction
When I was in college, I started a simple e-store selling pet products. Back then, I only collected
enough customer information to make the sale, and get my products to them. Simply put, I only
had their transaction records and addresses.
Back then, I didn’t think I had enough information to perform any useful segmentation.
However, I recently came across an intuitive segmentation approach
called RFM (Recency Frequency Monetary Value), which can be easily applied to basic customer
transaction records.
Recency: How long has it been since the customer last purchased from you (e.g. in days,
in months)?
Frequency: How many times has the customer purchased from you within a fixed period
(e.g. past 3 months, past year)
Monetary Value: How much has the customer spent at your store within a fixed period
(which should be the same period set for Frequency).
We can group customers, and come up with business recommendations based on RFM scores.
For example, you could offer promotions to reengage customers who have not bought from your
store recently. You could further prioritize your promotional strategy by focusing on customers
who used to buy frequently and spend at least average monetary value.
The traditional RFM approach requires you to manually rank customers from 1 to 5 on each of
their RFM features. Two ways to define ranks would be to create groups of equal intervals (e.g.
range/5), or categorize them based on percentiles (those up to 20th percentile would form a
rank).
Since we are data scientists, why not use an unsupervised learning model to do the job? In fact,
our model might perform better than the traditional approach since it groups customers based on
their RFM values, instead of their ranking.
The Data
The dataset was from UCL’s machine learning repository. The file contained 1 million customer
transaction records for a UK-based online gift store for the period between 2009 to 2010, and
2010 to 2011. There were two sheets in the excel file (one for each year), and each sheet had the
same 8 features:
Customer ID
Country (I didn’t really look at this since most customers were UK-based as well)
Invoice Code
Invoice Date
Stock Code
Stock Description
Unit Price
Unit Quantity
Data Cleaning
Since both datasets contained the same features, I appended one to the other. Following this, I
dropped rows that had:
Missing Customer ID
Abnormal Stock Codes that did not conform to the expected format, such as Stock Codes
that started with letters, and had less than 5 digits. These tended to be from manual entries
(Stock Code ‘M’), postage costs (Stock Code ‘DOT’) and cancelled orders (Stock Codes
starting with ‘C’). However, I retained Stock Codes that ended with letters, as these tended to
indicate product variations (e.g. pattern, color).
After creating RFM features for each customer (see Feature Engineering), I also removed
extreme outliers that were more than 4 standard deviations away from the mean. Removing
extreme outliers is important because they can skew unsupervised learning models that use
distance-based measures.
Feature Engineering
To derive a customer’s Recency, I calculated the time difference (in days) between the latest
purchase in the combined dataset, and the customer’s last purchase. Lower scores indicate a
more recent purchase, which is better for the store.
I created features that corresponded to each customer’s frequency of purchase (over the 2 year
period) and total spend (Monetary Value) through aggregation:
I also created other features, which I thought would be useful cluster descriptors:
Silhouette score can be used to evaluate the quality of unsupervised learning models where the
ground truth is unknown. Silhouette score measures how similar an observation is to its own
cluster, as compared to other clusters.
Values closer to 1 indicate better cluster separation, while values near 0 indicate overlapping
clusters. Avoid values that are negative.
I applied 3 unsupervised learning models to the data, and chose go with K-Means because it had
the best silhouette scores regardless of the number of clusters.
Silhouette scores of unsupervised learning models by number of clusters
To choose the number of clusters (n_clusters), I took into account each cluster’s silhouette score.
Optimally, every cluster’s coefficient value should be higher than the mean silhouette score (in
the graph, each cluster’s peak should exceed the red dotted line). I also took into account the
RFM values of each cluster.
I differed the number of K-Means clusters and examined the RFM values and silhouette scores of
the models. I decided to go with n_clusters =5 instead of anything less despite a lower silhouette
score because an important customer segment that had good RFM values only appeared when
n_clusters = 5. Clusters that appeared beyond n_clusters = 5 were less critical because they had
poorer RFM scores.
Table depicting silhouette scores across n number of clusters, and whether each cluster’s coefficient
value was higher than the mean silhouette score within each model
Having choose a unsupervised learning model, and a suitable number of clusters, I visualized the
clusters using a 3D plot.
3D plot depicting customer segments derived using RFM segmentation and K-Means
Clusters 4 and 2 have better RFM scores and represent the store’s core customers. The other 3
clusters appear to be more causal customers who purchase less frequently.
Core Customers
Based on this dataset, 18% of customers are core customer, and they contributed to 62% of
revenue for the past two years. They spend a lot, purchase frequently, every one or two months,
and are still engaged with the store. As the typical price of the online store’s products tends to be
low, the clusters’ average spend suggest that they are purchasing in large quantities, so they are
probably wholesalers and smaller shops that resell the store’s goods.
Table describing key features of core customers. Non-percentage figures represent averages.
Casual Customers
As for casual customers, I‘d like to highlight Cluster 0 (which I’ve called Gift Hunters) as they are
most critical to the store. They contributed to about a quarter of revenue, which is a lot more than
the other casual clusters. They tended to purchase from the store once every quarter in small
amounts, which suggest that they are individuals buying for special occasions.
Table describing the key features of casual customers. Non-percentage figures represent averages.
Given the features of the clusters, I propose the following promotional strategies for key groups:
Wholesalers: Given their small numbers, it might make sense to engage them directly
to build goodwill and loyalty. It would be best to lock them in with a custom solution.
Small Shops: Explore cashback discounts that can be used during subsequent
purchases. This will also lower their cost and encourage them to spend more.
Gift Hunters: Engage them just before special occasions and encourage them to spend
more by giving them free gifts for a minimum spend that is higher than their current mean
spend of 347 pounds.
To End Off…
I think RFM segmentation pairs very well with unsupervised learning models, as they remove the
need for marketers to manually segment their customer records. I hope I’ve illustrated how
meaningful customer segments can be created from very basic customer information. For more
details, you can look at my notebook. It contains code and details about the other models that I
explored.
K-Means Clustering
Follow
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled
data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups
in the data, with the number of groups represented by the variable K. The algorithm works
iteratively to assign each data point to one of K groups based on the features that are provided.
Data points are clustered based on feature similarity. The results of the K-means clustering
algorithm are:
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze
the groups that have formed organically. The “Choosing K” section below describes how the
number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups.
Examining the centroid feature weights can be used to qualitatively interpret what kind of group
each cluster represents.
Business Uses
The K-means clustering algorithm is used to find groups which have not been explicitly labeled in
the data. This can be used to confirm business assumptions about what types of groups exist or to
identify unknown groups in complex data sets. Once the algorithm has been run and the groups
are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use
cases are:
Behavioral segmentation:
Inventory categorization:
Group images
Separate audio
In addition, monitoring if a tracked data point switches between groups over time can be used to
detect meaningful changes in the data.
Algorithm
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The
algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of
features for each data point. The algorithms starts with initial estimates for the Κ centroids,
which can either be randomly generated or randomly selected from the data set. The algorithm
then iterates between two steps:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, based on the squared Euclidean distance. More formally, if ci is the collection of
centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for
each ith cluster centroid be Si.
In this step, the centroids are recomputed. This is done by taking the mean of all data points
assigned to that centroid’s cluster.
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data
points change clusters, the sum of the distances is minimized, or some maximum number of
iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not
necessarily the best possible outcome), meaning that assessing more than one run of the
algorithm with randomized starting centroids may give a better outcome.
Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K.
To find the number of clusters in the data, the user needs to run the K-means clustering
algorithm for a range of K values and compare the results. In general, there is no method for
determining exact value of K, but an accurate estimate can be obtained using the following
techniques.
One of the metrics that is commonly used to compare results across different values of K is the
mean distance between data points and their cluster centroid. Since increasing the number of
clusters will always reduce the distance to data points, increasing K will always decrease this
metric, to the extreme of reaching zero when K is the same as the number of data points. Thus,
this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function
of K is plotted and the “elbow point,” where the rate of decrease sharply shifts, can be used to
roughly determine K.
Exempel
writing on this data uses secondary data obtained from the website https://archive.ics.uci.edu
This dataset contains information about wart treatment results of 90 patients using cryotherapy.
Read data
Attributes in files try to use integer and float data types, non-null indicating that none of the data
is missing, with 90 data amounts
Plot distribution
The distribution of the plot to be carried out is the distribution of age attributes / variables with
sex attributes / variables.
The highest distribution is less than 20 year old, while the least spread is at the range of ages 51–
60 years.
K-Means Cluster
Use the package from “sklearn (scikit-learn)” to do clustering on import KMeans and use the
package from “sklearn” to do preprocessing “to import MinMaxScaler.
he first thing to do in K-Means Cluster is to change the data frame variable to an array data. From
the data array the variable size will be standardized using “MinMaxScaler ()”. Alternative
standardization is the scaling of features to be between the minimum and maximum values
given, often between zero and one, or so that the maximum absolute value of each feature is
scaled to the unit size.
To do the K-Means model the first step is to determine and confirm the K-Means function of all
variables using the “n_clusters” command of 5, and use the “random_state” command of 123
which is used to create a random key.
The final step is to display the cluster results and enter them into the data frame with the name
“cluster” and then visualize it in a scatterplot.
Reference :
https://archive.ics.uci.edu/ml/datasets/Cryotherapy+Dataset+#
Feature Engineering
Feature engineering is the process of using domain knowledge to choose which data metrics to
input as features into a machine learning algorithm. Feature engineering plays a key role in K-
means clustering; using meaningful features that capture the variability of the data is essential
for the algorithm to find all of the naturally-occurring groups.
Categorical data (i.e., category labels such as gender, country, browser type) needs to be
encoded or separated in a way that can still work with the algorithm.
Feature transformations, particularly to represent rates rather than measurements, can help to
normalize the data. For example, in the delivery fleet example above, if total distance driven had
been used rather than mean distance per day, then drivers would have been grouped by how
long they had been driving for the company rather than rural vs. urban.
Alternatives
Example:
We are using weight+height for that, and in our trained set let's say we have two people already
in clusters:
So, we need to scale it. Scikit Learn provides many functions for scaling. One you can use
is sklearn.preprocessing.MinMaxScaler.
Yes. Clustering algorithms such as K-means do need feature scaling before they are fed to the
algo. Since, clustering techniques use Euclidean Distance to form the cohorts, it will be wise
e.g to scale the variables having heights in meters and weights in KGs before calculating the
distance.
If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should
standardize variables, of course. Even if variables are of the same units but show quite different
variances it is still a good idea to standardize before K-means. You see, K-means clustering is
"isotropic" in all directions of space and therefore tends to produce more or less round (rather
than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more
weight on variables with smaller variance, so clusters will tend to be separated along variables
with greater variance.
A different thing also worth to remind is that K-means clustering results are potentially sensitive
to the order of objects in the data set11. A justified practice would be to run the analysis several
times, randomizing objects order; then average the cluster centres of those runs and input the
centres as initial ones for one final run of the analysis.
Here is some general reasoning about the issue of standardizing features in cluster or other
multivariate analysis.
11 Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even
when the initialization method isn't sensitive, results might depend sometimes on the order the
initial centres are introduced to the program by (in particular, when there are tied, equal
distances within data); (3) so-called running means version of k-means algorithm is naturaly
sensitive to case order (in this version - which is not often used apart from maybe online
clustering - recalculation of centroids take place after each individual case is re-asssigned to
another cluster).
As explained in this paper, the k-means minimizes the error function using the Newton
algorithm, i.e. a gradient-based optimization algorithm. Normalizing the data improves
convergence of such algorithms. See here for some details on it.
The idea is that if different components of data (features) have different scales, then derivatives
tend to align along directions with higher variance, which leads to poorer/slower convergence.
If you have mixed numerical data, where each attribute is something entirely different (say, shoe
size and weight), has different units attached (lb, tons, m, kg ...) then these values aren't really
comparable anyway; z-standardizing them is a best-practise to give equal weight to them.
If you have binary values, discrete attributes or categorial attributes, stay away from k-means.
K-means needs to compute means, and the mean value is not meaningful on this kind of data.
The issue is what represents a good measure of distance between cases.
If you have two features, one where the differences between cases is large and the other small,
are you prepared to have the former as almost the only driver of distance?
So for example if you clustered people on their weights in kilograms and heights in metres, is a
1kg difference as significant as a 1m difference in height? Does it matter that you would get
different clusterings on weights in kilograms and heights in centimetres? If your answers are
"no" and "yes" respectively then you should probably scale.
On the other hand, if you were clustering Canadian cities based on distances east/west and
distances north/south then, although there will typically be much bigger differences east/west,
you may be happy just to use unscaled distances in either kilometres or miles (though you might
want to adjust degrees of longitude and latitude for the curvature of the earth).
I think standard scaling mostly depends on the model being used, and normalizing depend on
how the data is originated
Most of distance based models e.g. k-means need standard scaling so that large-scaled
features don't dominate the variation. Same goes to PCA.
About the normalization, it mostly depends on the data. For example, if you have sensor data
(each time step being a variable) with different scaling, you need to L2 normalize the data to
bring them into the same scale. Or if you are working on customer recommendation and your
entry are the number of times they bought each item (items being variables), you might need to
L2 normalize the items if you don't want people who buy a lot to skew the feature.
Personally, I think if the variables are well-defined, their log might result in losing interpretaility.
So if you get good looking clusters without the log transform, I'd stick to it.
It's simply a case of getting all your data on the same scale: if the scales for different features
are wildly different, this can have a knock-on effect on your ability to learn (depending on what
methods you're using to do it). Ensuring standardised feature values implicitly weights all
features equally in their representation.
https://www.quora.com/Should-you-standardize-binary-categorical-and-indicator-primary-key-
variables-before-performing-K-means-clustering
Should you standardize binary, categorical and indicator (primary key) variables before
performing K-means clustering?
Yes, standardizing (normalizing) the input features is an important preprocessing step for
using k-means. This is done to make all the features in the same scale and give equal
importance to all features during learning. You can either use min-max normalization or mean-
SD normalization.
Are mean normalization and feature scaling needed for k-means clustering?
Having said that, the standard k-means technique preferably should not be directly applied to
categorical data, for various reasons. This is because the sample space for categorical data is
discrete, and doesn't have a natural origin. A Euclidean distance function on such a space isn't
really meaningful.
k-modes (for only categorical data) and k-prototypes (for data with mixed data types) are more
appropriate.
http://www.cs.ust.hk/~qyang/Teac...
Why does K-means clustering perform poorly on categorical data? The weakness of the K-
means method is that it is applicable only when the mean is defined, one needs to specify K in
advance, and it is unable to handle noisy data and outliers.
nicodv/kmodes (Source code)
Related Questions
What are the most typical applications of K-means clustering algorithm or its variants?
How do I understand the characteristics of each cluster when doing a K-Means clustering
algorithm?
How should I run K-means clustering when I cannot choose which variable to run on?
How can I choose variables using principal component analysis for K-means clustering?
How do we apply k-means clustering algorithm for mixed data-numeric and categorical?
What happens when you try clustering data with higher dimensions using k-means? For
example, if the dimensionality of the data set is 1000, nu...
What is the intuition behind distance and clustering in a space formed by categorical
variables?
http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf ( Paper )
Why does K-means clustering perform poorly on categorical data? The weakness of the K-
means method is that it is applicable only when the mean is defined, one needs to specify
K in advance, and it is unable to handle noisy data and outliers.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
The K-means algorithm defines a cost function that computes Euclidean distance (or it can be
anything similar) between two numeric values. However, it is not possible to define such
distance between categorical values. for e.g. if Euclidean distance between numeric points A and
B is 25 and A and C is 10, we know A is closer to C than B. However, as Sean Owen and User-
9806452280263883043 suggested, categorical values are not numbers but are enumerations
such as 'banana', 'apple' and 'oranges'. Euclidean distance cannot be used to compute euclidean
distances between the above fruits. We cannot say apple is closer to orange or banana because
Euclidean distance is not meant to handle such information. Therefore, we need to change the
cost function. In his paper, Huang proposed two things to handle such situation :
HTH
There are lots of good answers given. Definitely, Euclidean distance between two points that
have a categorical dimension does not make sense when you want to compute the mean of a
possible centroid for a cluster.
As an alternative to all the suggestions, if you convert your categorical data to a numeric value
and if you scale your *actual* numeric features to the range of numeric values you derived from
converted categorical features, then probably you could runs k means ( I havent tested this
myself ) using something like cosine similarity etc to identify the centroids.
If you google for kmeans on categorical data, there are many papers that list various different
aprroach, one such approach as everybody mentioned is http://arxiv.org/ftp/cs/papers/0...
https://github.com/nicodv/kmodes
https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-
categorical-data
A Google search for "k-means mix of categorical data" turns up quite a few more recent papers
on various algorithms for k-means-like clustering with a mix of categorical and numeric data. (I
haven't yet read them, so I can't comment on their merits.)
Actually, what you suggest (converting categorical attributes to binary values, and then doing k-
means as if these were numeric values) is another approach that has been tried before
(predating k-modes). (See Ralambondrainy, H. 1995. A conceptual version of the k-means
algorithm. Pattern Recognition Letters, 16:1147–1157.) But I believe the k-modes approach is
preferred for the reasons I indicated above.
Euclidean distance is not defined for categorical data; therefore, K-means cannot be used
directly. You may like to read more here
Shehroz Khan's answer to Why does K-means clustering perform poorly on categorical data?
The weakness of the K-means method is that it is applicable only when the mean is defined, one
needs to specify K in advance, and it is unable to handle noisy data and outliers.
Shehroz Khan's answer to How do we apply k-means clustering algorithm for mixed data-
numeric and categorical?
Shehroz Khan's answer to About converting a categorical variable into a numeric variable: When
is it better to use dummy variables instead of a single numerical variable?
You just apply PCA and choose the principle components with the largest eigenvalues (usually 2
or 3 for visualization purposes). The issue though is that you won't benefit insight from the
clustering about the features and how they cluster the data points because you are not using
the original data features, but instead you're using the principal components.
One standard approach is to compute a distance or dissimilarity matrix from the data and then
cluster it using hierarchical clustering, PAM etc.
https://www.youtube.com/watch?edufilter=NULL&v=9991JlKnFmk
https://github.com/llSourcell/k_means_clustering
Let's detect the intruder trying to break into our security system using a very popular ML
technique called K-Means Clustering! This is an example of learning from data that has no
labels (unsupervised) and we'll use some concepts that we've already learned about like
computing the Euclidean distance and a loss function to do this. Code for this video:
https://github.com/llSourcell/k_means... Please Subscribe! And like. And comment. That's what
keeps me going. More learning resources: http://www.kdnuggets.com/2016/12/data...
http://opencv-python-tutroals.readthe... http://people.revoledu.com/kardi/tuto...
https://home.deib.polimi.it/matteucc/... http://mnemstudio.org/clustering-k-me...
https://www.dezyre.com/data-science-i... http://scikit-learn.org/stable/tutori...
https://medium.com/search?q=K-mean
https://www.kaggle.com/patneshubham123/k-means-clustering-and-cluster-profiling
# Clustering
In [27]:
from sklearn.cluster import KMeans
In [28]:
km_3=KMeans(n_clusters=3,random_state=123)
km_3.fit(train_num)
km_3.cluster_centers_
km_3.labels_
pd.Series(km_3.labels_).value_counts()
km_4=KMeans(n_clusters=4,random_state=123).fit(train_num)
#km_4.labels_
km_5=KMeans(n_clusters=5,random_state=123).fit(train_num)
#km_5.labels_
km_6=KMeans(n_clusters=6,random_state=123).fit(train_num)
#km_6.labels_
km_7=KMeans(n_clusters=7,random_state=123).fit(train_num)
#km_7.labels_
km_8=KMeans(n_clusters=8,random_state=123).fit(train_num)
#km_8.labels_
# save the cluster labels and sort by cluster
train_num['cluster_3'] = km_3.labels_
train_num['cluster_4'] = km_4.labels_
train_num['cluster_5'] = km_5.labels_
train_num['cluster_6'] = km_6.labels_
train_num['cluster_7'] = km_7.labels_
train_num['cluster_8'] = km_8.labels_
train_num.head()
pd.Series.sort_index(train_num.cluster_3.value_counts())
pd.Series(train_num.cluster_3.size)
size=pd.concat([pd.Series(train_num.cluster_3.size),
pd.Series.sort_index(train_num.cluster_3.value_counts()),
pd.Series.sort_index(train_num.cluster_4.value_counts()),
pd.Series.sort_index(train_num.cluster_5.value_counts()),
pd.Series.sort_index(train_num.cluster_6.value_counts()),
pd.Series.sort_index(train_num.cluster_7.value_counts()),
pd.Series.sort_index(train_num.cluster_8.value_counts())])
In [38]:
Seg_size=pd.DataFrame(size, columns=['Seg_size'])
Seg_Pct = pd.DataFrame(size/train_num.cluster_3.size, columns=['Seg_Pct'])
Seg_size.T
Seg_Pct.T
clusters_df[0:10]
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o"
)
Note:
The elbow diagram shows that the gain in explained variance reduces significantly
to k=2. So, optimal number of clusters is 2.
Silhouette Coefficient
In [46]:
from sklearn import metrics
# calculate SC for K=3 through K=12
k_range = range(2, 12)
scores = []
for k in k_range:
km = KMeans(n_clusters=k, random_state=1)
km.fit(train_num)
scores.append(metrics.silhouette_score(train_num, km.labels_))
In [47]:
scores
# The sc is maximum for k=2 so we will select the 2 as our optimum cluster
# plot the results
plt.plot(k_range, scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.grid(True)
ote:
The SC plot shows that Silhouette Coefficient is maximum at k=2 So, optimal
number of clusters is 2.
https://www.kaggle.com/karthickaravindan/k-means-clustering-project
In [14]:
from sklearn.cluster import KMeans
Create an instance of a K Means model with 2 clusters.
In [15]:
kmeans=KMeans(n_clusters=2)
Fit the model to all the data except for the Private label.
In [16]:
kmeans.fit(df.drop('Private',axis=1))
In [17]:
kmeans.cluster_centers_
def converter(cluster):
if cluster=='Yes':
return 1
else:
return 0
In [19]:
df['Cluster'] = df['Private'].apply(converter)
In [20]:
df.head()
Create a confusion matrix and classification report to see how well the Kmeans
clustering worked without being given any labels.
In [21]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(df['Cluster'],kmeans.labels_))
print(classification_report(df['Cluster'],kmeans.labels_))
https://www.kaggle.com/sirpunch/k-means-clustering (very important)
K-Means Clustering
Python notebook using data from The Movies Dataset · 6,575 views · 2y ago
1import numpy as np
import pandas as pd
import pylab as pl
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Read the movies metadata csv file
In [2]:
df = pd.read_csv("../input/movies_metadata.csv")
1
Only keep the numeric columns for our analysis. However, we'll keep titles also to interpret the
results at the end of clustering. Note that this title column will not be used in the analysis.
In [3]:
df.drop(df.index[19730],inplace=True)
df.drop(df.index[29502],inplace=True)
df.drop(df.index[35585],inplace=True)
In [4]:
df_numeric =
df[['budget','popularity','revenue','runtime','vote_average','vote_count','ti
tle']]
In [5]:
df_numeric.head()
In [6]:
df_numeric.isnull().sum()
Drop all the rows with null values
In [7]:
df_numeric.dropna(inplace=True)
Normalize data
Normalize the data with MinMax scaling provided by sklearn
In [12]:
from sklearn import preprocessing
In [13]:
minmax_processed =
preprocessing.MinMaxScaler().fit_transform(df_numeric.drop('title',axis=1))
In [14]:
df_numeric_scaled = pd.DataFrame(minmax_processed, index=df_numeric.index,
columns=df_numeric.columns[:-1])
In [15]:
df_numeric_scaled.head()
In [16]:
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
In [17]:
score = [kmeans[i].fit(df_numeric_scaled).score(df_numeric_scaled) for i in
range(len(kmeans))]
These score values signify how far our observations are from the cluster center. We want to
keep this score value around 0. A large positive or a large negative value would indicate that the
cluster center is far from the observations.
Based on these scores value, we plot an Elbow curve to decide which cluster size is optimal.
Note that we are dealing with tradeoff between cluster size(hence the computation required)
and the relative accuracy.
In [18]:
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
Our Elbow point is around cluster size of 5. We will use k=5 to further interpret our clustering
result. I'm prefering this number for ease of interpretation in this demo. We can also pick a
higher number like 9.
Fit K-Means clustering for k=5
In [19]:
kmeans = KMeans(n_clusters=5)
kmeans.fit(df_numeric_scaled)
Out[19]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
As a result of clustering, we have the clustering label. Let's put these labels back into the
original numeric data frame.
In [20]:
len(kmeans.labels_)
Out[20]:
12178
In [21]:
df_numeric['cluster'] = kmeans.labels_
In [22]:
df_numeric.head()
In [23]:
plt.figure(figsize=(12,7))
axis =
sns.barplot(x=np.arange(0,5,1),y=df_numeric.groupby(['cluster']).count()
['budget'].values)
x=axis.set_xlabel("Cluster Number")
x=axis.set_ylabel("Number of movies")
We clearly see that one cluster is the largest and one cluster has the fewest number of movies.
Let's look at the cluster statistics.
In [24]:
df_numeric.groupby(['cluster']).mean()
size_array = list(df_numeric.groupby(['cluster']).count()['budget'].values)
In [26]:
size_array
Out[26]:
df_numeric[df_numeric['cluster']==size_array.index(sorted(size_array)
[0])].sample(5)
73, 253, 3744, 4801, 1107]
We see many big movie names in this cluster. So the results are intuitive.
Cluster that is the second smallest cluster in the results, has 2nd highest votes
count and the most highly rated movies. The runtime for these movies is on the higher
end and popularity score is also good. Let's see some of the movie names from this
cluster
df_numeric[df_numeric['cluster']==size_array.index(sorted(size_array)
[1])].sample(5)
Lastly, let's take a look at the least successful movies. This cluster represents the
movies that recieved least number of votes and also has the smallest runtime,
revenue and popularity score.
In [29]:
df_numeric[df_numeric['cluster']==size_array.index(sorted(size_array)[-
1])].sample(5)
As we can see this cluster also includes the movies for which our dataset has no information
about the budget and revenue, hence there corresponding fields have 0 value in it. This pulls
down the net revenue of the whole cluster. If we keep the cluster size slightly larger, we might
get to see these movies clustered separately.
https://www.kaggle.com/rounakbanik/the-movies-dataset
https://www.kaggle.com/vjchoudhary7/kmeans-clustering-in-customer-segmentation
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
['Mall_Customers.csv']
In [2]:
#Import the dataset
dataset = pd.read_csv('../input/Mall_Customers.csv')
M
a 1 1 3
0 1
l 9 5 9
e
M
a 2 1 8
1 2
l 1 5 1
e
F
e
m 2 1
2 3 6
a 0 6
l
e
3 4 F 2 1 7
e 3 6 7
m
S
p
A e
n n
n d
u i
C a n
u l g
s G
t e I S
A
o n n c
g
m d c o
e
e e o r
r r m e
I e
D (
( 1
k -
$ 1
) 0
0
)
a
l
e
F
e
m 3 1 4
4 5
a 1 7 0
l
e
F
e
m 2 1 7
5 6
a 2 7 6
l
e
6 7 F 3 1 6
e
S
p
A e
n n
n d
u i
C a n
u l g
s G
t e I S
A
o n n c
g
m d c o
e
e e o r
r r m e
I e
D (
( 1
k -
$ 1
) 0
0
)
m
a
5 8
l
e
F
e
m 2 1 9
7 8
a 3 8 4
l
e
M
a 6 1
8 9 3
l 4 9
e
9 1 F 3 1 7
0 e 0 9 2
m
S
p
A e
n n
n d
u i
C a n
u l g
s G
t e I S
A
o n n c
g
m d c o
e
e e o r
r r m e
I e
D (
( 1
k -
$ 1
) 0
0
)
a
l
e
In [3]:
#total rows and colums in the dataset
dataset.shape
Out[3]:
(200, 5)
In [4]:
dataset.info() # there are no missing values as all the columns has 200
entries properly
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
CustomerID 200 non-null int64
Gender 200 non-null object
Age 200 non-null int64
Annual Income (k$) 200 non-null int64
Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB
In [5]:
#Missing values computation
dataset.isnull().sum()
Out[5]:
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
In [6]:
### Feature sleection for the model
#Considering only 2 features (Annual income and Spending Score) and no Label
available
X= dataset.iloc[:, [3,4]].values
In [7]:
#Building the Model
#KMeans Algorithm to decide the optimum cluster number , KMeans++ using Elbow
Mmethod
#to figure out K for KMeans, I will use ELBOW Method on KMEANS++ Calculation
from sklearn.cluster import KMeans
wcss=[]
for i in range(1,11):
kmeans = KMeans(n_clusters= i, init='k-means++', random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
#inertia_ is the formula used to segregate the data points into clusters
In [8]:
#Visualizing the ELBOW method to get the optimal value of K
plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('wcss')
plt.show()
In [9]:
#If you zoom out this curve then you will see that last elbow comes at k=5
#no matter what range we select ex- (1,21) also i will see the same behaviour
but if we chose higher range it is little difficult to visualize the ELBOW
#that is why we usually prefer range (1,11)
##Finally we got that k=5
#Model Build
kmeansmodel = KMeans(n_clusters= 5, init='k-means++', random_state=0)
y_kmeans= kmeansmodel.fit_predict(X)
In [11]:
###Model Interpretation
#Cluster 1 (Red Color) -> earning high but spending less
#cluster 2 (Blue Colr) -> average in terms of earning and spending
#cluster 3 (Green Color) -> earning high and also spending high [TARGET SET]
#cluster 4 (cyan Color) -> earning less but spending more
#Cluster 5 (magenta Color) -> Earning less , spending less
######We can put Cluster 3 into some alerting system where email can be send
to them on daily basis as these re easy to converse ######
#wherein others we can set like once in a week or once in a month
# K-Means Clustering
X = dataset.iloc[:,:-1].values
X = pd.DataFrame(X)
X = X.convert_objects(convert_numeric=True)
X.columns = ['mpg', ' cylinders', ' cubicinches', ' hp', ' weightlbs',
' time-to-60', 'year']
https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis
Clustering Analysis
In [17]:
x = data.iloc[:, [3, 4]].values
In [18]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init =
10, random_state = 0)
km.fit(x)
wcss.append(km.inertia_)
plt.style.use('fivethirtyeight')
plt.title('K Means Clustering', fontsize = 20)
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.grid()
plt.show()
This Clustering Analysis gives us a very clear insight about the different segments of the
customers in the Mall. There are clearly Five segments of Customers namely Miser, General,
Target, Spendthrift, Careful based on their Annual Income and Spending Score which are
reportedly the best factors/attributes to determine the segments of a customer in a Mall.
Hierarchial Clustering
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster
is distinct from each other cluster, and the objects within each cluster are broadly similar to each
other
Using Dendrograms to find the no. of Optimal Clusters
plt.style.use('fivethirtyeight')
plt.title('Hierarchial Clustering', fontsize = 20)
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.grid()
plt.show()
Clusters of Customers Based on their Ages
In [22]:
x = data.iloc[:, [2, 4]].values
x.shape
K-means Algorithm
Hide
In [23]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300,
n_init = 10, random_state = 0)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
plt.rcParams['figure.figsize'] = (15, 5)
plt.plot(range(1, 11), wcss)
plt.title('K-Means Clustering(The Elbow Method)', fontsize = 20)
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid()
plt.show()
plt.style.use('fivethirtyeight')
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid()
plt.show()
According to my own intuition by looking at the above clustering plot between the age of
the customers and their corresponding spending scores, I have aggregated them into 4 different
categories namely Usual Customers, Priority Customers, Senior Citizen Target Customers,
Young Target Customers. Then after getting the results we can accordingly make different
marketing strategies and policies to optimize the spending scores of the customer in the Mall.
layout = go.Layout(
title = 'Character vs Gender vs Alive or not',
margin=dict(
l=0,
r=0,
b=0,
t=0
),
scene = dict(
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
)
)
Yousef-Project-1
%matplotlib inline
import numpy as np
import pandas as pd
corr = data.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
fig.colorbar(cax)
ticks = np.arange(0,len(data.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)
plt.show()
data.describe()
data.describe()
P
a
t
h
L Pat RX RX
RXL RXL RX RX RXL RXL Traf
o Po h QU QU
EV EV QU QU EV EV fic
s wer Los AL AL
DL UL AL AL UL DL Lev
s Red s UL DL
> > DL UL Ave Ave el
D . BS DL Ave Ave
-95 -95 >4 >4 rag rag Ave
if =0 > rag rag
dB dB GS GS e e rag
f. dB 150 e e
m m M M (dB (dB e
> (%) dB (GS (GS
(%) (%) (%) (%) m) m) (E)
0 (%) M) M)
d
B
(
%
)
c
662 662 662 662 662 662 662 662 662 662 662 680
o
.00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
u
000 000 000 000 000 000 000 000 000 000 000 000
n
0 0 0 0 0 0 0 0 0 0 0 0
t
m - -
76. 56. 0.2 93. 63. 2.6 2.1 0.2 0.6 1.2
e 92. 77.
338 031 328 787 751 853 422 973 528 160
a 313 934
308 495 25 100 601 93 05 87 10 44
n 006 245
12. 43. 0.5 5.9 15. 2.2 1.6 0.1 0.5 2.9 6.0 1.1
st
152 498 055 739 424 816 929 971 182 068 663 415
d
037 307 91 86 403 37 25 28 01 42 40 84
-
-
14. 3.3 0.0 54. 2.4 0.0 0.0 0.0 0.0 104 0.0
m 93.
290 600 000 400 400 000 000 000 000 .71 000
in 050
000 00 00 000 00 00 00 00 00 000 00
000
0
P
a
t
h
L Pat RX RX
RXL RXL RX RX RXL RXL Traf
o Po h QU QU
EV EV QU QU EV EV fic
s wer Los AL AL
DL UL AL AL UL DL Lev
s Red s UL DL
> > DL UL Ave Ave el
D . BS DL Ave Ave
-95 -95 >4 >4 rag rag Ave
if =0 > rag rag
dB dB GS GS e e rag
f. dB 150 e e
m m M M (dB (dB e
> (%) dB (GS (GS
(%) (%) (%) (%) m) m) (E)
0 (%) M) M)
d
B
(
%
)
- -
2 67. 12. 0.0 91. 53. 0.9 1.0 0.1 0.2 0.4
94. 82.
5 622 015 200 730 567 100 400 500 100 275
250 555
% 500 000 00 000 500 00 00 00 00 00
000 000
- -
5 79. 31. 0.0 95. 65. 2.1 1.7 0.2 0.5 0.9
92. 78.
0 125 485 700 680 430 400 500 600 600 400
295 375
% 000 000 00 000 000 00 00 00 00 00
000 000
100 - -
7 85. 0.2 97. 75. 3.8 2.7 0.3 0.9 1.6
.00 90. 73.
5 525 200 907 092 600 900 875 600 125
000 352 565
% 000 00 500 500 00 00 00 00 00
0 500 000
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 683 entries, EBSC05/KC2001A to TOTAL
Data columns (total 13 columns):
Path Loss Diff. > 0 dB (%) 662 non-null float64
Power Red. BS = 0 dB (%) 662 non-null float64
Path Loss DL > 150 dB (%) 662 non-null float64
RXLEV DL > -95 dBm (%) 662 non-null float64
RXLEV UL > -95 dBm (%) 662 non-null float64
RXQUAL DL > 4 GSM (%) 662 non-null float64
RXQUAL UL > 4 GSM (%) 662 non-null float64
Channel Group 683 non-null object
RXQUAL UL Average (GSM) 662 non-null float64
RXQUAL DL Average (GSM) 662 non-null float64
RXLEV UL Average (dBm) 662 non-null float64
RXLEV DL Average (dBm) 662 non-null float64
Traffic Level Average (E) 680 non-null float64
dtypes: float64(12), object(1)
memory usage: 94.7+ K
data.dtypes
data.shape
print(data.isnull().sum() )
Path Loss Diff. > 0 dB (%) 21
Power Red. BS = 0 dB (%) 21
Path Loss DL > 150 dB (%) 21
RXLEV DL > -95 dBm (%) 21
RXLEV UL > -95 dBm (%) 21
RXQUAL DL > 4 GSM (%) 21
RXQUAL UL > 4 GSM (%) 21
Channel Group 0
RXQUAL UL Average (GSM) 21
RXQUAL DL Average (GSM) 21
RXLEV UL Average (dBm) 21
RXLEV DL Average (dBm) 21
Traffic Level Average (E) 3
dtype: int64
sns.set(style="ticks", color_codes=True)
sns.pairplot(data);
Path Loss Diff. > 0 dB (%)Power Red. BS = 0 dB (%)Path Loss DL > 150 dB
(%)RXLEV DL > -95 dBm (%)RXLEV UL > -95 dBm (%)RXQUAL DL > 4 GSM
(%)RXQUAL UL > 4 GSM (%)RXQUAL UL Average (GSM)RXQUAL DL Average
(GSM)RXLEV UL Average (dBm)RXLEV DL Average (dBm)Traffic Level
Average (E)Path Loss Diff. > 0 dB (%)1.000000-0.2132730.228015-
0.3236580.0943570.168676-0.174919-0.0844410.0759770.047227-
0.3994460.048085Power Red. BS = 0 dB (%)-0.2132731.000000-
0.0311980.343456-0.105120-0.398970-0.098418-0.341315-0.647911-
0.0863240.593539-0.535905Path Loss DL > 150 dB (%)0.228015-
0.0311981.000000-0.565437-0.3140080.2759810.3151730.2701670.123071-
0.329219-0.3174460.020621RXLEV DL > -95 dBm (%)-0.3236580.343456-
0.5654371.0000000.579133-0.442519-0.362459-0.443099-
0.2622240.5938660.748726-0.155881RXLEV UL > -95 dBm (%)0.094357-
0.105120-0.3140080.5791331.000000-0.260925-0.312329-0.275087-
0.0753350.9435020.5940840.051000
# sns.heatmap(corr,annot=True)
sns.heatmap(corr,annot=True,cmap="YlGnBu")
plt.figure(figsize=(30,10))
sns.heatmap(corr,annot=True,cmap="YlGnBu",fmt='.1g',square=True)
# plt.figure(figsize=(15,4))
# plt.savefig('medals.svg')
#very important Subplot Seaborn
# result['diagnosis'] = data.iloc[:,0]
j=0
for i in data.columns:
plt.subplot(6, 4, j+1)
j += 1
plt.legend(loc='best')
fig.suptitle('KV2001_MRR')
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()
g = sns.pairplot(data, vars=["RXLEV DL > -95 dBm (%)", "RXQUAL DL > 4 GSM (%)"])
g = sns.pairplot(data,x_vars=["Path Loss Diff. > 0 dB (%)", "Power Red. BS = 0 dB (%)"],
size = 4)
size = 4);
# Title
size = 28);
'log_gdp_per_cap'], size = 4)
edgecolor = 'k')
import numpy as np
import pandas as pd
sns.set(style="white")
mean = np.zeros(3)
cov += cov.T
cov[np.diag_indices(3)] = 1
r, _ = stats.pearsonr(x, y)
ax = plt.gca()
ax.annotate("r = {:.2f}".format(r),
g = sns.PairGrid(df, palette=["red"])
g.map_upper(plt.scatter, s=10)
g.map_diag(sns.distplot, kde=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_lower(corrfunc)
#K-means Clustering
#MRR-3-Feb
dataF.head()
fig = plt.figure()
ylim=(-1.2, 1.2))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
ax = fig.add_subplot(2, 3, i)
alpha=0.4, edgecolors='w')
alpha=0.4, edgecolors='w')
fig.subplots_adjust(top=0.85, wspace=0.3)
ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Density")
ax2 = fig.add_subplot(1,2,2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Density")
# xx , yy = enumerate(data.columns)
# sns.distplot(data[column],ax=axes[i//3,i%3])
i=1
print(columnName)
sns.distplot(data[columnName],ax=axes[i//3,i%3])
i=i+1
plt.show()
edgecolors=edge_colors,
alpha=0.4)
plt.xlabel('Fixed Acidity')
plt.ylabel('Alcohol')
aspect=1.2,
size=3.5)
g.map(plt.scatter,
alpha=0.5,
edgecolor='white',
linewidth=0.5,
fig = g.fig
fig.subplots_adjust(top=0.8, wspace=0.3)
sns.boxplot(data=data,
ax=ax)
ax.set_xlabel("Wine Quality",size=12,alpha=0.8)
# Pre-format DataFrame
sns.boxplot(data=stats_df)
#Groupby # Merge
# data.head()
data.isnull().sum()
# #clean
data.dropna(how="any", inplace=True)
data.isnull().sum()
data.head()
#Scaling :
# minmax_processed = preprocessing.MinMaxScaler().fit_transform(data)
# print (data.columns)
# data.head()
minmax_processed = preprocessing.MinMaxScaler().fit_transform(data)
# df_numeric_scaled = pd.DataFrame(minmax_processed)
df_numeric_scaled.head()
#Elbow
#K-means
wcss = []
kmeans.fit(df_numeric_scaled)
wcss.append(kmeans.inertia_)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
kmeans = KMeans(n_clusters=10, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(df_numeric_scaled)
# pred_y = kmeans.fit_predict(X)
# plt.scatter(X[:,0], X[:,1])
# plt.show()
kmeans.cluster_centers_
kmeans.labels_
len(kmeans.labels_)
data['cluster'] = kmeans.labels_
data.head()
# to Excel Save
plt.figure(figsize=(12,7))
x=axis.set_xlabel("Cluster Number")
data.groupby(['cluster']).mean()
# tips = sns.load_dataset("tips")
data.groupby(['cluster']).median()
data[data['cluster']==size_array.index(sorted(size_array)[0])].sample(5)
data[data['cluster']==size_array.index(sorted(size_array)[1])].sample(5)
DF_Cluster = data.groupby(['cluster']).mean()
DF_Cluster.head()
x=ax.set_ylabel("Cluster Number")
x=ax.set_ylabel("Cluster Number")
#Hue Clusters with Them
import numpy as np
x = np.random.standard_normal(1000)
yy = stats.norm.pdf(xx)
# and plot on the same axes that seaborn put the histogram
target_0 = data.loc[data['cluster'] == 0]
target_1 = data.loc[data['cluster'] == 1]
target_2 = data.loc[data['cluster'] == 2]
target_6 = data.loc[data['cluster'] == 6]
# sns.plt.show()
unique_vals = data['cluster'].unique() # [0, 1, 2]
# fig.legend(labels=['test_label1','test_label2'])
# sns.plt.show()
# sns.plt.show()
Subplot Data Columns >> For Loop Clusters Displot
I used the following code to create a synthetic dataset which appears to match yours:
import pandas
import numpy
import seaborn as sns
import matplotlib.pyplot as plt
dfs = []
for A0 in A0s:
V_w_dr = numpy.sin(A0*omega)
V_w_tr = numpy.cos(A0*omega)
dfs.append(pandas.DataFrame({'omega': omega,
'V_w_dr': V_w_dr,
'V_w_tr': V_w_tr,
'A0': A0}))
dataframe = pandas.concat(dfs, axis=0)
Then you can do what you want (thanks to @mwaskom in the comments for )sharey='row',
margin_titles=True):
melted = dataframe.melt(id_vars=['A0', 'omega'], value_vars=['V_w_dr', 'V_w_tr'])
g = sns.FacetGrid(melted, col='A0', hue='A0', row='variable', sharey='row',
margin_titles=True)
g.map(plt.plot, 'omega', 'value')
You need melt for reshape with seaborn.factorplot:
df = df.melt('X_Axis', var_name='cols', value_name='vals')
#alternative for pandas < 0.20.0
#df = pd.melt(df, 'X_Axis', var_name='cols', value_name='vals')
g = sns.factorplot(x="X_Axis", y="vals", hue='cols', data=df)
Sample:
df = pd.DataFrame({'X_Axis':[1,3,5,7,10,20],
'col_2':[.4,.5,.4,.5,.5,.4],
'col_3':[.7,.8,.9,.4,.2,.3],
'col_4':[.1,.3,.5,.7,.1,.0],
'col_5':[.5,.3,.6,.9,.2,.4]})
print (df)
X_Axis col_2 col_3 col_4 col_5
0 1 0.4 0.7 0.1 0.5
1 3 0.5 0.8 0.3 0.3
2 5 0.4 0.9 0.5 0.6
3 7 0.5 0.4 0.7 0.9
4 10 0.5 0.2 0.1 0.2
5 20 0.4 0.3 0.0 0.4
df = pandas.DataFrame({
'Factor': ['Growth', 'Value'],
'Weight': [0.10, 0.20],
'Variance': [0.15, 0.35]
})
fig, ax1 = pyplot.subplots(figsize=(10, 10))
tidy = df.melt(id_vars='Factor').rename(columns=str.title)
seaborn.barplot(x='Factor', y='Value', hue='Variable', data=tidy, ax=ax1)
seaborn.despine(fig)
Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from
its "wide format" (one column per measurement type) into long format (one column for all
measurement values, one column to indicate the type) is pandas.melt. Given
a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with
print(df)
sns.set(style='ticks', color_codes=True)
g = sns.FacetGrid(df, col="PORTFOLIO", col_wrap=4, height=4)
g = g.map(plt.plot, 'DATE', 'IRR', color='#FFAA11')
g = g.map(plt.plot, 'DATE', 'TWR', color='#22AA11')
plt.show()
Subplots
https://www.kaggle.com/sohailkhan/pandas-plotting-and-visualization
f, axes = plt.subplots(1, 2)
Where axes is an array with each subplot.
Then we tell each plot in which subplot we want them with the argument ax.
sns.boxplot( y="b", x= "a", data=df, orient='v' , ax=axes[0])
sns.boxplot( y="c", x= "a", data=df, orient='v' , ax=axes[1])
And the result is:
https://seaborn.pydata.org/examples/distplot_options.html
seaborn0.10.0
Gallery
Tutorial
API
Site
Page
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.setp(axes, yticks=[])
plt.tight_layout()
Back to top
target_0 = data.loc[data['cluster'] == 0]
target_1 = data.loc[data['cluster'] == 1]
target_2 = data.loc[data['cluster'] == 2]
target_6 = data.loc[data['cluster'] == 6]
f, axes = plt.subplots(2, 3)
# sns.plt.show()
Difficult solutions:
f, axes = plt.subplots(1, 3)
j=0
data_t1.head()
columnSeriesObj = data_t1[column]
print(target[[column]])
# print(target.column)
sns.distplot(target[[column]],ax=axes[0,j],hist=False,rug=True,label="Cluster" + str(i))
j += 1
# data_t1 = data.loc[:, ['RXLEV DL > -95 dBm (%)','RXQUAL DL > 4 GSM (%)','Traffic Level Average (E)']]
f,axes = plt.subplots(1, 2)
for ix,cx in enumerate(c):
sns.distplot(target[[cx]],hist=False,rug=True,label="Cluster" + str(i),ax=axes[ix])