lec48
lec48
Lecture - 48
K-means Clustering
So, we are into the last theory lecture of this course. I am going to
talk about K-means clustering today. Following this lecture there will
be a demonstration of this technique on an case study by the TAs of
this course. So, what is K-means clustering? So, K-means clustering is
a technique that you can use to cluster or partition a certain number of
observations, let us say N observations, into K clusters.
So, these are labelled data points and the classification algorithms
job is to actually find decision boundaries between these different
classes. So, those are supervised algorithms. So, an example of that
would be the K nearest neighbour that we saw before. Where we have
labelled data and when you get a test data, you kind of bin it into the
most likely class that it belongs to. However, many of the clustering
algorithms are unsupervised algorithms in the sense that you have N
observations as we mentioned here but they are not really labelled into
different classes.
So, to give a very simple example, let us say you have this
observations x1, x2 all the way up to xN and just for the sake of
argument let us take that we are going to partition this data into two
clusters ok. So, there is going to be one set for one cluster, cluster 1
and there is going to be another set for the other cluster 2. Now the job
of K-means clustering algorithm would be to put these data points into
these two bins. So, let us say this could go here this could go here, x 3
could go here xN could go here maybe an x4 could go here and so on.
So, all you are doing is you are taking all of these data points and
putting them into two bins and while you do it what you are looking for
really in a clustering algorithm is to make sure that all the data points
that are in this bin have certain characteristics which are like, in the
sense that, if I take two data points here and two data points here.
These two data points will be more like and these two data points will
be more like each other, but if I take a data point here and here they
will be in some sense unlike each other. So, if you cluster the data this
way then it is something that we can use where we could say look all
of these data points share certain common characteristics and then we
are going to make some judgments based on what those characteristics
are.
So, this is one data point this is another data points on. So, there are
8 data points and if you simply plot this you would you would see this
and when we talk about cluster. So, notionally you would think that
there is a very clear separation of this data into two clusters.
Now, you could do this or like I mentioned before you could pick a
point which is not there in the data also and we will see the impact of
choosing this cluster centres later in this lecture.
Now, that we have two cluster centres then what this algorithm does is the
following. It finds for every data point in our database, this algorithm
first finds out the distance of that data point from each one of this
cluster centres. So, in this table we have distance one, which is the
distance from the point 1 1 and we have distance two, which is the
distance from the point 3 3. Now, if you notice since the first point
from the data itself is 1 1 the distance of 1 1 from 1 1 is 0. So, you see
that distance one is 0 and distance two is the distance of the point 1 1
from 3 3.
Now, since we want all the points that are like 1 1 to be in one
group and all the points which are like 3 3 to be in the other group, we
use a distance as a metric for likeness. So, if a point is closer to 1 1
then it is more like 1 1 then 3 3 and similarly if a point is close to 3 3 it
is more like 3 3 than 1 1. So, we are using a very very simple logic
here.
Now, what you do is, you know that these group positions are the
centres of these groups were randomly chosen now, but we have now
more information to update the centres because I know all of these data
points belong to group 1 and all of these data points belong to group 2.
So, a better representation for this group, so initially the representation
for this group was 1 1, a better representation for this group would be
the mean of all of these 4 samples and the initial representation for
group 2 was 3 3 , but a better representation for group 2 would be the
mean of all of these points. So, that is step 3.
So, we compute the new mean and because group 1 has points 1, 2,
3, 4, we do a mean of those points and the x is 1.075 and y is 1.05.
Similarly for group 2, we do the mean of the labels are the data points
5, 6, 7, 8 and you will see the mean here. So, notice how from 1 1 and
3 3 the mean has been updated. In this case because we chose a very
simple example the updation is only slight. So, these points move a
little bit.
Now, you redo the same computation because I still have the same
eight observations but now group 1 is represented not by 1 1, but by
1.075 and 1.05 and group 2 is represented by 3.95 and 3.145 and not 3
3. So, for each of these points you can again calculate a distance 1 and
distance 2, and notice previously this distance was 0 because the
representative point was 1 1. Now that the representative point has
become 1.075 and 1.05 this distance is no more 0, but it is still a small
number. So, for each of these data points with these new means you
calculate these distances. And again use the same logic to see whether
distance 1 is smaller or distance 2 smaller and depending on that you
assign these groups.
Now, if it were the case that because of this new mean let us say for
example, if this had gone into group 2 and let us say this and this had
gone into group 1 then correspondingly you have to collect all the
points in group 1 and calculate a new mean collect all the points in
group 2 and calculate a new mean. And then do this process again and
you keep doing this process till the mean change is negligible or no
reassignment. So, one of these two things happen then you can stop the
process. So, this is a very very simple technique.
So, in that sense an algorithm like this allows you to work with just
raw data without any annotation. What I mean by annotation here is
without any more information in terms of labelling of the data and then
start making some judgments about how this data might be organized.
And you can look at this multi dimensional data and then look at what
is the optimum number of clusters, maybe there are 5 groups in this
multi dimensional data which would be impossible to find out by just
looking at an excel sheet.
But once you do an algorithm like this then it maybe organizes this
into 5 or 6 or 7 groups then that allows you to go and probe more into
what these groups might mean in terms of process operations and so
on. So, it is an important algorithm in that sense.
Now, we kept talking about finding the number of clusters. Till now
I said you let the algorithm know how many clusters you want to look
for, but there is something called an elbow method where you look at
the results of the clustering for different number of clusters and use a
graph to find out what is an optimum number of clusters. So, this is
called an elbow method and you will see a demonstration of this in an
example.
So, all of this will be assigned to this group and all of these data
points are closer to this. So, they will be assigned to this group then
you will calculate the mean and once a mean is calculated there will
never be any reassignment possible afterwards. Then you have clearly
these two clusters very well separated.
So, after the first round of the clustering calculations, you will see
that the center might not even move because the mean of this and this
might be some-where in the center. So, this will never move, but the
algorithm will say all of these data points belong to this center and this
center will never have any data points for it.
So, this is a very trivial case, but it just still makes the point that if
you use the same algorithm depending on how you start your cluster
centres, you can get different results. So, you would like to avoid
situations like this and these are actually easy situations to avoid, but
when you have multi dimensional data and lots of data and which you
cannot visualize like I showed you here it turns out it is not really that
obvious to see how you should initially. So, there are ways of solving
this problem, but that is something to keep in mind. So, every time you
run an algorithm if the initial guesses are different you are likely to get
at least minor differences in your results.
And the other important thing to notice, look at how I have been
plotting this data. Typically I have been plotting to this data so that you
know the clusters are in general spherical. But let us say if I have data
like this where you know all of this belongs to one class and all of this
belongs to another class. Now, K-means clustering can have di culty
with this kind of data simply be-cause if you look at data within this
class, this point and this point are quite far though they are within the
same class whereas, this point and this point might be closer than this
point in this point.
So, with this we come to the conclusion of the theory part of this
course on data science for engineers. There will be one more lecture
which will demonstrate the use of K-means on a case study. With that
all the material that we intended to cover would have been covered. I
hope this was an useful course for you.
And this course would have hopefully given you a good feel for the
kind of thinking and aptitude that you might need to follow up on data
science. And also the mathematical foundations that are going to be
quite important as you try to learn more advanced and complicated
machine learning techniques either on your own or through courses
such as this that are available.
Thanks again.