What is Clustering?
Clustering is not a difficult concept to grasp. Basically, when you cluster a group of objects, you’re dividing them into a set of subgroups wherein each subgroup’s members are close to one another according to some measure.
One of the most basic ways we can perform clustering is the K Means Clustering algorithm.
K Means Clustering Algorithm Explained
Imagine you’ve got a set of points, and you want to cluster them into K=2 groups.
The first thing you’ve gotta do is to pick K=2 points at random. These two points will serve as the initial centroids for the two clusters. More on centroids later. Let’s represent our two clusters by green triangle and yellow diamonds. Suppose we pick the following two points as the initial centroids of our clusters.
Next, we go over every point and assign each to the cluster of the centroid it is nearest to.
As we can see, every point now belongs to a cluster. The next thing we do is to calculate the centroid of each cluster. (That’s why it’s called the K means algorithm.) Let’s label the succeeding cluster centroids by x marks.
As before, we go over each point and assign each to the cluster whose centroid it is nearest to.
We see here that one of the green triangles becomes a yellow diamond since it’s nearer to the yellow diamonds’ centroid than to the green triangles’.
We then repeat the process of
- calculating the centroid of each cluster, and
- assigning each point to the cluster whose centroid it is nearest to,
until a stopping condition is satisfied. We could check if no point changes cluster in the next iteration or if the average inter-cluster error changes only very slightly.
In a nutshell, that’s the K Means Clustering algorithm.
If we continue our example and compute the next centroids,
Going over each point and assigning clusters,
Again, a green triangle converts to a yellow diamond. Now, calculating the centroids,
We see that no point transfers cluster in this iteration. Thus, we halt the K means clustering algorithm and consider the resulting clusters as final.
An interesting application of the K Means Clustering algorithm is image compression. Suppose that you have an image but the size is a bit too big. In other words, you want to reduce the file size of the image. How can you do it?
One way to do this is to reduce the set of colors in the image. A color of a pixel is usually stored as a tuple of 3 integers – RGB – which describes the pixel’s relative levels of Red, Green, and Blue. Levels can be go from 0 to 255.
Since each pixel stores 3 integers to describe color, there could possibly be a large set of colors in a single image. In theory, there are actually 16,777,216 possible color choices for a pixel. Thus, a highly colorful image holds a ton of data.
By applying the K Means Clustering algorithm on the RGB color space of an image, we can reduce its size by picking a set of K representative colors (a.k.a. the centroids) to describe the color space of the whole image. Specifically, we replace each pixel’s color by the centroid that it is closest to. By reducing the set of colors to K, we are, in effect, introducing error in our compression, and so color compression is not a lossless type of compression.
As an example, we can apply K Means Clustering algorithm to the following image of Copenhagen:
If we choose 5 colors (K=5),
And finally, 40 colors,
The larger the K we select, the closer the colors of our compressed image are to those in the original image. On the other hand, the smaller K is, the smaller the image’s file size. For K=5, the image’s size is 679 KB, while for K=40, it is 2.1 MB. So we can see that there’s a trade-off between color accuracy and file size.
The K Means Clustering algorithm is a ridiculously simple but powerful machine learning tool that has found great application in a myriad of domains. As we saw in our example, clustering can be used to reduce the color space of an image to a set of K colors.
I used the K Means Clustering algorithm in my La La Land Color Analysis to reduce the color space of each movie frame to a set of 50 colors. Check out how I did it here. The final product is a spectrum of the movie over its runtime.