Author: Sara McCarney, Data Scientist

K-means Clustering is one of several available clustering algorithms and can be traced back to Hugo Steinhaus in 1956. K-means is a non-supervised Machine Learning algorithm, which aims to organize data points into K clusters of equal variance. It is a centroid-based clustering technique. K-means is one of the fastest clustering algorithms available.

Unsupervised Machine Learning

Machine Learning (ML) algorithms may be categorized into two general groups based on their learning approach: supervised and unsupervised. Supervised learning requires labelled data as input, with the model attempting to learn how the data corresponds to its label. Decision tree models and Multi-Layer Perceptron Neural Networks are examples of supervised machine learning models.

On the other hand, unsupervised learning requires no labelling of the input data. Instead, the model groups or detects similarities, patterns or anomalies within the unlabelled data. K-means is one such unsupervised learning method that aims to group similar data points in clusters. tSNE, a dimensionality reduction algorithm, is another example of unsupervised learning.

Algorithm summary

K-means clustering

Figure 1. An example of K-means clustering by Keven Arvai where kmeans n clusters are iterating through Steps 1-3

  1. Initialize centroids
  2. For each sample in the set of samples:
    • Assign sample to its closest cluster K based on the Euclidean distance
  3. For each cluster in the set of clusters:
    • Recalculate the cluster centroid based on the mean of the samples in the cluster
  4. Repeat steps 2-3 until a stopping criterion is fulfilled
  5. Determine the variance of each cluster and sum (SSE)
  6. Repeat steps 1-5 with different initialization of the cluster centroid positions n_init times
  7. The initialization with the lowest SSE is selected as the final result

 

How it works – more detail

Before the algorithm starts, the number of clusters k should be specified by the user. Once specified, the K-means algorithm works by initializing the positions of the k cluster centroids (cluster centers). These centroid positions can have random initialization, or a special algorithm can be utilised to speed up the convergence. One such suitable algorithm is the kmeans++ algorithm.

Following the initialization of the cluster positions, each sample in your data set will be assigned to a cluster. This assignment is based on the Euclidean distance between the cluster centroid and the data point, and the data point is assigned to its closest centroid. Once all the data points are assigned to a cluster, the cluster centroids are recalculated as the mean of the cluster’s data points.

The process now repeats, redistributing datapoints to their closest cluster based on the new cluster positions. Over the set of samples, this translates to minimising the inertia or within-cluster sum-of-squares criterion (SSE):

CodeCogsEqn

Where n is the number of samples, μ is the centroid mean, and C is the number of centroids. The number of iterations required is specified by a stopping criterion. For example, the algorithm might stop when the samples do not redistribute to new clusters, i.e. the centroid positions do not change.

At this point, the SSE of the model is collected, and the process can be terminated. However, it is recommended to restart it several times with different centroid initializations and select the best performing model. This is because K-means can fall in local minima.

Advantages & Limitations of K-means clustering

When should you use K-means clustering over another clustering algorithm? K-means is the most commonly used clustering method in data analytics, largely due to it being computationally inexpensive and straightforward to implement and understand. It can also scale well with large data and will always converge.

K-means clustering can also be used as a pre or post-processing step to other algorithms in a data analysis pipeline. For example, PCA Analysis can be used prior to K-means as a feature extraction step to reveal the clusters. However, it is worth considering that other clustering methods such as hierarchical clustering or expectation maximization techniques might be better suited to your data set.

Overcoming the limitations of K-Means Clustering

Limitations of K-Means ClusteringResolved by
Number of clusters must be chosen manuallyFind an optimal number of clusters e.g. Elbow and Silhouette Analysis
Outliers can have a strong influence on resultsRemove outliers before clustering
Initial centroid positions have a strong influence on resultsRestart the clustering multiple times to find an optimal SSE
Does not cluster well with clusters of varying sizes and densitiesConsider using another clustering algorithm such as Gaussian mixture models

 

K-means Clustering as Data Mining

Clustering techniques are used ubiquitously in the data science arena and across a wide range of domains. Clustering helps us gain insights into how the features in our data relate to each other. Gaining new insights into our existing data is known as Data Mining.

K-means Clustering in Precision Medicine: A Case Study

k-means

Figure 2. K-Means assigns data points to one of two clusters on a 2D plot

Precision Medicine aims to find alternatives to a “one size fits all” approach to healthcare by dividing target groups into smaller semantic sub-groups, where treatment efficacy and patient outcomes may vary significantly.

Precision medicine allows patients to be treated according to their subgroup, thereby improving overall patient outcomes. Going beyond the traditional subtyping approaches, we can employ unsupervised machine learning techniques to identify subgroups of patients based on their molecular data.

In Gastric adenocarcinoma, there is increasing evidence to suggest that dysregulated stem cell biology is strongly associated with disease recurrence and treatment resistance. A recent study characterized the mRNA expression patterns of stem-like genes in gastric cancer samples.

We have further investigated two of the key genes reported in this study, CTNNB1 and NOTCH1. Using the K-means clustering algorithm on the TCGA STAD RNA-seq dataset, the algorithm assigned each patient to a cluster. From the above image (Figure 2), we can see that the navy cluster (cluster 1) both express these genes more highly than the green cluster (cluster 0). Comparing the two clusters on a Kaplan-Meier curve, we observed that there is a statistically significant difference in overall survival in the patients between clusters (figure 3).

Figure 3. Kaplan-Meier Curve showing the statistically significant difference in the CTNNB1 and NOTCH1 clusters identified by k-means (Log-rank : p = 0.011)

This finding builds upon the study by Jafari et al and supports other findings in the literature regarding the prognostic value of stem-like genes. Our finding also demonstrates the utility of unsupervised clustering techniques such as K-means in precision medicine for identifying ‘molecular subtypes’ that may yield prognostic or predictive value.

 

Tell us about your needs

Whether you have general questions about our products and solutions or would like to schedule a demo, our expert team is available to help.


GET IN TOUCH

Related Post

Sonrai-DarkBG-low-res

Cloud and data technology startup conceptualising raw data into actionable insights.

Follow Us

Copyright © 2020 Sonrai Analytics Ltd

Contact Us

Address:  Whitla Medical Building
Health Sciences Campus
Lisburn Road, Belfast, BT9 7BL

Email: info@sonraianalytics.com
Phone: + (00) (44) 028 9097 2629