Author: Sara McCarney, Data Scientist
K-means Clustering is one of several available clustering algorithms and can be traced back to Hugo Steinhaus in 1956. K-means is an unsupervised Machine Learning algorithm, which aims to organize data points into K clusters of equal variance. It is a centroid-based clustering technique.
K-means is one of the fastest clustering algorithms available and continues to be widely used in modern data science and bioinformatics applications. Further demonstrating its popularity, a search of k-means in the NCBI online repository GEO returns 798 datasets and 5,648 publication results where the algorithm has been applied.
Machine Learning (ML) algorithms may be categorized into two general groups based on their learning approach: supervised and unsupervised. Supervised learning requires labelled data as input, with the model attempting to learn how the data corresponds to its label. Decision tree models and Multi-Layer Perceptron Neural Networks are examples of supervised machine learning models.
On the other hand, unsupervised learning requires no labelling of the input data. Instead, the model groups or detects similarities, patterns or anomalies within the unlabelled data. K-means is one such unsupervised learning method that aims to group similar data points in clusters. tSNE, a dimensional reduction algorithm, is another example of unsupervised learning.
Figure 1. An example of K-means clustering by Keven Arvai where kmeans n clusters are iterating through Steps 1-3
Before the algorithm starts, the number of clusters k should be specified by the user. Once specified, the K-means algorithm works by initializing the positions of the k cluster centroids (cluster centers). These centroid positions can have random initialization, or a special algorithm can be used to speed up the convergence. One such suitable algorithm is the kmeans++ algorithm.
Following the initialization of the cluster positions, each sample in your data set will be assigned to a cluster. Each data point is assigned to its closest centroid. This assignment is based on computing the distances between the cluster centroid and the data point.
Once all the data points are assigned to a cluster, the cluster centroids are recomputed. Cluster centroids are calculated by taking the mean of the cluster’s data points. The process now repeats, and the data points are assigned to their closest cluster based on the new cluster positions.
Over the set of samples, this translates to minimizing the inertia or within-cluster sum-of-squares criterion (SSE).
Where n is the number of samples, μ is the centroid mean, and C is the number of centroids. The number of iterations required is specified by a stopping criterion. For example, the algorithm might stop when the samples do not redistribute to new clusters (centroid positions do not change).
At this point, the SSE of the model is collected, and the process can be terminated. However, it is recommended to restart it several times with different centroid initializations and select the best performing model. This is because K-means can fall in local minima.
When should you use K-means clustering over another clustering algorithm? K-means is the most commonly used clustering method in data analytics. This is largely due to its inexpensive compute, as well as straightforward to implement and understand. It can also scale well with large data and will always converge.
K-means clustering can also be used as a pre or post-processing step for other machine learning algorithms. For example, PCA Analysis can be used prior to K-means as a feature extraction step to reveal the clusters.
However, it is worth considering that other clustering methods might be better suited to your data set. Examples of other clustering techniques include hierarchical clustering or expectation maximization techniques.
Overcoming the limitations of K-Means Clustering
|Limitations of K-Means Clustering||Resolved by|
|Number of clusters must be chosen manually||Find an optimal number of clusters e.g. Elbow and Silhouette Analysis|
|Outliers can have a strong influence on results||Remove outliers before clustering|
|Initial centroid positions have a strong influence on results||Restart the clustering multiple times to find an optimal SSE|
|Does not cluster well with clusters of varying sizes and densities||Consider using another clustering algorithm such as Gaussian mixture models|
Clustering techniques are commonly adopted by data science teams and across a wide range of domains. Clustering helps us gain insights into how the features in our data relate to each other. Gaining new insights into our existing data is known as Data Mining. Using the clustering result, data mining can uncover patterns and trends existing in the data.
Data mining can be used across many industries such as fraud detection. In precision medicine, data mining can help determine which patients will respond to treatments.
Clustering algorithms like the kmeans function is just one way we can perform data mining.
Figure 2. K-Means assigns data points to one of two clusters on a 2D plot
Precision Medicine aims to find alternatives to a “one size fits all” approach to healthcare. This typically happens by dividing target groups into smaller semantic sub-groups. Within these sub-groups, treatment efficacy and patient outcomes may vary significantly.
Precision medicine allows patients to be treated according to their subgroup, thereby improving overall patient outcomes. We can employ unsupervised machine learning techniques to identify subgroups of patients based on their molecular data.
There has been increasing evidence to suggest that dysregulated stem cell biology is strongly associated with disease recurrence and treatment resistance. A recent study characterized the mRNA expression patterns of stem-like genes in gastric cancer samples.
We have further investigated two of the key genes reported in this study, CTNNB1 and NOTCH1. Using the K-means clustering algorithm on the TCGA STAD RNA-seq dataset, the algorithm assigned each patient to a cluster. We compared the two clusters on a Kaplan-Meier curve (figure 2). It is observed that there is a statistically significant difference in overall survival in the patients between clusters.
Kaplan-Meier Curve showing survival analysis of gene clusters identified by k-means (Log-rank : p = 0.011)
This finding builds upon the study by Jafari et al and supports other findings in the literature. Our finding also demonstrates the utility of unsupervised clustering techniques such as K-means in precision medicine. The clusters can identify ‘molecular subtypes’ that may yield prognostic or predictive value (data mining).
Whether you have general questions about our products and solutions or would like to schedule a demo, our expert team is available to help.
Cloud and data technology startup conceptualising raw data into actionable insights.
Copyright © 2020 Sonrai Analytics Ltd