Author: Sara McCarney, Senior Software Engineer | Reading time: 5 minutes
K-means Clustering is one of several available clustering algorithms and can be traced back to Hugo Steinhaus in 1956. K-means is a non-supervised Machine Learning algorithm, which aims to organize data points into K clusters of equal variance. It is a centroid-based technique. K-means is one of the fastest clustering algorithms available.
Unsupervised Machine Learning
Machine Learning (ML) algorithms may be categorized into two general groups based on their learning approach: supervised and unsupervised. Supervised learning requires labelled data as input, with the model attempting to learn how the data corresponds to its label. Decision tree models and Multi-Layer Perceptron Neural Networks are examples of supervised machine learning models.
On the other hand, unsupervised learning requires no labelling of the input data. Instead, the model groups or detects similarities, patterns or anomalies within the unlabelled data. K-means is one such unsupervised learning method that aims to group similar data points in clusters. tSNE, a dimensionality reduction algorithm, is another example of unsupervised learning.
How it works - more detail
Before the algorithm starts, the number of clusters k should be specified by the user. Once specified, the K-means algorithm works by initializing the positions of the k cluster centroids (cluster centers). These centroid positions can have random initialization, or a special algorithm can be used to speed up the convergence. One such suitable algorithm is the kmeans++ algorithm.
Following the initialization of the cluster positions, each sample in your data set will be assigned to a cluster. Each data point is assigned to its closest centroid. This assignment is based on computing the distances between the cluster centroid and the data point.
Once all the data points are assigned to a cluster, the cluster centroids are recomputed. Cluster centroids are calculated by taking the mean of the cluster’s data points. The process now repeats, and the data points are assigned to their closest cluster based on the new cluster positions.
Over the set of samples, this translates to minimizing the inertia or within-cluster sum-of-squares criterion (SSE).
Where n is the number of samples, μ is the centroid mean, and C is the number of centroids. The number of iterations required is specified by a stopping criterion. For example, the algorithm might stop when the samples do not redistribute to new clusters, i.e. the centroid positions do not change.
At this point, the SSE of the model is collected, and the process can be terminated. However, it is recommended to restart it several times with different centroid initializations and select the best-performing model. This is because K-means can fall in local minima.
Advantages & Limitations of K-means clustering
When should you use K-means clustering over another algorithm? K-means is the most commonly used method in data analytics. This is largely due to its inexpensive compute, as well as being straightforward to implement and understand. It can also scale well with large data and will always converge.
K-means clustering can also be used as a pre or post-processing step for other machine-learning algorithms. For example, PCA Analysis can be used prior to K-means as a feature extraction step to reveal the clusters.
However, it is worth considering that other methods might be better suited to your data set. Examples of other techniques include hierarchical clustering or expectation maximization techniques.
Overcoming the limitations of K-Means Clustering
Limitations of K-Means Clustering
Number of clusters must be chosen manually
Outliers can have strong influence on results
Initial centroid positions have strong influence on results
Does not cluster well with clusters of varying sizes and densities
Remove outliers before clustering
Restart the clustering multiple times to find an optimal SSE
K-means Clustering as Data Mining
Clustering techniques are commonly adopted by data science teams and across a wide range of domains. They help us gain insights into how the features in our data relate to each other. Gaining new insights into our existing data is known as Data Mining. Using the clustering result, data mining can uncover patterns and trends existing in the data.
Data mining can be used across many industries such as fraud detection. In precision medicine, data mining can help determine which patients will respond to treatments.
Algorithms like the kmeans function is just one way we can perform data mining.
K-means Clustering in Precision Medicine: A Case Study
K-Means assigns data points to one of two clusters on a 2D plot
Precision Medicine aims to find alternatives to a “one size fits all” approach to healthcare by dividing target groups into smaller semantic sub-groups, where treatment efficacy and patient outcomes may vary significantly.
Precision medicine allows patients to be treated according to their subgroup, thereby improving overall patient outcomes. Going beyond the traditional subtyping approaches, we can employ unsupervised machine-learning techniques to identify subgroups of patients based on their molecular data.
In Gastric adenocarcinoma, there is increasing evidence to suggest that dysregulated stem cell biology is strongly associated with disease recurrence and treatment resistance. A recent study characterized the mRNA expression patterns of stem-like genes in gastric cancer samples.
We have further investigated two of the key genes reported in this study, CTNNB1 and NOTCH1. Using the K-means algorithm on the TCGA STAD RNA-seq dataset, the algorithm assigned each patient to a cluster. Comparing the two clusters on a Kaplan-Meier curve, we observed that there is a statistically significant difference in overall survival in the patients between clusters (figure 2).
Kaplan-Meier Curve showing survival analysis of gene clusters identified by k-means (Log-rank : p = 0.011)
This finding builds upon the study by Jafari et al and supports other findings in the literature. Our finding also demonstrates the utility of unsupervised clustering techniques such as K-means in precision medicine. The clusters can identify ‘molecular subtypes’ that may yield prognostic or predictive value (data mining).