What is tSNE and when should I use it?
Author: Matt Lee, Senior Data Scientist
T-distributed Stochastic Neighbourhood Embedding (tSNE) is an unsupervised Machine Learning algorithm developed in 2008 by Laurens van der Maaten and Geoffery Hinton. It has become widely used in bioinformatics and more generally in data science to visualise the structure of high dimensional data in 2 or 3 dimensions. It is the best known of a group of algorithms called Manifold Learning which are used for non-linear dimensionality reduction. It is sometimes contrasted with Principal Component Analysis (PCA).
What Data can I use it for?
It is incredibly versatile - this Google techtalk demonstrated excellent results on the ubiquitous MNIST handwritten digit classification dataset, more complex images in the CIFAR-10 dataset, audio of speech and word clusters generated from natural language processing (NLP). The technique has since been used to visualise structure in a range of data sources in bioinformatics including: Single-Cell Transcriptomics, RNA/DNA-Sequencing, DNA methylation and Proteomics to name just a few. Visualising high-dimensional data such as these is impossible without first reducing the dimensions.
T-SNE can be used to give visual feedback on whether there is sufficient signal in the features available to train an ML model before embarking on the training cycle. This can be seen in our case study on biomarker discovery where it was used to determine if the bladder cancer patients were distinguishable from a control group prior to training with XGBoost.
T-SNE can also be used with imaging data; this works best when features are extracted with Convolutional Neural Networks such as ResNet and used as the input to T-SNE. Ultimately this allows images to be clustered based on their similarity with useful insights. This can be seen in our case study on grouping compounds with a similar mechanism of action for high content screening.
Curse of Dimensionality
A phenomenon well known within the Data Science community is the so-called 'Curse of Dimensionality'. Dimensionality here refers to the number of features for a given observation. This observation particularly pertains to bioinformatics data where features (e.g. variants in a VCF file, methylation scores in DNA Methylation Profile) can number in the hundreds of thousands whilst the observations typically number in the tens or hundreds.
Visualising this data is near impossible. Furthermore training machine learning models on this high dimensional data is prone to a phenomenon known as 'Overfitting' - the model performs well on the training data but does not generalise well to new data.
Dimensionality Reduction Techniques
Dimensionality reduction techniques reduce the effects of the Curse of Dimensionality. There are a number of ways to reduce the dimensionality of a dataset, including Isomap, Multi-Dimensional Scaling (MDS), Locally Linear Embedding, Spectral Embedding and t-Distributed Stochastic Neighbour Embedding (tSNE), which is the focus of this article.
These algorithms belong to an overarching class of algorithms known as Manifold Learning, which are predominantly stochastic algorithms. This means the same algorithm and data will result in different embeddings given different initialisations.
Matrix decomposition is another group of dimensionality reduction algorithms with PCA being the best known. These are deterministic- they produce the same result given the same data. They also have few if any parameters to tune and they often process in a much shorter time.
It allows the visualisation of the underlying local structure of high-dimensional data in 2 or 3 dimensions, allowing the results to be plotted in a simple scatter plot. It also builds upon SNE which was introduced in 2002. The ‘t’ prefix was introduced to distinguish the algorithm using the T-distribution as opposed to the Gaussian distribution to compute the similarity between points in the low-dimensional space.
The T-distribution has longer tails than the Gaussian distribution. This alleviates the ‘crowding problem’ present in the original algorithm. The crowding problem is when the euclidean distance between clusters is large compared to the distance between intra-cluster points. This results in losing distinguishing detail between intra-cluster points.
By measuring the pairwise distances in the high-dimensional and low-dimensional spaces using different probability distributions, the intra-cluster detail is more optimally visualised.
tSNE optimises over a set number of iterations, using gradient descent with Kullback-Leibler divergence as the cost function. The algorithm is stochastic, therefore multiple executions with different random seeds will yield different results. It is acceptable to run the algorithm several times and pick the embedding with the lowest KL divergence.
The K-L divergence is defined as:
Where pij and qij are the pairwise probabilities in the high dimensional space and low dimensional space respectively. The reason that it is more accurate in resolving local structure is that the cost function is proportional to the pairwise probability in the high dimensional space. Points that are relatively distant have a much lower impact on the cost function.
Example: Breast Cancer Wisconsin (Diagnostic) Data Set.
This dataset comprises features computed from cell nuclei in images from fine needle aspirates of breast masses. From the embedding it can be seen that there are significant differences between benign (blue) and malignant masses (orange). PCA was used prior to tSNE embedding.
How to use tSNE
To get the best out of tSNE there are a number of tuneable hyperparameters, the optimum values of which are data-dependent. Understanding these at a high level will help you make informed choices when experimenting with values.
Arguably the most significant hyperparameter, this roughly translates to the number of nearest neighbours in other manifold techniques and effectively changes the width of the Gaussian for calculating pairwise probabilities. Higher perplexities will increase the weighting of more distant data in the calculated embedding. Consider the number of clusters that are expected and the number of observations when setting this value.The original paper suggests a perplexity between 5-50.
Generally a high number of iterations will allow the algorithm to converge on an embedding with a lower loss. As with all ML algorithms, there are diminishing returns on extra iterations.
An exact implementation of the original algorithm is possible but can take orders of magnitude more time to complete with larger data. The Barnes-Hut approximation speeds computation immensely and should be used in the vast majority of cases.
This parameter is used in many ML models so Data Scientists will be familiar with it. The learning rate is a scalar that affects the scale of the updates to the embedded values in each iteration. A higher learning rate will generally converge to a solution faster, too high however and the embedding may not converge, manifesting as a ball of equidistant points.
The embeddings produced by tSNE are useful for exploratory data analysis and also as an indication of whether there is a sufficient signal in the features of a dataset for supervised methods to make successful predictions. Because it is non-linear, it may show class separation when linear models fail to make accurate predictions. In this case, non-linear supervised models should be considered as an alternative.
The embeddings produced by tSNE can be used for downstream analysis and model training but should be used with caution; for additional data cannot easily be added to an embedding without recalculating. Therefore if using a train/test split to evaluate a model, it is not possible to ‘calculate’ the embeddings of the test data.
Another useful way of interpreting it is by identifying outliers in the embedding and analysing those points to identify potential causes.
Comparison with PCA
Global structure preserved
Focusses more on local structure preserving distances
Application to new data
Not possible with the original algorithm
Not possible, missing data should first be imputed
Not possible, however probabilistic PCA supports missing data
Represents non-linear structure
It is an extremely versatile and robust method of visualising high dimensional data. It is particularly useful for data with a non-linear relationship between observed features and the target class that are poorly represented in other dimensionality reduction techniques. Principal Component Analysis should be considered before tSNE where the new reduced features feed into a ML pipeline as this can be readily applied to unseen data.
 Machine learning analysis of DNA methylation profiles distinguishes primary lung squamous cell carcinomas from head and neck metastases, Philipp Jurmeister, Michael Bockmayr, Philipp Seegerer, Teresa Bockmayr, Denise Treue, Grégoire Montavon, Claudia Vollbrecht, Alexander Arnold, Daniel Teichmann, Keno Bressem, Ulrich Schüller, Maximilian von Laffert, Klaus-Robert Müller, David Capper, Frederick Klauschen Science Translational Medicine 11 Sep 2019
 Advanced Cell Mapping Visualizations for Single Cell Functional Proteomics Enabling Patient Stratification, Nick Bowman, Dong Liu, Patrick Paczkowski, Jon Chen, John Rossi , Sean Mackay, Adrian Bot , Jing Zhou Proteomics 2020 Jul;20(13):e1900270.doi: 10.1002/pmic.201900270
 Wisconsin Breast Cancer Diagnostics Set Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.