What Is TSNE And When Should I Use It?

What Is TSNE And When Should I Use It?

Sharing Knowledge

What Is T-SNE?

Subscribe to stay up to date with the latest Sonrai content.

What Is T-SNE And When Should I Use It?

Author: Matt Lee, Senior Data Scientist

Updated: 12/02/2024

Matt Lee

Matt Lee

Senior Data Scientist

Matt has a background in Medical Physics, previously working in diagnostic radiology and nuclear medicine as a Clinical Scientist in the NHS. Since then Matt developed experience in machine learning at Allstate NI; particularly in the field of computer vision where he developed an application to read VIN numbers from images. Matt has combined his experience of medical imaging and machine learning at Sonrai where he specialises in Computational Pathology development.

T-distributed Stochastic Neighbourhood Embedding (t-SNE) is an unsupervised Machine Learning algorithm developed in 2008 by Laurens van der Maaten and Geoffery Hinton. It has become widely used in bioinformatics and more generally in data science to visualize the structure of high dimensional data in 2 or 3 dimensions. It is the best known of a group of algorithms called Manifold Learning which are used for non-linear dimensionality reduction. It is sometimes contrasted with Principal Component Analysis (PCA).

What Data Can I Use It For?

It is incredibly versatile - this Google techtalk demonstrated excellent results on the ubiquitous MNIST handwritten digit classification dataset, more complex images in the CIFAR-10 dataset, audio of speech and word clusters generated from natural language processing (NLP). The technique has since been used to visualize structure in a range of data sources in bioinformatics including: Single-Cell Transcriptomics[1], RNA/DNA-Sequencing[2], DNA methylation[3] and Proteomics[4] to name just a few. Visualizing high-dimensional data such as these is impossible without first reducing the dimensions.

 

T-SNE can be used to give visual feedback on whether there is sufficient signal in the features available to train an ML model before embarking on the training cycle. This can be seen in our case study on biomarker discovery where it was used to determine if the bladder cancer patients were distinguishable from a control group prior to training with XGBoost.

T-SNE can also be used with imaging data; this works best when features are extracted with Convolutional Neural Networks such as ResNet and used as the input to t-SNE. Ultimately this allows images to be clustered based on their similarity with useful insights. This can be seen in our case study on grouping compounds with a similar mechanism of action for high content screening.

Curse of Dimensionality

A phenomenon well known within the Data Science community is the so-called 'Curse of Dimensionality'. Dimensionality here refers to the number of features for a given observation. This observation particularly pertains to bioinformatics data where features (e.g. variants in a VCF file, methylation scores in DNA Methylation Profile) can number in the hundreds of thousands whilst the observations typically number in the tens or hundreds. 

 

Visualizing this data is near impossible. Furthermore training machine learning models on this high dimensional data is prone to a phenomenon known as 'Overfitting' - the model performs well on the training data but does not generalize well to new data.

Dimensionality Reduction Techniques

Dimensionality reduction techniques reduce the effects of the Curse of Dimensionality. There are a number of ways to reduce the dimensionality of a dataset, including Isomap, Multi-Dimensional Scaling (MDS), Locally Linear Embedding, Spectral Embedding and t-Distributed Stochastic Neighbour Embedding (t-SNE), which is the focus of this article. 

 

These algorithms belong to an overarching class of algorithms known as Manifold Learning, which are predominantly stochastic algorithms. This means the same algorithm and data will result in different embeddings given different initialisations. 

 

Matrix decomposition is another group of dimensionality reduction algorithms with PCA being the best known. These are deterministic- they produce the same result given the same data. They also have few if any parameters to tune and they often process in a much shorter time.

Description

It allows the visualization of the underlying local structure of high-dimensional data in 2 or 3 dimensions, allowing the results to be plotted in a simple scatter plot. It also builds upon SNE which was introduced in 2002. The ‘t’ prefix was introduced to distinguish the algorithm using the T-distribution as opposed to the Gaussian distribution to compute the similarity between points in the low-dimensional space. 

 

The T-distribution has longer tails than the Gaussian distribution. This alleviates the ‘crowding problem’ present in the original algorithm. The crowding problem is when the euclidean distance between clusters is large compared to the distance between intra-cluster points. This results in losing distinguishing detail between intra-cluster points.

 

By measuring the pairwise distances in the high-dimensional and low-dimensional spaces using different probability distributions, the intra-cluster detail is more optimally visualized.

 

T-SNE optimizes over a set number of iterations, using gradient descent with Kullback-Leibler divergence as the cost function. The algorithm is stochastic, therefore multiple executions with different random seeds will yield different results. It is acceptable to run the algorithm several times and pick the embedding with the lowest KL divergence. 

 

The K-L divergence is defined as: 

K-L divergence

 

Where pij and qij are the pairwise probabilities in the high dimensional space and low dimensional space respectively. The reason that it is more accurate in resolving local structure is that the cost function is proportional to the pairwise probability in the high dimensional space. Points that are relatively distant have a much lower impact on the cost function. 

Example: Breast Cancer Wisconsin (Diagnostic) Data Set.

 

This dataset[5] comprises features computed from cell nuclei in images from fine needle aspirates of breast masses. From the embedding it can be seen that there are significant differences between benign (blue) and malignant masses (orange). PCA was used prior to t-SNE embedding.

If you'd like to talk to Matt or another expert from our team, reach out to us

Contact our friendly team for expert guidance and transformative insights.

How To Use T-SNE

To get the best out of t-SNE there are a number of tuneable hyperparameters, the optimum values of which are data-dependent. Understanding these at a high level will help you make informed choices when experimenting with values.

Perplexity

Arguably the most significant hyperparameter, this roughly translates to the number of nearest neighbors in other manifold techniques and effectively changes the width of the Gaussian for calculating pairwise probabilities. Higher perplexities will increase the weighting of more distant data in the calculated embedding. Consider the number of clusters that are expected and the number of observations when setting this value.The original paper suggests a perplexity between 5-50.

Iterations

Generally a high number of iterations will allow the algorithm to converge on an embedding with a lower loss. As with all ML algorithms, there are diminishing returns on extra iterations.

Method

An exact implementation of the original algorithm is possible but can take orders of magnitude more time to complete with larger data. The Barnes-Hut approximation speeds computation immensely and should be used in the vast majority of cases.

Learning Rate

This parameter is used in many ML models so Data Scientists will be familiar with it. The learning rate is a scalar that affects the scale of the updates to the embedded values in each iteration. A higher learning rate will generally converge to a solution faster, too high however and the embedding may not converge, manifesting as a ball of equidistant points.

Interpreting T-SNE

The embeddings produced by t-SNE are useful for exploratory data analysis and also as an indication of whether there is a sufficient signal in the features of a dataset for supervised methods to make successful predictions. Because it is non-linear, it may show class separation when linear models fail to make accurate predictions. In this case, non-linear supervised models should be considered as an alternative. 

 

The embeddings produced by t-SNE can be used for downstream analysis and model training but should be used with caution; for additional data cannot easily be added to an embedding without recalculating. Therefore if using a train/test split to evaluate a model, it is not possible to ‘calculate’ the embeddings of the test data.

 

Another useful way of interpreting it is by identifying outliers in the embedding and analyzing those points to identify potential causes. 

Comparison With PCA

 

 

t-SNE

PCA

Method

Stochastic

Deterministic

Global structure preserved

Focusses more on local structure preserving distances

Yes

Application to new data

Not possible with the original algorithm

Yes

Incomplete data

Not possible, missing data should first be imputed

Not possible, however probabilistic PCA supports missing data

Speed

Relatively slow

Relatively fast

Represents non-linear structure

Yes

No

 

Summary

It is an extremely versatile and robust method of visualizing high dimensional data. It is particularly useful for data with a non-linear relationship between observed features and the target class that are poorly represented in other dimensionality reduction techniques. Principal Component Analysis should be considered before t-SNE where the new reduced features feed into a ML pipeline as this can be readily applied to unseen data. 

References

[1] The art of using t-SNE for single-cell transcriptomics Dmitry Kobak & Philip Berens, Nature Communications 10, 5416 (2019)

[2] Visualizing Single-Cell RNA-Seq Data with t-SNE: Researcher Interview with Dmitry Kobak and Philipp Berens, National Cancer Institute September 2020

[3] Machine learning analysis of DNA methylation profiles distinguishes primary lung squamous cell carcinomas from head and neck metastases, Philipp Jurmeister, Michael Bockmayr, Philipp Seegerer,  Teresa Bockmayr, Denise Treue, Grégoire Montavon, Claudia Vollbrecht, Alexander Arnold, Daniel Teichmann, Keno Bressem, Ulrich Schüller, Maximilian von Laffert, Klaus-Robert Müller, David Capper, Frederick Klauschen Science Translational Medicine  11 Sep 2019

[4] Advanced Cell Mapping Visualizations for Single Cell Functional Proteomics Enabling Patient Stratification, Nick Bowman, Dong Liu, Patrick Paczkowski, Jon Chen, John Rossi , Sean Mackay, Adrian Bot , Jing Zhou Proteomics 2020 Jul;20(13):e1900270.doi: 10.1002/pmic.201900270

[5] Wisconsin Breast Cancer Diagnostics Set Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. 

Get in touch

Like What You See? Let's Talk

No hard sales conversations
We Listen to your problems
We give you confidence to make your decision
Related Posts
Leave a Reply