Biomarker Discovery for High-Dimensional Data

Biomarker Discovery for High-Dimensional Data

Sharing Knowledge

Biomarker Discovery for High-Dimensional Data

Subscribe to stay up to date with the latest Sonrai content.

Author: Dr Matt Alderdice, Head of Data Science

In Precision Medicine, high-dimensional molecular datasets are typically explored to identify novel biomarkers, predict novel drug targets and advance our understanding of disease. Working with big data is challenging.

Transcriptional Analysis Workflow

Transcriptional analysis workflow, PCA Plot - View interactive report

A key use case for the analysis of high-dimensional datasets can be seen in the transcriptional analysis workflow below. The workflow is captured in a custom interactive report (Click to View). This interactive workflow was constructed using four applications from the Sonrai App store. Our App store offers over 5000 different combinations. In this workflow, we demonstrate how we can perform quality control using unsupervised machine learning, differential gene expression analysis and classic statistical analysis in under 5 minutes without the need for a bioinformatician!

The data used for this analysis consists of clinical metadata (n=244) and RNA-seq transcriptional profiles (genes > 20,000). Sonrai’s Data Transformation Application integrated this data, making it ready for real-time analysis using the following steps.

Step 1

Using the machine learning technique Principle Components Analysis we can quickly visualize the variation in our data to help identify biologically meaningful clusters and detect outliers and batch effects. We see from figure 1 that TCGA-3947 is an outlier, and we may decide to investigate this sample or remove it from our analyses.

Step 2

To identify key biomarkers from this gene expression dataset we use the volcano plot application. This performs differential expression analysis. The higher and wider the features appear in the resulting eruption, the more differentially expressed the feature associated with MSI-H or MSS status. We identify 45 key genes using our powerful default settings.

Step 3

We then visualize our data using the heatmap application. Heatmaps allow patterns of expression to be quickly visualized and it is a widely used and important bioinformatics tool. The dendrogram shows robust clustering of the patients into two groups which consist predominantly of MSI-H and MSS patients.

Step 4

Finally, we identify IDO1 as our key biomarker of interest from our 45 gene signature. We use the box and dot plot app to perform a statistical comparison of the means using ANOVA to generate a p-value that demonstrates statistical significance.


This workflow depicts one of many possible combinations that are ready to be used out-of-the-box to address bottlenecks in multi-omic analysis. The workflow showcases how our intuitive no-code machine learning, bioinformatics and data visualization applications can be used to detect outliers, generate gene signatures and validate key biomarkers from high-dimension data such as transcriptomics.

Sonrai’s technology is at the core of our high-profile partnerships where we are driving innovative data-driven healthcare solutions in diseases such as colorectal cancer. Sonrai’s technology can help pioneer the democratization of complex bioinformatics and data science workflows for precision medicine.

Get in touch

Like What You See? Let's Talk

No hard sales conversations
We Listen to your problems
We give you confidence to make your decision
Related Posts