PCA Plot

Author: Joy Kavanagh

Advances in molecular technologies and high throughput instrumentation have opened a vast array of possibilities for life scientists to pursue a deeper understanding of human disease biology and the discovery of safer and more effective treatments.

At the same time, such advances sparked a data revolution, inexorably changing the requirements for biological data storage, management, analysis and interpretation. While technology has raced ahead, globally the training of sufficient skilled data scientists to meet the demand has lagged behind, resulting in a widening skills gap among life scientists [1], [2].

Observing this growing trend several groups embarked on surveys aimed at better understanding the widening skills gap among research scientists and seeking viable solutions to the problem. All found that common among respondents was a lack of confidence in the ability to use bioinformatics tools and statistical methods [3].

The resounding opinion, when faced with this evidence, was that in order to meet the ever-increasing demand for STEM researchers with a depth of knowledge and skill that a fundamental training shift was required starting from the bachelor level (or earlier) and upward to allow such skills to develop in tandem with learning in the underlying biology. Other high profile articles highlighted the issue, stressing that in order to prevent research from stalling, there was a need for biologists to learn bioinformatics skills thus widening the pool of applied bioinformaticians [4].

Five years on and a shortfall of ~250,000 data scientists [5] means almost 50% of biomedical data scientist positions go unfilled in academia and industry. The reality is that creation or adaptation of training programs that integrate data science with biomedical sciences to narrow that skills gap is a long term strategy that is yet to bear fruit [6].

So, where does that leave researchers today?

The reality for many, who seek to break new ground in the understanding or treatment of human diseases, is a requirement for complex data analysis that is beyond their skill set or available resources. Sadly for some, this barrier can feel insurmountable, while for others with time and funding on their side, the hurdle can be overcome. The scenarios below will be all too familiar:

  1. A data scientist colleague is available to collaborate on the project, however, the translation of your research objectives into tangible outputs requires significant time investment on both sides and in some cases many iterations of analysis methods to obtain the crucial results.
  2. The data scientist in your group is already overwhelmed with projects and unable to assist (and most likely face challenge A above).
  3. You decide to embark on the time consuming and expensive task of recruiting a qualified, applied data scientist who can fill the gap (the average salary for an experienced data scientist in the UK is £70,000 [7] or to outsource each piece of analysis at a premium rate. Or
  4. The budget required to fund the appropriate support is not available and the research is shelved.

How can we address the immediate problem? What if more life scientists were empowered to meet their own data analysis needs independently? Examples of the types of multimodal data processing and analysis you might need to perform:

  • Secure clinical data storage
  • Data cleaning, data mangling, data visualization
  • Analysis of processed data

Not surprisingly, many data analytics tools have emerged that automate one or many of these analyses, helping researchers to navigate data analysis independently. Each table below provides an overview of some open-source tools and their main capabilities.

Sonrai Analytics Can Help

If your team does not have the time or expertise to use the tools in the tables below – get in touch with Sonrai; we can perform all the functions below in one easy-to-use cloud-based Biomarker Discovery Platform.

Researchers: With access to such powerful tools, a large and multifaceted raw data set can be transformed into presentation-ready outputs from minutes to hours. This fast turnaround could make the difference in making an abstract or grant deadline that helps fund the next stage of the research or bring an important discovery one step closer to the clinic.

Academic and industry directors: Access to powerful, user-friendly and secure data analysis tools removes barriers and enables groundbreaking research. Imagine the impact that fast, cost-effective multi-omics data analysis would have on your department’s outputs. 


Selected open-source bioinformatics tools and their main capabilities.

Data cleaning & Visualization

ToolSummarySkill Requirement
OpenRefinePreviously Google Refine, OpenRefine is a well-known open-source data tool. Its main benefit is being open source. It lets you transform data between different formats and ensure that data is cleanly structured.While OpenRefine streamlines many complex tasks (e.g. using clustering algorithms) it does require a little bit of technical know-how.
TidyverseThe tidyverse is a comprehensive collection of R packages which enables the generation of tidy data.A powerful toolkit but thus requires programming skills
PandasSimilar to the R Tidyverse this python package is a stablemate for data cleaning and transformation tasksThis requires knowledge of python to use
matplotlibMatplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.Technical users only
R ShinyA web application framework for creating interactive dashboards and data visualizations.Programming skills needed

Bioinformatics & Machine Learning

Analysis TypeToolSummarySkill Requirement
Differential gene expressionDESeq2
This tool enables users to perform differential gene expression analysis on data from a variety of platforms from microarray to next-generation sequencing.A hugely popular tool but requires basic programming skills to use
Survival  AnalysisLog Rank TestCompare survival in different groups of patients. The Log-Rank test is automatically performed when you compare groups. The p-value and the survivals of each group are calculated and provided for inclusion in your article.Easy-to-use web interface but for advanced Survival analysis alongside molecular data technical skills are required
Pathway AnalysisGSEAGene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states.Another hugely popular tool with a basic user interface. R and Python packages are available for technical users.
Machine Learningscikit learnA python toolkit for performing classification, regression and dimensionality reduction techniques.Technical users only
Data MiningOrange
Open-source software for data mining and machine learning workflows.Programming skills needed
Deep LearningPyTorchPyTorch is an open-source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook’s AI Research lab.Requires in-depth knowledge of python and concepts such as neural networks.


ToolSummarySkill Requirement
CBIO PortalEnables interactive exploration of multidimensional cancer genomics data sets. The goal of cBioPortal is to significantly lower the barriers between complex genomic data and cancer researchers by providing rapid, intuitive, and high-quality access to molecular profiles and clinical attributes from large-scale cancer genomics projects, and therefore to empower researchers to translate these rich data sets into biologic insights and clinical applications.Easy to use graphical interface however API user requires technical skills for advanced queries
FirebrowseDeveloped at the Broad Institute of MIT and Harvard, Firebrowse was created to cull and analyze data generated by The Cancer Genome Atlas (TCGA), which characterizes and identifies genomic patterns in human cancer models. FireBrowse provides access to a variety of cancer genomics data, such as clinical annotations, DNA copy number, miR, miRseq, mRNA and mRNAseq; as well as a comprehensive suite of more than 100 interdependent analyses of those data, including correlations, clustering, and GISTIC and MutSigCV.Technical skills are required to make the most of the data on this platform

Digital and Computational Pathology

ToolSummarySkill Requirement

Designed by Pete Bankhead at the Queen’s University Belfast QuPath is designed specifically to analyze WSI. Its primary use is biomarker analysis/ IHC quantification (whole slides and tumour microarrays), but it has also been used for tumour analysis on H&E.Easy to use for image viewing however for computational pathology technical skills are required
ImageJImageJ is probably the best known and longest-lived open-source software for biomedical image analysis. Even though the program is so widely used, ImageJ is an experimental system and NIH does not assume any responsibility for its use by other parties.For use in a regulated environment, a separate validation and verification would be required by the end-user.
ilastikIlastik helps users perform segmentation and classification of 2, 3 and 4D images in a unified way. Through a random forest classifier, ilastik learns from labels provided by the user through a convenient GUI. Based on these labels, ilastik applies a problem specific segmentation.Easy-to-use. Created for users without expertise in image processing

If your team does not have the time or expertise to use the tools in the table below – get in touch with Sonrai; we can perform all the functions below in one easy-to-use cloud-based Biomarker Discovery Platform.


[1] BBSRC and MRC Review of vulnerable skills and capabilities, 2017.

[2] ‘Bridging the skills gap in the biopharmaceutical industry’, ABPI, 2015.

[3] Attwood, T.K. et al, Briefings in bioinformatics, 20(2), 2019, 398-404. 

[4] Chang, J. Core services: reward bioinformaticians. Nature 2015;520:151–2.

[5] The Data Scientist Shortage in 2020

[6] Bourne, P.E. PLoS Biol. 2021 Mar; 19(3): e3001165. 

[7] National Careers Service Data Scientist

Related Post


Cloud and data technology startup conceptualising raw data into actionable insights.

Follow Us

Copyright © 2020 Sonrai Analytics Ltd

Contact Us

Address:  Whitla Medical Building
Health Sciences Campus
Lisburn Road, Belfast, BT9 7BL

Email: info@sonraianalytics.com
Phone: + (00) (44) 028 9097 2629