Author: Joy Kavanagh | Reading time: 8 Minutes
In this article, we will explore:
The widening skills gap among research scientists
The reality for research scientists today
Open-source bioinformatics tools list
The Widening Skills Gap Among Research Scientists
Advances in molecular technologies and high throughput instrumentation have opened a vast array of possibilities for life scientists to generate large and complex data sets as they pursue a deeper understanding of human disease biology and the discovery of safer and more effective treatments.
At the same time, such advances sparked a data revolution, inexorably changing the requirements for biological data storage, management, analysis and interpretation. While technology has raced ahead, globally the training of sufficient skilled data scientists to meet the demand has lagged behind, resulting in a widening skills gap among life scientists , .
Observing this growing trend several groups embarked on surveys aimed at better understanding the widening skills gap among research scientists and seeking viable solutions to the problem. All found that common among respondents was a lack of confidence in the ability to use bioinformatics tools and statistical methods .
The resounding opinion, when faced with this evidence, was that in order to meet the ever-increasing demand for STEM researchers with a depth of knowledge and skill that a fundamental training shift was required starting from the bachelor level (or earlier) and upward to allow such skills to develop in tandem with learning in the underlying biology. Other high-profile articles highlighted the issue, stressing that in order to prevent research from stalling, there was a need for biologists to learn bioinformatics skills thus widening the pool of applied bioinformaticians .
Five years on and a shortfall of ~250,000 data scientists  means almost 50% of biomedical data scientist positions go unfilled in academia and industry. The reality is that creation or adaptation of training programs that integrate data science with biomedical sciences to narrow that skills gap is a long-term strategy that is yet to bear fruit .
Top Tips for Successful Biomarker Discovery & Development
For those beginning their biomarker journey - to those who have already embarked and need further guidance on overcoming common challenges.
So, where does that leave researchers today?
The reality for many, who seek to break new ground in the understanding or treatment of human diseases, is a requirement for complex data analysis that is beyond their skill set or available resources. Sadly for some, this barrier can feel insurmountable, while for others with time and funding on their side, the hurdle can be overcome. The scenarios below will be all too familiar:
How can we address the immediate problem? What if more life scientists were empowered to meet their own data analysis needs independently? Examples of the types of multimodal data processing and analysis you might need to perform:
Not surprisingly, many data analytics tools have emerged that automate one or many of these analyses, helping researchers to navigate data analysis independently. Each table below provides an overview of some open-source tools and their main capabilities.
Researchers: With access to such powerful tools, a large and multifaceted raw data set can be transformed into presentation-ready outputs from minutes to hours. This fast turnaround could make the difference in making an abstract or grant deadline that helps fund the next stage of the research or bring an important discovery one step closer to the clinic.
Academic and industry directors: Access to powerful, user-friendly, and secure data analysis tools removes barriers and enables groundbreaking research. Imagine the impact that fast, cost-effective multi-omics data analysis would have on your department’s outputs.
Sonrai Can HelpIf your team does not have the time or expertise to use the tools in the tables below - get in touch with Sonrai; we can perform all the functions in one easy-to-use cloud-based Biomarker Discovery Platform.
Selected open-source bioinformatics tools and their main capabilities
Data cleaning & Visualization
|OpenRefine||Previously Google Refine, OpenRefine is a well-known open-source data tool. Its main benefit is being open source. It lets you transform data between different formats and ensure that data is cleanly structured.||While OpenRefine streamlines many complex tasks (e.g. using clustering algorithms) it does require a little bit of technical know-how.|
|Tidyverse||The tidyverse is a comprehensive collection of R packages which enables the generation of tidy data.||A powerful toolkit but thus requires programming skills|
|Pandas||Similar to the R Tidyverse this python package is a stablemate for data cleaning and transformation tasks||This requires knowledge of python to use|
|matplotlib||Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.||Technical users only|
|R Shiny||A web application framework for creating interactive dashboards and data visualizations.||Programming skills needed|
Bioinformatics & Machine Learning
|Analysis Type||Tool||Summary||Skill Requirement|
|Differential gene expression||DESeq2
||This tool enables users to perform differential gene expression analysis on data from a variety of platforms from microarray to next-generation sequencing.||A hugely popular tool but requires basic programming skills to use|
|Survival Analysis||Log Rank Test||Compare survival in different groups of patients. The Log-Rank test is automatically performed when you compare groups. The p-value and the survivals of each group are calculated and provided for inclusion in your article.||Easy-to-use web interface but for advanced Survival analysis alongside molecular data technical skills are required|
|Pathway Analysis||GSEA||Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states.||Another hugely popular tool with a basic user interface. R and Python packages are available for technical users.|
|Machine Learning||scikit learn||A python toolkit for performing classification, regression and dimensionality reduction techniques.||Technical users only|
||Open-source software for data mining and machine learning workflows.||Programming skills needed|
|Deep Learning||PyTorch||PyTorch is an open-source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook’s AI Research lab.||Requires in-depth knowledge of python and concepts such as neural networks.|
|CBIO Portal||Enables interactive exploration of multidimensional cancer genomics data sets. The goal of cBioPortal is to significantly lower the barriers between complex genomic data and cancer researchers by providing rapid, intuitive, and high-quality access to molecular profiles and clinical attributes from large-scale cancer genomics projects, and therefore to empower researchers to translate these rich data sets into biologic insights and clinical applications.||Easy to use graphical interface however API user requires technical skills for advanced queries|
|Firebrowse||Developed at the Broad Institute of MIT and Harvard, Firebrowse was created to cull and analyze data generated by The Cancer Genome Atlas (TCGA), which characterizes and identifies genomic patterns in human cancer models. FireBrowse provides access to a variety of cancer genomics data, such as clinical annotations, DNA copy number, miR, miRseq, mRNA and mRNAseq; as well as a comprehensive suite of more than 100 interdependent analyses of those data, including correlations, clustering, and GISTIC and MutSigCV.||Technical skills are required to make the most of the data on this platform|
Digital and Computational Pathology
|Designed by Pete Bankhead at the Queen’s University Belfast QuPath is designed specifically to analyze WSI. Its primary use is biomarker analysis/ IHC quantification (whole slides and tumour microarrays), but it has also been used for tumour analysis on H&E.||Easy to use for image viewing however for computational pathology technical skills are required|
|ImageJ||ImageJ is probably the best-known and longest-lived open-source software for biomedical image analysis. Even though the program is so widely used, ImageJ is an experimental system and NIH does not assume any responsibility for its use by other parties.||For use in a regulated environment, separate validation and verification would be required by the end user.|
|ilastik||Ilastik helps users perform segmentation and classification of 2, 3 and 4D images in a unified way. Through a random forest classifier, ilastik learns from labels provided by the user through a convenient GUI. Based on these labels, ilastik applies problem-specific segmentation.||Easy-to-use. Created for users without expertise in image processing|
Need some advice?
Meet our expert Biomarker team
 BBSRC and MRC Review of vulnerable skills and capabilities, 2017.
 ‘Bridging the skills gap in the biopharmaceutical industry’, ABPI, 2015.
 Attwood, T.K. et al, Briefings in bioinformatics, 20(2), 2019, 398-404.
 Chang, J. Core services: reward bioinformaticians. Nature 2015;520:151–2.
 The Data Scientist Shortage in 2020
 Bourne, P.E. PLoS Biol. 2021 Mar; 19(3): e3001165.
 National Careers Service Data Scientist
End-to-end data solution for Biomarker Discovery