Author: Dr Matthew Alderdice, Data Scientist
Precision Medicine (PM) describes the process of combining clinical and molecular data to identify subgroups of patients associated with better response to treatments and clinical interventions. These clinically relevant subgroups of patients are identified from big data using advanced analytics such as Machine Learning (ML) and Artificial Intelligence (AI). The subgroups are defined by the presence of distinguishing molecular profiles. The Precision Medicine Initiative was launched by United States President Barack Obama in 2015 with the aim of performing deep phenotyping on over one million volunteers. The findings of the initiative will inform clinical decision making, identify new targeted therapies and ultimately improve patient care.
The term ‘Precision Medicine’ is often used interchangeably with ‘Personalized Medicine’ and ‘Stratified Medicine’ and very few people agree on a concrete definition. Regardless of the perceived differences between the terms, they all revolve around the goal of identifying novel biomarkers, molecular subtypes and patient stratification tools. This article aims to shed some light on the key scientific terms within the precision medicine framework.
Biomarkers are one of the core concepts of precision medicine. The FDA describes a biomarker as “a defined characteristic that is measured as an indicator of normal biological processes, pathogenic processes, or response to an exposure or intervention, including therapeutic interventions”. In other words, it is a very broad term and may refer to a single quantitative measurement such as blood glucose level or a multitude of clinical and molecular data measurements with complex interdependencies. Applications of biomarkers include diagnosis of patients with disease, staging of disease, prognosis, and prediction and monitoring of treatment response.
Advances in molecular techniques in the 1990s coincided with the start of the Human Genome Project which paved the way for a myriad of discoveries linking mutational status with disease and response to clinical intervention. Perhaps one of the most significant early discoveries, led by Dr Mary-Claire King demonstrated the link between the presence of mutations in the genomic sequence of BRCA1/2 genes and the diagnosis of early-onset familial breast cancer. More recently, cancer treatment has been transformed with the development of targeted treatments alongside companion diagnostics such as imatinib and bcr-ABL, trastuzumab and HER2, and vemurafenib and BRAF v600. All of these advances have only been possible due to the ability to perform deep phenotyping experiments generating large multi-omic data lakes in the process.
At the beginning of the century, it cost approximately $100 million to sequence a single human genome. From 2008 onward, we have seen a rapid decline in the cost of DNA sequencing and other technologies whereby we can now routinely generate multi-omic molecular profiles (e.g genomics, transcriptomics and proteomics) for every patient. It is no coincidence that the emergence of precision medicine is directly correlated with our ability to perform large scale molecular profiling. As the data deluge continues it becomes more challenging to generate insights and we must apply advanced analytics to identify the most clinically and biologically relevant biomarkers.
The curse of dimensionality is a term often used to describe the phenomenon where the number of features (e.g genes) greatly outweighs the number of observations (e.g patients). This is a very prominent characteristic of multi-omic data which makes finding the next ground-breaking biomarker or meaningful patterns with simple statistics like looking for a needle in a haystack. There is a rich ecosystem of bioinformatics tools that are routinely used for exploratory data analysis in precision medicine to ease the discovery of novel biomarkers. We will explore a number of these key tools. Differential expression analysis and feature selection tools are core techniques used in the generation of molecular assays known as gene signatures.
Gene signatures are lists of genes with unique expression patterns that characterise a clinical or biological phenotype. Gene signatures can provide insights into the biology that underpins disease using techniques such as pathway analysis. Most gene signatures are not clinically viable, however, there are a number of noteworthy commercially available gene signatures (e.g MammaPrint, DDRD, OncotypeDX and Prosigna) that are used for prognostication and prediction. Prosigna, previously known as the PAM50 gene signature, assigns patients with early-stage breast cancer a prognostic score which is indicative of cancer recurrence. Gene signatures such as Prosigna are the culmination of years of exploratory data analysis, development and validation. The PAM50 gene signature was first published in Nature in 2000, the authors used a technique called hierarchical clustering to identify four ‘molecular subtypes’ associated with complex biological pathways. Molecular subtyping is a term that has become synonymous with precision medicine particularly in oncology but what does it mean and how is it impacting health care?
Molecular subtyping refers to the use of multi-omics data to find clusters of samples that have shared biological traits. Molecular subtypes are often discovered using unsupervised machine learning techniques such as clustering and dimensionality reduction techniques (e.g tSNE) which are an incredibly powerful approach for cluster visualization. Breast cancer is undoubtedly the cancer type which has seen the most research in this area. However there is now substantial evidence to suggest that molecular subtypes exist across most if not all tumour types. The Consensus Molecular Subtypes of Colorectal Cancer (CMS) were established more recently in 2015 by Guinney et al, bringing together the findings of six independent subtyping studies. Using Markov Clustering they identified four molecular subtypes characterised using thousands of Colorectal Cancer (CRC) transcriptome profiles. The subtypes have clear molecular associations defined not only by gene expression but also mutational, methylation, miRNA and histopathology profiles. The paper has been cited over 2000 times since 2015 and is now helping shape future clinical stratification strategies. CMS stratification was originally published using a Random Forest Classifier. Random Forests are a classical machine learning algorithm and form part of the wider Artificial Intelligence and ML ecosystem. ML techniques have been widely used by the data science and bioinformatics community in precision medicine for a while however over the last decade AI has seen a resurgence.
The renewed excitement around the application of AI in precision medicine is hard to ignore but what is AI and how is AI being routinely used in precision medicine? The term AI or ‘narrow AI’ is used to describe a collection of machine learning and deep learning algorithms that are trained on vast amounts of data that mimic aspects of human decision making and behaviour. The precision medicine Sector has been slower than others in its adoption of these algorithms. However, a recent review by Benjamens et al showed that the number of publications using AI/ML in the field of life sciences rose from 596 in 2010 to 12,422 in 2019. Similarly, the number of clinical trials published on NCBI involving AI has grown dramatically over the last decade (see figure). The same review also showed that medical imaging is the area where AI has had the biggest impact in precision medicine with 46% of the 29 official FDA approved algorithms being applied in radiology data modalities such as CT, MRI and mammograms. The success of AI in medical imaging is partly due to the development of deep learning frameworks such as TensorFlow and Pytorch. These frameworks enable data scientists to train algorithms known as convolutional neural networks (CNNs) which are highly specialised at classifying images.
We should expect to see over the next decade the same pattern of adoption to emerge in digital pathology where AI applied to large gigapixel histopathology images such as H&Es will be used routinely in clinical decision making. Large consortiums such as PathLAKE which provide researchers with access to huge data lakes and computing infrastructure are accelerating the adoption of AI in digital pathology. The next big step will be applying AI to molecular data, however, there needs to be a shift in our data management practices before this can be realised.
The adoption of data best practices is one of the most important challenges to address if the true value of precision medicine is to be delivered. Multi-omic data is inherently diverse, there are many file types, bioinformatics pipelines and preferred analysis methods. Adopting a data-driven mentality will be key if AI is to be successfully applied to multi-omic data. The power of precision medicine is inextricably linked to data and our ability to store and analyse it. Organisations that undertake a digital transformation now are going to be the ones that make the most impact.
Whether you have general questions about our products and solutions or would like to schedule a demo, our expert team is available to help.
Cloud and data technology startup conceptualising raw data into actionable insights.
Copyright © 2020 Sonrai Analytics Ltd