AI Biomarker Discovery Workflow
Author: Dr Craig Davison
End-to-End Biomarker Discovery
In the following use case, we'll delve into an illustrative workflow employing Sonrai Discovery for comprehensive biomarker discovery. This particular segment is dedicated to the analysis phase. However, for a complete overview, encompassing data management, secure collaboration, and reporting findings, please don't hesitate to contact us. We'd be delighted to provide you with the full picture.
Build Your Storyboard
In this instance, our objective is to establish a classifier for identifying bladder cancer disease status using MicroRNA data. To achieve this, we will develop a tailored workflow for the specific use case. This workflow involves configuring custom parameters, which includes inclusion / exclusion criteria, and fine-tuning parameters for various applications. Additionally, we'll proceed to train a machine learning model.
After uploading, curating, and merging our data from various sources into an AI-ready format, our initial step involves constructing a storyboard. Within this process:
1. We choose pie charts and histograms to visualize the data and select the correct subcohort.
2. A t-SNE (t-distributed stochastic neighbor embedding) is selected to visualize complex data as a scatterplot.
3. Subsequently, we will train an XGBoost model and utilize a boxplot to validate the features identified by the XGBoost.
These applications can function independently, but they are also adaptable for use in a sequence where data can be seamlessly transferred between them. The order can be easily modified to initiate and commence the analysis.
Launch The Analysis
There's an easy to navigate user interface, we can easily customise the applications we have selected. We will select the variables we want to look at: disease status, age and sex to see our selected graphs.
Filter for Bladder Cancer and Non-Cancer Controls
As we can see below, we're looking at more than just bladder cancer and non-cancer controls, there are other cancers within this cohort. So, lets create a filter to ensure that we're only looking at bladder cancer and non-cancer controls.
Instantly Updated With Our New Filter Applied
t-SNE: Setting Parameters
As we have added our apps into a sequence we can now hit 'Next App' to go to our t-SNE app. We pass our data through the app to visualize this complex data in a scatter plot with each point representing a patient in this case.
No filters are necessary, as they have already been linked from the previous app. We can now instantly set the parameters and today we're interested in disease status. This is a peer reviewed approach to reduce the dimensions initially to speed up the t-SNE. When we tap 'Calculate PCA'
This information illustrates the number of principal components that account for the variability between the disease status groups, in this instance, bladder cancer and non-cancer controls. These findings are beneficial for configuring the parameters of the t-SNE application. Once satisfied with the settings, we simply click 'Apply,' and you'll have access to the interactive t-SNE graph below.
Interactive t-SNE Graph
Changes in the Explained Variance and KL Divergent Scores
Modifying the parameters in the t-SNE application is easy, and as we make updates, you'll notice changes in both the explained variance and KL divergent scores. Sonrai simplifies the process of parameter exploration, allowing you to determine what works best for your needs. As an illustration, you have the option to generate a 3D visualization (as seen below) with interactive capabilities.
Looking for a Specific Workflow?This adaptable workflow extends to an array of applications in biomarker and drug discovery. For specific use cases, connect with the Sonrai team. We're here to help.
Rendered as 3D
The t-SNE analysis clearly reveals two distinct clusters, one for bladder cancer and the other for non-cancer control participants. This finding gives us confidence that this data is well-suited for training a robust classifier. So we will now move to the next application - the XGBoost model.
XGBoost Classification - Exclude Unique IDs
Next, we'll proceed to train a classifier designed to differentiate between bladder cancer and non-cancer controls. Setting the parameters is simple; we can focus on disease status as our target variable, and decide whether to include or exclude features. For instance, here, we opt to exclude unique IDs from the model, as they don't contribute to the analysis. Adjusting the testing / training data split is also possible, with a default split of 60/40. Once the parameters are configured to our satisfaction, we can initiate the model training process.
As you can see above it has trained 100 models, but the blue testing line has reached 0 after only 13 runs. It doesn't need that many rounds, so we'll adjust the parameters and retrain the model accordingly.
Confusion Matrix with Results on Sensitivity and Specificity
Now, let's examine the confusion matrix to evaluate the model's performance in both training and testing rounds. As you can see below, the model demonstrates exceptional sensitivity and specificity, both at a perfect score of 1. You would be correct to be skeptical of a model that performed so well. However, in this particular case, these are genuine outcomes from the dataset, corroborated by an associated publication.
We know that the model performs well, we now want to see which features the model is using. For the XGBoost model we can check which features are the most important for your particular model, making this a grey box machine learning tool rather than a black box. This is really useful as it enables individual verification of these markers, which brings us to the next application to individually verify the features.
Boxplot - Individually Verify the Features
Let's configure our boxplot. We'll set our X-axis variable to "disease status," opt to color by disease status, and for the Y-axis, we'll select the specific features. For instance, let's choose v5582, the top feature, to instantly generate a boxplot that provides statistics comparing bladder cancer and non-cancer controls.
The visualization below makes it obvious why v5582 emerged as the top feature in the XGBoost application and why the model relied on it to differentiate between participants with bladder cancer and those without cancer. The data shows a significant difference in expression levels, with bladder cancer participants exhibiting notably lower expression compared to non-cancer controls.
As demonstrated, crafting a personalized workflow, fine-tuning parameters for various applications, and training machine learning models are seamless tasks in Sonrai Discovery. This adaptable workflow concept extends to a wide array of applications in biomarker and drug discovery. For specific use cases, feel free to connect with the Sonrai team. We're here to help.
We aim for perfection for our clients. We listen to their goals, quickly grasp their challenges, consider the needs of all stakeholders and deliver. Below are just some of our biomarker discovery case studies to view.