Biomarker Discovery using Flow Cytometry Data

Data types in this use case: Flow Cytometry

Use Case Highlight

The quest for reliable biomarkers in disease diagnosis, prognosis, and treatment is critical in biomedical research and in developing accurate diagnostics and effective therapies. In this use case, we showcase how Sonrai Discovery can be used to analyze flow cytometry data to discover potential biomarkers. Flow cytometry, a powerful tool for analyzing cells’ physical and chemical properties in a heterogenous mixture, generates vast and complex datasets. Thanks to AI and machine learning algorithms, Sonrai Discovery is able to identify patterns and correlations in data that might elude traditional analysis methods. By automating parts of the data analysis process, AI reduces the likelihood of human error and accelerates the identification of potential biomarkers.

Challenges of Flow Cytometry Datasets

Flow cytometry datasets have become larger in recent years with the progress of spectral flow cytometry and mass cytometry. Several recent flow cytometer models allow the design of high-throughput, 30-fluorescent marker panels, offering the opportunity to identify more cell populations of interest. This can be extremely valuable in the context of immuno-oncology clinical trials where more cell types can be characterized in a single panel. The recent technological progress has reinforced the importance of fluorescence activated cell sorting (FACS) in biomarker research. 

These datasets pose several challenges due to their size and complexity, and make traditional data analysis methods inefficient and prone to errors. Flow cytometry data often involves multiple parameters per cell, resulting in high-dimensional datasets. Analyzing such data can be overwhelming and complex due to the sheer volume of information and the intricate relationships between different data points.

A classical workflow for flow cytometry-based biomarker discovery involves:

  1. Manually entering immune cell expression values into a statistical analysis software
  2. Performing statistical analysis
  3. Identifying immune cell expression differences between groups

This approach is error-prone and slow for today’s larger datasets. Statistical analysis performed this way might sometimes also omit correction for multiple statistical tests.

In this use case we present how AI can assist with biomarker discovery using data generated by flow cytometry to enable meaningful, data-driven insights. 

Do These Challenges Sound Familiar?

Contact our friendly team for expert guidance and transformative insights.

Example Flow Cytometry Data Analysis Scenario

We use the following scenario to help illustrate this. 

  • Dataset (simulated):
    • N = 250. 125 Controls and 125 treated patients. 
    • 12 immune cell types.
    • Flow cytometry data set, cell proportions expressed as % of live cells.
    • Treated patients received checkpoint inhibitor therapy. 
  • Identify predictive biomarkers of interest from this flow cytometry data set.
  • Identify the mechanism of action. 

Given the size of the data set, how can we quickly identify biomarkers of interest?

Applying AI and ML to Accelerate Analysis

Using our code or no-code platform, users can quickly identify potential biomarkers of interest using machine learning, visualize cell types proportion between patients, and generate reports for stakeholders.

Thanks to the seamless integration with R and Python notebook, users can develop custom pipelines for data cleaning and visualization either in-house or with the support of our software engineers.

Step 1: Add all of your flow cytometry data into Sonrai Discovery for further processing and analysis

Image 1. Visualize Dataset with Table Viewer

Step 2: Within Sonrai Discovery, identify and handle missing values with our data imputation no-code application

Image 2. Data preprocessing in Sonrai Discovery

Detecting and addressing missing values is a key step in data preprocessing. Our data imputation application offers an interface to detect and handle missing values without requiring the user to code; for users who prefer to implement their pipeline using our managed R and Python notebook, it offers the possibility to run custom scripts.

Step 3: Apply dimension reduction methods: Principal Component Analysis (PCA)

Thanks to recent advances in flow cytometry, scientists can routinely design panels encompassing 18 to 30 markers and study the expression of a large number of immune cell subsets. This high-dimensional data can be challenging to analyze. Sonrai Discovery utilizes PCA and t-SNE plots to represent high-dimensional data in a three-dimensional space, enabling researchers to identify distinct cell populations and potential biomarkers easily. We offer no-code applications to perform dimension reduction using PCA or tSNE. Using our managed R and Python notebooks users can also implement their own custom dimension reduction pipeline, such as UMAP.

Image 3. High-dimensional data can be reduced to identify patterns.

By coloring observation by treatment, we can see a clear difference between the control and treated patients (image 4). This suggests that we can  train a classification model to classify patients based on immune cell expression data. Using this model, we will be able to explore the key biomarkers to differentiate between the control and treated patients.

Step 4: Build a classifier model using XGBoost

XGBoost (eXtreme Gradient Boosting) is a powerful machine-learning algorithm that combines sequential weak learners to train robust models to predict continuous (regression model) or categorical variables (classification). It allows the generation of more accurate models than the simpler linear or tree-based algorithms while being more explainable than neural network models.  While exploring model parameters can be challenging the algorithms allow a great balance between explainability and model performance and make it an excellent choice for biomarkers discovery. 

Using our no-code application, users can easily train an XGboost classification algorithm.

Step 5: Evaluate model performance

It is crucial to evaluate model performance to ensure that the model is not overfitting but learning genuine biological patterns in cell expressions across patient groups. 

Examining the confusion matrix below, we can see that the models generate a high proportion of true positives and negatives on the test dataset that was not used during the training. This indicates that the model is robust and learning to detect immune cell expression differences between patient groups. This is further confirmed by the model’s high sensitivity and specificity scores. 

Sonrai Discovery platform includes pre-set parameters to avoid and detect overfitting.

Image 4. Confusion matrix on test data

Table 1: Model metrics on test data




Patients number









Sensitivity (also known as Recall or True Positive Rate) is defined as: 

Sensitivity=True Positives / (True Positives+False Negatives)

Specificity (also known as True Negative Rate) is defined as:

Specificity= True Negatives / (True Negatives+False Positives)

The metrics table allows us to evaluate our model performance. We can see here that our model achieves a high score on both Sensitivity and Specificity indicating that the model is learning to differentiate between patients’ status based on their immunological profile. 

Step 6: Identify biomarkers using the Feature Importances plot

Image 5. The Feature importances plot allows to quickly identify the important biomarkers differentiating between patients status. 

Step 7: Explore cell types identified using XGBoost

Now that we have identified immune cells of interest, differentiating patient groups we can explore the difference in cell expression using our no-code Boxplot application.

Image 6. Cell types identified using XGBoost

Step 8: Create custom analyses using Python or R notebooks

Using our R and Python notebook, users can develop custom pipelines for data cleaning and visualization either in-house or with the support of our software engineers and bioinformaticians.

Image 7. Sonrai Discovery offers code and no-code discovery

Example of a publication-ready graph using R

Here, coding users can also train custom machine learning models and/or generate custom figures using their preferred tools with our managed R or Python notebooks.  

Image 8. Example of publication-ready figures using R, including statistical analysis.


By leveraging AI, researchers can overcome many of the inherent challenges of flow cytometry data analysis, leading to more efficient, accurate, and insightful outcomes. Our platform facilitates biomarker discovery from flow cytometry and/or molecular biology datasets.

We support both technical bioinformaticians with our code solutions, as well as other researchers with our no-code offering, to help researchers get the most out of their data, regardless of programming experience.

Discover how companies worldwide grow with Sonrai. Explore all our case studies.

Unified Code and No-Code Discovery

See what happens when researchers and bioinformaticians are united in a code-optional environment.

AI Biomarker Discovery Workflow

Explore an end-to-end AI biomarker discovery workflow using Sonrai Discovery. Discover just how easy it can be.

What is tSNE and when should I use it?

T-distributed Stochastic Neighbourhood Embedding (tSNE) is an unsupervised Machine Learning algorithm developed in 2008 by Laurens van der Maaten and Geoffery Hinton.

Get in touch

Like What You See? Let's Talk

We Listen to your problems
We give you confidence to make your decision