USE CASE

End-to-end Bulk RNA-Seq Analysis Workflow

Author: Kai Lawson-McDowall, Bioinformatician at Sonrai Analytics

Updated: 12/03/2024

Kai Lawson-McDowall

Kai Lawson-McDowall

Bioinformatician at Sonrai

Kai Joined Sonrai Analytics in March of 2023, and possess a strong background in pre-clinical screening of gene therapies, big data in oncology, and cloud solutions for bioinformatics. Currently, Kai focuses on the pipeline development and scaling at Sonrai Discovery, as well as aiding clients in particular with the processing of proteomics and transcriptomics data. Prior to joining Sonrai Analytics, Kai has previously served as a bioinformatician at Cancer Research UK, GlaxoSmithKline, and Broken Strings Biosciences. He earned his undergraduate in Biochemistry from the University of Glasgow, and completed his masters at the University of Leeds in Genomics and Data Analytics.

Data types in this use case: RNA-Seq

Objective

The purpose of this document is to describe a workflow detailing the end-to-end analysis of bulk RNA-Seq data from raw data (FASTQ files) to interpretable results on the Sonrai Discovery platform. This guide will first look at the analysis via the NextFlow RNA-Seq pipeline to process Raw Data, followed by a no-code exploration and visualization of this data via the Sonrai Discovery Storyboards.

Introduction

Bulk RNA-Seq can be an incredibly powerful tool to help understand gene expression patterns, discover new genes, transcripts and variants, in an incredible variety of cells, tissues and organisms. Despite well established laboratory protocols for sequencing this data, there is often a lack of gold standard for bulk RNA-Seq data analysis [1]Historically, both academia and industry have had to produce and maintain their own bespoke pipelines which, despite yielding results, may be esoteric, unreproducible when using another pipeline, become unreproducible if versions of software are changed internally, or become outdated with lack of maintenance.

The use of the NextFlow-Core RNA-Seq pipeline [2], and nf-core community pipelines broadly, have been a  community-wide effort to tackle the issues stated above by producing easy-to-use, gold-standard pipelines, that are version-controlled, highly-documented, and flexible with their input and analysis depending on the exact outputs required (Fig. 1). In addition, the nf-core pipelines can leverage the power of cloud computing, allowing us to split our workloads across multiple computers, improving the speed at which pipelines run from days on a single computer to hours or minutes using cloud technology.

Our analysis today will leverage the green route shown on the nf-core RNA-Seq pipeline outlined in Fig.1 – aligning our reads to the most recent version of the human reference genome at the time of writing (the T2T-CHM13v2.0 [3] ) using the STAR aligner and Salmon Quantifier [4] . Moreover, Sonrai’s nf-core pipeline will leverage Amazon Web Services (AWS) HealthOmics [5] , Amazon’s platform for hosting pipelines to optimize pipeline execution for both cost and performance. 

Figure 1. An adapted version of the nf-core RNAseq “tube map” that provides a visual representation of the different steps within the pipeline, as well as categories these steps belong to in gray boxes (e.g. Pre-processing, alignment, quantification) and different routes that can be taken, which begin to differ based on the choice of either full or pseudo-alignment and quantification. This demonstration will take the green route (Aligner: STAR, Quantification: Salmon). In our workflow, we will take the gene counts produced by running the RNAseq pipeline on Sonrai’s dedicated cloud infrastructure, using the AWS HealthOmics service, and load these into Sonrai Discovery to perform downstream analysis to produce the visualizations seen on the right hand side.

Figure 1. An adapted version of the nf-core RNA-Seq “tube map” that provides a visual representation of the different steps within the pipeline, as well as categories these steps belong to in gray boxes (e.g. pre-processing, alignment, quantification) and different routes that can be taken, which begin to differ based on the choice of either full or pseudo-alignment and quantification. This demonstration will take the green route (Aligner: STAR, Quantification: Salmon). In our workflow, we will take the gene counts produced by running the RNA-Seq pipeline on Sonrai’s dedicated cloud infrastructure, using the AWS HealthOmics service, and load these into Sonrai Discovery to perform downstream analysis to produce the visualizations seen on the right hand side.

Although the nextflow pipeline produces high-quality outputs containing the raw counts of different genes for different samples, and a detailed multiQC report for understanding the quality of sequencing data, it does not produce immediate and effective visualizations that help extract biological meaning from this data. This is where the Sonrai Discovery Platform [6] and its large variety of statistical and bioinformatic tools can be used to ingest and analyze the nf-core RNA-Seq outputs. For this workflow, will:

  1. Understand the inherent quality and structure of the data using principal component analysis (PCA).
  2. Perform differential gene expression analysis using a volcano plot to select biomarkers. 
  3. Visualize the pattern of expression via a heatmap to further refine our  biomarker search. 
  4. Perform classical statistical analysis via a boxplot and confirm our discovery by revisiting the principal components analysis.

This workflow requires no coding or bioinformatician, and can be done by a researcher or non-technical person in less than 20 minutes.

Dataset Analyzed

The data used in this tutorial is taken from the paper “Functional Analysis and Transcriptomic Profiling of iPSC-Derived Macrophages and Their Application in Modeling Mendelian Disease” by Zhang et al., 2015 [7].

We will be using 6 samples of macrophages derived from human monocytes (HMDMs). Three samples have been treated with an endotoxin and interferon-gamma for 18-20 hours to induce an inflammatory response (M1), whereas the other 3 have been left untreated (M0). We can use our RNA-Seq pipeline followed by Sonrai Discovery to interrogate the quality of our sequencing data and investigate differences in gene expression between these two treatment groups.

sample 

unique_id 

description

A1BG

A2M

HMDM_MAC_Rep1

SRR1182374

M0-HMDM

95.598319

14681.575644

HMDM_MAC_Rep2

SRR1182375

M0-HMDM

61.626786

9171.506656

HMDM_MAC_Rep3

SRR2910663

M0-HMDM

229.812337

26119.492960

HMDM_M1_Rep1

SRR1182376

M1-HMDM

60.190322

6740.259626

HMDM_M1_Rep2

SRR1182377

M1-HMDM

48.030779

5382.382654

HMDM_M1_Rep3

SRR2910664

M1-HMDM

115.511736

23485.441039

Table 1. A summary of the metadata used for this experiment. Many of the comparisons will be done using the description column, as it presents a binary (i.e. M0 -HMDM vs M1-HMDM). The additional columns (A1BG and A2M as illustrative examples) will use bias-corrected count values for each of our genes, of which there are over 42,000.

If you'd like to talk to Kai or another expert from our team, reach out to us

Contact our friendly team for expert guidance and transformative insights.

Quality Control - MultiQC

A key output of the pipeline prior to analysis on Sonrai Discovery is its interactive multiQC report [8] , which compiles analysis logs from numerous tools throughout the pipeline with over 40 different metrics from 10 different tools. Despite the large amount of data, the multiQC report also provides overall summaries coalescing these data into summary tables, and will throw warnings or failure messages if a particular metric has not met the threshold, meaning an in-depth knowledge of every plot and table is not needed. Additionally, if a metric has failed, it does not necessarily mean anything is wrong with the run, but that the specific nature of the data does not operate with that tool.

These are fully self-contained, downloadable (including each plot within the report), customisable prior to running the pipeline and afterwards, and provide the version of each and every tool. Most importantly, they communicate the quality of our data from the sequencing to the final, analysis-ready tables.

Figure 2. Initial opening view of the multiQC report, the full video for which can be found here. This initial table shows the summary of sequencing, including duplication, number of reads mapped, GC content, strand bias, and many more. The full video of the report and browsing it can be found here:  ipsc_multiqc_browsing.webm

From an initial scan of the multiQC report, the majority of metrics throw no errors, and the measurements which do are due to particular modules that anticipate different data modalities [9] (e.g. whole genome sequencing as opposed to RNA-Seq) and so will inevitably fail. Given the overwhelming number of metrics that have passed, this RNA-Seq data is amenable to downstream analysis

Analysis Using The Sonrai Discovery Platform

Principal Components Analysis (PCA)

Our first port-of-call for a high-dimensional dataset (where each dimension equals a gene or transcript) may be dimensionality reduction, which allows us to detect biologically meaningful clusters and establish outliers in our dataset by collapsing all of our data into 2 or 3 visual dimensions. A well-established and commonly used technique, particularly for -omics data is principal components analysis [10]. We can do this easily through our inbuilt PCA functionality by selecting the number of principal components we wish to calculate (up to 100), as well as what we wish to color our samples by. In this case, we are interested if the structure of the data reveals differences between our untreated (M0) and inflamed (M1) HMDMs, and so color the points accordingly. To the right hand side, we can also see the portion of total variation that each principal component captures.

Figure 3. Principal component analysis of our RNAseq expression data, Left-hand side) a 3D representation of the first 3 principal components, these are coloured by if they are M0 or M1 macrophages, and the labels provide their PCA coordinates, as well as their sample name and unique id. Right-Hand Side) The Scree plot, showing the cumulative variation explained for each additional principal component. In our case, we can explain all the variation in our dataset via the first 5 principal components. Overall, this reveals that there are inherent differences between M0 and M1 Macrophages explained by the data. @ipsc_pca.webm

Figure 3. Principal component analysis of our RNA-Seq expression data, Left-hand side) a 3D representation of the first 3 principal components, these are coloured by if they are M0 or M1 macrophages, and the labels provide their PCA coordinates, as well as their sample name and unique id. Right-Hand Side) The Scree plot, showing the cumulative variation explained for each additional principal component. In our case, we can explain all the variation in our dataset via the first 5 principal components. Overall, this reveals that there are inherent differences between M0 and M1 Macrophages explained by the data. For a video of this app in action: @ipsc_pca.webm

As we can see from the PCA analysis, there is a distinct separation (if not tight clustering) of the control (M0) and treated (M1) groups, indicating inherently different structures in the gene expression data between the control and inflamed macrophages. Although PCA does not provide immediate insight into specific biological differences, it tells us there are underlying differences in our data that can be explored later in our workflow.

Volcano Plot

Having visualized the underlying structure of this data, we want to try and uncover what genes in particular are significantly differentially expressed. To this end, we can deploy a volcano plot [11] using the python implementation of the deseq2 algorithm [12] to understand both fold change (how much a gene is up or down regulated), and how significant this change is.

In our plot, we choose which two comparisons we would like to make (in this case, between M0 and M1), and produce our volcano as simply as that. Given the interactive nature of this plot, we can immediately pick out which genes have been both significantly, and most greatly up and down regulated. As we input M1 and M0 as “Contrast variable 1” and “Contrast Variable 2” respectively, we can read the log fold change as “compared to control”, i.e if we see a log-fold change of 9.6 for the gene SNORD22, it means that gene is in reality is around 766x more upregulated in M0 and than M1 (as log2(776) ~ 9.6), the converse is also true (negative Log2FoldChange means larger expression in the inflamed cells).

Figure 4. Volcano plot showing the upregulation (green) and downregulation (red) of 42,000+ genes in M0 with respect to M1 via the deseq2 algorithm. The upregulation can also be thought of as genes which are less expressed in M1, and the downregulation can be thought of as genes which are more expressed in M1. This volcano plot uses a cut-off for the log 2 fold changes of -2 and 2 respectively (i.e. something must 4x up or downregulated) and a p-value of <0.05. However, these can be modified to the user’s preferences. With this particular set of filters, we find 3084 genes which are significantly differently regulated between M0 and M1. The highlighted gene, H2BC5, has a log fold change of -5.48, meaning it is upregulated ~47x more in the inflamed macrophages compared to control, and a significance 10^-255. For a video of this app in action: @ipsc_volcanoplot_browsing.webm

From our Volcano plot table viewer tab, we can interrogate these up and downregulated genes more closely, filtering for those that were significant, and then sorting by log2FoldChange to get the most upregulated genes (Fig.5 – top) and the most downregulated genes (Fig.5 – bottom), as well their corresponding metrics, such as their mean expression, log-fold change standard errors, p-values, and adjusted p-values.

Figure 5. Top) The top 5 most differentially expressed genes in the M1 (inflamed) Macrophages as compared to M0 (control macrophages) As it is all relative to M0, these are expressed as how much M0 is downregulated with respect to M1. Bottom) the top most differentially expressed genes in M0 compared to M1. From right to left there is the name of the gene, its mean expression across samples, its log 2 fold change, the standard error of the log 2 fold change. The T-statistic, p-value, adjusted p-value via benjamini-hochberg correction, and if it is upregulated, downregulated, or not significant.

Figure 5. Top) The top 5 most differentially expressed genes in the M1 (inflamed) Macrophages as compared to M0 (control macrophages) as it is all relative to M0, these are expressed as how much M0 is downregulated with respect to M1.

Figure 5. Bottom) The top most differentially expressed genes in M0 compared to M1. From left to right the column names are the feature (the gene name), its mean expression across samples, its log 2 fold change, the standard error of the log 2 fold change. The T-statistic, p-value, adjusted p-value via benjamini-hochberg correction, and if it is upregulated, downregulated, or not significant.

Heatmap

We can begin to break down expressions for these 10 differentially expressed genes by leveraging a heatmap, which will not only show us the expression for each sample for all 10 genes, but how they cluster together. Once again helping us understand if these genes can be used to explain the difference between the M0 and M1 groups.

Figure 8. A heatmap of the 10 most differentially expressed genes identified in the volcano plot differential expression analysis. This heat map uses the default distance metric to calculate the similarities between rows (genes) and columns (samples) respectively. This heat map reveals that there is similarity between the expression profiles for the M0 and M1 macrophages based on these genes as well as similarities between expression profiles for particular genes (e.g. LINC01961 and LINC03025). Lastly, The heatmap also reveals variability of gene expression between samples (e.g. LOC124904143 in M0).

Figure 8. A heatmap of the 10 most differentially expressed genes identified in the volcano plot differential expression analysis. This heat map uses the default distance metric to calculate the similarities between rows (genes) and columns (samples) respectively. This heat map reveals that there is similarity between the expression profiles for the M0 and M1 macrophages based on these genes as well as similarities between expression profiles for particular genes (e.g. LINC01961 and LINC03025). Lastly, The heatmap also reveals variability of gene expression between samples (e.g. LOC124904143 in M0) which shows very high expression in only one sample, suggesting the high expression we see for this may be anomalous rather than biologically meaningful.

The heatmap reveals several interesting features within the data. Firstly, it shows that these samples will cluster according to the values of these 10 genes, indicating that M0 and M1 macrophage populations can be distinguished from each other based on the expression patterns generated by this set of genes. Interestingly, it also shows that there is a degree of variation in the expression within these samples, for instance,  LINC02930 and LOC124904143 have a high amount of expression in HMDM_MAC_Rep3 (an M0 sample) but not in the other two M0 Samples, suggesting it could be anomalous. Conversely we see a consistent increase in expression across all M1 samples for CSAG3 and LOC101929800, suggesting these may be a more robust biomarker for inflammation.

Boxplot

Digging deeper into these 10 differentially expressed genes, we can now perform some classical statistics to compare their variation more directly. We can select the Boxplot tab, click “Add graph” to have two total graphs, and then select the genes of interest (5 most up and down regulated in each respective plot) and produce these boxplots to compare the raw expression between M0 and M1, as well as understand the variation in expression which can be obtained immediately from the plots interquartile ranges. The overlap between the boxplots can also tell us visually if our gene expressions are significantly different between M0 and M1 Groups (overlap – not significantly different, no overlap – significantly different).

Figure 9. Left) Genes downregulated in the M1 macrophages compared to M0 macrophages. Right) Genes upregulated in the M1 macrophages compared to the M0 macrophages. The greater variation of upregulated M0 genes on the LHS indicates that although genes appear downregulated inflamed macrophages, it may in part appear this way because there are anomalous samples in M0 with sporadically high levels of expression. Conversely, we have more both much larger and more consistent upregulated in the M1 macrophages on the LHS (note the larger y-axis on the RHS), Indicating significant and large differences in the expression of these genes in M1 macrophages as compared to M0.s

Figure 9. Left) Genes downregulated in the M1 macrophages compared to M0 macrophages.  Right) Genes upregulated in the M1 macrophages compared to the M0 macrophages. The greater variation of upregulated M0 genes on the LHS indicates that although genes appear downregulated in inflamed macrophages, it may appear this way because there are anomalous samples in M0 with sporadically high levels of expression. Conversely, we have both much larger and more consistent (smaller error bars) upregulation in the M1 macrophages on the LHS (note the larger y-axis on the RHS), Indicating significant and large differences in the expression of these genes in M1 macrophages as compared to M0.

The left plot shows that the genes upregulated in the uninflamed M0 samples are much more variable, and could indicate that this upregulation could be partially by chance, with 1 sample in particular having an abnormally high expression. Conversely, we see three highly significantly upregulated genes in the M1 macrophages (LINC03025, LOC101929800, CSAG3)  and with the exception of LINC03025, we see much tighter distributions for upregulated genes in M1, notably for LOC101929800 and CSAG3, confirming what we saw in our heatmap and reaffirming our suspicion these are candidate biomarkers for inflammation.

As we are primarily focused on understanding genes upregulated during the inflammatory response, we might be particularly interested in CSAG3, as it is not only significantly different in M1, but also consistently upregulated. As an additional confirmatory step, we can recolour our initial PCA by the expression of one of these genes and see that when we recolor our samples by their CSAG3 expression, we can  reproduce the distinction between the M0 and M1 groups in the same way coloring by the sample in the first place might do (Fig. 10).

Figure 10. A comparison between the initial PCA plot coloured by Sample and then coloured by the expression of CSAG3 in this dataset. As can be seen, the M0 and M1 groups, when coloured by their expression of CSAG3, show the same separation as when they’re coloured by their M0 or M1 status.

Figure 10. A comparison between the initial PCA plot coloured by Sample and then coloured by the expression of CSAG3 in this dataset. As can be seen, the M0 and M1 groups, when coloured by their expression of CSAG3, show the same separation as when they’re coloured by their M0 or M1 status.

Summary

To summarize, we have taken the outputs of the nf-core RNA-Seq pipeline, used the generated multiQC to interrogate and confirm the quality of our sequencing prior to downstream analysis. Through our platform, We have used principal component analysis to reduce the dimensionality of our data and show the data has patterns distinguishing M0 and M1 macrophages. From here, we used a volcano plot to help extract statistically significant, differentially expressed genes, and filter down a selection of 10 genes  to look for potential biomarkers of inflammation. We built upon our volcano plot app with a heatmap, which allowed us to show how the individual samples expression was similar to one another (reflected in the clustering of these samples) as well as initial understanding of variation between samples’ gene expression. We then confirmed the results of the box plot, separating our data into two boxplots which allowed us to understand both the spread and the statistical significance of the genes up and down regulated in M1 as compared to M0. This led us to select a particular gene of interest (CSAG3). Returning to the PCA, recolouring the sample by CSAG3 expression showed that it could reflect the differences in samples (i.e. if it was M0 or M1) nearly as well as the original samples labels, strongly suggesting that it could be a biomarker of inflammation.

Lastly, this demonstrates the power of the nf-core RNA-Seq pipeline in combination with the Sonrai Discovery platform to quickly and accurately convert raw data into actionable insights for biomarker discovery.

References

  1. RNA-Seq pipelines can be complex when coming from a non-code or non-bioinformatics standpoint. This is a detailed webinar intended for beginners to understand RNA-Seq analysis. Alternatively, A written guide to the analysis of RNA-Seq data  written by Matthew N. Bernstein also provides a good initial overview, as well as more technical detail if desired. 
  2. The NextFlow-Core RNAseq pipeline contains multiple tabs which provide an introduction to the pipeline, usage, parameters, outputs and results, which are incredibly well documented and detailed. 
  3. The link to the files used for the RNA-Seq alignment – i.e. the T2T-CHM13v2.0 genome.
  4. Explanations of both the STAR aligner and Salmon Quantifier used for aligning and quantifying our RNA-Seq reads when aligning to our genome 
  5. A quick introduction to Amazon Web Services (AWS) HealthOmics, Amazon’s platform for hosting pipelines to reduce cost and increase execution time. 
  6. The AI-driven, user-friendly Sonrai Discovery Platform used to analyze the raw RNA-Seq data.
  7. The original paper – “Functional Analysis and Transcriptomic Profiling of iPSC-Derived Macrophages and Their Application in Modeling Mendelian Disease”, From which the IPSC Macrophage used in this paper was derived.
  8. A quick introduction to the ethos and structure behind multiQC reports. 
  9. A Bioinformatics StackExchange thread explaining why particular modules may fail in multiqc due with different data modalities (e.g. module expects DNA instead of RNA) irrespective of the quality of the data.
  10. An introduction to the dimensionality reduction technique – principal components analysis
  11. An explanation of volcano plots for visualizing differential gene expression.
  12. An explanation of the deseq2 algorithm used to calculate differential expression in RNA- Seq data

Discover how companies worldwide grow with Sonrai. Explore all our case studies.

AI Biomarker Discovery Workflow

Explore an end-to-end AI biomarker discovery workflow using Sonrai Discovery. Discover just how easy it can be.

AI-Powered Single-cell Analysis

AI single-cell analysis boosts accuracy, revealing subtle patterns, expediting large dataset processing and enabling personalized medicine.

Get in touch

Like What You See? Let's Talk

We Listen to your problems
We give you confidence to make your decision