USE CASE

Automatic Public Data Pulling and Processing using Bioinformatic Pipelines

Author: Kai Lawson-McDowall, Bioinformatician at Sonrai Analytics

Kai Lawson-McDowall

Kai Lawson-McDowall

Bioinformatician at Sonrai

Kai Joined Sonrai Analytics in March of 2023, and possesses a strong background in pre-clinical screening of gene therapies, big data in oncology, and cloud solutions for bioinformatics. Currently, Kai focuses on the pipeline development and scaling at Sonrai Discovery, as well as aiding clients in particular with the processing of proteomics and transcriptomics data. Prior to joining Sonrai Analytics, Kai has previously served as a bioinformatician at Cancer Research UK, GlaxoSmithKline, and Broken Strings Biosciences. He earned his undergraduate in Biochemistry from the University of Glasgow, and completed his masters at the University of Leeds in Genomics and Data Analytics.

Data types in this use case: RNA-Seq

Objective

The purpose of this use case is to demonstrate how Sonrai enables companies to rapidly identify and download high-quality publicly available data to enrich and cross-reference their own findings. We do this using proprietary algorithms to automatically identify relevant data from the web  such as RNA-Seq and enabling companies to easily download this data for further processing and analysis. Ultimately, researchers can save hours of manual work by using Sonrai’s integrated bioinformatic pipelines to find and access high-quality public data.

Introduction

Public data offers a wealth of resources for researchers, enabling them to delve into diverse studies to cross-reference their own findings or avoid duplication of efforts, such as primary data collection. Utilizing public datasets, researchers can access a broad spectrum of information, enhancing the robustness and reach of their investigations. The benefits of using public data are substantial, including increased transparency, reproducibility of results, and the facilitation of interdisciplinary collaboration. However, these advantages come with significant challenges. 

Challenges of Using Public Data

Locating relevant datasets often requires navigating fragmented and inconsistent data repositories, while downloading and processing large datasets can be technically demanding and time-consuming.  Not all websites have an easy way of downloading the data in the right format. This may have to be done manually and is error-prone. In addition, researchers and small companies often don’t realize the wealth of sequencing data which is out there, and if they do, these files are often large, unwieldy and almost always require tailored analysis. These hurdles underscore the need for more streamlined and user-friendly data access systems to fully leverage the potential of public data in academic research. Table 1 presents some key sequencing and proteomic repositories.

If you'd like to talk to Kai or another expert from our team, reach out to us

Contact our friendly team for expert guidance and transformative insights.

Public Data Repository

Data Types

Supported by Sonrai

European Nucleotide Archive (ENA)

Sequencing information, including raw sequencing data, sequence assembly information and functional annotation

Sequence Read Archive (SRA)

High throughput sequencing data

Gene Expression Omnibus (GEO)

Functional genomics data

DNA Data Bank of Japan (DDBJ)

Nucleotide sequence data

neXtProt

Protein data

ProteomicsDB

Multi-omics and multi-organism data including proteomics, transcriptomics, and phenomics data for human, animal and plant species

Table 1. Sequencing data repositories

Sonrai Enables Companies to Maximize Usage of Public Data

Sonrai solves this challenge by creating algorithms that can easily identify relevant data in public data repositories.

For example, we can leverage the NextFlow-Core fetchNGS to obtain raw NGS data and metadata from large public databases such as the Sequence Read Archive, Gene Expression Omnibus (GEO), European Nucleotide Archive (ENA), and DNA Data Bank of Japan (DDBJ). Sonrai provides intuitive APIs and algorithms that allow companies to easily download high quality, well-curated and analysable data from reputable sources, as well as perform the necessary analysis to enhance and complement existing research.. This has included downloading hundreds of proteomics data files from databases such as NextProt and ProteomicsDB, and can be expanded to virtually any journal, database or website that has relevant information.

Data Analyzed

Downloading data:

A drug developer company wanted to pull 100gb of raw NGS data (~60 samples) from a public repository such as the Sequence Read Archive , to process it. Assuming they had already found the data, it would take approximately two days to download, whereas with theFetchNGS pipeline integrated within Sonrai Discovery we could do this in two hours.

Analyzing data:

If the drug developer company wanted use a pipeline for analysis, they would need

1) the know-how – this could take weeks for them to figure out how set up and run a pipeline, 

2) large scale computing – if they know how to run a pipeline but they don’t have large-scale computing, this likely means the pipeline will also take days to run, it may also use up their computer’s RAM, CPU and almost definitely storage hundreds of gigabytes or terabytes when executing. 

Sonrai Discovery is cloud-based and eliminates the resource problem enabling the company to  perform the analysis on the downloaded data within a few hours.

Results

Without Sonrai’s Automation

With Sonrai’s Automation

Best case scenario:

  •  >2 days to download data
  • Multiple days to process data
  • Very high resource usage on the company’s side.
  • Save days locating relevant datasets and downloading large data files.
  • Reduce end-to-end time from weeks to approximately 10 to 15 hours.
  • No real resource usage on the company’s side.

Worst case (more likely) scenario:

  • The whole process can take weeks as the company doesn’t have the know-how to download data, run or configure the pipeline in-house. 
  • Less error-prone than manual data handling.
  • Can generate insights more quickly.

Table 2. Sonrai’s automated public data pulling can save weeks for companies using public data for their research.

Using Sonrai’s automated bioinformatic pipelines, companies can save days and even weeks finding appropriate high-quality data to complement their research, downloading large raw data files which require extensive computing power, processing that data to prepare for analysis and then analyzing the data for meaningful insights. Automating this process reduces errors associated with manual data handling, ensures no relevant data is missed, and frees up researchers’ time, increasing efficiency and maximizing data outputs for faster discoveries.

Discover how companies worldwide grow with Sonrai. Explore all our case studies.

AI-Powered Single-cell Analysis

AI single-cell analysis boosts accuracy, revealing subtle patterns, expediting large dataset processing and enabling personalized medicine.

AI Biomarker Discovery Workflow

Explore an end-to-end AI biomarker discovery workflow using Sonrai Discovery. Discover just how easy it can be.

End-To-End Bulk RNA-Seq Analysis Workflow

Explore the power of bulk RNA-Seq analysis with our end-to-end workflow using the NextFlow RNA-Seq pipeline and Sonrai Discovery.

Get in touch

Like What You See? Let's Talk

We Listen to your problems
We give you confidence to make your decision