Author: Dr Matthew Alderdice, Head of Data Science
Data management systems are now a core part of any modern business. Data management covers topics, including storage, automation, processing, analysis, and sharing. Here we summarise some of the fundamental data-driven business processes that an organization can adopt to maximize its data assets' value.
Data Storage - moving your data lake to the cloud
In Precision Medicine, organizations such as CROs, Biotech, and Pharma generate vast molecular data using laboratory instruments such as next-generation sequencers, flow cytometers, and digital pathology image scanners. Organizations need to create expansive data lakes to store the raw data from these instruments. Data lakes allow an organization to store their data as-is rather than "analysis-ready." A data lake should be scalable, secure, and accessible, which is now best achieved using cloud technologies that offer affordable, limitless, and on-demand storage services.
In 2020, the global pandemic helped drive the surge in the adoption of cloud technologies, with over 50% of organizations making a move to the cloud. Cloud providers offer cheap and scalable storage and provide access to unrivalled computational power, facilitating data processing to make it analysis-friendly. This analysis-friendly data is traditionally stored in a data warehouse where downstream reporting occurs; however, it is increasingly crucial for data management systems to fulfil both data lake and data warehouse roles. We need data pipelines to take raw data from a data lake, transform it to be analysis-ready, and then store it for use in a data warehouse.
What is a data pipeline?
Data pipelines are essential for transporting, transforming, and analyzing large volumes of data. It is no surprise that data engineering was the fastest-growing tech role in 2020.
Data pipelines may refer to niche bioinformatics tasks such as variant calling or gene expression counting for the multi-modal data types frequently found in Precision Medicine. It might refer to more standard tasks like data cleaning and identifying and removing duplicates, inaccuracies, and missingness in a dataset. It may also simply refer to transporting data from A to B.
In short, you can use a data pipeline for many things, but the main advantages they offer are consistency, efficiency, and reproducibility. A well-built pipeline will consistently perform its intended task no matter how many times it is used on that data. This ability to consistently perform challenging, repetitive and error-prone tasks ensures reproducibility. Precision Medicine is currently going through a 'reproducibility crisis' as outlined by Stupple et al. We believe adopting software-based data pipelines is essential for driving reproducible discoveries in Precision Medicine.
As your organization scales, if you are still manually curating, transforming, and analyzing your data in silos, you will likely run into issues with reproducibility sooner rather than later. Therefore, identifying processes that you can pipeline should be a top priority for your organization's leadership team.
Automate your Data Processes
Are you still manually copying and pasting large volumes of data? Are you manually generating reports with the same plots? If the answer is yes, then your organization could benefit from automation. Repetitive and manual processes are critical causes of errors and inefficiencies.
An article by McKinsey highlights that companies that have prioritized an automation strategy have already been able to automate 50 to 70 per cent of tasks, with return on investment generally in triple-digit percentages. Key automation technologies they identified include machine learning and intelligent workflows and process automation technologies to automate routine tasks, including data extraction and cleaning.
Organizations such as CROs run vast quantities of assays for their clients, including in vitro assays, in vivo tumour volume measurements, and FACS-based molecular assays. These projects are time-consuming and usually require repetitive curation, analysis, and reporting. While these processes should continue to have a human-in-the-loop, CROs have a tremendous opportunity to dramatically reduce the amount of time and errors that occur while they generate and report on data to their clients. Better processes will ultimately enable CROs to take on more projects and drive business outputs.
Data Preparation - keep it tidy!
The following report in 2018 MIT Sloan: Seizing Opportunity in Data Quality indicated that messy data costs the average business an astonishing 15% to 25% of revenue. It is well established that data scientists spend nearly 80% of their time cleaning and wrangling data rather than training AI/ML models to generate insights and predictions. On this basis, organizations must act early to identify sources of messy data, clean their legacy data and ensure that prospective datasets are in what is known as a tidy format.
The term Tidy data was first coined by world-renowned data scientist Hadley Wickham in 2014 and provided a "standardized approach to linking the structure of a dataset (its physical layout) with its semantics (its meaning)". Adhering to Tidy data principles reduces the time it takes to clean and analyze datasets and enables scientists and analysts to spend more of their time generating key insights to drive business outputs.
In Precision Medicine, metadata associated with preclinical or clinical studies (e.g., gender, age, and intervention dates) are frequently the largest source of messy data. Organizations can overcome incomplete, inconsistent, and inaccurate data collection by using synoptic reporting. Synoptic reporting of metadata has been widely adopted in routine pathology to improve the completeness and accuracy of reporting and the quality of life for the data collector. Syntopic reporting refers to a standardized way of collecting data using defined checklists, validation of inputs, and eradication of free text. Moreover, the concepts of syntopic reporting are synergistic with tidy data principles and will help ensure analysts spend the least amount of time possible cleaning and wrangling their datasets. Investing time and resources at the data collection stage of your data management processes will ensure data quality, enable data integration and enhance downstream analysis.
Smarter Data Reporting
CROs will periodically update stakeholders during a client project by sharing raw data files, documents, and reports with their clients. Whether the project is a tumour volume study, cell viability experiment, or FACS analysis, the reports will likely consist of many data visualizations, including line charts, box plots, and scatter plots. Generating these plots is very time-consuming, requiring manual copy and pasting from spreadsheets.
Organizations that invest in custom visualization and reporting tools will see dramatic increases in productivity and reductions in errors and failures. Custom dashboards and software applications are increasingly helping to reduce the time it takes to create a report and enhance the client experience by employing exciting interactive features that are engaging for clients and much more valuable than static documents or slides. The quality, impactfulness, and accessibility of a CRO report will directly influence whether their client returns in the future.
Summary of Core Data Management Principles
- Move your data to the cloud.
- Overcome the reproducibility crisis with data pipelines.
- Maximise efficiency by Automating your Data Processes.
- Keep your data tidy.
- Communicate the value of your data with smarter reporting.
- Moving to the cloud stats, Techjury.net.
- Fastest growing tech roles, Datanami.
- The prevention and handling of the missing data, The National Center for Biotechnology Information.
- Automation strategy, Driving impact at scale from automation and AI, McKinsey.
- Seizing Opportunity in Data Quality, Sloan Review.
- Tidy Data, Journal of Statistical Software, Hadley Wickham
- Synoptic Reporting: Evidence-Based Review and Future Directions, JCO Clinical Cancer Informatics.