Data quality

Author: Dr Matthew Alderdice, Head of Data Science

How much does ensuring data quality cost your organisation? Most CROs would find it challenging to translate quality management into a monetary value. However, research shows that failed or heavily delayed trials and studies can significantly impact both the sponsor and the CRO, leading to reputational damage, job losses, share price devaluation, potentially huge costs of rework and, in extreme cases, ‘rescue studies’. One report estimates that delays to a trial can cost between $600,000 to $8 million per day, whilst FDA research highlights that companies reporting quality issues suffered an average share price drop of nearly 17%.

Many reports highlight poor data quality and missing data as a significant reason for unsuccessful trials, preclinical projects and studies. Such errors are typically driven by the acceleration of data volume and the proliferation of multiple data types that are now present in modern healthcare, which has made data integration and analysis considerably more complex. More stringent regulations and the accumulation of vast quantities of data across multiple siloed systems can quickly cause errors in established processes.

A more positive question to ask is, how much can data quality earn for your company? Better data quality reduces costs and increases sales, brand equity, and productivity. Most importantly, better data helps accelerate the development of new and more effective drugs for those that urgently need them.

 

 

 

Myths about Data Quality

myths

As data experts, we hear many myths about data quality. Here’s the truth behind five of the most common ones:

Quality Myth #1:

Data quality can only be implemented by a team of skilled data scientists and engineers.

The truth:  Until now, many systems for managing data quality were so complex that only staff with years of experience could operate them successfully. However, new platforms are emerging that have been explicitly designed with the user experience in mind and have a much lower barrier to entry – meaning anyone with the appropriate training can operate them. 

Quality Myth #2:

The cost of establishing, maintaining, and re-evaluating data quality is usually very high.

The truth: If you are relying on manual processes and teams of people to verify data, then yes, the costs can be very high. However, machine learning and analytics can automatically monitor data integrity and automate time-consuming and laborious tasks, leading to error-free data and reduced expenditure.

Quality Myth #3:

A data quality management platform will solve all of your quality issues.

The truth: A data quality management platform is an essential part of maintaining the quality of your data. However, you also need to ensure that data quality is seen as a priority across the organisation: so buy-in from your leadership team is just as important.

Quality Myth #4:

Sponsors, not CROs, are solely responsible for the quality of their data.

The truth: Whilst ultimate responsibility for data quality rests with the sponsor, CROs need to take every precaution they can to ensure data quality. We know that accurate data increases the chances of a successful trial, preclinical projects or study, generating significant benefits for CROs, including happy sponsors who will return for repeat business. Other benefits could include; case studies, white papers and other marketing collateral that can drive future sales. 

 

What are the causes of poor data quality?

Poor Quality Data

1. A lack of quality control over the collection and recording of study data. 

It’s often not practical or possible for teams to manually review each other’s study data. However, accidental duplicates, unintended missingness and typographical errors can ultimately derail a project. Unfortunately, improving how data is collected and reviewed as part of a broader data quality initiative is often not seen as a top priority by organisations.

2. Differing interpretations and implementation of GCPs, SOPs and protocols. 

There can be multiple reasons for this, including the sudden departure of essential staff/introduction of new staff mid-study and short timeframes, insufficient communication or changes to the study. These factors can result in accidental deviations from even the most carefully-crafted protocols. 

3. Poor (or lack of) management supervision or quality control of task completion during the study. 

Implementing an oversight plan is part of quality by design. When managing subcontractors, oversight plans are a must.

4. Loss of key staff

CROs compete with biotech and pharma for highly skilled professionals, and job turnover is increasing, hitting 30% in 2018. The sudden departure of key staff during a trial or preclinical project, with the accompanying loss of know-how and knowledge and delays whilst replacements, are recruited, can cause serious problems. In particular, data quality can be severely impacted with new staff struggling to get up to speed with systems and processes and the accelerating volume of data and accompanying protocols, leading to errors. 

5. Poor communication between sponsor and CRO.

Trials, preclinical projects and studies are complex projects and often operate on tight timeframes with many people involved from multiple organisations.  Poor communication (and planning) can lead to a lack of clarity over who is responsible for which task or how specific data should be recorded, leading to errors and oversights.

 

What is the most effective way to perform quality control checks on data?

Quality-Checks

From a risk mitigation perspective, integrating quality management into daily clinical practice helps sponsors and CROs better understand data trends and patient outcomes while identifying unanticipated data anomalies. Unknown or unexpected data and events can significantly impact data quality and study results, and data issues are more common as trial or preclinical project complexity increases.

CROs have opportunities to grow their business through the power of machine learning and real-time insights for prediction, remediation, and ongoing performance improvement. The ability to detect issues proactively, including inclusion/exclusion criteria violations, protocol violations, and other compliance or performance red flags, will keep your trials or preclinical projects on track.

Data Quality Control in the precision medicine sector frequently refers to applying defined rules or complex batch normalisation techniques. However, a few simple yet vital routine checks can be applied to any dataset to help ensure it is ready for downstream analysis.

Missingness

Incomplete or missing data costs businesses millions – MIT put the cost at 15-25% of revenue. But how many people report on the missingness of their data? The unknown cost could be much larger!

Have you ever realised at the end of the study that critical data is missing when it’s too late to fix it? What impact did this have on the overall quality of the study? Did you have to repeat the study, or did you attempt to painstakingly identify the source of the missingness?

You can use Sonrai’s advanced filtering to enrich your dataset for samples with more complete profiles or use our imputation app to visualise the fraction of missingness and replace missing data with substituted values. Moreover, to reduce levels of missingness in your data, use our synoptic data collection tool to prevent incomplete data capture.

Data Validations

Collecting, storing and analysing date formats can be a nightmare, especially when working across different time zones. By proactively checking that the features in your dataset are the correct data types (e.g. date instead of character) or that the formats for dates are consistent (e.g. MM:DD:YYYY vs DD:MM:YYYY), will save your analysts and clients a lot of time and frustration in the long run. So why not put processes in place to prevent it from happening in the first place?

Another simple yet overlooked check is profiling your data records for uniqueness. Unique ids are crucial for record linkage and are often used to register samples and integrate data from different sources. Identifying erroneous duplicates early in the data collection process will enhance your and your sponsor’s ability to perform data integration and prevent information loss.

Batch Effect and Outlier Detection

Batch effects (technical variation) and outliers plague molecular biology experiments. They can be caused by many different factors such as differences in laboratory conditions, choice of reagent, changes in staff, time of day, atmosphere and laboratory instruments. Unnoticed batch effects and undetected outliers can skew results and impact the validity of your findings. So what if you could effortlessly highlight batch effects (technical variation) and identify outliers in your datasets using cutting-edge data visualisations, machine learning and advanced filtering techniques?

Determining the distribution of your dataset can help you choose appropriate statistical tests for your downstream analysis and help identify outliers in your data.

The most common way to visualise distribution is using a histogram. However, box plots and dot plots enable quick comparison of distributions and visualise if they are skewed or asymmetric. The box and dot plot below clearly show that the groups of samples have similar distribution but that there is a distinct outlier.

See the interactive report

Box Plot 

Use Case: Identifying outliers in a high-dimension dataset using Principal Components Analysis

Interactive data visualisations and machine learning are often used to identify biologically meaningful clusters. But did you know they can help you quickly identify outliers too? Principal Components Analysis (PCA) is a classic dimensionality reduction technique that allows users to look at the overall variation and spot patterns in their dataset. Other similar yet more advanced techniques such as tSNE and UMAP could also be used. The PCA plot in our report below shows a striking example of an outlier from a FACS dataset.

See the interactive report

Principal Components Analysis

The PCA plot below shows a dataset with an outlier removed (sample 45) and enriched for features with complete profiles. By performing and reporting on some of these routine checks in the article, not only will you impress your clients with granular and custom quality control reports, but you will also have increased confidence in the insights that are distilled during downstream analysis.

See the interactive report

PCA-Plot-without-outlier

Summary

Performing quality control checks on your data will help mitigate reputational damage or rework. Sonrai’s platform for CROs can help detect issues proactively, scale your business and give your team confidence in your data as the clinical trial, preclinical projects or study progresses. If you have any questions about this article, or if you’d like to discuss your challenges we’re available to chat.

 

Tell us about your needs

Whether you have general questions about our products and solutions or would like to schedule a chat, our expert team is available to help.

Dr Peter Kerr

Related Post

Leave a Comment

Sonrai-DarkBG-low-res

Cloud and data technology startup conceptualising raw data into actionable insights.

Follow Us

Copyright © 2020 Sonrai Analytics Ltd

Contact Us

Address:  Whitla Medical Building
Health Sciences Campus
Lisburn Road, Belfast, BT9 7BL

Email: info@sonraianalytics.com
Phone: + (00) (44) 028 9097 2629