Author: Matthew Alderdice, Head of Data Science
What is data integration?
Data integration is the process of combining data from disparate sources to create a unified view for downstream business intelligence, advanced analytics and data science applications.
In Precision Medicine, the ability to generate multi-omic patient profiles by integrating data from different clinical and molecular sources provides unrivalled insights into disease. Regardless of the sector, once an organisation harnesses a data integration tool it will quickly realise that integrated data is more than the sum of its parts. Whilst data integration can provide powerful insights overcoming data silos represents a major challenge.
What is a data silo?
Data silos (also known as information silos) are isolated and heterogeneous data systems which are not compatible with other data systems. Data silos can cause many problems for organizations such as;
- Incomplete and potentially inaccurate views of their data
- Breakdowns in communication
- Duplication of effort and
- Missed opportunities.
What causes data silos?
As an organization grows a silo mentality may occur which can lead to poor communication between teams, departments and companies where the importance of data system compatibility is ignored or not recognised. This is a ‘vicious circle’ with silos causing communication problems, which in turn leads to the further strengthening of those silos. This situation can be avoided if leadership teams instead adopt a data-driven mentality.
Another common cause of data silos is a lack of standardisation in data collection, processing and storage practices. In Precision Medicine, one example of how a data silo could arise is when two collaborating laboratories generate inter-dependent data (e.g patient-matched proteomics and gene expression profiles) but have used different patient IDs, different aliases for gene names and different data processing pipelines.
When this occurs, the two data systems are incompatible, massively reducing their value. Resolving these issues retrospectively is very challenging and thus establishing a data model up-front will help reduce the amount of data wrangling required prior to integration.
Regulatory compliance often restricts data access due to the sensitive nature of the data. This is particularly common in Precision Medicine with patient identifiable information such as clinical and genomic data. However, if such data cannot be accessed then it cannot be integrated and the data system becomes siloed.
Whilst the importance of data security should not be undermined, data should also be accessible to those who need it. An organisation should ask itself if its data cannot be accessed then why is it being stored? Again, it is preferable to preempt this with simple, logical and clear data access policies, enforced by a data integration platform, with the appropriate control of access and roles and auditing of data access.
Why is data integration important?
By consolidating its data an organisation will not only have a complete and more accurate view of their resources but will also see enhanced collaboration and increased performance of its data-driven decisions.
A key sector that is yet to reap the full benefits of data integration is Precision Medicine. It is now commonplace to profile a patient's DNA, RNA and protein sequences using a multitude of fast-evolving technologies. This results in a high volume of multi-omic big data where data integration remains challenging.
Revealing the complex interaction between biomolecules such as RNA, DNA and protein will accelerate drug development, biomarker discovery and patient stratification strategies. Healthcare leaders and executives have the opportunity and responsibility to future-proof their organizations by adopting data integration systems that ensure the true value of their data lakes are unlocked.
How is data integrated?
In precision medicine, unique identifiers such as patient IDs, gene names or biological terms and ontologies are often required to integrate data sources. Data integration requires the data to be cleaned, transformed and stored prior to unification. This process is notoriously error-prone, tedious and time-consuming which necessitates data integration systems.
The first data integration systems known as data warehouses were developed in the 1980s. Enterprise data warehouses are among the most commonly adopted systems and have historically utilized the extract, transform and load (ETL) or extract, load and transform (ELT) data integration processes (See figure). ETL/ELT curates and aggregates data from different sources to improve the performance and quality of downstream BI and reporting.
What are the pitfalls of traditional data integration?
Whilst enterprise data warehouses have seen a lot of success over the years some of their pitfalls are now beginning to surface in the era of modern big data. For example, many traditional data warehouses offer on-premise solutions (as opposed to cloud) which means that both their storage and compute capabilities cannot efficiently scale with the data.
There is often a reliance on proprietary data formats rather than the use of open file formats such as parquet. These may not be designed for use with data science applications such as AI and machine learning, and often act as a barrier to mining data using these.
Data lakes are distinct from data warehouses as they store an organization's raw data. However, querying and processing raw data comes with added complexity. Being able to access both the raw and integrated data is becoming increasingly important and few solutions currently exist that can do both.
Precision Medicine adds further challenges as very few traditional data warehouses/lakes can accommodate the diverse molecular data types and data processing pipelines required by multi-omics which often consists of both structured tabular and unstructured high-resolution radiology and digital pathology images.
What is the future of data integration?
The shortcomings of the enterprise data warehouse and data lakes have given rise to the next generation of data integration tools such as lakehouses, feature stores and data meshes. These systems are often cloud-native and use storage systems such as AWS S3 as a cost-effective data lake and serverless compute such as AWS Fargate to scale compute to meet demand.
These modern data integration tools accommodate a diversity of raw data types but also can transform and store data in open formats which meet performance requirements and which can seamlessly integrate with open-source data science tools such as R and Python.
As data integration systems become more prevalent and powerful, they will enable the seamless use of machine learning frameworks such as Tensorflow, PyTorch and XGBoost on this integrated data. This unlocks the potential to apply techniques such as representation learning and data dimensionality reduction to mine the ever-increasing flood of data, and derive novel, powerful and actionable Precision Medicine insights to benefit clinicians and patients.