Data quality-aware genomic data integration

Abstract Genomic data are growing at unprecedented pace, along with new protocols, update polices, formats and guidelines, terminologies and ontologies, which are made available every day by data providers. In this continuously evolving universe, enforcing quality on data and metadata is increasingly critical. While many aspects of data quality are addressed at each individual source, we focus on the need for a systematic approach when data from several sources are integrated, as such integration is an essential aspect for modern genomic data analysis. Data quality must be assessed from many perspectives, including accessibility, currency, representational consistency, specificity, and reliability. In this article we review relevant literature and, based on the analysis of many datasets and platforms, we report on methods used for guaranteeing data quality while integrating heterogeneous data sources. We explore several real-world cases that are exemplary of more general underlying data quality problems and we illustrate how they can be resolved with a structured method, sensibly applicable also to other biomedical domains. The overviewed methods are implemented in a large framework for the integration of processed genomic data, which is made available to the research community for supporting tertiary data analysis over Next Generation Sequencing datasets, continuously loaded from many open data sources, bringing considerable added value to biological knowledge discovery.

[1]  Marco Masseroli,et al.  Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data , 2018, Bioinform..

[2]  Lorena Etcheverry,et al.  Data Quality Metrics for Genome Wide Association Studies , 2010, 2010 Workshops on Database and Expert Systems Applications.

[3]  Steven G. Johnson,et al.  A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data , 2016, EGEMS.

[4]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[5]  Ana León,et al.  Data Quality Problems When Integrating Genomic Information , 2016, ER Workshops.

[6]  Sanjay Ranka,et al.  BioDQ: Data Quality Estimation and Management for Genomics Databases , 2008, ISBRA.

[7]  Brendan W. Vaughan,et al.  The 1000 Genomes Project: data management and community access , 2012, Nature Methods.

[8]  Shicai Wang,et al.  COSMIC: the Catalogue Of Somatic Mutations In Cancer , 2018, Nucleic Acids Res..

[9]  Dennis A. Benson,et al.  GenBank , 2018, Nucleic Acids Res..

[10]  Massimiliano Izzo,et al.  FAIRsharing as a community approach to standards, repositories and policies , 2019, Nature Biotechnology.

[11]  Avi Ma'ayan,et al.  Mining data and metadata from the gene expression omnibus , 2018, Biophysical Reviews.

[12]  Qingyu Chen,et al.  Benchmarks for Measurement of Duplicate Detection Methods in Nucleotide Databases , 2016, bioRxiv.

[13]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[14]  Qingyu Chen,et al.  Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases , 2019, bioRxiv.

[15]  Ramkiran Gouripeddi,et al.  Towards a content agnostic computable knowledge repository for data quality assessment , 2019, Comput. Methods Programs Biomed..

[16]  Diego Marcheggiani,et al.  On the Effects of Low-Quality Training Data on Information Extraction from Clinical Reports , 2017, JDIQ.

[17]  Yike Guo,et al.  Consistency, comprehensiveness, and compatibility of pathway databases , 2010, BMC Bioinformatics.

[18]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[19]  Oscar Pastor,et al.  A Method to Identify Relevant Genome Data: Conceptual Modeling for the Medicine of Precision , 2018, ER.

[20]  Roy Pardee,et al.  The HMO Research Network Virtual Data Warehouse: A Public Data Model to Support Collaboration , 2014, EGEMS.

[21]  Carole A. Goble,et al.  Data curation + process curation=data integration + science , 2008, Briefings Bioinform..

[22]  Fabian Prasser,et al.  Improving Data Quality in Medical Research: A Monitoring Architecture for Clinical and Translational Data Warehouses , 2020, 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS).

[23]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[24]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[25]  Kei-Hoi Cheung,et al.  CEDAR: Semantic Web Technology to Support Open Science , 2018, WWW.

[26]  Nan Deng,et al.  Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis , 2018, Database.

[27]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[28]  Richard S. Sandstrom,et al.  BEDOPS: high-performance genomic feature operations , 2012, Bioinform..

[29]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[30]  O Bodenreider,et al.  Biomedical ontologies in action: role in knowledge management, data integration and decision support. , 2008, Yearbook of medical informatics.

[31]  Samuel T. Savitz,et al.  How much can we trust electronic health record data? , 2020, Healthcare.

[32]  Marco Masseroli,et al.  GenoSurf: metadata driven semantic search system for integrated genomic datasets , 2019, Database J. Biol. Databases Curation.

[33]  J. Michael Cherry,et al.  Prevention of data duplication for high throughput sequencing repositories , 2018, Database J. Biol. Databases Curation.

[34]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[35]  Marco Masseroli,et al.  META-BASE: A Novel Architecture for Large-Scale Genomic Metadata Integration , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  K. Sanderson Bioinformatics: Curation generation. , 2011, Nature.

[37]  Susan B. Davidson,et al.  BioGuideSRS: querying multiple sources with a user-centric perspective , 2007, Bioinform..

[38]  G. Lin,et al.  A comparison framework and guideline of clustering methods for mass cytometry data , 2019, Genome Biology.

[39]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[40]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[41]  Hedi Peterson,et al.  The bio.tools registry of software tools and data resources for the life sciences , 2019, Genome Biology.

[42]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[43]  José Fabián Reyes Román,et al.  Using conceptual modeling to improve genome data management , 2020, Briefings Bioinform..

[44]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[45]  Syed Haider,et al.  International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data , 2011, Database J. Biol. Databases Curation.

[46]  Ricardo Cruz-Correia,et al.  Personalised medicine challenges: quality of data , 2018, International Journal of Data Science and Analytics.

[47]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[48]  A Bairoch,et al.  SWISS-PROT: connecting biomolecular knowledge via a protein database. , 2001, Current issues in molecular biology.

[49]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[50]  Marco Masseroli,et al.  OpenGDC: Unifying, Modeling, Integrating Cancer Genomic Data and Clinical Metadata , 2020, Applied Sciences.

[51]  Elena Baralis,et al.  Data Cleaning and Semantic Improvement in Biological Databases , 2006, J. Integr. Bioinform..

[52]  James B Thissen,et al.  Manipulation of the Gut Microbiome Alters Acetaminophen Biodisposition in Mice , 2020, Scientific Reports.

[53]  Marco Masseroli,et al.  TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas , 2017, BMC Bioinformatics.

[54]  Helen E. Parkinson,et al.  BioSamples database: an updated sample metadata hub , 2018, Nucleic Acids Res..

[55]  Stefano Ceri,et al.  Ontology-driven metadata enrichment for genomic datasets , 2018, SWAT4LS.

[56]  Marco Masseroli,et al.  Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. , 2016, Methods.

[57]  Chris Morris,et al.  Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data , 2017, bioRxiv.

[58]  Fouzia Moussouni,et al.  Quality-Aware Integration and Warehousing of Genomic Data , 2005, ICIQ.

[59]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[60]  Marco Masseroli,et al.  Overview of GeCo: A Project for Exploring and Integrating Signals from the Genome , 2017, DAMDID/RCDL.

[61]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[62]  Maria Jesus Martin,et al.  Minimizing proteome redundancy in the UniProt Knowledgebase , 2016, Database J. Biol. Databases Curation.

[63]  Helen E. Parkinson,et al.  The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 , 2018, Nucleic Acids Res..

[64]  Felix Naumann,et al.  Data Quality in Genome Databases , 2003, ICIQ.

[65]  Stefano Paraboschi,et al.  Designing data marts for data warehouses , 2001, TSEM.

[66]  Eugenia Galeota,et al.  Ontology-driven integrative analysis of omics data through Onassis , 2020, Scientific Reports.

[67]  Alan R. Moody,et al.  From Big Data to Precision Medicine , 2019, Front. Med..

[68]  David Robinson,et al.  Research resources: curating the new eagle-i discovery system , 2012, Database J. Biol. Databases Curation.

[69]  Claire O'Donovan,et al.  Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data , 2014, Database J. Biol. Databases Curation.

[70]  Antonio Mauro Saraiva,et al.  A conceptual framework for quality assessment and management of biodiversity data , 2017, PloS one.

[71]  Stefano Ceri,et al.  Exploiting Conceptual Modeling for Searching Genomic Metadata: A Quantitative and Qualitative Empirical Study , 2019, ER Workshops.

[72]  Oscar Pastor,et al.  Applying Conceptual Modeling to Better Understand the Human Genome , 2016, ER.

[73]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[74]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[75]  Joachim Hammer,et al.  Making quality count in biological data sources , 2005, IQIS '05.

[76]  Wenfei Fan,et al.  Data Quality: From Theory to Practice , 2015, SGMD.

[77]  James C. Hu,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2019 .

[78]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[79]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[80]  Ulf Leser,et al.  Improving data quality by source analysis , 2012, JDIQ.

[81]  Cory B. Giles,et al.  ALE: automated label extraction from GEO metadata , 2017, BMC Bioinformatics.

[82]  Alessandro Campi,et al.  Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data , 2017, ER.

[83]  Microarray standards at last , 2002, Nature.

[84]  Douglas Boyle,et al.  Improving a Secondary Use Health Data Warehouse: Proposing a Multi-Level Data Quality Framework , 2019, EGEMS.

[85]  Michel Dumontier,et al.  MetaCrowd: Crowdsourcing Biomedical Metadata Quality Assessment , 2019, Hum. Comput..

[86]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[87]  Erhard Rahm,et al.  Flexible Integration of Molecular-Biological Annotation Data: The GenMapper Approach , 2004, EDBT.

[88]  Joshua M. Korn,et al.  Next-generation characterization of the Cancer Cell Line Encyclopedia , 2019, Nature.

[89]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[90]  Carole A. Goble,et al.  Bioschemas: From Potato Salad to Protein Annotation , 2017, SEMWEB.

[91]  Steve Pettifer,et al.  EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats , 2013, Bioinform..

[92]  Paul T. J. Tan,et al.  Duplicate Detection in Biological Data using Association Rule Mining , 2004 .

[93]  Michel Dumontier,et al.  Predicting structured metadata from unstructured metadata , 2016, Database J. Biol. Databases Curation.

[94]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[95]  Xiaoyan Zhang,et al.  Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis , 2018, Nucleic Acids Res..

[96]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[97]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[98]  Anila Sahar Butt,et al.  Where to search top-K biomedical ontologies? , 2018, Briefings Bioinform..

[99]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[100]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[101]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[102]  Mark A. Musen,et al.  The variable quality of metadata about biological samples used in biomedical experiments , 2018, Scientific Data.

[103]  Gilberto Fragoso,et al.  The NCI Thesaurus quality assurance life cycle , 2009, J. Biomed. Informatics.

[104]  Stefano Ceri,et al.  From a Conceptual Model to a Knowledge Graph for Genomic Datasets , 2019, ER.

[105]  Hanlee P. Ji,et al.  Data quality in genomics and microarrays , 2006, Nature Biotechnology.

[106]  Nuno A. Fonseca,et al.  ArrayExpress update – from bulk to single-cell expression data , 2018, Nucleic Acids Res..

[107]  Zhiyong Lu,et al.  Community challenges in biomedical text mining over 10 years: success, failure and the future , 2016, Briefings Bioinform..

[108]  Ulf Leser,et al.  Integrating and Warehousing Liver Gene Expression Data and Related Biomedical Resources in GEDAW , 2005, DILS.

[109]  Julien Grosjean,et al.  Health multi-terminology portal: a semantic added-value for patient safety. , 2011, Studies in health technology and informatics.

[110]  S. Samarajiwa,et al.  Challenges and Cases of Genomic Data Integration Across Technologies and Biological Scales , 2018 .

[111]  Alexander D. Diehl,et al.  Logical Development of the Cell Ontology , 2011, BMC Bioinformatics.

[112]  Karin M. Verspoor,et al.  Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases , 2018, ACM J. Data Inf. Qual..

[113]  Alun D. Preece,et al.  Quality views: capturing and exploiting the user perspective on data quality , 2006, VLDB.

[114]  Rodrigo Lopez,et al.  The EBI search engine: EBI search as a service—making biological data accessible for all , 2017, Nucleic Acids Res..

[115]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[116]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[117]  Stuart E. Madnick,et al.  Editors’ Comments: ACM Journal of Data and Information Quality (JDIQ) is alive and well! , 2010, JDIQ.

[118]  Les Gasser,et al.  A framework for information quality assessment , 2007, J. Assoc. Inf. Sci. Technol..

[119]  Fouzia Moussouni,et al.  QDex: A Database Profiler for Generic Bio-data Exploration and Quality Aware Integration , 2007, WISE Workshops.

[120]  Karin M. Verspoor,et al.  Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study , 2016, bioRxiv.

[121]  Patrick B. Ryan,et al.  A Comparison of Data Quality Assessment Checks in Six Data Sharing Networks , 2017, EGEMS.

[122]  Laure Berti-Équille,et al.  Cleaning, Integrating, and Warehousing Genomic Data From Biomedical Resources , 2013 .

[123]  Karin M. Verspoor,et al.  Automated detection of records in biological sequence databases that are inconsistent with the literature , 2017, J. Biomed. Informatics.

[124]  Martin J. O'Connor,et al.  Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases , 2019, Database J. Biol. Databases Curation.

[125]  Karin M. Verspoor,et al.  Literature consistency of bioinformatics sequence databases is effective for assessing record quality , 2017, bioRxiv.

[126]  Eugenia Galeota,et al.  Ontology-based annotations and semantic relations in large-scale (epi)genomics data , 2016, Briefings Bioinform..

[127]  Elena Baralis,et al.  Extraction of Constraints from Biological Data , 2009, Biomedical Data and Applications.

[128]  Marco Masseroli,et al.  The road towards data integration in human genomics: players, steps and interactions , 2020, Briefings Bioinform..