Data Quality Problems When Integrating Genomic Information

Due to the complexity of genomic information and the broad amount of data produced every day, the genomic information accessible on the web has become very difficult to integrate, which hinders the research process. Using the knowledge from the Data Quality field and after a specific study of a set of genomic databases we have found problems related to six Data Quality dimensions. The aim of this paper is to highlight the problems that bioinformaticians have to face when they integrate information from different genomic databases. The contribution of this paper is to identify and characterize those problems in order to understand which ones hinder the research process, increasing the time-waste that this task means for researchers.

[1]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[2]  Robert Lücking,et al.  From GenBank to GBIF: Phylogeny-Based Predictive Niche Modeling Tests Accuracy of Taxonomic Identifications in Large Occurrence Data Repositories , 2016, PloS one.

[3]  Carlo Batini,et al.  Data Quality at a Glance , 2005, Datenbank-Spektrum.

[4]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[5]  H Clevers,et al.  Catenins, Wnt signaling and cancer. , 2000, BioEssays : news and reviews in molecular, cellular and developmental biology.

[6]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[7]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[8]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[9]  Oscar Pastor,et al.  Conceptual Modeling Meets the Human Genome , 2008, ER.

[10]  David Loshin 7 – Data Governance , 2011 .

[11]  S A Krawetz Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. , 1989, Nucleic acids research.

[12]  Thomas Triplet,et al.  Systems biology warehousing: challenges and strategies toward effective data integration , 2011 .

[13]  Yike Guo,et al.  Consistency, comprehensiveness, and compatibility of pathway databases , 2010, BMC Bioinformatics.

[14]  Helen M. Berman,et al.  Quality control in databanks for molecular biology. , 2000, BioEssays : news and reviews in molecular, cellular and developmental biology.

[15]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.