Data Quality in Genome Databases

Genome databases store data about molecular biological entities such as genes, proteins, diseases, etc. The main purpose of creating and maintaining such databases in commercial organizations is their importance in the process of drug discovery. Genome data is analyzed and interpreted to gain so-called leads, i.e., promising structures for new drugs. Following a lead through the process of drug development, testing, and finally several stages of clinical trials is extremely expensive. Thus, an underlying high quality database is of utmost importance. Due to the exploratory nature of genome databases, commercial and public, they are inaccurate, incomplete, outdated and in an overall poor state. This paper highlights the important challenges of determining and improving data quality for databases storing molecular biological data. We examine the production process for genome data in detail and show that producing incorrect data is intrinsic to the process at the same time highlight common types of data errors. We compare these error classes with existing solutions for data cleansing and come to the conclusion that traditional and proven data cleansing techniques of other application domains do not suffice for the particular needs and problem types of genomic databases.

[1]  Timos K. Sellis,et al.  ARKTOS: towards the modeling, design, control and execution of ETL processes , 2001, Inf. Syst..

[2]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[3]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[4]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[5]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[6]  William E. Winkler,et al.  Methods for evaluating and creating data quality , 2004, Inf. Syst..

[7]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[8]  Charles Schroeder,et al.  DataBryte: A Proposed Data Warehouse Cleansing Framework , 1998, IQ.

[9]  Richard Y. Wang,et al.  IP-MAP: Representing the Manufacture of an Information Product , 2000, IQ.

[10]  R. Giegerich,et al.  GenDB--an open source genome annotation system for prokaryote genomes. , 2003, Nucleic acids research.

[11]  Joseph M. Hellerstein,et al.  Potter''s Wheel: An Interactive Framework for Data Transformation and Cleaning , 2001, VLDB 2001.

[12]  Vincent Lombard,et al.  The EMBL Nucleotide Sequence Database: major new developments , 2003, Nucleic Acids Res..

[13]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[14]  P. Richterich,et al.  Estimation of errors in "raw" DNA sequences: a validation study. , 1998, Genome research.

[15]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[16]  Miguel A. Andrade-Navarro,et al.  Evaluation of annotation strategies using an entire genome sequence , 2003, Bioinform..

[17]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[18]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[19]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[20]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[21]  J. R. MacDonald,et al.  Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence , 2003, Genome Biology.

[22]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[23]  Antonio Sassano,et al.  Errors Detection and Correction in Large Scale Data Collecting , 2001, IDA.

[24]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[25]  Wei Zhang,et al.  A Framework for Corporate Householding , 2002, ICIQ.

[26]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[27]  Frank Piontek,et al.  Healthcare Informatics: Data Quality, Warehousing and Mining Applications , 2002, ICIQ.

[28]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.