BIO-AJAX: an extensible framework for biological data cleaning

As databases become more pervasive through the biological sciences, various data quality issues regarding data legacy, data uniformity and data duplication arise. Due to the nature of this data, each of these problems is non-trivial. For biological data to be corrected and standardized, new methods and frameworks must be developed. This paper proposes one such framework, called BIO-AJAX, which uses principles from data cleaning to improve data quality in biological information systems, specifically in TreeBASE.

[1]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[2]  Bertram Ludäscher,et al.  A Model-Based Mediator System for Scientific Data Management , 2003, Bioinformatics.

[3]  B A Shapiro,et al.  Complementary classification approaches for protein sequences. , 1996, Protein engineering.

[4]  Tok Wang Ling,et al.  A knowledge-based approach for duplicate elimination in data cleaning , 2001, Inf. Syst..

[5]  Steven L Salzberg,et al.  Automated correction of genome sequence errors. , 2004, Nucleic acids research.

[6]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[7]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[8]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[9]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[10]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[12]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2021, Nucleic Acids Res..

[13]  Michael J. Sanderson,et al.  The Small-world Dynamics of Tree Networks and Data Mining in Phyloinformatics , 2003, Bioinform..

[14]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[15]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[16]  Cathy H. Wu,et al.  Protein family classification and functional annotation , 2003, Comput. Biol. Chem..