Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature

We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious”. Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

[1]  Paul Pavlidis,et al.  Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA) , 2013, BMC Bioinformatics.

[2]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[3]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[4]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[5]  Friedhelm Pfeiffer,et al.  A Manual Curation Strategy to Improve Genome Annotation: Application to a Set of Haloarchael Genomes , 2015, Life.

[6]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[7]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[8]  Bas Teusink,et al.  Understanding the Adaptive Growth Strategy of Lactobacillus plantarum by In Silico Optimisation , 2009, PLoS Comput. Biol..

[9]  Claire O'Donovan,et al.  Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data , 2014, Database J. Biol. Databases Curation.

[10]  P a g e | 141 Global Journal of Computer Science and Technology Detecting Redundancy in Biological Databases – An Efficient Approach , 2022 .

[11]  Neil D. Rawlings,et al.  New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily , 2014, BMC Bioinformatics.

[12]  Marcus C. Chibucos,et al.  The Confidence Information Ontology: a step towards a standard for asserting confidence in annotations , 2015, Database J. Biol. Databases Curation.

[13]  Karin M. Verspoor,et al.  Literature consistency of bioinformatics sequence databases is effective for assessing record quality , 2017, bioRxiv.

[14]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[15]  Bryan Kolaczkowski,et al.  Functional Annotations of Paralogs: A Blessing and a Curse , 2016, Life.

[16]  Vitor R. Carvalho,et al.  Reducing long queries using query quality predictors , 2009, SIGIR.

[17]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[18]  Michal Linial,et al.  Automatic detection of false annotations via binary property clustering , 2005, BMC Bioinformatics.

[19]  M. Facciotti,et al.  An Integrated Pipeline for de Novo Assembly of Microbial Genomes , 2012, PloS one.

[20]  Roland J. Siezen,et al.  Genome (re‐)annotation and open‐source annotation pipelines , 2010, Microbial biotechnology.

[21]  Seng Hong Seah,et al.  SCORPION, a molecular database of scorpion toxins. , 2002, Toxicon : official journal of the International Society on Toxinology.

[22]  Karin M. Verspoor,et al.  A close look at protein function prediction evaluation protocols , 2015, GigaScience.

[23]  Christos A. Ouzounis,et al.  Annotation inconsistencies beyond sequence similarity-based function prediction – phylogeny and genome structure , 2015, Standards in Genomic Sciences.

[24]  Phillip W. Lord,et al.  Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation? A Case Study Using UniProtKB , 2013, PloS one.

[25]  Stephen C. Ekker,et al.  Mojo Hand, a TALEN design tool for genome editing applications , 2013, BMC Bioinformatics.

[26]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[27]  Iadh Ounis,et al.  Inferring Query Performance Using Pre-retrieval Predictors , 2004, SPIRE.

[28]  Tin Wee Tan,et al.  Large-scale analysis of antigenic diversity of T-cell epitopes in dengue virus , 2006, BMC Bioinformatics.

[29]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[30]  J. Gogarten,et al.  Using comparative genome analysis to identify problems in annotated microbial genomes. , 2010, Microbiology.

[31]  Paul T. J. Tan,et al.  Duplicate Detection in Biological Data using Association Rule Mining , 2004 .

[32]  Richard J. Roberts,et al.  Objective: biochemical function , 2014, Front. Genet..

[33]  S. Brunak,et al.  Cleaning the GenBank Arabidopsis thaliana data set. , 1996, Nucleic acids research.

[34]  Karin M. Verspoor,et al.  Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study , 2016, bioRxiv.

[35]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[36]  Guillaume J. Filion,et al.  Starcode: sequence clustering based on all-pairs search , 2015, Bioinform..

[37]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[38]  Ying Xu,et al.  Mapping of orthologous genes in the context of biological pathways: An application of integer programming , 2006, Proc. Natl. Acad. Sci. USA.

[39]  Miguel A. Andrade-Navarro,et al.  Evaluation of annotation strategies using an entire genome sequence , 2003, Bioinform..

[40]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[41]  Carol Harger,et al.  Establishing a method of vector contamination identification in database sequences , 1999, Bioinform..

[42]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[43]  Min Song,et al.  Detecting duplicate biological entities using Markov random field-based edit distance , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[44]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[45]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[46]  Éric Gaussier,et al.  Information-based models for ad hoc IR , 2010, SIGIR '10.

[47]  Iadh Ounis,et al.  Query performance prediction , 2006, Inf. Syst..

[48]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[49]  Natalia N. Ivanova,et al.  Improving Microbial Genome Annotations in an Integrated Database Context , 2013, PloS one.

[50]  S. O’Brien,et al.  SmileFinder: a resampling-based approach to evaluate signatures of selection from genome-wide sets of matching allele frequency data in two or more diploid populations , 2015, GigaScience.

[51]  Min Song,et al.  Detecting duplicate biological entities using Shortest Path Edit Distance , 2010, Int. J. Data Min. Bioinform..

[52]  Riccardo Percudani,et al.  Ureidoglycolate hydrolase, amidohydrolase, lyase: how errors in biological databases are incorporated in scientific papers and vice versa , 2013, Database J. Biol. Databases Curation.

[53]  Michael Y. Galperin,et al.  Sequence ― Evolution ― Function: Computational Approaches in Comparative Genomics , 2010 .

[54]  Seán S. ÓhÉigeartaigh,et al.  SearchDOGS Bacteria, Software That Provides Automated Identification of Potentially Missed Genes in Annotated Bacterial Genomes , 2014, Journal of bacteriology.

[55]  Yunming Ye,et al.  Collective prediction of protein functions from protein-protein interaction networks , 2014, BMC Bioinformatics.

[56]  Stephen E. Robertson,et al.  Okapi at TREC-2 , 1993, TREC.

[57]  Elena Baralis,et al.  Data Cleaning and Semantic Improvement in Biological Databases , 2006, J. Integr. Bioinform..

[58]  David A. Coil,et al.  Swabs to genomes: a comprehensive workflow , 2015, PeerJ.

[59]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[60]  W. Van Criekinge,et al.  PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration , 2014, Nucleic acids research.

[61]  K. Osatomi,et al.  Complete nucleotide sequence of dengue type 3 virus genome RNA. , 1990, Virology.

[62]  Karin M. Verspoor,et al.  Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases , 2015, DTMBIO@CIKM.

[63]  Manju Bansal,et al.  A novel method for prokaryotic promoter prediction based on DNA stability , 2005, BMC Bioinformatics.

[64]  Falk Scholer,et al.  Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.