Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

The volume of biological database records is growing rapidly, populated by complex records drawn from heterogeneous sources. A specific challenge is duplication, that is, the presence of redundancy (records with high similarity) or inconsistency (dissimilar records that correspond to the same entity). The characteristics (which records are duplicates), impact (why duplicates are significant), and solutions (how to address duplication), are not well understood. Studies on the topic are neither recent nor comprehensive. In addition, other data quality issues, such as inconsistencies and inaccuracies, are also of concern in the context of biological databases. A primary focus of this paper is to present and consolidate the opinions of over 20 experts and practitioners on the topic of duplication in biological sequence databases. The results reveal that survey participants believe that duplicate records are diverse; that the negative impacts of duplicates are severe, while positive impacts depend on correct identification of duplicates; and that duplicate detection methods need to be more precise, scalable, and robust. A secondary focus is to consider other quality issues. We observe that biocuration is the key mechanism used to ensure the quality of this data, and explore the issues through a case study of curation in UniProtKB/Swiss-Prot as well as an interview with an experienced biocurator. While biocuration is a vital solution for handling of data quality issues, a broader community effort is needed to provide adequate support for thorough biocuration in the face of widespread quality concerns.

[1]  Oren Etzioni,et al.  The elephant in the room: getting value from Big Data , 2015, WebDB.

[2]  David S. Goodsell,et al.  The RCSB protein data bank: integrative view of protein, gene and 3D structural information , 2016, Nucleic Acids Res..

[3]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[4]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[5]  S. Persson,et al.  Identification of Clinical Aeromonas Species by rpoB and gyrB Sequencing and Development of a Multiplex PCR Method for Detection of Aeromonas hydrophila, A. caviae, A. veronii, and A. media , 2014, Journal of Clinical Microbiology.

[6]  Michael Y. Galperin,et al.  The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes , 2017, Nucleic acids research.

[7]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[8]  Sébastien Moretti,et al.  Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species , 2008, DILS.

[9]  M. Gillespie,et al.  Guidelines for the functional annotation of microRNAs using the Gene Ontology , 2016, RNA.

[10]  Jürg Bähler,et al.  PomBase 2015: updates to the fission yeast database , 2014, Nucleic Acids Res..

[11]  Mohamed A. Sharaf,et al.  A Framework for Data Quality Aware Query Systems , 2011, DASFAA Workshops.

[12]  E Pennisi,et al.  Keeping Genome Databases Clean and Up to Date , 1999, Science.

[13]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[14]  Qian Li,et al.  Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model , 2016, Database J. Biol. Databases Curation.

[15]  Marcus C. Chibucos,et al.  The Confidence Information Ontology: a step towards a standard for asserting confidence in annotations , 2015, Database J. Biol. Databases Curation.

[16]  Cassie S. Mitchell,et al.  Undergraduate Biocuration: Developing Tomorrow's Researchers While Mining Today's Data. , 2015, Journal of undergraduate neuroscience education : JUNE : a publication of FUN, Faculty for Undergraduate Neuroscience.

[17]  Karin M. Verspoor,et al.  Literature consistency of bioinformatics sequence databases is effective for assessing record quality , 2017, bioRxiv.

[18]  Bei Wu,et al.  Investigation on the Association of Soil Microbial Populations with Ecological and Environmental Factors in the Pearl River Estuary , 2018 .

[19]  Min Song,et al.  Detecting duplicate biological entities using Markov random field-based edit distance , 2009, Knowledge and Information Systems.

[20]  Philip E. Bourne,et al.  Biocurators: Contributors to the World of Science , 2006, PLoS Comput. Biol..

[21]  Kristof Coussement,et al.  Data Accuracy's Impact on Segmentation Performance: Benchmarking RFM Analysis, Logistic Regression, and Decision Trees , 2012 .

[22]  Pascale Gaudet,et al.  Best Practices in Manual Annotation with the Gene Ontology. , 2017, Methods in molecular biology.

[23]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[24]  A Bairoch,et al.  SWISS-PROT: connecting biomolecular knowledge via a protein database. , 2001, Current issues in molecular biology.

[25]  Amos Bairoch,et al.  The Sulfinator: predicting tyrosine sulfation sites in protein sequences , 2002, Bioinform..

[26]  Jon R Lorsch,et al.  Perspective: Sustaining the big-data ecosystem , 2015, Nature.

[27]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[28]  Oliver Horlacher,et al.  The SIB Swiss Institute of Bioinformatics’ resources: focus on curated databases , 2015, Nucleic Acids Res..

[29]  Rachael P. Huntley,et al.  Standardized description of scientific evidence using the Evidence Ontology (ECO) , 2014, Database J. Biol. Databases Curation.

[30]  M S Waterman,et al.  Genomic sequence databases. , 1990, Genomics.

[31]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[32]  Narmada Thanki,et al.  CDD: NCBI's conserved domain database , 2014, Nucleic Acids Res..

[33]  Paul T. J. Tan,et al.  Duplicate Detection in Biological Data using Association Rule Mining , 2004 .

[34]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[35]  Karin M. Verspoor,et al.  Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases , 2018, ACM J. Data Inf. Qual..

[36]  Claire O'Donovan,et al.  Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data , 2014, Database J. Biol. Databases Curation.

[37]  S. Brunak,et al.  Cleaning the GenBank Arabidopsis thaliana data set. , 1996, Nucleic acids research.

[38]  Karin M. Verspoor,et al.  Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study , 2016, bioRxiv.

[39]  Jocelyn Kaiser,et al.  BIOMEDICAL RESOURCES. Funding for key data resources in jeopardy. , 2016, Science.

[40]  Ivan Erill,et al.  CollecTF: a database of experimentally validated transcription factor-binding sites in Bacteria , 2013, Nucleic Acids Res..

[41]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[42]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[43]  Rex L. Chisholm,et al.  dictyBase 2013: integrating multiple Dictyostelid species , 2012, Nucleic Acids Res..

[44]  S. Brunak,et al.  Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. , 2005, Glycobiology.

[45]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[46]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[47]  Zi Huang,et al.  Near-duplicate video retrieval: Current research and future trends , 2013, CSUR.

[48]  Silvio C. E. Tosatto,et al.  InterPro in 2017—beyond protein family and domain annotations , 2016, Nucleic Acids Res..

[49]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[50]  Sandra Orchard,et al.  Data standardization and sharing-the work of the HUPO-PSI. , 2014, Biochimica et biophysica acta.

[51]  Kara Dolinski,et al.  The BioGRID interaction database: 2017 update , 2016, Nucleic Acids Res..

[52]  Amos Bairoch,et al.  The neXtProt knowledgebase on human proteins: 2017 update , 2016, Nucleic Acids Res..

[53]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[54]  Peter D. Karp,et al.  The comprehensive updated regulatory network of Escherichia coli K-12 , 2006, BMC Bioinformatics.

[55]  Elisabeth Coudert,et al.  HAMAP in 2015: updates to the protein family classification and annotation system , 2014, Nucleic Acids Res..

[56]  Karin M. Verspoor,et al.  Coreference resolution improves extraction of Biological Expression Language statements from texts , 2016, Database J. Biol. Databases Curation.

[57]  Erika Check Hayden,et al.  Funding for model-organism databases in trouble , 2016 .

[58]  Brijesh Sharma,et al.  Molecular characterization and in vitro antifungal susceptibility of 80 clinical isolates of mucormycetes in Delhi, India , 2014, Mycoses.

[59]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[60]  C. Ponting,et al.  Homology-based method for identification of protein repeats using statistical significance estimates. , 2000, Journal of molecular biology.

[61]  Claire O'Donovan,et al.  Biocurators and Biocuration: surveying the 21st century challenges , 2012, Database J. Biol. Databases Curation.

[62]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[63]  Maria Jesus Martin,et al.  Minimizing proteome redundancy in the UniProt Knowledgebase , 2016, Database J. Biol. Databases Curation.

[64]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[65]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[66]  Guy Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[67]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[68]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[69]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[70]  Qingyu Chen,et al.  Supervised Learning for Detection of Duplicates in Genomic Sequence Databases , 2016, PLoS ONE.

[71]  Peter D. Karp,et al.  How much does curation cost? , 2016, Database J. Biol. Databases Curation.

[72]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[73]  Jean-Claude Tardif,et al.  Damming the genomic data flood using a comprehensive analysis and storage data structure , 2010, Database J. Biol. Databases Curation.

[74]  Barbara Wixom,et al.  An Empirical Investigation of the Factors Affecting Data Warehousing Success , 2001, MIS Q..

[75]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[76]  Yifan Peng,et al.  Improving chemical disease relation extraction with rich features and weakly labeled data , 2016, Journal of Cheminformatics.

[77]  Patrick Ruch,et al.  Application of text-mining for updating protein post-translational modification annotation in UniProtKB , 2012, BMC Bioinformatics.

[78]  B. Langmead,et al.  Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive , 2016, Genome Biology.

[79]  Cathy H. Wu,et al.  Protein Bioinformatics From Protein Modifications and Networks to Proteomics , 2017 .

[80]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[81]  Haipeng Liu,et al.  MoonProt: a database for proteins that are known to moonlight , 2013, Nucleic Acids Res..

[82]  Min Song,et al.  Detecting duplicate biological entities using Markov random field-based edit distance , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[83]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[84]  Guy Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2012, Nucleic Acids Res..

[85]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[86]  Donald P. Ballou,et al.  Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems , 1985 .

[87]  Wenfei Fan,et al.  Data Quality: From Theory to Practice , 2015, SGMD.

[88]  Maria Jesus Martin,et al.  From data repositories to submission portals: rethinking the role of domain-specific databases in CollecTF , 2016, Database J. Biol. Databases Curation.

[89]  Alex Bateman,et al.  Curators of the world unite: the International Society of Biocuration , 2010, Bioinform..

[90]  J. Michael Cherry,et al.  Prevention of data duplication for high throughput sequencing repositories , 2018, Database J. Biol. Databases Curation.

[91]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[92]  Qingyu Chen,et al.  Benchmarks for Measurement of Duplicate Detection Methods in Nucleotide Databases , 2016, bioRxiv.

[93]  Michele Magrane,et al.  Searching and Navigating UniProt Databases , 2015, Current protocols in bioinformatics.

[94]  Paolo Papotti,et al.  Big data quality - whose problem is it? , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[95]  Friedhelm Pfeiffer,et al.  A Manual Curation Strategy to Improve Genome Annotation: Application to a Set of Haloarchael Genomes , 2015, Life.

[96]  Cathy H. Wu,et al.  Activities at the Universal Protein Resource (UniProt) , 2014, Nucleic Acids Research.

[97]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .

[98]  A. Baxevanis The Importance of Biological Databases in Biological Discovery , 2003, Current protocols in bioinformatics.

[99]  Felix Naumann,et al.  Data Quality in Genome Databases , 2003, ICIQ.

[100]  Robert D. Finn,et al.  Rfam 12.0: updates to the RNA families database , 2014, Nucleic Acids Res..

[101]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[102]  Christian Cole,et al.  The Jpred 3 secondary structure prediction server , 2008, Nucleic Acids Res..

[103]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[104]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[105]  Midori A. Harris,et al.  Canto: an online tool for community literature curation , 2014, Bioinform..

[106]  Rolf Apweiler,et al.  A novel method for automatic functional annotation of proteins , 1999, Bioinform..

[107]  Zhiyong Lu,et al.  On expert curation and sustainability: UniProtKB/Swiss-Prot as a case study , 2016, bioRxiv.

[108]  Anne Niknejad,et al.  Uncovering hidden duplicated content in public transcriptomics data , 2013, Database J. Biol. Databases Curation.

[109]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[110]  S. Guptill,et al.  Elements of Spatial Data Quality , 1995 .

[111]  S. Poux On expert curation and sustainability: UniProtKB/Swiss-Prot as a case study , 2017 .

[112]  Guy Cochrane,et al.  European Nucleotide Archive in 2016 , 2016, Nucleic Acids Res..