Supervised Learning for Detection of Duplicates in Genomic Sequence Databases

Motivation First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. Results We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.

[1]  Karin M. Verspoor,et al.  Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases , 2015, DTMBIO@CIKM.

[2]  Rishiraj Saha Roy,et al.  Probabilistic Deduplication of Anonymous Web Traffic , 2015, WWW.

[3]  J. Fitzgerald,et al.  Understanding fraud: the nature of fraud offences recorded by NSW Police , 2015 .

[4]  Guillaume J. Filion,et al.  Starcode: sequence clustering based on all-pairs search , 2015, Bioinform..

[5]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[6]  Claire O'Donovan,et al.  Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data , 2014, Database J. Biol. Databases Curation.

[7]  Elmer V. Bernstam,et al.  A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation , 2014, J. Am. Medical Informatics Assoc..

[8]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[9]  Min Song,et al.  Mapping biological entities using the longest approximately common prefix method , 2014, BMC Bioinformatics.

[10]  Elmer V. Bernstam,et al.  Optimized Dual Threshold Entity Resolution For Electronic Health Record Databases - Training Set Size And Active Learning , 2013, AMIA.

[11]  Riccardo Percudani,et al.  Ureidoglycolate hydrolase, amidohydrolase, lyase: how errors in biological databases are incorporated in scientific papers and vice versa , 2013, Database J. Biol. Databases Curation.

[12]  Valentin Guignon,et al.  The Banana Genome Hub , 2013, Database J. Biol. Databases Curation.

[13]  Yoshihiko Suhara,et al.  Automatically generated spam detection based on sentence-level topic information , 2013, WWW '13 Companion.

[14]  Liang Feng,et al.  Practical Duplicate Bug Reports Detection in a Large Web-Based Development Community , 2013, APWeb.

[15]  Shie-Jue Lee,et al.  Detecting near-duplicate documents using sentence-level features and supervised learning , 2013, Expert Syst. Appl..

[16]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[17]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[18]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[19]  Bruno Martins A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records , 2011, GeoS.

[20]  Peter B. McGarvey,et al.  A comprehensive protein-centric ID mapping service for molecular data integration , 2011, Bioinform..

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Min Song,et al.  Detecting duplicate biological entities using Shortest Path Edit Distance , 2010, Int. J. Data Min. Bioinform..

[23]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[24]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Min Song,et al.  Detecting duplicate biological entities using Markov random field-based edit distance , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[27]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[28]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[30]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[31]  Paul T. J. Tan,et al.  Duplicate Detection in Biological Data using Association Rule Mining , 2004 .

[32]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[33]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[34]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[35]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[36]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[37]  Temple F. Smith,et al.  The challenges of genome sequence annotation or “The devil is in the details” , 1997, Nature Biotechnology.

[38]  S. Brunak,et al.  Cleaning the GenBank Arabidopsis thaliana data set. , 1996, Nucleic acids research.