Identifying Relevant Data for a Biological Database: Handcrafted Rules versus Machine Learning

With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.

[1]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[4]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[5]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[6]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[7]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[10]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[11]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[12]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[15]  Graham Dellaire,et al.  The Nuclear Protein Database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome , 2003, Nucleic Acids Res..

[16]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[17]  Noriko Kando,et al.  An empirical study on retrieval models for different document genres: patents and newspaper articles , 2003, SIGIR '03.

[18]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[19]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[20]  Mário J. Silva,et al.  Classifying biological articles using web resources , 2004, SAC '04.

[21]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[22]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[23]  William R. Hersh,et al.  Evaluation of biomedical text-mining systems: Lessons learned from information retrieval , 2005, Briefings Bioinform..

[24]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[25]  Yong Zhang,et al.  SPD—a web-based secreted protein database , 2004, Nucleic Acids Res..

[26]  Hagit Shatkay,et al.  Integrating image data into biomedical text categorization , 2006, ISMB.

[27]  Milton H. Saier,et al.  TCDB: the Transporter Classification Database for membrane transport protein analyses and information , 2005, Nucleic Acids Res..

[28]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[29]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[30]  Victor B. Strelets,et al.  FlyBase: anatomical data, images and queries , 2005, Nucleic Acids Res..

[31]  Hsin-Hsi Chen,et al.  Classifying Biological Full-Text Articles for Multi-Database Curation , 2006, EACL.

[32]  Sanmay Das,et al.  Finding Transport Proteins in a General Protein Database , 2007, PKDD.

[33]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[34]  Hongfang Liu,et al.  Learning from Positive and Unlabeled Documents for Retrieval of Bacterial Protein-Protein Interaction Literature , 2009, BioLINK@ISMB/ECCB.

[35]  Michael Y. Galperin,et al.  Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009 , 2008, Nucleic Acids Res..