Finding Transport Proteins in a General Protein Database

The number of specialized databases in molecular biology is growing fast, as is the availability of molecular data. These trends necessitate the development of automatic methods for finding relevant information to include in specialized databases. We show how to use a comprehensive database (SwissProt) as a source of new entries for a specialized database (TCDB, the Transport Classification Database). Even carefully constructed keyword-based queries perform poorly in determining which SwissProt records are relevant to TCDB; we show that a machine learning approach performs well. We describe a maximum-entropy classifier, trained on SwissProt records, that achieves high precision and recall in cross-validation experiments. This classifier has been deployed as part of a pipeline for updating TCDB that allows a human expert to examine only about 2% of SwissProt records for potential inclusion in TCDB. The methods we describe are flexible and general, so they can be applied easily to other specialized databases.

[1]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[2]  M. Hagberg Editorial , 2004 .

[3]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[4]  Mark Craven,et al.  Classifying Biomedical Articles by Making Localized Decisions , 2005, TREC.

[5]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[6]  Michael Y. Galperin The Molecular Biology Database Collection: 2007 update , 2006, Nucleic Acids Res..

[7]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[8]  Hagit Shatkay,et al.  Hairpins in bookstacks: Information retrieval from biomedical text , 2005, Briefings Bioinform..

[9]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[10]  Michael Y. Galperin The Molecular Biology Database Collection: 2005 update , 2004, Nucleic Acids Res..

[11]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[12]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[13]  Milton H. Saier,et al.  TCDB: the Transporter Classification Database for membrane transport protein analyses and information , 2005, Nucleic Acids Res..

[14]  William R. Hersh,et al.  Report on the TREC 2004 genomics track , 2005, SIGF.