Boosting Precision and Recall of Dictionary-Based Protein Name Recognition

Dictionary-based protein name recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a large number of false recognitions mainly caused by short names. (2) low recall due to spelling variation. In this paper, we tackle the former problem by using a machine learning method to filter out false positives. We also present an approximate string searching method to alleviate the latter problem. Experimental results using the GE-NIA corpus show that the filtering using a naive Bayes classifier greatly improves precision with slight loss of recall, resulting in a much better F-score.

[1]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[2]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[3]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[4]  Lluís Màrquez i Villodre,et al.  Naive Bayes and Exemplar-based Approaches to Word Sense Disambiguation Revisited , 2000, ECAI.

[5]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[6]  Jin-Dong Kim,et al.  Corpus-Based Approach to Biological Entity Recognition , 2004 .

[7]  Ted Pedersen,et al.  A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation , 2000, ANLP.

[8]  Nigel Collier,et al.  Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain , 2001 .

[9]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[10]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[11]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[12]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[13]  Ricardo A. Baeza-Yates,et al.  Matchsimile: A Flexible Approximate Matching Tool for Personal Names Searching , 2001, SBBD.

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[16]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[17]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.