High-recall protein entity recognition using a dictionary

SUMMARY Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields (CRFs) that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches with that of Maximum Entropy (MaxEnt) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary-the measure of most interest in our intended application. AVAILABILITY Dictionary HMMs were implemented in Java. Algorithms are available through an information extraction package MINORTHIRD on http://minorthird.sourceforge.net

[1]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[2]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[3]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[4]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[5]  Robert J. Gaizauskas,et al.  Automatically Augmenting Terminological Lexicons from Untagged Text , 2000, LREC.

[6]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[7]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[8]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[9]  Robert F. Murphy,et al.  EXTRACTING AND STRUCTURING SUBCELLULAR LOCATION INFORMATION FROM ON-LINE JOURNAL ARTICLES: THE SUBCELLULAR LOCATION IMAGE FINDER , 2004 .

[10]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[11]  Javed Mostafa,et al.  An approach to protein name extraction using heuristics and a dictionary , 2005, ASIST.

[12]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[13]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[16]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[17]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[18]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[19]  Wendy G. Lehnert,et al.  Wrap-Up: a Trainable Discourse Module for Information Extraction , 1994, J. Artif. Intell. Res..

[20]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[21]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.