Recognizing names in biomedical texts using hidden Markov model and SVM plus sigmoid

In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena in the biomedical domain, various evidential features are proposed and integrated through a Hidden Markov Model (HMM). In addition, a Support Vector Machine (SVM) plus sigmoid is proposed to resolve the data sparseness problem in our system. Finally, we present two post-processing modules to deal with the cascaded entity name and abbreviation phenomena. Evaluation shows that our system achieves the F-measure of 69.1 and 71.2 on the 23 classes of GENIA V1.1 and V3.0 respectively. In particular, our system achieves the F-measure of 77.8 on the "protein" class of GENIA V3.0. It shows that our system outperforms the best published system on GENIA V1.1 and V3.0.

[1]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[2]  Jian Su,et al.  Effective Adaptation of Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain , 2003, BioNLP@ACL.

[3]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[4]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[5]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[6]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  Christian Jacquemin,et al.  Spotting and Discovering Terms through Natural Language Processing , 1997 .

[11]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[12]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[13]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[14]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.