论文信息 - Corpus-Based Approach to Biological Entity Recognition

Corpus-Based Approach to Biological Entity Recognition

Recently, we are witnessing a revolution in bioinformatics research, which is due to the advent of large-scale machine-readable resources including MEDLINE1, UMLS2, GO3, etc. The GENIA corpus4, an annotated corpus in biomedical domain, is also one of valuable resources having rich semantic information encoded by human experts. To exploit the availability of such a corpus, we have been applying various machine learning techniques to induce models for biological entity recognition. This paper reports one of the result of those efforts. We used our original machine learning method named Self-Organizing Hidden Markov Model (hereafter shortened to SOHMM) with a simple feature set. For the evaluation, we used hard and soft matching criterion to cope with the inconsistency inherent in manually annotated corpus. With the soft matching criterion on left boundaries, the experimental results show about 68% and 74% of performance for biological source and substance recognition respectively.

Jin-Dong Kim | Tomoko Ohta

[1] Jun'ichi Tsujii,et al. Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[2] William A. Gale,et al. Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[3] Jin-Dong Kim,et al. The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[4] John Ross Quinlan,et al. Introduction to Decision Trees , 1986 .