Corpus-Based Approach to Biological Entity Recognition

Recently, we are witnessing a revolution in bioinformatics research, which is due to the advent of large-scale machine-readable resources including MEDLINE1, UMLS2, GO3, etc. The GENIA corpus4, an annotated corpus in biomedical domain, is also one of valuable resources having rich semantic information encoded by human experts. To exploit the availability of such a corpus, we have been applying various machine learning techniques to induce models for biological entity recognition. This paper reports one of the result of those efforts. We used our original machine learning method named Self-Organizing Hidden Markov Model (hereafter shortened to SOHMM) with a simple feature set. For the evaluation, we used hard and soft matching criterion to cope with the inconsistency inherent in manually annotated corpus. With the soft matching criterion on left boundaries, the experimental results show about 68% and 74% of performance for biological source and substance recognition respectively.