Selection of Effective Sentences from a Corpus to Improve the Accuracy of Identification of Protein Names

As the number of documents about protein structural analysis increases, a method of automatically identifying protein names in them is required. However, the accuracy of identification is not high if the training data set is not large enough. We consider a method to extend a training data set based on machine learning using an available corpus. Such a corpus usually consists of documents about a certain kind of organism species, and documents about different kinds of organism species tend to have different vocabularies. Therefore, depending on the target document or corpus, it is not effective for the accurate identification to simply use a corpus as a training data set. In order to improve the accuracy, we propose a method to select sentences that have a positive effect on identification and to extend the training data set with the selected sentences. In the proposed method, a portion of a set of tagged sentences is used as a validation set. The process to select sentences is iterated using the result of the identification of protein names in a validation set as feedback. In the experiment, compared with the baseline, a method without a corpus, with a whole corpus, or with a part of a corpus chosen at random, the accuracy of the proposed method was higher than any baseline method. Thus, it was confirmed that the proposed method selected effective sentences.

[1]  Takenao Ohkawa,et al.  A Method to Extract Sentences with Protein Functional Information from Literature by Iterative Learning of the Corpus , 2006, Inf. Media Technol..

[2]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[5]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[6]  Tomonobu Ozaki,et al.  Iterative Learning with Feature Update for Extracting Sentences Containing Protein Function Information , 2007 .

[7]  Ramesh Nallapati,et al.  Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition , 2008, ACL.

[8]  Masaki Murata,et al.  Overfitting in protein name recognition on biomedical literature and method of preventing it through use of transductive SVM , 2007, Fourth International Conference on Information Technology (ITNG'07).

[9]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[10]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[11]  Franco Turini,et al.  Time-Annotated Sequences for Medical Data Mining , 2007 .

[12]  ChengXiang Zhai,et al.  Exploiting Domain Structure for Named Entity Recognition , 2006, NAACL.

[13]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[14]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[15]  Toshihisa Takagi,et al.  Gene/Protein/Family Name Recognition in Biomedical Literature , 2004, HLT-NAACL 2004.

[16]  Brian Roark,et al.  Supervised and unsupervised PCFG adaptation to novel domains , 2003, NAACL.

[17]  William W. Cohen,et al.  Intra-document structural frequency features for semi-supervised domain adaptation , 2008, CIKM '08.

[18]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[19]  Nigel Collier,et al.  The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers , 1999, EACL.

[20]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[21]  Ramesh Nallapati,et al.  A Comparative Study of Methods for Transductive Transfer Learning , 2007 .

[22]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[23]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[24]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.