A Method to Extract Sentences Containing Protein Function Information with Training Data Extension Based on User's Feedback (IPSJ Transactions on Bioinformatics Vol.3)

A protein expresses various functions by interacting with chemical compounds. Protein function is clarified by protein structure analysis and the obtained knowledge has been stated in a number of documents. Extracting the function information and constructing the database are useful for various application fields such as drug discovery, understanding of life phenomenon, and so on. However, it is impractical to extract the function information manually from a number of documents for constructing the database, which strongly provide motivation to study automatic extraction of the function information. Extraction of protein function information is considered as a classification problem, namely, whether each sentence from the target document includes the function information or not is determined. Typically, in the case of addressing such a classification problem, a classifier is learned using the training data previously given. However, the accuracy is not high when the training data is not large enough. In such a case, we attempt to improve the accuracy of classification by extending the training data. Effective sentences for getting high accuracy are selected from the reference data aside from the training data set, and added to the training data. In order to select such effective sentences, we introduce the reliability of temporary labels assigned to sentences in the reference data. Sentences with low reliability temporary labels are presented to users, assigned true labels as users' feedback, and added to the training data. Additionally, a classifier is learned by the training data with sentences with high reliability temporary labels. By iterating this process, we attempt to improve the accuracy steadily. In the experiment, compared with the related approach, the accuracy is higher when the iteration steps of feedbacks and the number of sentences returned by users' feedback are small. Thus, it is confirmed that the training data is appropriately extended based on users' feedback by the proposed method. In addition, this result serves a purpose of reducing users' load.

[1]  Robert S. Ledley,et al.  The Protein Information Resource , 2003, Nucleic Acids Res..

[2]  Rohit J. Kate,et al.  Learning to Extract Proteins and their Interactions from Medline Abstracts , 2003 .

[3]  Manabu Sassano,et al.  An Empirical Study of Active Learning with Support Vector Machines for Japanese Word Segmentation , 2002, ACL.

[4]  Yi Guan,et al.  Biomedical Named Entities Recognition Using Conditional Random Fields Model , 2006, FSKD.

[5]  Wen-Lian Hsu,et al.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition , 2006, BMC Bioinformatics.

[6]  Udo Hahn,et al.  Semi-Supervised Active Learning for Sequence Labeling , 2009, ACL.

[7]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[8]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[9]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[10]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[11]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  Craig A. Knoblock,et al.  Selective Sampling with Redundant Views , 2000, AAAI/IAAI.

[13]  Zhi-Hua Zhou,et al.  Exploiting Unlabeled Data in Content-Based Image Retrieval , 2004, ECML.

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[16]  Takenao Ohkawa,et al.  A Method to Extract Sentences with Protein Functional Information from Literature by Iterative Learning of the Corpus , 2006, Inf. Media Technol..

[17]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[20]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[21]  Jinbo Bi,et al.  Active learning via transductive experimental design , 2006, ICML.

[22]  P E Bourne,et al.  The Protein Data Bank. , 2002, Nucleic acids research.

[23]  James W. Cooper,et al.  Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information , 2005, BMC Bioinformatics.

[24]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[25]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..