Feature generation and representations for protein-protein interaction classification

Automatic detecting protein-protein interaction (PPI) relevant articles is a crucial step for large-scale biological database curation. The previous work adopted POS tagging, shallow parsing and sentence splitting techniques, but they achieved worse performance than the simple bag-of-words representation. In this paper, we generated and investigated multiple types of feature representations in order to further improve the performance of PPI text classification task. Besides the traditional domain-independent bag-of-words approach and the term weighting methods, we also explored other domain-dependent features, i.e. protein-protein interaction trigger keywords, protein named entities and the advanced ways of incorporating Natural Language Processing (NLP) output. The integration of these multiple features has been evaluated on the BioCreAtIvE II corpus. The experimental results showed that both the advanced way of using NLP output and the integration of bag-of-words and NLP output improved the performance of text classification. Specifically, in comparison with the best performance achieved in the BioCreAtIvE II IAS, the feature-level and classifier-level integration of multiple features improved the performance of classification 2.71% and 3.95%, respectively.

[1]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[3]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[4]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[5]  Alexander A. Morgan,et al.  Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles , 2002, SKDD.

[6]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[7]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[8]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[9]  Claire Grover,et al.  Adapting a Relation Extraction Pipeline for the BioCreAtIvE II Tasks , 2007 .

[10]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[11]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Jian Su,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[16]  Alfonso Valencia,et al.  Assessing the Correlation between Contextual Patterns and Biological Entity Tagging , 2004, NLPBA/BioNLP.

[17]  Richard Tzong-Han Tsai,et al.  Protein-Protein Interaction Abstract Identification with Contextual Bag of Words , 2007, LBM.

[18]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[19]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[20]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[21]  Wei Chu,et al.  A machine learning approach for the curation of biomedical literature: KDD Cup 2002 (task 1) , 2002, SKDD.

[22]  Chew Lim Tan,et al.  The Integration of Multiple Feature Representations for Protein Protein Interaction Classification Task , 2007, LBM.

[23]  Moustafa Ghanem,et al.  Automatic scientific text classification using local patterns: KDD CUP 2002 (task 1) , 2002, SKDD.