Improving protein-protein interaction article classification using biological domain knowledge

Interaction Article Classification (IAC) is a specific text classification application in biological domain that tries to find out which articles describe Protein-Protein Interactions (PPIs) to help extract PPIs from biological literature more efficiently. However, the existing text representation and feature weighting schemes commonly used for text classification are not well suited for IAC. We capture and utilise biological domain knowledge, i.e. gene mentions also known as protein or gene names in the articles, to address the problem. We put forward a new gene mention order-based approach that highlights the important role of gene mentions to represent the texts. Furthermore, we also incorporate the information concerning gene mentions into a novel feature weighting scheme called Gene Mention-based Term Frequency (GMTF). By conducting experiments, we show that using the proposed representation and weighting schemes, our Interaction Article Classifier (IACer) performs better than other leading systems for the moment.

[1]  Bernardete Ribeiro,et al.  A Hybrid AIS-SVM Ensemble Approach for Text Classification , 2011, ICANNGA.

[2]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[3]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Fang-Xiang Wu,et al.  SVM-RFE based feature selection for tandem mass spectrum quality assessment , 2011, Int. J. Data Min. Bioinform..

[6]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[7]  Bernard Manderick,et al.  BioLMiner System: Interaction Normalization Task and Interaction Pair Task in the BioCreative II.5 Challenge , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Scott M. Smith,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1989 .

[9]  Dongming Lu,et al.  A Technique for Improving the Performance of Naive Bayes Text Classification , 2011, WISM.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[12]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[13]  W. John Wilbur,et al.  Classifying protein-protein interaction articles using word and syntactic features , 2011, BMC Bioinformatics.

[14]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[15]  Gerard Salton,et al.  A comparison of search term weighting: term relevance vs. inverse document frequency , 1981, SIGIR 1981.

[16]  William S. Cooper,et al.  Foundations of Probabilistic and Utility-Theoretic Indexing , 1978, JACM.

[17]  Jian Su,et al.  Feature generation and representations for protein-protein interaction classification , 2009, J. Biomed. Informatics.

[18]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[19]  Zhiyong Lu,et al.  Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases , 2011 .

[20]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[21]  Yi Du,et al.  Improve VSM text classification by title vector based document representation method , 2011, 2011 6th International Conference on Computer Science & Education (ICCSE).

[22]  Miguel A. Andrade-Navarro,et al.  Ranking the whole MEDLINE database according to a large training set using text indexing , 2005, BMC Bioinformatics.

[23]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[24]  Shanshan Zheng,et al.  Feature selection for genomic data sets through feature clustering , 2010, Int. J. Data Min. Bioinform..

[25]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .