An ensemble self-training protein interaction article classifier.

Protein-protein interaction (PPI) is essential to understand the fundamental processes governing cell biology. The mining and curation of PPI knowledge are critical for analyzing proteomics data. Hence it is desired to classify articles PPI-related or not automatically. In order to build interaction article classification systems, an annotated corpus is needed. However, it is usually the case that only a small number of labeled articles can be obtained manually. Meanwhile, a large number of unlabeled articles are available. By combining ensemble learning and semi-supervised self-training, an ensemble self-training interaction classifier called EST_IACer is designed to classify PPI-related articles based on a small number of labeled articles and a large number of unlabeled articles. A biological background based feature weighting strategy is extended using the category information from both labeled and unlabeled data. Moreover, a heuristic constraint is put forward to select optimal instances from unlabeled data to improve the performance further. Experiment results show that the EST_IACer can classify the PPI related articles effectively and efficiently.

[1]  Jian Su,et al.  Feature generation and representations for protein-protein interaction classification , 2009, J. Biomed. Informatics.

[2]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[3]  Zhi-Hua Zhou,et al.  Improve Computer-Aided Diagnosis With Machine , 2007 .

[4]  Jiawei Han,et al.  Feature selection using dynamic weights for classification , 2013, Knowl. Based Syst..

[5]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[6]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[7]  William Eberle,et al.  Genetic algorithms in feature and instance selection , 2013, Knowl. Based Syst..

[8]  Zhiyong Lu,et al.  Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases , 2011 .

[9]  W. John Wilbur,et al.  Classifying protein-protein interaction articles using word and syntactic features , 2011, BMC Bioinformatics.

[10]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[11]  Russell V. Lenth,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1990 .

[12]  Friedhelm Schwenker,et al.  Combining Committee-Based Semi-Supervised Learning and Active Learning , 2010, Journal of Computer Science and Technology.

[13]  Xuelong Li,et al.  Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.