The Positive Effects of Negative Information: Extending One-Class Classification Models in Binary Proteomic Sequence Classification

Profile Hidden Markov Models (PHMMs) have been widely used as models for Multiple Sequence Alignments. By their nature, they are generative one-class classifiers trained only on sequences belonging to the target class they represent. Nevertheless, they are often used to discriminate between classes. In this paper, we investigate the beneficial effects of information from non-target classes in discriminative tasks. Firstly, the traditional PHMM is extended to a new binary classifier. Secondly, we propose propositional representations of the original PHMM that capture information from target and non-target sequences and can be used with standard binary classifiers. Since PHMM training is time intensive, we investigate whether our approach allows the training of the PHMM to stop, before it is fully converged, without loss of predictive power.

[2]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[3]  Geoff Holmes,et al.  Propositionalisation of Profile Hidden Markov Models for Biological Sequence Analysis , 2008, Australasian Conference on Artificial Intelligence.

[4]  K. Nakai Protein sorting signals and prediction of subcellular localization. , 2000, Advances in protein chemistry.

[5]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[6]  András Kocsor,et al.  Counter-Example Generation-Based One-Class Classification , 2007, ECML.

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Kuo-Chen Chou,et al.  Prediction of enzyme family classes. , 2003, Journal of proteome research.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[11]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[12]  Jian Guo,et al.  A Novel Method for Protein Subcellular Localization Based on Boosting and Probabilistic Neural Network , 2004, APBC.

[13]  Joost N. Kok Machine Learning: ECML 2007, 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007, Proceedings , 2007, ECML.

[14]  Balázs Kégl,et al.  A One-Class Classification Approach for Protein Sequences and Structures , 2009, ISBRA.

[15]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[16]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[17]  Peter A. Flach,et al.  Comparative Evaluation of Approaches to Propositionalization , 2003, ILP.

[18]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[19]  Ian Witten,et al.  Data Mining , 2000 .