Protein Fold Prediction Problem Using Ensemble of Classifiers

Prediction of tertiary structure of protein from its primary structure (amino acid sequence of protein) without relying on sequential similarity is a challenging task for bioinformatics and biological science. The protein fold prediction problem can be expressed as a prediction problem that can be solved by machine learning techniques. In this paper, a new method based on ensemble of five classifiers (Naive Bayes, Multi Layer Perceptron (MLP), Support Vector Machine (SVM), LogitBoost and AdaBoost.M1) is proposed for the protein fold prediction problem. The dataset used in this experiment is from the standard dataset provided by Ding and Dubchak. Experimental results show that the proposed method enhanced the prediction accuracy up to 64% on an independent test dataset, which is the highest prediction accuracy in compare with other methods proposed by the works have done by literature.

[1]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[5]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[6]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[7]  M.C.P. de Souto,et al.  An empirical comparison of individual machine learning techniques and ensemble approaches in protein structural class prediction , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[8]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[9]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[10]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[11]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[12]  Guido Bologna,et al.  A comparison study on protein fold recognition , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[13]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[14]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[15]  Chandan K. Reddy,et al.  Boosting Methods for Protein Fold Recognition: An Empirical Comparison , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[16]  David J. Miller,et al.  Transductive Methods for the Distributed Ensemble Classification Problem , 2007, Neural Computation.

[17]  Nir Friedman,et al.  Learning Bayesian Networks with Local Structure , 1996, UAI.

[18]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[19]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[20]  Kalyanmoy Deb,et al.  Multiclass protein fold recognition using multiobjective evolutionary algorithms , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[21]  Rehab Duwairi,et al.  A framework for predicting proteins 3D structures , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[23]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[24]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .