An empirical comparison of individual machine learning techniques and ensemble approaches in protein structural class prediction

Protein fold recognition is an important approach to structure discovery without relying on sequence similarity. In this context, computer-based tools, mainly the techniques from machine learning (ML), have become essential considering the large volume of data. We present an empirical comparison of individual machine learning techniques (k-nearest neighbor, naive Bayes, decision trees, support vector machines and neural networks) and ensemble approaches (bagging and boosting) to the task of protein structural class prediction.

[1]  Patrice Koehl,et al.  ASTRAL compendium enhancements , 2002, Nucleic Acids Res..

[2]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[3]  Edward C. Uberbacher,et al.  Predicting Protein Folding Classes without Overly Relying on Homology , 1995, ISMB.

[4]  Inna Dubchak,et al.  Protein Folding Class Predictor for SCOP: Approach Based on Global Descriptors , 1997, ISMB.

[5]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[6]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[7]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[8]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[9]  Infotech Oulu,et al.  Protein Fold Recognition with K-Local Hyperplane Distance Nearest Neighbor Algorithm , 2004 .

[10]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[11]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[12]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[13]  James E. Bray,et al.  Assigning genomic sequences to CATH , 2000, Nucleic Acids Res..

[14]  Guido Bologna,et al.  A comparison study on protein fold recognition , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[15]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[16]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[17]  A Chinnasamy,et al.  Protein structure and fold prediction using tree-augmented naive Bayesian classifier. , 2004, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.