A comparative study of multi-classification methods for protein fold recognition

Fold recognition based on sequence-derived features is a complex multi-class classification problem. In the current study, we comparatively assess five different classification techniques, namely multilayer perceptron and probabilistic neural networks, nearest neighbour classifiers, multi-class support vector machines and classification trees for fold recognition on a reference set of proteins that are organised in 27 folds and are described by 125-dimensional vectors of sequence-derived features. We evaluate all classifiers in terms of total accuracy, mutual information coefficient, sensitivity and specificity measurements using a ten-fold cross-validation method. A polynomial support vector machine and a multilayer perceptron of one hidden layer of 88 nodes performed better and achieved satisfactory multi-class classification accuracies (42.8% and 42.1%, respectively) given the complexity of the problem and the reported similar classification performances of other researchers.

[1]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  I. Muchnik,et al.  Recognition of a protein fold in the context of the SCOP classification , 1999 .

[4]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[5]  Mia K. Markey,et al.  Comparison of three-class classification performance metrics: a case study in breast cancer CAD , 2005, SPIE Medical Imaging.

[6]  Kalyanmoy Deb,et al.  Multi-Class Protein Fold Recognition Using Multi-Objective Evolutionary Algorithms , 2004 .

[7]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[8]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[9]  J Skolnick,et al.  Defrosting the frozen approximation: PROSPECTOR— A new approach to threading , 2001, Proteins.

[10]  I. Muchnik,et al.  Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. , 1999, Proteins.

[11]  A. Sali,et al.  Alignment of protein sequences by their profiles , 2004, Protein science : a publication of the Protein Society.

[12]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[13]  Srinivasan Parthasarathy,et al.  A multi-level approach to SCOP fold recognition , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[14]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[15]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[16]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[17]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[18]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[19]  Scott L. Powell,et al.  Effect of Alternative Splitting Rules on Image Processing Using Classification Tree Analysis , 2006 .

[20]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[21]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[22]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[23]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[24]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[25]  Jacek M. Zurada,et al.  Classification algorithms for quantitative tissue characterization of diffuse liver disease from ultrasound images , 1996, IEEE Trans. Medical Imaging.

[26]  Venu Govindaraju,et al.  Improved k-nearest neighbor classification , 2002, Pattern Recognit..

[27]  Dong Xu,et al.  PROSPECT II: protein structure prediction program for genome-scale applications. , 2003, Protein engineering.

[28]  Edward C. Uberbacher,et al.  Predicting Protein Folding Classes without Overly Relying on Homology , 1995, ISMB.

[29]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[30]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[31]  Konstantina S. Nikita,et al.  Differential diagnosis of CT focal liver lesions using texture features, feature selection and ensemble driven classifiers , 2007, Artif. Intell. Medicine.

[32]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[33]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..