Prediction of Protein Folds: Extraction of New Features, Dimensionality Reduction, and Fusion of Heterogeneous Classifiers

Here, we consider a two-level (four classes in level 1 and 27 folds in level 2) protein fold determination problem. We propose several new features and use some existing features including frequencies of adjacent residues, frequencies of residues separated by one residue, and triplets (trio) of amino acid compositions (AACs). The dimensionality of the trio AAC features is drastically reduced using a neural network based novel online feature selection scheme. We also propose new sets of features called trio potential computed using the hydrophobicity values considering only the selected trio AACs. We demonstrate that the proposed features including the selected trio AACs and trio potential have good discriminating power for protein fold determination. As machine learning tools, we use multilayer perceptron network, radial basis function network, and support vector machine. To improve the recognition accuracies further, we use fusion of different classifiers using the same set of features as well as different sets of features. The effectiveness of our schemes is demonstrated with a benchmark structural classification of proteins (SCOP) dataset. Our system achieves 84.9% test accuracy for the SCOP structural class (four classes) determination and 68.6% test accuracy for the fold recognition with 27 folds. In order to demonstrate the consistency of feature sets and fusion schemes, we also perform the fivefold cross-validation experiments.

[1]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[2]  Berrin A. Yanikoglu,et al.  Protein Structural Class Determination Using Support Vector Machines , 2004, ISCIS.

[3]  Robi Polikar,et al.  Majority Vote and Decision Template Based Ensemble Classifiers Trained on Event Related Potentials for Early Diagnosis of Alzheimer's Disease , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Chin-Teng Lin,et al.  Recognition of Structure Classification of Protein Folding by NN and SVM Hierarchical Learning Architecture , 2003, ICANN.

[5]  Loris Nanni A novel ensemble of classifiers for protein fold recognition , 2006, Neurocomputing.

[6]  I. Muchnik,et al.  Recognition of a protein fold in the context of the SCOP classification , 1999 .

[7]  I. Muchnik,et al.  Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. , 1999, Proteins.

[8]  T. Joachims Support Vector Machines , 2002 .

[9]  X.-D. Sun,et al.  Prediction of protein structural classes using support vector machines , 2006, Amino Acids.

[10]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[11]  Antônio F. Pereira de Araújo Folding protein models with a simple hydrophobic energy function: The fundamental importance of monomer inside/outside segregation , 1999 .

[12]  Loris Nanni,et al.  Ensemble of classifiers for protein fold recognition , 2006, Neurocomputing.

[13]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[14]  Martin Vingron,et al.  Support Vector Machines for Protein Fold Class Prediction , 2003 .

[15]  Oleg Okun Feature Normalization and Selection for Protein Fold Recognition , 2004 .

[16]  D. Ruta,et al.  An Overview of Classifier Fusion Methods , 2000 .

[17]  Terry Windeatt,et al.  Vote counting measures for ensemble classifiers , 2003, Pattern Recognit..

[18]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[19]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[20]  Ludmila I. Kuncheva,et al.  Generating classifier outputs of fixed accuracy and diversity , 2002, Pattern Recognit. Lett..

[21]  Mohamed S. Kamel,et al.  Adaptive fusion and co-operative training for classifier ensembles , 2006, Pattern Recognit..

[22]  Loris Nanni Fusion of classifiers for protein fold recognition , 2005, Neurocomputing.

[23]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[24]  Chin-Teng Lin,et al.  Machine Learning with Automatic Feature Selection for Multi-class Protein Fold Classification , 2005, J. Inf. Sci. Eng..

[25]  Jagath C Rajapakse,et al.  Multi-class support vector machines for protein secondary structure prediction. , 2003, Genome informatics. International Conference on Genome Informatics.

[26]  Chuan Yi Tang,et al.  Feature Selection and Combination Criteria for Improving Accuracy in Protein Structure Prediction , 2007, IEEE Transactions on NanoBioscience.

[27]  Bernard F. Buxton,et al.  Secondary structure prediction with support vector machines , 2003, Bioinform..

[28]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Srinivasan Parthasarathy,et al.  A multi-level approach to SCOP fold recognition , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[30]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[31]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[33]  Juan Liu,et al.  Predicting protein secondary structure by a support vector machine based on a new coding scheme. , 2004, Genome informatics. International Conference on Genome Informatics.

[34]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[35]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[36]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[37]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[38]  Nikhil R. Pal,et al.  Some New Features for Protein Fold Prediction , 2003, ICANN.

[39]  A F Pereira De Araújo Folding protein models with a simple hydrophobic energy function: the fundamental importance of monomer inside/outside segregation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[41]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.