Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction

Machine learning techniques have been widely applied to solve the problem of predicting protein secondary structure from the amino acid sequence. They have gained substantial success in this research area. Many methods have been used including k-Nearest Neighbors (k-NNs), Hidden Markov Models (HMMs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs), which have attracted attention recently. Today, the main goal remains to improve the prediction quality of the secondary structure elements. The prediction accuracy has been continuously improved over the years, especially by using hybrid or ensemble methods and incorporating evolutionary information in the form of profiles extracted from alignments of multiple homologous sequences. In this paper, we investigate how best to combine k-NNs, ANNs and Multi-class SVMs (M-SVMs) to improve secondary structure prediction of globular proteins. An ensemble method which combines the outputs of two feed-forward ANNs, k-NN and three M-SVM classifiers has been applied. Ensemble members are combined using two variants of majority voting rule. An heuristic based filter has also been applied to refine the prediction. To investigate how much improvement the general ensemble method can give rather than the individual classifiers that make up the ensemble, we have experimented with the proposed system on the two widely used benchmark datasets RS 126 and CB513 using cross-validation tests by including PSI-BLAST position-specific scoring matrix (PSSM) profiles as inputs. The experimental results reveal that the proposed system yields significant performance gains when compared with the best individual classifier.

[1]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[2]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[3]  O. Lund,et al.  Prediction of protein secondary structure at 80% accuracy , 2000, Proteins.

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  V. Lim Structural principles of the globular organization of protein chains. A stereochemical theory of globular protein secondary structure. , 1974, Journal of molecular biology.

[6]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[7]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[8]  P. Y. Chou,et al.  Empirical predictions of protein conformation. , 1978, Annual review of biochemistry.

[9]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[10]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[11]  Hyunsoo Kim,et al.  Protein secondary structure prediction based on an improved support vector machines approach. , 2003, Protein engineering.

[12]  F. Richards,et al.  Identification of structural motifs from protein coordinate data: Secondary structure and first‐level supersecondary structure * , 1988, Proteins.

[13]  Thomas G. Dietterich,et al.  Error-Correcting Output Codes: A General Method for Improving Multiclass Inductive Learning Programs , 1991, AAAI.

[14]  J. Gibrat,et al.  Protein secondary structure assignment revisited: a detailed analysis of different assignment methods , 2005, BMC Structural Biology.

[15]  P. Argos,et al.  Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. , 1996, Protein engineering.

[16]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[18]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[19]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[20]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[21]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  J. Gibrat,et al.  Secondary structure prediction: combination of three different methods. , 1988, Protein engineering.

[24]  B. Rost,et al.  A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment , 1999, Proteins.

[25]  A A Salamov,et al.  Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. , 1995, Journal of molecular biology.

[26]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[27]  Hu Chen,et al.  A novel method for protein secondary structure prediction using dual‐layer SVM and profiles , 2004, Proteins.

[28]  R. King,et al.  Identification and application of the concepts important for accurate and reliable protein secondary structure prediction , 1996, Protein science : a publication of the Protein Society.

[29]  J. Gibrat,et al.  Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. , 1987, Journal of molecular biology.

[30]  Enrique Vidal,et al.  Learning weighted metrics to minimize nearest-neighbor classification error , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[32]  Jagath C. Rajapakse,et al.  Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction , 2004, Pacific Symposium on Biocomputing.

[33]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[34]  S. K. Riis,et al.  Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. , 1996, Journal of computational biology : a journal of computational molecular cell biology.

[35]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[36]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[37]  Yen-Jen Oyang,et al.  A novel radial basis function network classifier with centers set by hierarchical clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[38]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[39]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[40]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[41]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[42]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[43]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[44]  Jagath C Rajapakse,et al.  Multi-class support vector machines for protein secondary structure prediction. , 2003, Genome informatics. International Conference on Genome Informatics.

[45]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[46]  Bernard F. Buxton,et al.  Secondary structure prediction with support vector machines , 2003, Bioinform..

[47]  Aoife McLysaght,et al.  Porter: a new, accurate server for protein secondary structure prediction , 2005, Bioinform..

[48]  Sridhar Narayan,et al.  The Generalized Sigmoid Activation Function: Competetive Supervised Learning , 1997, Inf. Sci..

[49]  C Geourjon,et al.  SOPM: a self-optimized method for protein secondary structure prediction. , 1994, Protein engineering.

[50]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[51]  Dong Xu,et al.  MUPRED: A tool for bridging the gap between template based methods and sequence profile based methods for protein secondary structure prediction , 2006, Proteins.

[52]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[53]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..