Protein Secondary Structure Prediction Using Support Vector Machines and a New Feature Representation

Knowledge of the secondary structure and solvent accessibility of a protein plays a vital role in the prediction of fold, and eventually the tertiary structure of the protein. A challenging issue of predicting protein secondary structure from sequence alone is addressed. Support vector machines (SVM) are employed for the classification and the SVM outputs are converted to posterior probabilities for multi-class classification. The effect of using Chou–Fasman parameters and physico-chemical parameters along with evolutionary information in the form of position specific scoring matrix (PSSM) is analyzed. These proposed methods are tested on the RS126 and CB513 datasets. A new dataset is curated (PSS504) using recent release of CATH. On the CB513 dataset, sevenfold cross-validation accuracy of 77.9% was obtained using the proposed encoding method. A new method of calculating the reliability index based on the number of votes and the Support Vector Machine decision value is also proposed. A blind test on the EVA dataset gives an average Q3 accuracy of 74.5% and ranks in top five protein structure prediction methods. Supplementary material including datasets are available on .

[1]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[2]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[3]  Marimuthu Palaniswami,et al.  Protein topology classification using two-stage support vector machines. , 2006, Genome informatics. International Conference on Genome Informatics.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory, Second Edition , 2000, Statistics for Engineering and Information Science.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Jagath C Rajapakse,et al.  Multi-class support vector machines for protein secondary structure prediction. , 2003, Genome informatics. International Conference on Genome Informatics.

[7]  Hae-Jin Hu,et al.  Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier , 2004, IEEE Transactions on NanoBioscience.

[8]  Burkhard Rost,et al.  UniqueProt: creating representative protein sequence sets , 2003, Nucleic Acids Res..

[9]  J. Jung,et al.  Protein structure prediction. , 2001, Current opinion in chemical biology.

[10]  Bernard F. Buxton,et al.  Secondary structure prediction with support vector machines , 2003, Bioinform..

[11]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[12]  B. Rost,et al.  A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment , 1999, Proteins.

[13]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[14]  W R Taylor,et al.  A model recognition approach to the prediction of all-helical membrane protein structure and topology. , 1994, Biochemistry.

[15]  M Ouali,et al.  Cascaded multiple classifiers for secondary structure prediction , 2000, Protein science : a publication of the Protein Society.

[16]  James T. Kwok Moderating the outputs of support vector machine classifiers , 1999, IEEE Trans. Neural Networks.

[17]  Detlef D. Leipe,et al.  Did DNA replication evolve twice independently? , 1999, Nucleic acids research.

[18]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[19]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[20]  Hyunsoo Kim,et al.  Protein secondary structure prediction based on an improved support vector machines approach. , 2003, Protein engineering.

[21]  P. Y. Chou,et al.  Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. , 1974, Biochemistry.

[22]  Marimuthu Palaniswami,et al.  Effect of constraints on sub-problem selection for solving Support Vector Machines using space decomposition , 2004 .

[23]  Hu Chen,et al.  A novel method for protein secondary structure prediction using dual‐layer SVM and profiles , 2004, Proteins.

[24]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[25]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[26]  B. Rost Protein Structure Prediction in 1D, 2D, and 3D , 2002 .

[27]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[28]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[29]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[30]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[31]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[32]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[33]  Philip E. Gill,et al.  Practical optimization , 1981 .

[34]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[35]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[36]  B. Rost,et al.  Redefining the goals of protein secondary structure prediction. , 1994, Journal of molecular biology.