Title Improved residue contact prediction using support vector machines and a large feature set Permalink

Background: Predicting protein residue-residue contacts is an important 2D prediction task. It is useful for ab initio structure prediction and understanding protein folding. In spite of steady progress over the past decade, contact prediction remains still largely unsolved. Results: Here we develop a new contact map predictor (SVMcon) that uses support vector machines to predict mediumand long-range contacts. SVMcon integrates profiles, secondary structure, relative solvent accessibility, contact potentials, and other useful features. On the same test data set, SVMcon's accuracy is 4% higher than the latest version of the CMAPpro contact map predictor. SVMcon recently participated in the seventh edition of the Critical Assessment of Techniques for Protein Structure Prediction (CASP7) experiment and was evaluated along with seven other contact map predictors. SVMcon was ranked as one of the top predictors, yielding the second best coverage and accuracy for contacts with sequence separation >= 12 on 13 de novo domains. Conclusion: We describe SVMcon, a new contact map predictor that uses SVMs and a large set of informative features. SVMcon yields good performance on mediumto long-range contact predictions and can be modularly incorporated into a structure prediction pipeline. Background Predicting protein inter-residue contacts is an important 2D structure prediction problem [1]. Contact prediction can help improve analogous fold recognition [2,3] and ab initio 3D structure prediction [4]. Several algorithms for reconstructing 3D structure from contacts have been developed in both the structure prediction and determination (NMR) literature [5-8]. Contact map prediction is also useful for inferring protein folding rates and pathways [9,10]. Due to its importance, contact prediction has received considerable attention over the last decade. For instance, contact prediction methods have been evaluated in the fifth, sixth, and seventh editions of the Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment [11-15]. A number of different methods for predicting contacts have been developed. These methods can be classified roughly into two non-exclusive categories: (1) statistical correlated mutations approaches [1622]; and (2) machine learning approaches [23-34]. The former uses correlated mutations of residues to predict contacts. The latter uses machine learning methods such Published: 2 April 2007 BMC Bioinformatics 2007, 8:113 doi:10.1186/1471-2105-8-113 Received: 28 December 2006 Accepted: 2 April 2007 This article is available from: http://www.biomedcentral.com/1471-2105/8/113 © 2007 Cheng and Baldi; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

[1]  Yanay Ofran,et al.  Prediction of Protein Structure Through Evolution , 2008 .

[2]  Emil Alexov,et al.  Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives , 2006, BMC Bioinformatics.

[3]  Alessandro Vullo,et al.  Distill: a suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins , 2006, BMC Bioinformatics.

[4]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[5]  H. Wolfson,et al.  Correlated mutations: Advances and limitations. A study on fusion proteins and on the Cohesin‐Dockerin families , 2006, Proteins.

[6]  Pierre Baldi,et al.  Large‐scale prediction of disulphide bridges using kernel methods, two‐dimensional recursive neural networks, and weighted graph matching , 2005, Proteins.

[7]  Alessandro Vullo,et al.  A two-stage approach for improved prediction of residue contact maps , 2006, BMC Bioinformatics.

[8]  Burkhard Rost,et al.  PROFcon: novel prediction of long-range contacts , 2005, Bioinform..

[9]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[10]  Burkhard Rost,et al.  EVAcon: a protein contact prediction evaluation service , 2005, Nucleic Acids Res..

[11]  Marco Punta,et al.  Protein folding rates estimated from contact predictions. , 2005, Journal of molecular biology.

[12]  Pierre Baldi,et al.  Three-stage prediction of protein ?-sheets by neural networks, alignments and graph algorithms , 2005, ISMB.

[13]  Adam Zemla,et al.  Critical assessment of methods of protein structure prediction (CASP)‐round V , 2005, Proteins.

[14]  Jens Meiler,et al.  CASP6 assessment of contact prediction , 2005, Proteins.

[15]  Pierre Baldi,et al.  Large-Scale Prediction of Disulphide Bond Connectivity , 2004, NIPS.

[16]  K. Burrage,et al.  Protein contact prediction using patterns of correlation , 2004, Proteins.

[17]  J. Skolnick,et al.  Automated structure prediction of weakly homologous proteins on a genomic scale. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Bernhard Schölkopf,et al.  A Primer on Kernel Methods , 2004 .

[19]  Robert M. MacCallum,et al.  Striped sheets and protein contact prediction , 2004, ISMB/ECCB.

[20]  J. Skolnick,et al.  TOUCHSTONE II: a new approach to ab initio protein structure prediction. , 2003, Biophysical journal.

[21]  Christopher K. I. Williams,et al.  Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2001 .

[22]  George Karypis,et al.  Prediction of contact maps using support vector machines , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[23]  Christopher Bystroff,et al.  Predicting interresidue contacts using templates and pathways , 2003, Proteins.

[24]  Richard Bonneau,et al.  Contact order and ab initio protein structure prediction , 2002, Protein science : a publication of the Protein Society.

[25]  Pierre Baldi,et al.  Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners , 2002, ISMB.

[26]  A. Valencia,et al.  Computational methods for the prediction of protein interactions. , 2002, Current opinion in structural biology.

[27]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[28]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[29]  T. Joachims Support Vector Machines , 2002 .

[30]  P Fariselli,et al.  Prediction of contact maps with neural networks and correlated mutations. , 2001, Protein engineering.

[31]  Piero Fariselli,et al.  Improved prediction of the number of residue contacts in proteins by recurrent neural networks , 2001, ISMB.

[32]  A. Lesk,et al.  Assessment of novel fold targets in CASP4: Predictions of three‐dimensional structures, secondary structures, and interresidue contacts , 2001, Proteins.

[33]  P Fariselli,et al.  Progress in predicting inter‐residue contacts of proteins with neural networks and correlated mutations , 2001, Proteins.

[34]  Roland L. Dunbrack,et al.  CAFASP2: The second critical assessment of fully automated structure prediction methods , 2001, Proteins.

[35]  Volker A. Eyrich,et al.  EVA: Large‐scale analysis of secondary structure prediction , 2001, Proteins.

[36]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[37]  W. Braun,et al.  Sequence specificity, statistical potentials, and three‐dimensional structure prediction with self‐correcting distance geometry calculations of β‐sheet formation in proteins , 2008 .

[38]  B. Rost,et al.  Effective use of sequence correlation and conservation in fold recognition. , 1999, Journal of molecular biology.

[39]  R. Jernigan,et al.  An empirical energy potential with a reference state for protein fold and sequence recognition , 1999, Proteins.

[40]  J. Skolnick,et al.  Ab initio folding of proteins using restraints derived from evolutionary information , 1999, Proteins.

[41]  D Fischer,et al.  CAFASP‐1: Critical assessment of fully automated structure prediction methods , 1999, Proteins.

[42]  R. Casadio,et al.  A neural network based predictor of residue contacts in proteins. , 1999, Protein engineering.

[43]  D. Baker,et al.  Contact order, transition state placement and the refolding rates of single domain proteins. , 1998, Journal of molecular biology.

[44]  J. Skolnick,et al.  Fold assembly of small proteins using monte carlo simulations driven by restraints derived from multiple sequence alignments. , 1998, Journal of molecular biology.

[45]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[46]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[47]  O. Lund,et al.  Protein distance constraints predicted by neural networks and probability density functions. , 1997, Protein engineering.

[48]  A. Valencia,et al.  Improving contact predictions by the combination of correlated mutations and other sources of sequence information. , 1997, Folding & design.

[49]  M Vendruscolo,et al.  Recovery of protein structure from contact maps. , 1997, Folding & design.

[50]  J. Skolnick,et al.  MONSSTER: a method for folding globular proteins with a small number of distance restraints. , 1997, Journal of molecular biology.

[51]  T. Hubbard,et al.  Critical assessment of methods of protein structure prediction (CASP): Round III , 1999, Proteins.

[52]  S. Bryant,et al.  Critical assessment of methods of protein structure prediction (CASP): Round II , 1997, Proteins.

[53]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[54]  M. Levitt,et al.  Using a hydrophobic contact potential to evaluate native and near-native folds generated by molecular dynamics simulations. , 1996, Journal of molecular biology.

[55]  W. Taylor,et al.  Global fold determination from a small number of distance restraints. , 1995, Journal of molecular biology.

[56]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[57]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[58]  C. Sander,et al.  Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? , 1994, Protein engineering.

[59]  C. Sander,et al.  Correlated Mutations and Residue Contacts , 1994 .

[60]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .