A two-stage approach for improved prediction of residue contact maps

BackgroundProtein topology representations such as residue contact maps are an important intermediate step towards ab initio prediction of protein structure. Although improvements have occurred over the last years, the problem of accurately predicting residue contact maps from primary sequences is still largely unsolved. Among the reasons for this are the unbalanced nature of the problem (with far fewer examples of contacts than non-contacts), the formidable challenge of capturing long-range interactions in the maps, the intrinsic difficulty of mapping one-dimensional input sequences into two-dimensional output maps.In order to alleviate these problems and achieve improved contact map predictions, in this paper we split the task into two stages: the prediction of a map's principal eigenvector (PE) from the primary sequence; the reconstruction of the contact map from the PE and primary sequence. Predicting the PE from the primary sequence consists in mapping a vector into a vector. This task is less complex than mapping vectors directly into two-dimensional matrices since the size of the problem is drastically reduced and so is the scale length of interactions that need to be learned.ResultsWe develop architectures composed of ensembles of two-layered bidirectional recurrent neural networks to classify the components of the PE in 2, 3 and 4 classes from protein primary sequence, predicted secondary structure, and hydrophobicity interaction scales. Our predictor, tested on a non redundant set of 2171 proteins, achieves classification performances of up to 72.6%, 16% above a base-line statistical predictor.We design a system for the prediction of contact maps from the predicted PE. Our results show that predicting maps through the PE yields sizeable gains especially for long-range contacts which are particularly critical for accurate protein 3D reconstruction. The final predictor's accuracy on a non-redundant set of 327 targets is 35.4% and 19.8% for minimum contact separations of 12 and 24, respectively, when the top length/5 contacts are selected. On the 11 CASP6 Novel Fold targets we achieve similar accuracies (36.5% and 19.7%). This favourably compares with the best automated predictors at CASP6.ConclusionOur final system for contact map prediction achieves state-of-the-art performances, and may provide valuable constraints for improved ab initio prediction of protein structures. A suite of predictors of structural features, including the PE, and PE-based contact maps, is available at http://distill.ucd.ie.

[1]  Richard Bonneau,et al.  De novo prediction of three-dimensional structures for major protein families. , 2002, Journal of molecular biology.

[2]  S. Arjunan,et al.  Prediction of Protein Secondary Structure , 2001 .

[3]  M Vendruscolo,et al.  Recovery of protein structure from contact maps. , 1997, Folding & design.

[4]  William A. Goddard,et al.  PROTEIN FOLD DETERMINATION FROM SPARSE DISTANCE RESTRAINTS : THE RESTRAINED GENERIC PROTEIN DIRECT MONTE CARLO METHOD , 1999 .

[5]  J. Skolnick,et al.  TOUCHSTONEX: Protein structure prediction with sparse NMR data , 2003, Proteins.

[6]  Giovanni Soda,et al.  Exploiting the past and the future in protein secondary structure prediction , 1999, Bioinform..

[7]  Aoife McLysaght,et al.  Porter: a new, accurate server for protein secondary structure prediction , 2005, Bioinform..

[8]  B. Rost,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round 6 , 2005, Proteins.

[9]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[10]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[11]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[12]  B. Rost,et al.  A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment , 1999, Proteins.

[13]  Pierre Tufféry,et al.  PredAcc: prediction of solvent accessibility , 1999, Bioinform..

[14]  Alessandro Sperduti,et al.  A general framework for adaptive processing of data structures , 1998, IEEE Trans. Neural Networks.

[15]  Akira R. Kinjo,et al.  Recoverable one-dimensional encoding of three-dimensional protein structures , 2005, Bioinform..

[16]  Robert M. MacCallum,et al.  Striped sheets and protein contact prediction , 2004, ISMB/ECCB.

[17]  D. Baker,et al.  De novo protein structure determination using sparse NMR data , 2000, Journal of biomolecular NMR.

[18]  D J Barlow,et al.  The bottom line for prediction of residue solvent accessibility. , 1999, Protein engineering.

[19]  Piero Fariselli,et al.  Prediction of the Number of Residue Contacts in Proteins , 2000, ISMB.

[20]  O. Lund,et al.  Prediction of protein secondary structure at 80% accuracy , 2000, Proteins.

[21]  Gordon F. Royle,et al.  Algebraic Graph Theory , 2001, Graduate texts in mathematics.

[22]  E. Huang,et al.  Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. , 1999, Journal of molecular biology.

[23]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[24]  Pierre Baldi,et al.  The Principled Design of Large-Scale Recursive Neural Network Architectures--DAG-RNNs and the Protein Structure Prediction Problem , 2003, J. Mach. Learn. Res..

[25]  P Fariselli,et al.  Prediction of contact maps with neural networks and correlated mutations. , 2001, Protein engineering.

[26]  Marc A. Martí-Renom,et al.  EVA: continuous automatic evaluation of protein structure prediction servers , 2001, Bioinform..

[27]  W. Taylor,et al.  Global fold determination from a small number of distance restraints. , 1995, Journal of molecular biology.

[28]  U. Bastolla,et al.  Principal eigenvector of contact matrices and hydrophobicity profiles in proteins , 2004, Proteins.

[29]  C. Sander,et al.  Parser for protein folding units , 1994, Proteins.

[30]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[31]  R. Casadio,et al.  A neural network based predictor of residue contacts in proteins. , 1999, Protein engineering.

[32]  J. Skolnick,et al.  MONSSTER: a method for folding globular proteins with a small number of distance restraints. , 1997, Journal of molecular biology.

[33]  Michele Vendruscolo,et al.  Reconstruction of protein structures from a vectorial representation. , 2004, Physical review letters.

[34]  Pierre Baldi,et al.  Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners , 2002, ISMB.

[35]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[36]  Gianluca Pollastri,et al.  Prediction of Contact Maps by Recurrent Neural Network Architectures and Hidden Context Propagation From All Four Cardinal Corners , 2002 .