Beyond the Twilight Zone: Automated prediction of structural properties of proteins by recursive neural networks and remote homology information

The prediction of 1D structural properties of proteins is an important step toward the prediction of protein structure and function, not only in the ab initio case but also when homology information to known structures is available. Despite this the vast majority of 1D predictors do not incorporate homology information into the prediction process. We develop a novel structural alignment method, SAMD, which we use to build alignments of putative remote homologues that we compress into templates of structural frequency profiles. We use these templates as additional input to ensembles of recursive neural networks, which we specialise for the prediction of query sequences that show only remote homology to any Protein Data Bank structure. We predict four 1D structural properties – secondary structure, relative solvent accessibility, backbone structural motifs, and contact density. Secondary structure prediction accuracy, tested by five‐fold cross‐validation on a large set of proteins allowing less than 25% sequence identity between training and test set and query sequences and templates, exceeds 82%, outperforming its ab initio counterpart, other state‐of‐the‐art secondary structure predictors (Jpred 3 and PSIPRED) and two other systems based on PSI‐BLAST and COMPASS templates. We show that structural information from homologues improves prediction accuracy well beyond the Twilight Zone of sequence similarity, even below 5% sequence identity, for all four structural properties. Significant improvement over the extraction of structural information directly from PDB templates suggests that the combination of sequence and template information is more informative than templates alone. Proteins 2009. © 2009 Wiley‐Liss, Inc.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[3]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[4]  R. Russell,et al.  Protein complexes: structure prediction challenges for the 21st century. , 2005, Current opinion in structural biology.

[5]  Burkhard Rost,et al.  Improving fold recognition without folds. , 2004, Journal of molecular biology.

[6]  Haipeng Gong,et al.  Does secondary structure determine tertiary structure in proteins? , 2005, Proteins.

[7]  Pierre Baldi,et al.  The Principled Design of Large-Scale Recursive Neural Network Architectures--DAG-RNNs and the Protein Structure Prediction Problem , 2003, J. Mach. Learn. Res..

[8]  Christian Cole,et al.  The Jpred 3 secondary structure prediction server , 2008, Nucleic Acids Res..

[9]  Marcin von Grotthuss,et al.  ORFeus: detection of distant homology using sequence profiles and predicted secondary structure , 2003, Nucleic Acids Res..

[10]  Chris Sander,et al.  Completeness in structural genomics , 2001, Nature Structural Biology.

[11]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[12]  Kevin Karplus,et al.  Evaluation of local structure alphabets based on residue burial , 2004, Proteins.

[13]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[14]  Aleksey A. Porollo,et al.  Accurate prediction of solvent accessibility using neural networks–based regression , 2004, Proteins.

[15]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[16]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[17]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[18]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[19]  H Naderi-Manesh,et al.  Prediction of protein surface accessibility with information theory. , 2000, Proteins.

[20]  W. Pearson,et al.  The limits of protein sequence comparison? , 2005, Current opinion in structural biology.

[21]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[22]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[23]  Pierre Tufféry,et al.  PredAcc: prediction of solvent accessibility , 1999, Bioinform..

[24]  Ron Elber,et al.  SSALN: An alignment algorithm using structure‐dependent substitution matrices and gap penalties learned from structurally aligned protein pairs , 2005, Proteins.

[25]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[26]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[27]  Shandar Ahmad,et al.  NETASA: neural network based prediction of solvent accessibility , 2002, Bioinform..

[28]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[29]  Sung-Hou Kim,et al.  Protein conformational space in higher order-maps , 2005 .

[30]  Yaoqi Zhou,et al.  SPARKS 2 and SP3 servers in CASP6 , 2005, Proteins.

[31]  Aoife McLysaght,et al.  Porter: a new, accurate server for protein secondary structure prediction , 2005, Bioinform..

[32]  N Linial,et al.  ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space , 1999, Proteins.

[33]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[34]  Leszek Rychlewski,et al.  LiveBench‐8: The large‐scale, continuous assessment of automated protein structure prediction , 2005, Protein science : a publication of the Protein Society.

[35]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[36]  Richard Hughey,et al.  Calibrating E-values for hidden Markov models using reverse-sequence null models , 2005, Bioinform..

[37]  Alessandro Vullo,et al.  A two-stage approach for improved prediction of residue contact maps , 2006, BMC Bioinformatics.

[38]  Jagath C Rajapakse,et al.  Prediction of protein relative solvent accessibility with a two‐stage SVM approach , 2005, Proteins.

[39]  Michael Gribskov,et al.  Estimating and Evaluating the Statistics of Gapped Local-Alignment Scores , 2003, J. Comput. Biol..

[40]  Gregory E Sims,et al.  Protein conformational space in higher order phi-Psi maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Alessandro Vullo,et al.  Protein Structural Motif Prediction in Multidimensional ø-Psi Space Leads to Improved Secondary Structure Prediction , 2006, J. Comput. Biol..

[42]  Alessandro Vullo,et al.  Distill: a suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins , 2006, BMC Bioinformatics.

[43]  Liam J. McGuffin,et al.  Improvement of the GenTHREADER Method for Genomic Fold Recognition , 2003, Bioinform..

[44]  S. Pascarella,et al.  Improvement in prediction of solvent accessibility by probability profiles. , 2003, Protein engineering.

[45]  Andras Fiser,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[46]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[47]  B. Honig,et al.  On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. , 2003, Journal of molecular biology.

[48]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[49]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[50]  Christopher Bystroff,et al.  Improved pairwise alignment of proteins in the Twilight Zone using local structure predictions , 2005, 2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05).

[51]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[52]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[53]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[54]  K. Karplus,et al.  Hidden Markov models that use predicted local structure for fold recognition: Alphabets of backbone geometry , 2003, Proteins.

[55]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[56]  Anthony C. Davison,et al.  Statistics of Extremes , 2015, International Encyclopedia of Statistical Science.

[57]  M. Sippl,et al.  Structure-derived substitution matrices for alignment of distantly related sequences. , 2000, Protein engineering.

[58]  David S. Wishart,et al.  Improving the accuracy of protein secondary structure prediction using structural alignment , 2006, BMC Bioinformatics.

[59]  Jonathan Casper,et al.  Combining local‐structure, fold‐recognition, and new fold methods for protein structure prediction , 2003, Proteins.

[60]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[61]  Ori Sasson,et al.  ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[62]  Alessandro Vullo,et al.  Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information , 2007, BMC Bioinformatics.

[63]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[64]  John Moult,et al.  A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. , 2005, Current opinion in structural biology.

[65]  Marc A. Martí-Renom,et al.  EVA: evaluation of protein structure prediction servers , 2003, Nucleic Acids Res..

[66]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[67]  T. Salakoski,et al.  Selection of a representative set of structures from brookhaven protein data bank , 1992, Proteins.