Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility

Motivation: The accuracy of predicting protein local and global structural properties such as secondary structure and solvent accessible surface area has been stagnant for many years because of the challenge of accounting for non‐local interactions between amino acid residues that are close in three‐dimensional structural space but far from each other in their sequence positions. All existing machine‐learning techniques relied on a sliding window of 10–20 amino acid residues to capture some ‘short to intermediate’ non‐local interactions. Here, we employed Long Short‐Term Memory (LSTM) Bidirectional Recurrent Neural Networks (BRNNs) which are capable of capturing long range interactions without using a window. Results: We showed that the application of LSTM‐BRNN to the prediction of protein structural properties makes the most significant improvement for residues with the most long‐range contacts (|i‐j| >19) over a previous window‐based, deep‐learning method SPIDER2. Capturing long‐range interactions allows the accuracy of three‐state secondary structure prediction to reach 84% and the correlation coefficient between predicted and actual solvent accessible surface areas to reach 0.80, plus a reduction of 5%, 10%, 5% and 10% in the mean absolute error for backbone Symbol, &psgr;, &thgr; and &tgr; angles, respectively, from SPIDER2. More significantly, 27% of 182724 40‐residue models directly constructed from predicted C&agr; atom‐based &thgr; and &tgr; have similar structures to their corresponding native structures (6Å RMSD or less), which is 3% better than models built by Symbol and &psgr; angles. We expect the method to be useful for assisting protein structure and function prediction. Symbol. No caption available. Symbol. No caption available. Availability and implementation: The method is available as a SPIDER3 server and standalone package at http://sparks‐lab.org. Contact: yaoqi.zhou@griffith.edu.au or yuedong.yang@griffith.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  S H Kim,et al.  Predicting surface exposure of amino acids from protein sequence. , 1990, Protein engineering.

[2]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[3]  Bin Xue,et al.  Real‐value prediction of backbone torsion angles , 2008, Proteins.

[4]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[6]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[7]  Zheng Yuan,et al.  Better prediction of protein contact number using a support vector regression analysis of amino acid sequence , 2005, BMC Bioinformatics.

[8]  Yaoqi Zhou,et al.  Prediction of One‐Dimensional Structural Properties Of Proteins by Integrated Neural Networks , 2010 .

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  J. Hirst,et al.  Protein secondary structure prediction with dihedral angles , 2005, Proteins.

[12]  Yaoqi Zhou,et al.  Real‐SPINE: An integrated system of neural networks for real‐value prediction of protein structural properties , 2007, Proteins.

[13]  D Gilis,et al.  Predicting protein stability changes upon mutation using database-derived potentials: solvent accessibility determines the importance of local versus non-local interactions along the sequence. , 1997, Journal of molecular biology.

[14]  Ozlem Keskin,et al.  Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy , 2009, Bioinform..

[15]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[16]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[17]  George Karypis,et al.  Introduction to Protein Structure Prediction , 2010 .

[18]  Harpreet Kaur,et al.  Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure , 2005, Proteins.

[19]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[20]  Y. Duan,et al.  Trends in template/fragment-free protein structure prediction , 2010, Theoretical chemistry accounts.

[21]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[22]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[23]  Alexey Drozdetskiy,et al.  JPred4: a protein secondary structure prediction server , 2015, Nucleic Acids Res..

[24]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[25]  H A Scheraga,et al.  Minimization of polypeptide energy. I. Preliminary structures of bovine pancreatic ribonuclease S-peptide. , 1967, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Giovanni Soda,et al.  Exploiting the past and the future in protein secondary structure prediction , 1999, Bioinform..

[27]  Kuldip K. Paliwal,et al.  Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins , 2016, Bioinform..

[28]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[29]  Yaoqi Zhou,et al.  Improving protein disorder prediction by deep bidirectional long short‐term memory recurrent neural networks , 2016, Bioinform..

[30]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[31]  K. Dill,et al.  The Protein-Folding Problem, 50 Years On , 2012, Science.

[32]  Kuldip K. Paliwal,et al.  Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto‐encoder deep neural network , 2014, J. Comput. Chem..

[33]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[34]  Aleksey A. Porollo,et al.  Accurate prediction of solvent accessibility using neural networks–based regression , 2004, Proteins.

[35]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[36]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[37]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[38]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[39]  Yaohang Li,et al.  Context-Based Features Enhance Protein Secondary Structure Prediction Accuracy , 2014, J. Chem. Inf. Model..

[40]  An-Suei Yang,et al.  Protein backbone angle prediction with machine learning approaches , 2004, Bioinform..

[41]  B. Lee,et al.  Estimation and use of protein backbone angle probabilities. , 1993, Journal of molecular biology.

[42]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Christian Cole,et al.  JPred4: a protein secondary structure prediction server , 2015, Nucleic Acids Res..

[44]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[45]  Gianluca Pollastri,et al.  Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility , 2013, Bioinform..

[46]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[47]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[48]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[49]  Zheng Yuan,et al.  Prediction of protein accessible surface areas by support vector regression , 2004, Proteins.

[50]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[51]  Akira R. Kinjo,et al.  CRNPRED: highly accurate prediction of one-dimensional protein structures by large-scale critical random networks , 2006, BMC Bioinformatics.

[52]  M. Gromiha,et al.  Real value prediction of solvent accessibility from amino acid sequence , 2003, Proteins.

[53]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[54]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[55]  J. Skolnick,et al.  What is the probability of a chance prediction of a protein structure with an rmsd of 6 A? , 1998, Folding & design.

[56]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..