High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features

Motivation In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Results Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. Availability and implementation DeepCov is freely available at https://github.com/psipred/DeepCov. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[2]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[3]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[4]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[5]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[6]  Jens Meiler,et al.  CASP6 assessment of contact prediction , 2005, Proteins.

[7]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  David T. Jones,et al.  MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins , 2014, Bioinform..

[9]  Marcin J. Skwark,et al.  Predicting accurate contacts in thousands of Pfam domain families using PconsC3 , 2017, Bioinform..

[10]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[11]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[13]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[14]  Ben Calderhead,et al.  Advances in Neural Information Processing Systems 29 , 2016 .

[15]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[16]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[17]  William R Taylor,et al.  Prediction of contacts from correlated sequence substitutions. , 2013, Current opinion in structural biology.

[18]  Magnus Ekeberg,et al.  Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences , 2014, J. Comput. Phys..

[19]  Charlotte Deane,et al.  Co-evolution techniques are reshaping the way we do structural bioinformatics , 2017, F1000Research.

[20]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[21]  David E. Kim,et al.  Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta , 2016, Proteins.

[22]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[23]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[24]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[25]  Sheng Wang,et al.  Analysis of deep learning methods for blind protein contact prediction in CASP12 , 2017, bioRxiv.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Lambert Schomaker,et al.  A Comparison of Feature and Pixel-Based Methods for Recognizing Handwritten Bangla Digits , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[28]  David E. Kim,et al.  One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling , 2014, Proteins.

[29]  Jinbo Xu,et al.  Analysis of deep learning methods for blind protein contact prediction in CASP12 , 2018, Proteins.

[30]  A. Tramontano,et al.  New encouraging developments in contact prediction: Assessment of the CASP11 results , 2016, Proteins.

[31]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[32]  Oliver Brock,et al.  EPSILON-CP: using deep learning to combine information from multiple sources for protein contact prediction , 2017, BMC Bioinformatics.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[35]  Timothy Nugent,et al.  Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis , 2012, Proceedings of the National Academy of Sciences.

[36]  David Baker,et al.  Origins of coevolution between residues distant in protein 3D structures , 2017, Proceedings of the National Academy of Sciences.

[37]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[38]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[39]  Bonnie Berger,et al.  Enhancing Evolutionary Couplings with Deep Convolutional Neural Networks , 2017, Cell systems.

[40]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[41]  Jinbo Xu,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016 .

[42]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[43]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[44]  David T. Jones,et al.  De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts , 2014, PloS one.

[45]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[46]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[47]  Dapeng Xiong,et al.  A deep learning framework for improving long‐range residue‐residue contact prediction using a hierarchical strategy , 2017, Bioinform..

[48]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[49]  Patrice Y. Simard,et al.  Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[50]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[51]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Daniel W. A. Buchan,et al.  EigenTHREADER: analogous protein fold recognition by efficient contact map threading , 2017, Bioinform..

[54]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[55]  Jianlin Cheng,et al.  A study and benchmark of DNcon: a method for protein residue-residue contact prediction using deep networks , 2013, BMC Bioinformatics.

[56]  Yuxing Liao,et al.  ECOD: An Evolutionary Classification of Protein Domains , 2014, PLoS Comput. Biol..

[57]  Marcin J. Skwark,et al.  Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images , 2016, NIPS.