Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterised proteins

Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologues of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale 3-D modelling of proteins on minimal hardware, and we demonstrate that by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper. Significance Statement We present a deep learning-based predictor of protein tertiary structure that uses only a multiple sequence alignment (MSA) as input. To date, most emphasis has been on the accuracy of such deep learning methods, but here we show that accurate structure prediction is also possible in very short timeframes (a few hundred milliseconds). In our method, the backbone coordinates of the target protein are output directly from the neural network, which makes the predictor extremely fast. As a demonstration, we generated over 1.3 million models of uncharacterised proteins in the BFD, a large sequence database including many metagenomic sequences. Our results showcase the utility of ultrafast and accurate tertiary structure prediction in rapidly exploring the “dark space” of proteins.

[1]  A. Householder,et al.  Discussion of a set of points in terms of their mutual distances , 1938 .

[2]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[3]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[4]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[5]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[6]  W. Taylor,et al.  Secondary structure formation in model polypeptide chains. , 1994, Protein engineering.

[7]  W R Taylor,et al.  SSAP: sequential structure alignment program for protein structure comparison. , 1996, Methods in enzymology.

[8]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[9]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[10]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[11]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[12]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[13]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[14]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[15]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[16]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Giuseppe Tradigo,et al.  Toward an accurate prediction of inter-residue distances in proteins using 2D recursive neural networks , 2014, BMC Bioinformatics.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[21]  Mirco Michel,et al.  Large-scale structure prediction by improved contact predictions and model quality assessment , 2017, bioRxiv.

[22]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[23]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2017, Nature Communications.

[24]  David A. Lee,et al.  Gene3D: Extensive prediction of globular domains in proteins , 2017, Nucleic Acids Res..

[25]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[26]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[27]  Zsuzsanna Dosztányi,et al.  Prediction of protein disorder based on IUPred , 2018, Protein science : a publication of the Protein Society.

[28]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[29]  Björn Wallner,et al.  rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments , 2019, PloS one.

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Mohammed AlQuraishi End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[32]  M. Wass,et al.  Environmental conditions shape the nature of a minimal bacterial genome , 2019, Nature Communications.

[33]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[34]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[35]  David T Jones,et al.  Prediction of interresidue contacts with DeepMetaPSICOV in CASP13 , 2019, Proteins.

[36]  Yang Zhang,et al.  Ensembling multiple raw coevolutionary features with deep residual neural networks for contact‐map prediction in CASP13 , 2019, Proteins.

[37]  Lisa N Kinch,et al.  CASP13 target classification into tertiary structure prediction categories , 2019, Proteins.

[38]  Mirco Michel,et al.  PconsFam: An Interactive Database of Structure Predictions of Pfam Families. , 2019, Journal of molecular biology.

[39]  David T. Jones,et al.  Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints , 2018, Nature Communications.

[40]  Jinbo Xu Distance-based protein folding powered by deep learning , 2019, Proceedings of the National Academy of Sciences.

[41]  Jianlin Cheng,et al.  DeepDist: real-value inter-residue distance prediction with deep residual convolutional network , 2020, bioRxiv.

[42]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2020, Proceedings of the National Academy of Sciences.

[43]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[44]  Badri Adhikari A fully open-source framework for deep learning protein real-valued distances , 2020, Scientific Reports.

[45]  Joe G Greener,et al.  BioStructures.jl: read, write and manipulate macromolecular structures in Julia , 2020, Bioinform..

[46]  Shaun M. Kandathil,et al.  Near-complete protein structural modelling of the minimal genome. , 2020, 2007.06623.

[47]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[48]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  David T Jones,et al.  Deep learning-based prediction of protein structure using learned representations of multiple sequence alignments , 2020, bioRxiv.

[50]  Gyu Rie Lee,et al.  Accurate prediction of protein structures and interactions using a 3-track neural network , 2021, Science.

[51]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. , 2021, IEEE transactions on pattern analysis and machine intelligence.

[52]  Radka Svobodová Vareková,et al.  CATH: increased structural coverage of functional space , 2020, Nucleic Acids Res..

[53]  Silvio C. E. Tosatto,et al.  Pfam: The protein families database in 2021 , 2020, Nucleic Acids Res..

[54]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[55]  Eric W. Bell,et al.  Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations , 2021, Cell reports methods.

[56]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.