Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

The inapplicability of amino acid covariation methods to small protein families has limited their use for structural annotation of whole genomes. Recently, deep learning has shown promise in allowing accurate residue-residue contact prediction even for shallow sequence alignments. Here we introduce DMPfold, which uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces more accurate models than two popular methods for a test set of CASP12 domains, and works just as well for transmembrane proteins. Applied to all Pfam domains without known structures, confident models for 25% of these so-called dark families were produced in under a week on a small 200 core cluster. DMPfold provides models for 16% of human proteome UniProt entries without structures, generates accurate models with fewer than 100 sequences in some cases, and is freely available. Prediction of protein structures on the scale of genomes remains a challenge. Here the authors introduce a protein structure prediction method that uses deep learning to predict inter-atomic distances, torsion angles and hydrogen bonds, and apply it to predict the structures of 1475 Pfam domains.

[1]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[2]  David T. Jones,et al.  De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts , 2014, PloS one.

[3]  Lars Malmström,et al.  PROTEINS: Structure, Function, and Bioinformatics Suppl 7:193–200 (2005) Automated Prediction of Domain Boundaries in CASP6 Targets Using Ginzu and RosettaDOM , 2022 .

[4]  Jinbo Xu Distance-based protein folding powered by deep learning , 2019, Proceedings of the National Academy of Sciences.

[5]  Robert D. Finn,et al.  HMMER web server: 2018 update , 2018, Nucleic Acids Res..

[6]  Andreas Windemuth,et al.  Structural coverage of the proteome for pharmaceutical applications. , 2017, Drug discovery today.

[7]  Lukas Zimmermann,et al.  A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core. , 2017, Journal of molecular biology.

[8]  David T Jones,et al.  Prediction of interresidue contacts with DeepMetaPSICOV in CASP13 , 2019, Proteins.

[9]  Frank DiMaio,et al.  Protein structure prediction using Rosetta in CASP12 , 2018, Proteins.

[10]  R J Williams,et al.  Topological mirror images in protein structure computation: An underestimated problem , 1991, Proteins.

[11]  Jens Meiler,et al.  Protocols for Molecular Modeling with Rosetta3 and RosettaScripts , 2016, Biochemistry.

[12]  Badri Adhikari,et al.  CONFOLD2: improved contact-driven ab initio protein structure modeling , 2017, BMC Bioinformatics.

[13]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[14]  David T. Jones,et al.  High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features , 2018, Bioinform..

[15]  Marco Punta,et al.  Genome3D: exploiting structure to help users understand their sequences , 2015, Nucleic Acids Res..

[16]  Mirco Michel,et al.  Large-scale structure prediction by improved contact predictions and model quality assessment , 2017 .

[17]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[18]  Mirco Michel,et al.  PconsFam: An Interactive Database of Structure Predictions of Pfam Families. , 2019, Journal of molecular biology.

[19]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[20]  Yuxing Liao,et al.  ECOD: new developments in the evolutionary classification of domains , 2016, Nucleic Acids Res..

[21]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[22]  A. Brunger Version 1.2 of the Crystallography and NMR system , 2007, Nature Protocols.

[23]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[24]  Marcin J. Skwark,et al.  PconsFold: improved contact predictions improve protein models , 2014, Bioinform..

[25]  B. Rost,et al.  Unexpected features of the dark proteome , 2015, Proceedings of the National Academy of Sciences.

[26]  A. Tramontano,et al.  New encouraging developments in contact prediction: Assessment of the CASP11 results , 2016, Proteins: Structure, Function, and Bioinformatics.

[27]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[28]  Badri Adhikari,et al.  CONFOLD2: improved contact-driven ab initio protein structure modeling , 2018, BMC Bioinformatics.

[29]  Debora S. Marks,et al.  Learning Protein Structure with a Differentiable Simulator , 2018, ICLR.

[30]  W. Taylor,et al.  Global fold determination from a small number of distance restraints. , 1995, Journal of molecular biology.

[31]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[32]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[33]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[34]  Alfonso Valencia,et al.  Emerging methods in protein co-evolution , 2013 .

[35]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using Modeller , 2006, Current protocols in bioinformatics.

[36]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[37]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[38]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[39]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[40]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[41]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using MODELLER , 2007, Current protocols in protein science.

[42]  Liam J. McGuffin,et al.  Improving sequence-based fold recognition by using 3D model quality assessment , 2005, Bioinform..

[43]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[44]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[45]  Jie Hou,et al.  Deep learning methods for protein torsion angle prediction , 2017, BMC Bioinformatics.

[46]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  Timothy Nugent,et al.  Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis , 2012, Proceedings of the National Academy of Sciences.

[49]  David A. Lee,et al.  PSI-2: structural genomics to cover protein domain family space. , 2009, Structure.

[50]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.