Highly accurate protein structure prediction with AlphaFold

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1–4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.

[1]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[2]  Jie Hou,et al.  Analysis of several key factors influencing deep learning-based inter-residue contact prediction , 2019, Bioinform..

[3]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[4]  K. Wüthrich The way to NMR structures of proteins , 2001, Nature Structural Biology.

[5]  R. Stein,et al.  An embedded lipid in the multidrug transporter LmrP suggests a mechanism for polyspecificity , 2020, Nature Structural & Molecular Biology.

[6]  Zhuowen Tu,et al.  Auto-Context and Its Application to High-Level Vision Tasks and 3D Brain Image Segmentation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[8]  A. Lesk,et al.  Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. , 1987, Journal of molecular biology.

[9]  T. Cech,et al.  The structure of human CST reveals a decameric assembly bound to telomeric DNA , 2020, Science.

[10]  S. Scheres,et al.  How cryo-EM is revolutionizing structural biology. , 2015, Trends in biochemical sciences.

[11]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[12]  S. Deorowicz,et al.  FAMSA: Fast and accurate multiple sequence alignment of huge protein families , 2016, Scientific Reports.

[13]  A. Plückthun,et al.  An Interface-Driven Design Strategy Yields a Novel, Corrugated Protein Architecture. , 2018, ACS synthetic biology.

[14]  K. Kavukcuoglu,et al.  Highly accurate protein structure prediction for the human proteome , 2021, Nature.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Björn Wallner,et al.  rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments , 2019, PloS one.

[18]  Adam Zemla,et al.  LGA: a method for finding 3D similarities in protein structures , 2003, Nucleic Acids Res..

[19]  Pushmeet Kohli,et al.  Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13) , 2019, Proteins.

[20]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2020, Proceedings of the National Academy of Sciences.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  S. Knight,et al.  MrpH, a new class of metal-binding adhesin, requires zinc to mediate biofilm formation , 2020, PLoS pathogens.

[23]  Xiaogen Zhou,et al.  Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks , 2020, bioRxiv.

[24]  wwPDB consortium,et al.  Protein Data Bank: the single global archive for 3D macromolecular structure data , 2019, Nucleic Acids Res..

[25]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2017, Nature Communications.

[26]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[27]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[28]  A. Lupas,et al.  High‐accuracy protein structure prediction in CASP14 , 2021, Proteins.

[29]  Jin Li,et al.  Universal Transforming Geometric Network , 2019, ArXiv.

[30]  Marco Biasini,et al.  lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests , 2013, Bioinform..

[31]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[32]  Mohammed AlQuraishi End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[33]  Brian Kuhlman,et al.  Advances in protein structure prediction and design , 2019, Nature Reviews Molecular Cell Biology.

[34]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[36]  Debora S. Marks,et al.  Learning Protein Structure with a Differentiable Simulator , 2018, ICLR.

[37]  Robert D. Finn,et al.  MGnify: the microbiome analysis resource in 2020 , 2019, Nucleic Acids Res..

[38]  P Fariselli,et al.  Prediction of contact maps with neural networks and correlated mutations. , 2001, Protein engineering.

[39]  C. Sander,et al.  Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? , 1994, Protein engineering.

[40]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jinbo Xu,et al.  Improved protein structure prediction by deep learning irrespective of co-evolution information , 2020, Nature Machine Intelligence.

[42]  Eric W. Bell,et al.  Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks , 2020, bioRxiv.

[43]  M. Loessner,et al.  The M23 peptidase domain of the Staphylococcal phage 2638A endolysin , 2020 .

[44]  T. Yeates,et al.  Advances in methods for atomic resolution macromolecular structure determination , 2020, F1000Research.

[45]  Matteo Dal Peraro,et al.  A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments , 2019, Proteins.

[46]  K Fidelis,et al.  A large‐scale experiment to assess protein structure prediction methods , 1995, Proteins.

[47]  K. Dill,et al.  Protein storytelling through physics , 2020, Science.

[48]  Ekaba Bisong,et al.  Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners , 2019 .

[49]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[50]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[51]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[52]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[53]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[54]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[55]  Cole H. Christie,et al.  Protein Data Bank: the single global archive for 3D macromolecular structure data , 2018, Nucleic acids research.

[56]  Adam Gudys,et al.  FAMSA: Fast and accurate multiple sequence alignment of huge protein families , 2016, Scientific Reports.

[57]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[58]  E. Koonin,et al.  Structure and function of virion RNA polymerase of crAss-like phage , 2020, bioRxiv.

[59]  Gwendolyn M. Jang,et al.  CryoEM and AI reveal a structure of SARS-CoV-2 Nsp2, a multifunctional protein involved in key host processes. , 2021, bioRxiv.

[60]  Vijay S. Pande,et al.  OpenMM 7: Rapid development of high performance algorithms for molecular dynamics , 2016, bioRxiv.

[61]  Johannes Söding,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[62]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[63]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[64]  C. Rapisarda,et al.  Structural basis for loading and inhibition of a bacterial T6SS phospholipase effector by the VgrG spike , 2020, The EMBO journal.

[65]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[66]  Regina Barzilay,et al.  Generative Models for Graph-Based Protein Design , 2019, DGS@ICLR.

[67]  Yang Zhang,et al.  Deep learning techniques have significantly impacted protein structure prediction and protein design. , 2021, Current opinion in structural biology.

[68]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..

[69]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[70]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[71]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[72]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[73]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[74]  M. Sippl Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. , 1990, Journal of molecular biology.

[75]  Yang Zhang,et al.  Deep‐learning contact‐map guided protein structure prediction in CASP13 , 2019, Proteins.

[76]  M. Jaskólski,et al.  A brief history of macromolecular crystallography, illustrated by a family tree and its Nobel fruits , 2014, The FEBS journal.

[77]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[78]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[79]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[80]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[81]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[82]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[83]  J. Hurley,et al.  Structure of SARS-CoV-2 ORF8, a rapidly evolving immune evasion protein , 2020, Proceedings of the National Academy of Sciences.

[84]  K. Dill,et al.  The protein folding problem. , 1993, Annual review of biophysics.

[85]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[86]  R. Maya CRITICAL ASSESSMENT OF TECHNIQUES FOR PROTEIN STRUCTURE PREDICTION , 2014 .

[87]  M. Sippl Calculation of conformational ensembles from potentials of mena force , 1990 .

[88]  E. Koonin,et al.  Structure and function of virion RNA polymerase of a crAss-like phage , 2020, Nature.

[89]  V. Hornak,et al.  Comparison of multiple Amber force fields and development of improved protein backbone parameters , 2006, Proteins.

[90]  Torsten Schwede,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XIII , 2019, Proteins.

[91]  Jinbo Xu,et al.  Improved protein structure prediction by deep learning irrespective of co-evolution information , 2021, Nat. Mach. Intell..