Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds

Proteins power a vast array of functional processes in living cells. The capability to create new proteins with designed structures and functions would thus enable the engineering of cellular behavior and development of protein-based therapeutics and materials. Structure-based protein design aims to find structures that are designable (can be realized by a protein sequence), novel (have dissimilar geometry from natural proteins), and diverse (span a wide range of geometries). While advances in protein structure prediction have made it possible to predict structures of novel protein sequences, the combinatorially large space of sequences and structures limits the practicality of search-based methods. Generative models provide a compelling alternative, by implicitly learning the low-dimensional structure of complex data distributions. Here, we leverage recent advances in denoising diffusion probabilistic models and equivariant neural networks to develop Genie, a generative model of protein structures that performs discrete-time diffusion using a cloud of oriented reference frames in 3D space. Through in silico evaluations, we demonstrate that Genie generates protein backbones that are more designable, novel, and diverse than existing models. This indicates that Genie is capturing key aspects of the distribution of protein structure space and facilitates protein design with high success rates. Code for generating new proteins and training new versions of Genie is available at https://github.com/aqlaboratory/genie.

[1]  Valentin De Bortoli,et al.  SE(3) diffusion model with application to protein backbone generation , 2023, ICML.

[2]  Zeming Lin,et al.  A high-level programming language for generative protein design , 2022, bioRxiv.

[3]  Brian L. Trippe,et al.  Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models , 2022, bioRxiv.

[4]  John Ingraham,et al.  Illuminating protein space with a programmable generative model , 2022, bioRxiv.

[5]  Brian D. Weitzner,et al.  OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization , 2022, bioRxiv.

[6]  George M. Church,et al.  Single-sequence protein structure prediction using a language model and deep learning , 2022, Nature Biotechnology.

[7]  Rianne van den Berg,et al.  Protein structure generation via folding diffusion , 2022, Nature communications.

[8]  O. S.,et al.  Accurate prediction of protein structures and interactions using a three-track neural network , 2022, Yearbook of Paediatric Endocrinology.

[9]  Jian Peng,et al.  High-resolution de novo structure prediction from primary sequence , 2022, bioRxiv.

[10]  S. Ovchinnikov,et al.  Scaffolding protein functional sites using deep learning , 2022, Science.

[11]  Brian L. Trippe,et al.  Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem , 2022, ICLR.

[12]  B. Sankaran,et al.  Robust deep learning based protein sequence design using ProteinMPNN , 2022, bioRxiv.

[13]  Tudor Achim,et al.  Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models , 2022, ArXiv.

[14]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[15]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Karsten Kreis,et al.  Tackling the Generative Learning Trilemma with Denoising Diffusion GANs , 2021, ICLR.

[17]  S. Brenner,et al.  SCOPe: improvements to the structural classification of proteins – extended database to facilitate variant interpretation and machine learning , 2021, Nucleic Acids Res..

[18]  D. Hassabis,et al.  AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models , 2021, Nucleic Acids Res..

[19]  Namrata Anand,et al.  Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation , 2020, bioRxiv.

[20]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[21]  Max Welling,et al.  E(n) Equivariant Graph Neural Networks , 2021, ICML.

[22]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[23]  Raphael J. L. Townshend,et al.  Learning from Protein Structure with Geometric Vector Perceptrons , 2020, ICLR.

[24]  Honglak Lee,et al.  Improved Consistency Regularization for GANs , 2020, AAAI.

[25]  David Baker,et al.  De novo protein design by deep network hallucination , 2020, Nature.

[26]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[27]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2019, Proceedings of the National Academy of Sciences.

[29]  Anna Dai Generative Modeling , 2020 .

[30]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[31]  Namrata Anand,et al.  Fully differentiable full-atom protein backbone generation , 2019, DGS@ICLR.

[32]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[33]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[34]  Namrata Anand,et al.  Generative modeling for protein structures , 2018, NeurIPS.

[35]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[36]  D. Baker,et al.  The coming of age of de novo protein design , 2016, Nature.

[37]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[38]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[39]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[40]  Martin Lundgren,et al.  Discrete Frenet frame, inflection point solitons, and curve visualization with applications to folded proteins. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  Yang Zhang,et al.  How significant is a protein structure similarity with TM-score = 0.5? , 2010, Bioinform..

[42]  F. Arnold,et al.  Directed evolution: new parts and optimized function. , 2009, Current opinion in biotechnology.

[43]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[44]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[45]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..