Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design

Designing novel protein sequences for a desired 3D topological fold is a fundamental yet nontrivial task in protein engineering. Challenges exist due to the complex sequence-fold relationship, as well as the difficulties to capture the diversity of the sequences (therefore structures and functions) within a fold. To overcome these challenges, we propose Fold2Seq, a novel transformer-based generative framework for designing protein sequences conditioned on a specific target fold. To model the complex sequence-structure relationship, Fold2Seq jointly learns a sequence embedding using a transformer and a fold embedding from the density of secondary structural elements in 3D voxels. On test sets with single, high-resolution and complete structure inputs for individual folds, our experiments demonstrate improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design, when compared to existing state-of-the-art methods that include data-driven deep generative models and physics-based RosettaDesign. The unique advantages of fold-based Fold2Seq, in comparison to a structure-based deep model and RosettaDesign, become more evident on three additional real-world challenges originating from low-quality, incomplete, or ambiguous input structures. Source code and data are available at https://github.com/IBM/fold2seq.

[1]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[2]  Yang Zhang,et al.  The I-TASSER Suite: protein structure and function prediction , 2014, Nature Methods.

[3]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[4]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[5]  Ilan Samish,et al.  Computational Protein Design , 2017, Methods in Molecular Biology.

[6]  Yuxin Peng,et al.  CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2021 .

[7]  Silvio Savarese,et al.  Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  D. Baker,et al.  RosettaRemodel: A Generalized Framework for Flexible Backbone Protein Design , 2011, PloS one.

[11]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[13]  M. Banfield,et al.  Structures of Phytophthora RXLR Effector Proteins , 2011, The Journal of Biological Chemistry.

[14]  David Baker,et al.  An enumerative algorithm for de novo design of proteins with diverse pocket structures , 2020, Proceedings of the National Academy of Sciences.

[15]  D. Baker,et al.  Principles for designing ideal protein structures , 2012, Nature.

[16]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[17]  William R. Taylor,et al.  A ‘periodic table’ for protein structures , 2002, Nature.

[18]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[19]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[20]  Sung-Hou Kim,et al.  A global representation of the protein fold space , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[21]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[22]  Michael Gamon,et al.  Representing Text for Joint Embedding of Text and Knowledge Bases , 2015, EMNLP.

[23]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[24]  Ola Engkvist,et al.  Molecular representations in AI-driven drug discovery: a review and practical guide , 2020, Journal of Cheminformatics.

[25]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Joint Latent Similarity Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  James G. Lyons,et al.  SPIN2: Predicting sequence profiles from protein structures using deep neural networks , 2018, Proteins.

[27]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[28]  David T. Jones,et al.  Design of metalloproteins and novel protein folds using variational autoencoders , 2018, Scientific Reports.

[29]  Andrew R Thomson,et al.  De novo protein design: how do we expand into the universe of possible protein structures? , 2015, Current opinion in structural biology.

[30]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[31]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[32]  John Z. H. Zhang,et al.  Computational Protein Design with Deep Learning Neural Networks , 2018, Scientific Reports.

[33]  D. Baker,et al.  The coming of age of de novo protein design , 2016, Nature.

[34]  Tom Sercu,et al.  Adversarial Semantic Alignment for Improved Image Captions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  M Vijayan,et al.  Structural similarity and functional diversity in proteins containing the legume lectin fold. , 2001, Protein engineering.

[36]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[37]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[38]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Yuedong Yang,et al.  To Improve Protein Sequence Profile Prediction through Image Captioning on Pairwise Residue Distance Map , 2019, bioRxiv.

[40]  Albert Perez-Riba,et al.  Fast and Flexible Protein Design Using Deep Graph Neural Networks. , 2020, Cell systems.

[41]  Yang Shen,et al.  TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding , 2020, bioRxiv.

[42]  Regina Barzilay,et al.  Generative Models for Graph-Based Protein Design , 2019, DGS@ICLR.

[43]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[44]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[45]  Cícero Nogueira dos Santos,et al.  PepCVAE: Semi-Supervised Targeted Design of Antimicrobial Peptide Sequences , 2018, 1810.07743.

[46]  Albert Perez-Riba,et al.  Fast and flexible design of novel proteins using graph neural networks , 2019, bioRxiv.

[47]  Guoyin Wang,et al.  Joint Embedding of Words and Labels for Text Classification , 2018, ACL.

[48]  Mostafa Karimi,et al.  De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks , 2020, J. Chem. Inf. Model..

[49]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.