Protein language model embeddings for fast, accurate, alignment-free protein structure prediction

All state-of-the-art (SOTA) protein structure predictions rely on evolutionary information captured in multiple sequence alignments (MSAs), primarily on evolutionary couplings (co-evolution). Such information is not available for all proteins and is computationally expensive to generate. Prediction models based on Artificial Intelligence (AI) using only single sequences as input are easier and cheaper but perform so poorly that speed becomes irrelevant. Here, we described the first competitive AI solution exclusively inputting embeddings extracted from pre-trained protein Language Models (pLMs), namely from the transformer pLM ProtT5, from single sequences into a relatively shallow (few free parameters) convolutional neural network (CNN) trained on inter-residue distances, i.e. protein structure in 2D. The major advance originated from processing the attention heads learned by ProtT5. Although these models required at no point any MSA, they matched the performance of methods relying on co-evolution. Although not reaching the very top, our lean approach came close at substantially lower costs thereby speeding up development and each future prediction. By generating protein-specific rather than family-averaged predictions, these new solutions could distinguish between structural features differentiating members of the same family of proteins with similar structure predicted alike by all other top methods.

[1]  V. Marx Method of the Year: protein structure prediction , 2022, Nature Methods.

[2]  A. Lavie,et al.  pH-Dependent Mechanisms of Influenza Infection Mediated by Hemagglutinin , 2021, Frontiers in Molecular Biosciences.

[3]  S. Ovchinnikov,et al.  ColabFold: making protein folding accessible to all , 2021, bioRxiv.

[4]  K. Kavukcuoglu,et al.  Highly accurate protein structure prediction for the human proteome , 2021, Nature.

[5]  Gyu Rie Lee,et al.  Accurate prediction of protein structures and interactions using a 3-track neural network , 2021, Science.

[6]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[7]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. , 2021, IEEE transactions on pattern analysis and machine intelligence.

[8]  B. Berger,et al.  Learning the protein language: Evolution, structure, and function. , 2021, Cell systems.

[9]  Kevin K. Yang,et al.  Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets , 2021, Current protocols.

[10]  B. Rost,et al.  Light attention predicts protein location from the language of life , 2021, bioRxiv.

[11]  Kadina E. Johnston,et al.  Protein sequence design with deep generative models , 2021, Current opinion in chemical biology.

[12]  Sai Raghavendra Maddhuri Venkata Subramaniya,et al.  Analyzing effect of quadruple multiple sequence alignments on deep learning based protein inter-residue distance prediction , 2021, Scientific Reports.

[13]  Michal Linial,et al.  The language of proteins: NLP, machine learning & protein sequences , 2021, Computational and structural biotechnology journal.

[14]  J. Hurley,et al.  Crystallographic molecular replacement using an in silico‐generated search model of SARS‐CoV‐2 ORF8 , 2021, Protein science : a publication of the Protein Society.

[15]  B. Rost,et al.  Clustering FunFams using sequence embeddings improves EC purity , 2021, bioRxiv.

[16]  J. Hurley,et al.  Crystallographic molecular replacement using an in silico-generated search model of SARS-CoV-2 ORF8 , 2021, bioRxiv.

[17]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[18]  Yun S. Song,et al.  Single Layers of Attention Suffice to Predict Protein Contacts , 2020, bioRxiv.

[19]  Tom Sercu,et al.  Transformer protein language models are unsupervised structure learners , 2020, bioRxiv.

[20]  Tie-Yan Liu,et al.  CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction , 2020, Nature Communications.

[21]  Xiaogen Zhou,et al.  Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks , 2020, bioRxiv.

[22]  Burkhard Rost,et al.  Embeddings from deep learning transfer GO annotations beyond homology , 2020, Scientific Reports.

[23]  S. Narayana,et al.  Novel structure of the N-terminal helical domain of BibA, a group B streptococcus immunogenic bacterial adhesin. , 2020, Acta crystallographica. Section D, Structural biology.

[24]  Modeling Aspects , 2020, Finite Elements for Engineers with ANSYS Applications.

[25]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[26]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[27]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[28]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2019, Proceedings of the National Academy of Sciences.

[29]  Yang Zhang,et al.  DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins , 2019, Bioinform..

[30]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[32]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[33]  Torsten Schwede,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XIII , 2019, Proteins.

[34]  Björn Wallner,et al.  rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments , 2019, PloS one.

[35]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[36]  Aaron Bauer,et al.  De novo protein design by citizen scientists , 2019, Nature.

[37]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[38]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[39]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[40]  Mohammed AlQuraishi,et al.  ProteinNet: a standardized data set for machine learning of protein structure , 2019, BMC Bioinformatics.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  B. Berks,et al.  Type 9 secretion system structures reveal a new protein transport mechanism , 2018, Nature.

[43]  David T. Jones,et al.  High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features , 2018, Bioinform..

[44]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[45]  Robert P. Sheridan,et al.  The EVcouplings Python framework for coevolutionary sequence analysis , 2018, bioRxiv.

[46]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[47]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[48]  Atina G. Coté,et al.  A framework for exhaustively mapping functional missense variants , 2017, Molecular systems biology.

[49]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[50]  David Baker,et al.  Origins of coevolution between residues distant in protein 3D structures , 2017, Proceedings of the National Academy of Sciences.

[51]  A. Chakraborty,et al.  Deconstruction of the Ras switching cycle through saturation mutagenesis , 2017, eLife.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Martin Weigt,et al.  Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis , 2017, Proceedings of the National Academy of Sciences.

[54]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[55]  Haruki Nakamura,et al.  Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive. , 2017, Methods in molecular biology.

[56]  Kiyoung Lee,et al.  Structure and dynamics study of translation initiation factor 1 from Staphylococcus aureus suggests its RNA binding mode. , 2017, Biochimica et biophysica acta. Proteins and proteomics.

[57]  Eric D. Kelsic,et al.  RNA Structural Determinants of Optimal Codons Revealed by MAGE-Seq. , 2016, Cell systems.

[58]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[59]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[60]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[61]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[62]  Kyle A. Barlow,et al.  Determination of ubiquitin fitness landscapes under different chemical stresses in a classroom setting , 2015, bioRxiv.

[63]  David T. Jones,et al.  MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins , 2014, Bioinform..

[64]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[65]  S. Fields,et al.  Deep mutational scanning: a new style of protein science , 2014, Nature Methods.

[66]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[67]  Avner Schlessinger,et al.  Coordinating the impact of structural genomics on the human α-helical transmembrane proteome , 2013, Nature Structural &Molecular Biology.

[68]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[69]  Marco Punta,et al.  Structural genomics plucks high-hanging membrane proteins. , 2012, Current opinion in structural biology.

[70]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[71]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[72]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[73]  Sergey Lyskov,et al.  PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta , 2010, Bioinform..

[74]  Ian S. Dunn,et al.  Exploring the Limits , 2009 .

[75]  Andrei N. Lupas,et al.  Gene Duplication of the Eight-stranded β-Barrel OmpX Produces a Functional Pore: A Scenario for the Evolution of Transmembrane β-Barrels , 2007 .

[76]  C. Lima,et al.  Lysine activation and functional analysis of E2-mediated conjugation in the SUMO pathway , 2006, Nature Structural &Molecular Biology.

[77]  Marco Punta,et al.  Protein folding rates estimated from contact predictions. , 2005, Journal of molecular biology.

[78]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[79]  G. Wider,et al.  NMR structure of the integral membrane protein OmpX. , 2004, Journal of molecular biology.

[80]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[81]  G. Schulz The structure of bacterial outer membrane proteins. , 2002, Biochimica et biophysica acta.

[82]  F. Melchior,et al.  Structure determination of the small ubiquitin-related modifier SUMO-1. , 1998, Journal of molecular biology.

[83]  E D Laue,et al.  Regional polysterism in the GTP-bound form of the human c-Ha-Ras protein. , 1997, Biochemistry.

[84]  B Rost,et al.  Progress of 1D protein structure prediction at last , 1995, Proteins.

[85]  K Fidelis,et al.  A large‐scale experiment to assess protein structure prediction methods , 1995, Proteins.

[86]  D. Wetlaufer Protein structure. , 1986, Science.