Zero-Shot Transfer of Protein Sequence Likelihood Models to Thermostability Prediction

Protein sequence likelihood models (PSLMs) are an emerging class of self-supervised deep learning algorithms which learn distributions over amino acid identities in structural and evolutionary contexts. Recently, PSLMs have demonstrated impressive performance in predicting the relative fitness of variant sequences without any task-specific training. In this work, we comprehensively analyze the capacity of six PSLMs to predict experimental measurements of thermostability for variants of hundreds of heterogeneous proteins. We assess performance of PSLMs relative to state-of-the-art supervised models, highlight relative strengths and weaknesses, and examine the complementarity between these models. We focus our analyses on stability engineering applications, assessing which methods and combinations of methods can most consistently identify and prioritize mutations for experimental validation. Our results indicate that structure-based PSLMs have competitive performance with the best existing supervised methods and can augment the predictions of supervised methods by integrating insights from their disparate training objectives.

[1]  N. Zanichelli,et al.  Masked Inverse Folding with Sequence Transfer for Protein Representation Learning , 2023, bioRxiv.

[2]  Alex X. Lu,et al.  Convolutions are competitive with transformers for protein sequence pretraining , 2024, bioRxiv.

[3]  P. Chacón,et al.  Predicting protein stability changes upon mutation using a simple orientational potential , 2023, Bioinformatics.

[4]  S. Ovchinnikov,et al.  Mega-scale experimental analysis of protein folding stability in biology and protein design , 2022, bioRxiv.

[5]  B. Sankaran,et al.  Robust deep learning based protein sequence design using ProteinMPNN , 2022, bioRxiv.

[6]  Aidan N. Gomez,et al.  Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval , 2022, ICML.

[7]  Hunter M Nisonoff,et al.  Learning protein fitness models from evolutionary and assay-labeled data , 2022, Nature Biotechnology.

[8]  E. Capriotti,et al.  Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset , 2022, Briefings Bioinform..

[9]  F. Pucci,et al.  Artificial intelligence challenges for predicting the impact of mutations on protein stability. , 2021, Current opinion in structural biology.

[10]  Protein Stability , 2021, Non-Covalent Interactions in Proteins.

[11]  Roland L. Dunbrack,et al.  PDBe-KB: collaboratively defining the biological context of structural data , 2021, Nucleic Acids Res..

[12]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[13]  Tom Sercu,et al.  Language models enable zero-shot prediction of the effects of mutations on protein function , 2021, bioRxiv.

[14]  K. Lindorff-Larsen,et al.  Predicting and interpreting large scale mutagenesis data using analyses of protein stability and conservation , 2021, bioRxiv.

[15]  Lorna J. Hepworth,et al.  Biocatalysis , 2021, Nature Reviews Methods Primers.

[16]  E. Capriotti,et al.  A Deep-Learning Sequence-Based Method to Predict Protein Stability Changes Upon Genetic Variations , 2021, Genes.

[17]  Tom L. Blundell,et al.  Three Simple Properties Explain Protein Stability Change upon Mutation , 2021, J. Chem. Inf. Model..

[18]  A. Ganesan,et al.  Engineering stable carbonic anhydrases for CO2 capture: a critical review. , 2021, Protein engineering, design & selection : PEDS.

[19]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[20]  Philip M. Kim,et al.  ELASPIC2 (EL2): Combining contextualized language models and graph neural networks to predict effects of mutations. , 2021, Journal of molecular biology.

[21]  Minghui Li,et al.  PremPS: Predicting the impact of missense mutations on protein stability , 2020, PLoS Comput. Biol..

[22]  Raphael J. L. Townshend,et al.  ATOM3D: Tasks On Molecules in Three Dimensions , 2020, NeurIPS Datasets and Benchmarks.

[23]  Jan Stourac,et al.  FireProtDB: database of manually curated protein stability data , 2020, Nucleic Acids Res..

[24]  Giovanni Birolo,et al.  Limitations and challenges in protein stability prediction upon genome variations: towards future applications in precision medicine , 2020, Computational and structural biotechnology journal.

[25]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[26]  Lav R. Varshney,et al.  BERTology Meets Biology: Interpreting Attention in Protein Language Models , 2020, bioRxiv.

[27]  Kyle Trainor,et al.  Computational Modeling of Protein Stability: Quantitative Analysis Reveals Solutions to Pervasive Problems. , 2020, Structure.

[28]  Frank DiMaio,et al.  Prediction of Protein Mutational Free Energy: Benchmark and Sampling Improvements Increase Classification Accuracy , 2020, bioRxiv.

[29]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[30]  Jianwen Fang,et al.  A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation , 2019, Briefings Bioinform..

[31]  Piero Fariselli,et al.  DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations , 2019, BMC Bioinformatics.

[32]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[33]  Zachary Wu,et al.  Machine learning-assisted directed protein evolution with combinatorial libraries , 2019, Proceedings of the National Academy of Sciences.

[34]  Kevin K. Yang,et al.  Machine-learning-guided directed evolution for protein engineering , 2018, Nature Methods.

[35]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[36]  Jens Rudat,et al.  FoldX as Protein Engineering Tool: Better Than Random Based Approaches? , 2018, Computational and structural biotechnology journal.

[37]  A. Redaelli,et al.  Review: Engineering of thermostable enzymes for industrial applications , 2018, APL bioengineering.

[38]  W. Dyrka,et al.  Quantiprot - a Python package for quantitative analysis of protein sequences , 2017, BMC Bioinformatics.

[39]  D. Baker,et al.  Global analysis of protein folding using massively parallel design, synthesis, and testing , 2017, Science.

[40]  Kyle Trainor,et al.  Computational tools help improve protein stability but with a solubility tradeoff , 2017, The Journal of Biological Chemistry.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  David E. Kim,et al.  Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. , 2016, Journal of chemical theory and computation.

[43]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using MODELLER , 2016, Current protocols in bioinformatics.

[44]  Piero Fariselli,et al.  INPS: predicting the impact of non-synonymous variations on protein stability from sequence , 2015, Bioinform..

[45]  P. Lackner,et al.  MAESTRO - multi agent stability prediction upon point mutations , 2015, BMC Bioinformatics.

[46]  Catherine L. Worth,et al.  SDM—a server for predicting effects of mutations on protein stability and malfunction , 2011, Nucleic Acids Res..

[47]  M. Rooman,et al.  PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality , 2011, BMC Bioinformatics.

[48]  D. Baker,et al.  Role of conformational sampling in computing mutation‐induced changes in protein structure and stability , 2011, Proteins.

[49]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[50]  Philippe Bogaerts,et al.  Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 , 2009, Bioinform..

[51]  François Stricher,et al.  The FoldX web server: an online force field , 2005, Nucleic Acids Res..

[52]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[53]  F. Arnold Design by Directed Evolution , 1998 .

[54]  Adam R. Klivans,et al.  HotProtein: A Novel Framework for Protein Thermostability Prediction and Editing , 2023, ICLR.

[55]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[56]  Akinori Sarai,et al.  ProTherm, Thermodynamic Database for Proteins and Mutants: developments in version 3.0 , 2002, Nucleic Acids Res..

[57]  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm098 Databases and ontologies UniRef: comprehensive and non-redundant UniProt reference , 2022 .