TemStaPro: protein thermostability prediction using sequence representations from protein language models

Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. We propose applying the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over 2 million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data. TemStaPro software is freely available from https://github.com/ievapudz/TemStaPro.

[1]  Č. Venclovas,et al.  TnpB structure reveals minimal functional core of Cas12 nuclease family , 2023, Nature.

[2]  Wenying Yan,et al.  DeepTP: A Deep Learning Model for Thermophilic Protein Prediction , 2023, International journal of molecular sciences.

[3]  Lucy J. Colwell,et al.  MGnify: the microbiome sequence data analysis resource in 2023 , 2022, Nucleic Acids Res..

[4]  Natalia N. Ivanova,et al.  IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata , 2022, Nucleic Acids Res..

[5]  Zhenglin Hou,et al.  A new family of CRISPR‐type V nucleases with C‐rich PAM recognition , 2022, EMBO reports.

[6]  Llion Jones,et al.  ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yiping Qi,et al.  Applications of CRISPR/Cas13-Based RNA Editing in Plants , 2022, Cells.

[8]  E. Sallard,et al.  Current and Prospective Applications of CRISPR-Cas12a in Pluricellular Organisms , 2022, Molecular Biotechnology.

[9]  G. Stegmayer,et al.  Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks , 2022, Briefings Bioinform..

[10]  Balachandran Manavalan,et al.  SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins , 2022, Comput. Biol. Medicine.

[11]  Zhao-Yue Zhang,et al.  iThermo: A Sequence-Based Model for Identifying Thermophilic Proteins Using a Multi-Feature Fusion Strategy , 2022, Frontiers in Microbiology.

[12]  V. Lee,et al.  A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides , 2021, Scientific Reports.

[13]  M. Salemi,et al.  A Thermostable Cas12b from Brevibacillus Leverages One-pot Detection of SARS-CoV-2 Variants of Concern , 2021, medRxiv.

[14]  Č. Venclovas,et al.  Transposon-associated TnpB is a programmable RNA-guided DNA endonuclease , 2021, Nature.

[15]  Suchita P. Nety,et al.  The widespread IS200/IS605 transposon family encodes diverse programmable RNA-guided endonucleases , 2021, Science.

[16]  B. T. Adalsteinsson,et al.  Efficient genome editing of an extreme thermophile, Thermus thermophilus, using a thermostable Cas9 variant , 2021, Scientific Reports.

[17]  Joshua K Young,et al.  A catalogue of biochemically diverse CRISPR-Cas9 orthologs , 2020, Nature Communications.

[18]  Lav R. Varshney,et al.  BERTology Meets Biology: Interpreting Attention in Protein Language Models , 2020, bioRxiv.

[19]  Jun Zhang,et al.  A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features , 2020, Frontiers in Bioengineering and Biotechnology.

[20]  Yonatan Belinkov,et al.  Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[21]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[22]  M. Engqvist Growth temperatures for 21,498 microorganisms , 2018 .

[23]  M. Engqvist Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures , 2018, BMC Microbiology.

[24]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[25]  R. V. Kranenburg,et al.  Characterizing a thermostable Cas9 for bacterial genome editing and silencing , 2017, bioRxiv.

[26]  Lucas B. Harrington,et al.  A thermostable Cas9 with increased lifetime in human plasma , 2017, bioRxiv.

[27]  D. Patel,et al.  PAM-Dependent Target DNA Recognition and Cleavage by C2c1 CRISPR-Cas Endonuclease , 2016, Cell.

[28]  Guo-Liang Fan,et al.  Identification of thermophilic proteins by incorporating evolutionary and acid dissociation information into Chou's general pseudo amino acid composition. , 2016, Journal of theoretical biology.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Luonan Chen,et al.  Detecting thermophilic proteins through selecting amino acid and dipeptide composition features , 2012, Amino Acids.

[31]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[32]  Naiwan Hsiao,et al.  Predicting melting temperature directly from protein sequences , 2009, Comput. Biol. Chem..

[33]  Jorng-Tzong Horng,et al.  An expert system to predict protein thermostability using decision tree , 2009, Expert Syst. Appl..

[34]  M. Rooman,et al.  Revisiting the correlation between proteins' thermoresistance and organisms' thermophilicity. , 2008, Protein engineering, design & selection : PEDS.

[35]  M Michael Gromiha,et al.  Discrimination of mesophilic and thermophilic proteins using machine learning algorithms , 2007, Proteins.

[36]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[37]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[38]  Wei Chen,et al.  Prediction of thermophilic proteins using feature selection technique. , 2011, Journal of microbiological methods.