Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry

Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.

[1]  Adam R. Klivans,et al.  Stability Oracle: A Structure-Based Graph-Transformer for Identifying Stabilizing Mutations , 2023, bioRxiv.

[2]  Daniel J. Diaz,et al.  Using machine learning to predict the effects and consequences of mutations in proteins. , 2023, Current opinion in structural biology.

[3]  Zeming Lin,et al.  Evolutionary-scale prediction of atomic level protein structure with a language model , 2022, bioRxiv.

[4]  Kevin B. Givechian,et al.  Transformer-based protein generation with regularized latent space optimization , 2022, Nature Machine Intelligence.

[5]  O. S.,et al.  Accurate prediction of protein structures and interactions using a three-track neural network , 2022, Yearbook of Paediatric Endocrinology.

[6]  Daniel J. Diaz,et al.  Machine learning-aided engineering of hydrolases for PET depolymerization , 2022, Nature.

[7]  E. Hernández-Lemus,et al.  Linking protein structural and functional change to mutation using amino acid networks , 2022, PloS one.

[8]  Hao Zheng,et al.  TANGO: A GO-Term Embedding Based Method for Protein Semantic Similarity Prediction , 2022, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Daniel J. Diaz,et al.  Improved Bst DNA Polymerase Variants Derived via a Machine Learning Approach. , 2021, Biochemistry.

[10]  Ondrej Bojar,et al.  Sequence Length is a Domain: Length-based Overfitting in Transformer Models , 2021, EMNLP.

[11]  James M. Loy,et al.  Learning the local landscape of protein structures with convolutional neural networks , 2021, Journal of Biological Physics.

[12]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[13]  Tom Sercu,et al.  Language models enable zero-shot prediction of the effects of mutations on protein function , 2021, bioRxiv.

[14]  K. Lindorff-Larsen,et al.  Predicting and interpreting large scale mutagenesis data using analyses of protein stability and conservation , 2021, bioRxiv.

[15]  M. Linial,et al.  ProteinBERT: a universal deep-learning model of protein sequence and function , 2021, bioRxiv.

[16]  Hunter M. Nisonoff,et al.  Combining evolutionary and assay-labelled data for protein fitness prediction , 2021, bioRxiv.

[17]  M. Reinders,et al.  The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction , 2021, bioRxiv.

[18]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..

[19]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20]  Isaac Donnell,et al.  Discovery of Novel Gain-of-Function Mutations Guided by Structure-Based Deep Learning. , 2020, ACS synthetic biology.

[21]  Stavros Makrodimitris,et al.  Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function , 2020, bioRxiv.

[22]  Kohske Takahashi,et al.  Welcome to the Tidyverse , 2019, J. Open Source Softw..

[23]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[24]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[25]  Brian Hutchinson,et al.  Predicting the Effect of Single and Multiple Mutations on Protein Structural Stability , 2018, Molecules.

[26]  Javier Bilbao,et al.  Overfitting problem and the over-training in the era of data: Particularly for Artificial Neural Networks , 2017, 2017 Eighth International Conference on Intelligent Computing and Information Systems (ICICIS).

[27]  R. Altman,et al.  3D deep convolutional neural networks for amino acid environment similarity analysis , 2017, BMC Bioinformatics.

[28]  Russ B. Altman,et al.  3D deep convolutional neural networks for amino acid environment similarity analysis , 2017, BMC Bioinformatics.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Simon Mitternacht,et al.  FreeSASA: An open source C library for solvent accessible surface area calculations , 2016, F1000Research.

[31]  Austin G. Meyer,et al.  Maximum Allowed Solvent Accessibilites of Residues in Proteins , 2012, PloS one.

[32]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[33]  Gerhard Klebe,et al.  PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations , 2007, Nucleic Acids Res..

[34]  Adam R. Klivans,et al.  HotProtein: A Novel Framework for Protein Thermostability Prediction and Editing , 2023, ICLR.

[35]  Llion Jones,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning , 2021 .

[36]  Yaoqi Zhou,et al.  FreeSASA: An open source C library for solvent accessible surface area calculations , 2016, F1000Research.

[37]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .