Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention

The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment. Increasingly large Transformers are being pretrained on unlabeled, unaligned protein sequence databases and showing competitive performance on protein contact prediction. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce an energy-based attention layer, factored attention, which, in a certain limit, recovers a Potts model, and use it to contrast Potts and Transformers. We show that the Transformer leverages hierarchical signal in protein family databases not captured by single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.

[1]  G. Petsko,et al.  Weakly polar interactions in proteins. , 1988, Advances in protein chemistry.

[2]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[3]  R. Jaenicke,et al.  Stability and stabilization of globular proteins in solution. , 2000, Journal of biotechnology.

[4]  Richard W Aldrich,et al.  On Evolutionary Conservation of Thermodynamic Coupling in Proteins* , 2004, Journal of Biological Chemistry.

[5]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[6]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[7]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[8]  Chris Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[10]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[11]  William R. Taylor,et al.  Structural Constraints on the Covariance Matrix Derived from Multiple Aligned Protein Sequences , 2011, PloS one.

[12]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[13]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[14]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[15]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[17]  D. Baker,et al.  Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information , 2014, eLife.

[18]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[19]  A. Tramontano,et al.  New encouraging developments in contact prediction: Assessment of the CASP11 results , 2016, Proteins.

[20]  Jinbo Xu,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016 .

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[23]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[24]  Andriy Kryshtafovych,et al.  Assessment of contact predictions in CASP12: Co‐evolution and deep learning coming of age , 2017, Proteins.

[25]  Duccio Malinverni,et al.  Coevolutionary Analysis of Protein Subfamilies by Sequence Reweighting , 2019, Entropy.

[26]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[27]  Rojan Shrestha,et al.  Assessing the accuracy of contact predictions in CASP13 , 2019, Proteins.

[28]  Steven E. Brenner,et al.  SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database , 2018, Nucleic Acids Res..

[29]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2019, Proceedings of the National Academy of Sciences.

[30]  Haobo Wang,et al.  Unified framework for modeling multivariate distributions in biological sequences , 2019 .

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Milot Mirdita,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[33]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[34]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[35]  David Dohan,et al.  Is Transfer Learning Necessary for Protein Landscape Prediction? , 2020, ArXiv.

[36]  Alan M. Moses,et al.  Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization , 2020, bioRxiv.

[37]  Ananthan Nambiar,et al.  Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks , 2020, bioRxiv.

[38]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[39]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[40]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[41]  Gyu Rie Lee,et al.  Accurate prediction of protein structures and interactions using a 3-track neural network , 2021, Science.

[42]  Kevin K. Yang,et al.  FLIP: Benchmark tasks in fitness landscape inference for proteins , 2021, bioRxiv.

[43]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. , 2021, IEEE transactions on pattern analysis and machine intelligence.

[44]  Radka Svobodová Vareková,et al.  CATH: increased structural coverage of functional space , 2020, Nucleic Acids Res..

[45]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[46]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[47]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[48]  Hunter M. Nisonoff,et al.  Combining evolutionary and assay-labelled data for protein fitness prediction , 2021, bioRxiv.