论文信息 - Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention - 字舞流文

Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention

The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment. Increasingly large Transformers are being pretrained on unlabeled, unaligned protein sequence databases and showing competitive performance on protein contact prediction. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce an energy-based attention layer, factored attention, which, in a certain limit, recovers a Potts model, and use it to contrast Potts and Transformers. We show that the Transformer leverages hierarchical signal in protein family databases not captured by single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.

Yun S. Song | Peter K. Koo | David Baker | S. Ovchinnikov | Roshan Rao | Nicholas Bhattacharya | Neil Thomas | Justas Dauparas | D. Baker | J. Dauparas

[1] G. Petsko,et al. Weakly polar interactions in proteins. , 1988, Advances in protein chemistry.

[2] R. Ranganathan,et al. Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[3] R. Jaenicke,et al. Stability and stabilization of globular proteins in solution. , 2000, Journal of biotechnology.

[4] Richard W Aldrich,et al. On Evolutionary Conservation of Thermodynamic Coupling in Proteins* , 2004, Journal of Biological Chemistry.

[5] Peter B. McGarvey,et al. UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[6] Duncan P. Brown,et al. Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[7] Gregory B. Gloor,et al. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[8] Chris Bailey-Kellogg,et al. Graphical Models of Residue Coupling in Protein Families , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9] T. Hwa,et al. Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[10] Sean R. Eddy,et al. Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[11] William R. Taylor,et al. Structural Constraints on the Covariance Matrix Derived from Multiple Aligned Protein Sequences , 2011, PloS one.

[12] Sivaraman Balakrishnan,et al. Learning generative models for protein fold families , 2011, Proteins.

[13] A. Biegert,et al. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[14] D. Baker,et al. Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[15] E. Aurell,et al. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16] Markus Gruber,et al. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[17] D. Baker,et al. Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information , 2014, eLife.

[18] Peter B. McGarvey,et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[19] A. Tramontano,et al. New encouraging developments in contact prediction: Assessment of the CASP11 results , 2016, Proteins.

[20] Jinbo Xu,et al. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016 .

[21] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[22] Zhen Li,et al. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[23] Johannes Söding,et al. Clustering huge protein sequence sets in linear time , 2018 .

[24] Andriy Kryshtafovych,et al. Assessment of contact predictions in CASP12: Co‐evolution and deep learning coming of age , 2017, Proteins.

[25] Duccio Malinverni,et al. Coevolutionary Analysis of Protein Subfamilies by Sequence Reweighting , 2019, Entropy.

[26] Myle Ott,et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[27] Rojan Shrestha,et al. Assessing the accuracy of contact predictions in CASP13 , 2019, Proteins.

[28] Steven E. Brenner,et al. SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database , 2018, Nucleic Acids Res..

[29] Jianyi Yang,et al. Improved protein structure prediction using predicted interresidue orientations , 2019, Proceedings of the National Academy of Sciences.

[30] Haobo Wang,et al. Unified framework for modeling multivariate distributions in biological sequences , 2019 .

[31] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32] Milot Mirdita,et al. HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[33] John Canny,et al. Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[34] Demis Hassabis,et al. Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[35] David Dohan,et al. Is Transfer Learning Necessary for Protein Landscape Prediction? , 2020, ArXiv.

[36] Alan M. Moses,et al. Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization , 2020, bioRxiv.

[37] Ananthan Nambiar,et al. Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks , 2020, bioRxiv.

[38] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[39] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[40] Nikhil Naik,et al. ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[41] Gyu Rie Lee,et al. Accurate prediction of protein structures and interactions using a 3-track neural network , 2021, Science.

[42] Kevin K. Yang,et al. FLIP: Benchmark tasks in fitness landscape inference for proteins , 2021, bioRxiv.

[43] B. Rost,et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. , 2021, IEEE transactions on pattern analysis and machine intelligence.

[44] Radka Svobodová Vareková,et al. CATH: increased structural coverage of functional space , 2020, Nucleic Acids Res..

[45] Tom Sercu,et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[46] Oriol Vinyals,et al. Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[47] John F. Canny,et al. MSA Transformer , 2021, bioRxiv.

[48] Hunter M. Nisonoff,et al. Combining evolutionary and assay-labelled data for protein fitness prediction , 2021, bioRxiv.