Evaluating Protein Transfer Learning with TAPE

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We bench-mark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

[1]  C. Anfinsen,et al.  The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. , 1961, Proceedings of the National Academy of Sciences of the United States of America.

[2]  C. Yanofsky,et al.  Protein Structure Relationships Revealed by Mutational Analysis , 1964, Science.

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  Portland Press Ltd IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature and symbolism for amino acids and peptides. Recommendations 1983 , 1984 .

[5]  T. Creighton Proteins: Structures and Molecular Properties , 1986 .

[6]  K. Nagai,et al.  Coordinated amino acid changes in homologous protein families. , 1988, Protein engineering.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[11]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[12]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[14]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[15]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[16]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[17]  Raymond C. Stevens,et al.  The cost and value of three-dimensional protein structure , 2003 .

[18]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[19]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[20]  Hyunjung Shin,et al.  Prediction of Protein Function from Networks , 2006, Semi-Supervised Learning.

[21]  Torsten Schwede,et al.  BIOINFORMATICS Bioinformatics Advance Access published November 12, 2005 The SWISS-MODEL Workspace: A web-based environment for protein structure homology modelling , 2022 .

[22]  E. Vázquez,et al.  Modular protein engineering in emerging cancer therapies. , 2009, Current pharmaceutical design.

[23]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[24]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tim A. H. te Beek,et al.  A series of PDB related databases for everyday needs , 2010, Nucleic Acids Res..

[26]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[27]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[28]  Timothy A. Whitehead,et al.  Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing , 2012, Nature Biotechnology.

[29]  L. Tavares,et al.  Strategies and molecular tools to fight antimicrobial resistance: resistome, transcriptome, and antimicrobial peptides , 2013, Front. Microbiol..

[30]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[31]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[32]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[33]  David E. Kim,et al.  One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling , 2014, Proteins.

[34]  Christian Cole,et al.  JPred4: a protein secondary structure prediction server , 2015, Nucleic Acids Res..

[35]  B. Rost,et al.  Unexpected features of the dark proteome , 2015, Proceedings of the National Academy of Sciences.

[36]  Zhiyong Wang,et al.  Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning , 2013, Bioinform..

[37]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[38]  Dmitry Chudakov,et al.  Local fitness landscape of the green fluorescent protein , 2016, Nature.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Steve Renals,et al.  Multiplicative LSTM for sequence modelling , 2016, ICLR.

[41]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[43]  D. Baker,et al.  Global analysis of protein folding using massively parallel design, synthesis, and testing , 2017, Science.

[44]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[45]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[46]  Ole Winther,et al.  NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning , 2018, bioRxiv.

[47]  Zachary Wu,et al.  Machine learning in protein engineering , 2018, 1811.10775.

[48]  Cindy J. Castelle,et al.  Major New Microbial Groups Expand Diversity and Alter our Understanding of the Tree of Life , 2018, Cell.

[49]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[50]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[51]  Jie Hou,et al.  DeepSF: deep convolutional neural network for mapping protein sequences to folds , 2017, Bioinform..

[52]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[53]  Andriy Kryshtafovych,et al.  Assessment of contact predictions in CASP12: Co‐evolution and deep learning coming of age , 2017, Proteins.

[54]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[55]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[56]  Mohammed AlQuraishi,et al.  ProteinNet: a standardized data set for machine learning of protein structure , 2019, BMC Bioinformatics.

[57]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[58]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[59]  Benjamin L. Oakes,et al.  CRISPR-CasX is an RNA-dominated enzyme active for human genome editing , 2019, Nature.

[60]  Benjamin L. Oakes,et al.  CRISPR-CasX is an RNA-dominated enzyme active for human genome editing , 2019, Nature.

[61]  Burkhard Rost,et al.  Modeling the language of life – Deep Learning Protein Sequences , 2019, bioRxiv.

[62]  Kevin K. Yang,et al.  Machine-learning-guided directed evolution for protein engineering , 2018, Nature Methods.

[63]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[64]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[65]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[66]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[68]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.