ProteinBERT: a universal deep-learning model of protein sequence and function

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to very large sequence lengths. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains state-of-the-art performance on multiple benchmarks covering diverse protein properties (including protein structure, post translational modifications and biophysical attributes), despite using a far smaller model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.

[1]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. , 2021, IEEE transactions on pattern analysis and machine intelligence.

[2]  Michal Linial,et al.  The language of proteins: NLP, machine learning & protein sequences , 2021, Computational and structural biotechnology journal.

[3]  Rethinking Attention with Performers , 2020, ICLR.

[4]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[5]  Pascal Sturmfels,et al.  Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models , 2020, ArXiv.

[6]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[7]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[8]  Ananthan Nambiar,et al.  Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks , 2020, bioRxiv.

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[11]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[12]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2019, Proceedings of the National Academy of Sciences.

[13]  J. Gough,et al.  The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures , 2019, Nucleic Acids Res..

[14]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[16]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[17]  Naoki Yoshinaga,et al.  On the Relation between Position Information and Sentence Length in Neural Machine Translation , 2019, CoNLL.

[18]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[19]  Wojciech Samek,et al.  UDSMProt: universal deep sequence models for protein classification , 2019, bioRxiv.

[20]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[21]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[22]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[23]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[24]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[25]  John N. Weinstein,et al.  ElemCor: accurate data analysis and enrichment calculation for high-resolution LC-MS stable isotope labeling experiments , 2019, BMC Bioinformatics.

[26]  Konstantinos D. Tsirigos,et al.  SignalP 5.0 improves signal peptide predictions using deep neural networks , 2019, Nature Biotechnology.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[29]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[30]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[31]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[32]  D. Baker,et al.  Global analysis of protein folding using massively parallel design, synthesis, and testing , 2017, Science.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Michal Linial,et al.  ASAP: a machine learning framework for local protein properties , 2015, bioRxiv.

[35]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[36]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[37]  Dmitry Chudakov,et al.  Local fitness landscape of the green fluorescent protein , 2016, Nature.

[38]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[39]  Michal Linial,et al.  ProFET: Feature engineering captures high-level protein functions , 2015, Bioinform..

[40]  Bin Zhang,et al.  PhosphoSitePlus, 2014: mutations, PTMs and recalibrations , 2014, Nucleic Acids Res..

[41]  Michal Linial,et al.  NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes , 2014, Bioinform..

[42]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[43]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[44]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[45]  Andrew Y. Ng,et al.  Transfer learning for text classification , 2005, NIPS.

[46]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[47]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[48]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[49]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[50]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[51]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.