ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112-times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8-states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs.

[1]  C. Anfinsen,et al.  Studies on the reduction and re-formation of protein disulfide bonds. , 1961, The Journal of biological chemistry.

[2]  A. Dillmann Enzyme Nomenclature , 1965, Nature.

[3]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[4]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[5]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[6]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[7]  Pieter de Haan,et al.  Characteristics of Sentence Length in Running Text , 1993 .

[8]  B. Rost,et al.  Improved prediction of protein secondary structure by use of sequence profiles and neural networks. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[9]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[10]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[11]  B Rost,et al.  Pitfalls of protein sequence analysis. , 1996, Current opinion in biotechnology.

[12]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[13]  Malmqvist,et al.  Epitope Mapping by Label-Free Biomolecular Interaction Analysis , 1996, Methods.

[14]  B Rost,et al.  Bridging the protein sequence-structure gap by structure predictions. , 1996, Annual review of biophysics and biomolecular structure.

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  C. Pabo,et al.  High-resolution structures of variant Zif268-DNA complexes: implications for understanding zinc finger-DNA recognition. , 1998, Structure.

[17]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[18]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[19]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[20]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[21]  R. Durbin,et al.  Enhanced protein domain discovery by using language modeling techniques from speech recognition , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Burkhard Rost,et al.  Improving fold recognition without folds. , 2004, Journal of molecular biology.

[23]  P. Radivojac,et al.  Protein flexibility and intrinsic disorder , 2004, Protein science : a publication of the Protein Society.

[24]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[25]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[26]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  David A. Lee,et al.  PSI-2: structural genomics to cover protein domain family space. , 2009, Structure.

[28]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[29]  Milo M. Lin,et al.  Hydrophobic forces and the length limit of foldable protein domains , 2012, Proceedings of the National Academy of Sciences.

[30]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[31]  Burkhard Rost,et al.  Supporting online material for : LocTree 2 predicts localization for all domains of life , 2012 .

[32]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[33]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[34]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[35]  Christian Cole,et al.  JPred4: a protein secondary structure prediction server , 2015, Nucleic Acids Res..

[36]  B. Rost,et al.  Unexpected features of the dark proteome , 2015, Proceedings of the National Academy of Sciences.

[37]  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[38]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[39]  Martin Kühn,et al.  Extreme Scale-out SuperMUC Phase 2 - lessons learned , 2015, PARCO.

[40]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[41]  David T. Jones,et al.  Accurate contact predictions using covariation techniques and machine learning , 2015, Proteins.

[42]  B. Rost,et al.  TMSEG: Novel prediction of transmembrane helices , 2016, Proteins.

[43]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[44]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[45]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[46]  Jeff Nichols,et al.  Announcing Supercomputer Summit , 2016 .

[47]  Wei Li,et al.  RaptorX-Property: a web server for protein structure property prediction , 2016, Nucleic Acids Res..

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[50]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[51]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[52]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[53]  Jianzhi Zhang,et al.  Evolutionary adaptations to new environments generally reverse plastic phenotypic changes , 2018, Nature Communications.

[54]  Sameer Kumar,et al.  PowerAI DDL , 2017, ArXiv.

[55]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[56]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[57]  Thomas A. Hopf,et al.  Evolutionary couplings and sequence variation effect predict protein binding sites , 2018, Proteins.

[58]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[59]  R. Plemper,et al.  Promotion of virus assembly and organization by the measles virus matrix protein , 2018, Nature Communications.

[60]  Seán I O'Donoghue,et al.  Dark Proteins Important for Cellular Function , 2018, Proteomics.

[61]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[62]  Andriy Kryshtafovych,et al.  Assessment of hard target modeling in CASP12 reveals an emerging role of alignment‐based contact prediction methods , 2018, Proteins.

[63]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[64]  Kiyokuni Kawachiya,et al.  TFLMS: Large Model Support in TensorFlow by Graph Rewriting , 2018, ArXiv.

[65]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[66]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[67]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[68]  Ole Winther,et al.  NetSurfP‐2.0: Improved prediction of protein structural features by integrated deep learning , 2019, Proteins.

[69]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[70]  Mohammed AlQuraishi,et al.  ProteinNet: a standardized data set for machine learning of protein structure , 2019, BMC Bioinformatics.

[71]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[72]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[73]  Steven E. Brenner,et al.  SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database , 2018, Nucleic Acids Res..

[74]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[75]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[76]  Manaal Faruqui,et al.  Attention Interpretability Across NLP Tasks , 2019, ArXiv.

[77]  Mohammed AlQuraishi End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[78]  Alice C McHardy,et al.  Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) , 2018, Scientific Reports.

[79]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[80]  John N. Weinstein,et al.  ElemCor: accurate data analysis and enrichment calculation for high-resolution LC-MS stable isotope labeling experiments , 2019, BMC Bioinformatics.

[81]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[82]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[83]  C. L. Rees,et al.  Quantitative firing pattern phenotyping of hippocampal neuron types , 2018, Scientific Reports.

[84]  Regina Barzilay,et al.  Generative Models for Graph-Based Protein Design , 2019, DGS@ICLR.

[85]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[86]  Richard Socher,et al.  A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.

[87]  D. Frishman,et al.  Pred‐MutHTP: Prediction of disease‐causing and neutral mutations in human transmembrane proteins , 2019, Human mutation.

[88]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[89]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[90]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[91]  Tom Sercu,et al.  Transformer protein language models are unsupervised structure learners , 2020, bioRxiv.

[92]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[93]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[94]  Henrik Nielsen,et al.  Language modelling for biological sequences – curated datasets and baselines , 2020, bioRxiv.

[95]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2020, Proceedings of the National Academy of Sciences.

[96]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[97]  Burkhard Rost,et al.  Embeddings from deep learning transfer GO annotations beyond homology , 2020, Scientific Reports.

[98]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[99]  Ananthan Nambiar,et al.  Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks , 2020, bioRxiv.

[100]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[101]  Samyam Rajbhandari,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, ArXiv.

[102]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[103]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[104]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[105]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[106]  B. Rost,et al.  Light attention predicts protein location from the language of life , 2021, bioRxiv.

[107]  Dong Huang,et al.  Optimal Gradient Checkpoint Search for Arbitrary Computation Graphs , 2018, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  B. Rost,et al.  PredictProtein - Predicting Protein Structure and Function for 29 Years , 2021, bioRxiv.

[109]  Seunghyun Park,et al.  Pre-Training of Deep Bidirectional Protein Sequence Representations With Structural Information , 2019, IEEE Access.

[110]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[111]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.