The language of proteins: NLP, machine learning & protein sequences

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

[1]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[2]  Inbal Budowski-Tal,et al.  FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately , 2010, Proceedings of the National Academy of Sciences.

[3]  Sandor Vajda,et al.  CAPRI: A Critical Assessment of PRedicted Interactions , 2003, Proteins.

[4]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[5]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[6]  Orna Man,et al.  Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla , 2003, Proteins.

[7]  Michal Linial,et al.  ASAP: a machine learning framework for local protein properties , 2015, bioRxiv.

[8]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[9]  Noah A. Smith Contextual Word Representations: A Contextual Introduction , 2019, ArXiv.

[10]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[11]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[12]  Piero Fariselli,et al.  DeepSig: deep learning improves signal peptide detection in proteins , 2017, Bioinform..

[13]  Wojciech Samek,et al.  UDSMProt: universal deep sequence models for protein classification , 2019, bioRxiv.

[14]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[15]  Chenxi Liu,et al.  Deep Nets: What have They Ever Done for Vision? , 2018, International Journal of Computer Vision.

[16]  Rob Phillips,et al.  Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment , 2009, Bioinform..

[17]  Ellen D. Zhong,et al.  Learning the language of viral evolution and escape , 2020, Science.

[18]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[19]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[20]  E. Trifonov The origin of the genetic code and of the earliest oligopeptides. , 2009, Research in microbiology.

[21]  Michal Linial,et al.  Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains , 2005, ISMB.

[22]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[26]  Patricia C. Babbitt,et al.  Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space , 2013, PLoS Comput. Biol..

[27]  Hannah Currant,et al.  FFPred 3: feature-based function prediction for all Gene Ontology domains , 2016, Scientific Reports.

[28]  Michael Krauthammer,et al.  Neural networks versus Logistic regression for 30 days all-cause readmission prediction , 2018, Scientific Reports.

[29]  Chengsheng Mao,et al.  KG-BERT: BERT for Knowledge Graph Completion , 2019, ArXiv.

[30]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[31]  F. Arnold,et al.  Signal Peptides Generated by Attention-Based Neural Networks. , 2020, ACS synthetic biology.

[32]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[33]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[34]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[35]  Michel Schneider,et al.  UniProtKB/Swiss-Prot. , 2007, Methods in molecular biology.

[36]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[37]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[38]  Hiroyuki Shindo,et al.  Neural Attentive Bag-of-Entities Model for Text Classification , 2019, CoNLL.

[39]  Henrik Nielsen,et al.  Language modelling for biological sequences – curated datasets and baselines , 2020, bioRxiv.

[40]  Oliver Kohlbacher,et al.  MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition , 2006, Bioinform..

[41]  A. Mignan,et al.  One neuron versus deep learning in aftershock prediction , 2019, Nature.

[42]  Thomas L. Griffiths,et al.  Evaluating Vector-Space Models of Word Representation, or, The Unreasonable Effectiveness of Counting Words Near Other Words , 2017, CogSci.

[43]  Srikumar Venugopal,et al.  Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce , 2013, ArXiv.

[44]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[45]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[46]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[47]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[48]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[49]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[50]  Michal Linial,et al.  ProFET: Feature engineering captures high-level protein functions , 2015, Bioinform..

[51]  Ting Chen,et al.  Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search , 2007, Bioinform..

[52]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[54]  Michal Linial,et al.  When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features , 2007, WABI.

[55]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[56]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[57]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[58]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[59]  D. Sculley,et al.  Using deep learning to annotate the protein universe , 2019, Nature Biotechnology.

[60]  Michal Linial,et al.  ClanTox: a classifier of short animal toxins , 2009, Nucleic Acids Res..

[61]  Bing Zhang,et al.  Deep Learning in Proteomics , 2020, Proteomics.

[62]  Michal Linial,et al.  Cooperativity within proximal phosphorylation sites is revealed from large-scale proteomics data , 2010, Biology Direct.

[63]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[64]  Qiang Zhou,et al.  Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2 , 2020, Science.

[65]  Michal Linial,et al.  The complete peptide dictionary – A meta‐proteomics resource , 2010, Proteomics.

[66]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[67]  Michal Linial,et al.  NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes , 2014, Bioinform..

[68]  J. Hoh,et al.  Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein , 2004, FEBS letters.

[69]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[70]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[71]  Zhihan Zhou,et al.  DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome , 2020, bioRxiv.

[72]  Olivier Raiman,et al.  DeepType: Multilingual Entity Linking by Neural Type System Evolution , 2018, AAAI.

[73]  Michael Heinzinger,et al.  Embeddings from deep learning transfer GO annotations beyond homology , 2021, Scientific reports.

[74]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[75]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[76]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[77]  Bruce R. Southey,et al.  Evaluation of Database Search Programs for Accurate Detection of Neuropeptides in Tandem Mass Spectrometry Experiments , 2012, Journal of proteome research.

[78]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[79]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[80]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[81]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[82]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[83]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[84]  Lav R. Varshney,et al.  BERTology Meets Biology: Interpreting Attention in Protein Language Models , 2020, bioRxiv.

[85]  D. Ofer,et al.  Machine Learning for Protein Function , 2016, 1603.02021.

[86]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[87]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[88]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[89]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[90]  Alice C McHardy,et al.  Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) , 2018, Scientific Reports.

[91]  Lefteris Koumakis,et al.  Deep learning models in genomics; are we there yet? , 2020, Computational and structural biotechnology journal.

[92]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[93]  O B Ptitsyn How does protein synthesis give rise to the 3D‐structure? , 1991, FEBS letters.

[94]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[95]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[96]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[97]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[98]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[99]  Paul Pavlidis,et al.  Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA) , 2013, BMC Bioinformatics.

[100]  Wojciech Samek,et al.  UDSMProt: universal deep sequence models for protein classification , 2020, Bioinformatics.

[101]  Torsten Schwede,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XIII , 2019, Proteins.

[102]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[103]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[104]  E. Horvitz,et al.  On Biases Of Attention In Scientific Discovery. , 2020, Bioinformatics.

[105]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[106]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[107]  Wang Liang,et al.  Detecting "protein words" through unsupervised word segmentation , 2014 .

[108]  Nuo Wang Pierse,et al.  Aligning the Pretraining and Finetuning Objectives of Language Models , 2020, ArXiv.

[109]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[110]  Malay Kumar Basu,et al.  Grammar of protein domain architectures , 2019, Proceedings of the National Academy of Sciences.

[111]  D. Baker,et al.  Global analysis of protein folding using massively parallel design, synthesis, and testing , 2017, Science.

[112]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[113]  Xiao Li,et al.  A High Efficient Biological Language Model for Predicting Protein–Protein Interactions , 2019, Cells.

[114]  Cesare Furlanello,et al.  Machine learning methods for predictive proteomics , 2007, Briefings Bioinform..

[115]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[116]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .

[117]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[118]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[119]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[120]  Georgios A. Pavlopoulos,et al.  Protein-protein interaction predictions using text mining methods. , 2015, Methods.