Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by un-supervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction.

[1]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[2]  C. Yanofsky,et al.  Protein Structure Relationships Revealed by Mutational Analysis , 1964, Science.

[3]  M. Levitt Conformational preferences of amino acids in globular proteins. , 1978, Biochemistry.

[4]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[5]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[6]  A. Lesk,et al.  Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. , 1987, Journal of molecular biology.

[7]  K. Nagai,et al.  Coordinated amino acid changes in homologous protein families. , 1988, Protein engineering.

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  C. Sander,et al.  Correlated Mutations and Residue Contacts , 1994 .

[10]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[13]  G. Stormo,et al.  Correlated mutations in protein sequences: Phylogenetic and structural effects , 1997 .

[14]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[15]  M. Cosgrove,et al.  On the mechanism of the reaction catalyzed by glucose 6-phosphate dehydrogenase. , 1998, Biochemistry.

[16]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[17]  B. Kräutler,et al.  Structure and dynamics of the B12-binding subunit of glutamate mutase from Clostridium cochlearium. , 1999, European journal of biochemistry.

[18]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[19]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[20]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[21]  T. Mizuno,et al.  Structure of the histidine-containing phosphotransfer (HPt) domain of the anaerobic sensor protein ArcB complexed with the chemotaxis response regulator CheY. , 1999, Acta crystallographica. Section D, Biological crystallography.

[22]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[23]  B. Seaton,et al.  The crystal structure of MarR, a regulator of multiple antibiotic resistance, at 2.3 Å resolution , 2001, Nature Structural Biology.

[24]  J Overbaugh,et al.  Selection Forces and Constraints on Retroviral Sequence Variation , 2001, Science.

[25]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[26]  Peer Bork,et al.  Impact of selection, mutation rate and genetic drift on human genetic variation. , 2003, Human molecular genetics.

[27]  Adam Zemla,et al.  Critical assessment of methods of protein structure prediction (CASP)‐round V , 2005, Proteins.

[28]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[29]  D. Kihara The effect of long‐range interactions on the secondary structure formation of proteins , 2005, Protein science : a publication of the Protein Society.

[30]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[31]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[32]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[33]  Roland L. Dunbrack Sequence comparison and protein structure prediction. , 2006, Current opinion in structural biology.

[34]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[35]  T. Gabaldón Evolution of proteins and proteomes: a phylogenetics approach , 2005, Evolutionary bioinformatics online.

[36]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[37]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[38]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[39]  Chris Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  Geoffrey E. Hinton Reducing the Dimensionality of Data with Neural , 2008 .

[41]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[42]  Philip A. Romero,et al.  Exploring protein fitness landscapes by directed evolution , 2009, Nature Reviews Molecular Cell Biology.

[43]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[44]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[45]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[46]  Tim A. H. te Beek,et al.  A series of PDB related databases for everyday needs , 2010, Nucleic Acids Res..

[47]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[48]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[49]  Johannes Söding,et al.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking. , 2011, Current opinion in structural biology.

[50]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[51]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[52]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[53]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[54]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[55]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[56]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[57]  Sahand Hormoz,et al.  Amino acid composition of proteins reduces deleterious impact of mutations , 2013, Scientific Reports.

[58]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[59]  I. Adzhubei,et al.  Predicting Functional Effect of Human Missense Mutations Using PolyPhen‐2 , 2013, Current protocols in human genetics.

[60]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[61]  S. Fields,et al.  Deep mutational scanning: a new style of protein science , 2014, Nature Methods.

[62]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[63]  Zhiyong Wang,et al.  MRFalign: Protein Homology Detection through Alignment of Markov Random Fields , 2014, PLoS Comput. Biol..

[64]  B. Schulz,et al.  Sequence-based protein stabilization in the absence of glycosylation , 2014, Nature Communications.

[65]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[66]  Jian Zhou,et al.  Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction , 2014, ICML.

[67]  Sheng Wang,et al.  Protein Homology Detection Through Alignment of Markov Random Fields , 2015, SpringerBriefs in Computer Science.

[68]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[69]  B. Rost,et al.  Better prediction of functional effects for sequence variants , 2015, BMC Genomics.

[70]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[71]  Debora S. Marks,et al.  Quantification of the effect of mutations using a global probability model of natural sequence variation , 2015, 1510.04612.

[72]  Andrew J. Hill,et al.  Analysis of protein-coding genetic variation in 60,706 humans , 2015, bioRxiv.

[73]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction: Progress and new directions in round XI , 2016, Proteins.

[74]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[75]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[76]  David A. Scott,et al.  Rationally engineered Cas9 nucleases with improved specificity , 2015, Science.

[77]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[78]  M. Weigt,et al.  Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1 , 2015, bioRxiv.

[79]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[80]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[81]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[82]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[83]  Jinbo Xu,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016 .

[84]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[85]  Piotr,et al.  UNSUPERVISED MACHINE TRANSLATION USING MONOLINGUAL CORPORA ONLY , 2017 .

[86]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[87]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[88]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[89]  David R. Liu,et al.  Phage-assisted continuous evolution of proteases with altered substrate specificity , 2017, Nature Communications.

[90]  David Baker,et al.  Origins of coevolution between residues distant in protein 3D structures , 2017, Proceedings of the National Academy of Sciences.

[91]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[92]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[93]  David T. Jones,et al.  High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features , 2018, Bioinform..

[94]  David R. Liu,et al.  Evolved Cas9 variants with broad PAM compatibility and high DNA specificity , 2018, Nature.

[95]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[96]  Jay Shendure,et al.  Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data. , 2017, Cell systems.

[97]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[98]  Ole Winther,et al.  NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning , 2018, bioRxiv.

[99]  Kaveri A. Thakoor,et al.  High Quality Prediction of Protein Q8 Secondary Structure by Diverse Neural Network Architectures , 2018, ArXiv.

[100]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[101]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[102]  Jie Hou,et al.  DeepSF: deep convolutional neural network for mapping protein sequences to folds , 2017, Bioinform..

[103]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[104]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[105]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[106]  Daniel Jurafsky,et al.  Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context , 2018, ACL.

[107]  Jinbo Xu Distance-based protein folding powered by deep learning , 2018, Proceedings of the National Academy of Sciences.

[108]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[109]  Ole Winther,et al.  NetSurfP‐2.0: Improved prediction of protein structural features by integrated deep learning , 2019, Proteins.

[110]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[111]  K. Persaud,et al.  Effects of point mutations in the binding pocket of the mouse major urinary protein MUP20 on ligand affinity and specificity , 2019, Scientific Reports.

[112]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[113]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[114]  Ryan L. Collins,et al.  Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes , 2019, bioRxiv.

[115]  Kevin K. Yang,et al.  Machine-learning-guided directed evolution for protein engineering , 2018, Nature Methods.

[116]  Aleksej Zelezniak,et al.  Expanding functional protein sequence space using generative adversarial networks , 2019, bioRxiv.

[117]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[118]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[119]  Burkhard Rost,et al.  End-to-end multitask learning, from protein language to protein features without alignments , 2019, bioRxiv.

[120]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[121]  Johannes Söding,et al.  HH-suite3 for fast remote homology detection and deep protein annotation , 2019, BMC Bioinformatics.

[122]  Debora S. Marks,et al.  Accelerating Protein Design Using Autoregressive Generative Models , 2019, bioRxiv.

[123]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[124]  Gregory M. Cooper,et al.  CADD: predicting the deleteriousness of variants throughout the human genome , 2018, Nucleic Acids Res..

[125]  Ned S Wingreen,et al.  Revealing evolutionary constraints on proteins through sequence analysis , 2018, bioRxiv.

[126]  Davide Heller,et al.  eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses , 2018, Nucleic Acids Res..

[127]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[128]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[129]  Alex Wang,et al.  BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[130]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[131]  Badri Adhikari DEEPCON: Protein Contact Prediction using Dilated Convolutional Neural Networks with Dropout , 2019 .

[132]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[133]  Torsten Schwede,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XIII , 2019, Proteins.

[134]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[135]  Jinbo Xu Distance-based protein folding powered by deep learning , 2019, Proceedings of the National Academy of Sciences.

[136]  Badri Adhikari,et al.  DEEPCON: Protein Contact Prediction using Dilated Convolutional Neural Networks with Dropout , 2019, bioRxiv.

[137]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[138]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[139]  Yang Liu,et al.  Evolutionary context-integrated deep sequence modeling for protein engineering , 2020, bioRxiv.

[140]  Wojciech Samek,et al.  UDSMProt: universal deep sequence models for protein classification , 2020, Bioinformatics.

[141]  Tileli Amimeur,et al.  Designing Feature-Controlled Humanoid Antibody Discovery Libraries Using Generative Adversarial Networks , 2020, bioRxiv.

[142]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[143]  Alex Hawkins-Hooker,et al.  Generating functional protein variants with variational autoencoders , 2020, bioRxiv.

[144]  Lav R. Varshney,et al.  BERTology Meets Biology: Interpreting Attention in Protein Language Models , 2020, bioRxiv.

[145]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[146]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.