Large language models generate functional protein sequences across diverse families

[1]  B. Höcker,et al.  ProtGPT2 is a deep unsupervised language model for protein design , 2022, Nature Communications.

[2]  S. Liao,et al.  A backbone-centred energy function of neural networks for protein design , 2022, Nature.

[3]  S. Ovchinnikov,et al.  ColabFold: making protein folding accessible to all , 2022, Nature Methods.

[4]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[5]  David E. Kim,et al.  Protein sequence design by conformational landscape optimization , 2021, Proceedings of the National Academy of Sciences.

[6]  Yi Yan Yang,et al.  Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations , 2021, Nature Biomedical Engineering.

[7]  Lucy J. Colwell,et al.  Deep diversification of an AAV capsid protein by machine learning , 2021, Nature Biotechnology.

[8]  Ethan C. Alley,et al.  Low-N protein engineering with data-efficient deep learning , 2020, Nature Methods.

[9]  Aleksej Zelezniak,et al.  Expanding functional protein sequence space using generative adversarial networks , 2019, bioRxiv.

[10]  Adam J. Riesselman,et al.  Protein design and variant prediction using autoregressive generative models , 2019, Nature Communications.

[11]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[12]  M. Mirdita,et al.  Fast and sensitive taxonomic assignment to metagenomic contigs , 2020, bioRxiv.

[13]  Simona Cocco,et al.  An evolution-based model for designing chorismate mutase enzymes , 2020, Science.

[14]  David Baker,et al.  De novo protein design by deep network hallucination , 2020, Nature.

[15]  F. Arnold,et al.  Signal Peptides Generated by Attention-Based Neural Networks. , 2020, ACS synthetic biology.

[16]  Po-Ssu Huang,et al.  Protein sequence design with a learned potential , 2020, bioRxiv.

[17]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[18]  R. Dobson,et al.  On the catalytic mechanism of bacteriophage endolysins: Opportunities for engineering. , 2019, Biochimica et biophysica acta. Proteins and proteomics.

[19]  A. Buckle,et al.  Catalytic diversity and cell wall binding repeats in the phage‐encoded endolysins , 2018, Molecular microbiology.

[20]  Robert P. Sheridan,et al.  The EVcouplings Python framework for coevolutionary sequence analysis , 2018, bioRxiv.

[21]  D. Baker,et al.  The coming of age of de novo protein design , 2016, Nature.

[22]  Robert A. Langan,et al.  De novo design of protein homo-oligomers with modular hydrogen-bond network–mediated specificity , 2016, Science.

[23]  D. Baker,et al.  De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy , 2015, Nature chemical biology.

[24]  D. Baker,et al.  Control over overall shape and size in de novo designed proteins , 2015, Proceedings of the National Academy of Sciences.

[25]  Debora S. Marks,et al.  Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models , 2015, PLoS Comput. Biol..

[26]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[27]  Deok-Soo Kim,et al.  BetaCavityWeb: a webserver for molecular voids and channels , 2015, Nucleic Acids Res..

[28]  D. Baker,et al.  Principles for designing ideal protein structures , 2012, Nature.

[29]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[30]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[31]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[32]  Dale E Tronrud,et al.  Lessons from the lysozyme of phage T4 , 2010, Protein science : a publication of the Protein Society.

[33]  B. Matthews,et al.  Evaluation at atomic resolution of the role of strain in destabilizing the temperature‐sensitive T4 lysozyme mutant Arg 96 → His , 2009, Protein science : a publication of the Protein Society.

[34]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[35]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[36]  F. Studier,et al.  Protein production by auto-induction in high density shaking cultures. , 2005, Protein expression and purification.

[37]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[38]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[39]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[40]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[41]  K. J. Oh,et al.  Conformation of T4 lysozyme in solution. Hinge-bending motion and the substrate-induced conformational transition studied by site-directed spin labeling. , 1997, Biochemistry.

[42]  B. Matthews,et al.  A covalent enzyme-substrate intermediate with saccharide distortion in a mutant T4 lysozyme. , 1993, Science.

[43]  Paul Martin,et al.  POTTS MODELS AND RELATED PROBLEMS IN STATISTICAL MECHANICS , 1991 .

[44]  Shana Poplack,et al.  Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 , 1980 .

[45]  Carol Pfaff Constraints on Language Mixing: Intrasentential Code-Switching and Borrowing in Spanish/English , 1979 .

[46]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.