Deep learning-based codon optimization with large-scale synonymous variant datasets enables generalized tunable protein expression

Increasing recombinant protein expression is of broad interest in industrial biotechnology, synthetic biology, and basic research. Codon optimization is an important step in heterologous gene expression that can have dramatic effects on protein expression level. Several codon optimization strategies have been developed to enhance expression, but these are largely based on bulk usage of highly frequent codons in the host genome, and can produce unreliable results. Here, we develop deep contextual language models that learn the codon usage rules from natural protein coding sequences across members of the Enterobacterales order. We then fine-tune these models with over 150,000 functional expression measurements of synonymous coding sequences from three proteins to predict expression in E. coli. We find that our models recapitulate natural context-specific patterns of codon usage and can accurately predict expression levels across synonymous sequences. Finally, we show that expression predictions can generalize across proteins unseen during training, allowing for in silico design of gene sequences for optimal expression. Our approach provides a novel and reliable method for tuning gene expression with many potential applications in biotechnology and biomanufacturing.

[1]  Md. Harunur Rashid Full-length recombinant antibodies from Escherichia coli: production, characterization, effector function (Fc) engineering, and clinical evaluation , 2022, mAbs.

[2]  Yun S. Song,et al.  DNA language models are powerful zero-shot predictors of non-coding variant effects , 2022, bioRxiv.

[3]  Randal S. Olson,et al.  Antibody optimization enabled by artificial intelligence predictions of binding affinity and naturalness , 2022, bioRxiv.

[4]  A. Bazzini,et al.  iCodon customizes gene expression based on the codon composition , 2022, Scientific Reports.

[5]  L. McElwain,et al.  Current trends in biopharmaceuticals production in Escherichia coli , 2022, Biotechnology Letters.

[6]  O. Karasan,et al.  A unifying network modeling approach for codon optimization , 2022, Bioinform..

[7]  M. Arbabi-Ghahroudi Camelid Single-Domain Antibodies: Promises and Challenges as Lifesaving Treatments , 2022, International journal of molecular sciences.

[8]  Francisco A. Cubillos,et al.  The evolution, evolvability and engineering of gene regulatory DNA , 2022, Nature.

[9]  L. Giver,et al.  Industrial production of microbial protein products. , 2022, Current opinion in biotechnology.

[10]  H. Salis,et al.  Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria , 2021, Nature Communications.

[11]  J. V. van Dijl,et al.  Microbial protein cell factories fight back? , 2021, Trends in biotechnology.

[12]  F. Rahbarizadeh,et al.  A comprehensive comparison between camelid nanobodies and single chain variable fragments , 2021, Biomarker Research.

[13]  D. Densmore,et al.  ICOR: improving codon optimization with recurrent neural networks , 2021, bioRxiv.

[14]  E. O’Brien,et al.  How synonymous mutations alter enzyme structure and function over long timescales , 2021, bioRxiv.

[15]  C. Dienemann,et al.  Neutralization of SARS‐CoV‐2 by highly potent, hyperthermostable, and mutation‐tolerant nanobodies , 2021, The EMBO journal.

[16]  B. Berger,et al.  Learning the protein language: Evolution, structure, and function. , 2021, Cell systems.

[17]  Fuzhong Zhang,et al.  Massively parallel gene expression variation measurement of a synonymous codon library , 2021, BMC genomics.

[18]  C. Garvie,et al.  Assessing optimal: inequalities in codon optimization algorithms , 2021, BMC biology.

[19]  A. Bazzini,et al.  Crosstalk between codon optimality and cis-regulatory elements dictates mRNA stability , 2021, Genome biology.

[20]  M. Rodnina,et al.  Translational Control by Ribosome Pausing in Bacteria: How a Non-uniform Pace of Translation Affects Protein Production and Folding , 2021, Frontiers in Microbiology.

[21]  Yang Xu,et al.  Codon optimization with deep learning to enhance protein expression , 2020, Scientific Reports.

[22]  Eirik Adim Moreb,et al.  Scalable, two-stage, autoinduction of recombinant protein expression in E. coli utilizing phosphate depletion. , 2020, Biotechnology and bioengineering.

[23]  Alper Şen,et al.  Codon optimization: a mathematical programing approach , 2020, Bioinform..

[24]  Karsten M. Borgwardt,et al.  Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping , 2020, Nature Communications.

[25]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[26]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[27]  V. Verendel,et al.  Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure , 2019, Nature Communications.

[28]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[29]  J. Gallego-Jara,et al.  Engineering protein production by rationally choosing a carbon and nitrogen source using E. coli BL21 acetate metabolism knockout strains , 2019, Microbial Cell Factories.

[30]  G. Ferry,et al.  VHH characterization. Comparison of recombinant with chemically synthesized anti‐HER2 VHH , 2019, Protein science : a publication of the Protein Society.

[31]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[32]  J. Puetz,et al.  Recombinant Proteins for Industrial versus Pharmaceutical Purposes: A Review of Process and Pricing , 2019, Processes.

[33]  C. Turnbough Regulation of Bacterial Gene Expression by Transcription Attenuation , 2019, Microbiology and Molecular Biology Reviews.

[34]  A. Bazzini,et al.  Translation affects mRNA stability in a codon-dependent manner in human cells , 2019, eLife.

[35]  David K. Yang,et al.  Generative models for codon prediction and optimization , 2019 .

[36]  Gary Walsh,et al.  Biopharmaceutical benchmarks 2018 , 2018, Nature Biotechnology.

[37]  Joshua B. Plotkin,et al.  Codon usage influences fitness through RNA toxicity , 2018, Proceedings of the National Academy of Sciences.

[38]  S. Kelly,et al.  Codon choice directs constitutive mRNA levels in trypanosomes , 2018, eLife.

[39]  C. Aldrich,et al.  Mutual potentiation drives synergy between trimethoprim and sulfamethoxazole , 2018, Nature Communications.

[40]  Jia Gu,et al.  fastp: an ultra-fast all-in-one FASTQ preprocessor , 2018, bioRxiv.

[41]  Amir Hussain,et al.  Applications of Deep Learning and Reinforcement Learning to Biological Data , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[42]  Jun Li,et al.  Widespread position-specific conservation of synonymous rare codons within coding sequences , 2017, PLoS Comput. Biol..

[43]  C. Herwig,et al.  Tunable recombinant protein expression in E. coli: promoter systems and genetic constraints , 2016, Applied Microbiology and Biotechnology.

[44]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[45]  Gaetano T. Montelione,et al.  Codon influence on protein expression in E. coli correlates with mRNA levels , 2016, Nature.

[46]  D. Söll,et al.  Codon Bias as a Means to Fine-Tune Gene Expression. , 2015, Molecular cell.

[47]  M. E. Castelli,et al.  Expression of codon optimized genes in microbial systems: current industrial applications and perspectives , 2014, Front. Microbiol..

[48]  Vivek K. Mutalik,et al.  Composability of regulatory sequences controlling transcription and translation in Escherichia coli , 2013, Proceedings of the National Academy of Sciences.

[49]  Judith Frydman,et al.  Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding , 2012, Nature Structural &Molecular Biology.

[50]  E. Shakhnovich,et al.  Soluble oligomerization provides a beneficial fitness effect on destabilizing mutations , 2011, Proceedings of the National Academy of Sciences.

[51]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[52]  Eytan Ruppin,et al.  Translation efficiency is determined by both codon bias and folding energy , 2010, Proceedings of the National Academy of Sciences.

[53]  Alan Villalobos,et al.  Design Parameters to Control Synthetic Gene Expression in Escherichia coli , 2009, PloS one.

[54]  Patricia L. Clark,et al.  Rare Codons Cluster , 2008, PloS one.

[55]  Randall L. Kincaid,et al.  Heterologous Protein Expression Is Enhanced by Harmonizing the Codon Usage Frequencies of the Target Gene with those of the Expression Host , 2008, PloS one.

[56]  B. Bukau,et al.  Chaperone-based procedure to increase yields of soluble recombinant proteins produced in E. coli , 2007, BMC biotechnology.

[57]  H. P. Sørensen,et al.  Soluble expression of recombinant proteins in the cytoplasm of Escherichia coli , 2005 .

[58]  J. Glasner,et al.  Genome-wide expression profiling in Escherichia coli K-12. , 1999, Nucleic acids research.

[59]  P. Sharp,et al.  The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. , 1987, Nucleic acids research.

[60]  O. Sköld,et al.  New observations regarding evolution of trimethoprim resistance. , 1986, The Journal of antimicrobial chemotherapy.

[61]  J. Rood,et al.  Cloning of the Escherichia coli K-12 dihydrofolate reductase gene following mu-mediated transposition. , 1980, Gene.