Predicting gene expression level in E. coli from mRNA sequence information

The accurate characterization of the translational mechanism is crucial for enhancing our understanding of the relationship between genotype and phenotype. In particular, predicting the impact of the genetic variants on gene expression will allow to optimize specific pathways and functions for engineering new biological systems. In this context, the development of accurate methods for predicting the translation efficiency and/or protein expression from the nucleotide sequence is a key challenge in computational biology. In this work we present PGExpress, a new regression method for predicting the log2-fold-change of the translation efficiency of an mRNA sequence in E. coli. PGExpress algorithm takes as input 12 features corresponding to the predicted RNA secondary structure and anti-Shine-Dalgarno hybridization free energies. The method was trained on a set of 1,772 sequence variants (WT-High)of 137 essential E. coli genes. For each gene, we considered 13 sequence variants of the first 33 nucleotides encoding for the same amino acids followed by the superfolder GFP. Each gene variant is represented sequence blocks that include the Ribosome Binding Site (RBS), the first 33 nucleotides of the coding region (C33), the remaining part of the coding region (CC), and their combinations. Our gradient-boosting-based tool (PGExpress) was trained using a 10-fold gene-based cross-validation procedure on the WT-High dataset. In this test PGExpress achieved a correlation coefficient of 0.60, with a Root Mean Square Error (RMSE)of 1.3. When the regression task is cast as a classification problem, PGExpress reached an overall accuracy of 0.74 a Matthews correlation coefficient 0.48 and an Area Under the Receiver Operating Characteristic Curve (AUC)of 0.81. In the regression task, PGExpress results in better performance than RBSCalculator in the prediction of the log2-fold-change of the translational efficiency and its variation on the WT-High dataset. Finally, we validated our method by performing in-house experiments on five newly generated mRNA sequence variants. The predictions of the expression level of the new variants are in agreement with our experimental results in E. coli.

[1]  Eytan Ruppin,et al.  Translation efficiency is determined by both codon bias and folding energy , 2010, Proceedings of the National Academy of Sciences.

[2]  J. Doudna,et al.  Insights into RNA structure and function from genome-wide studies , 2014, Nature Reviews Genetics.

[3]  Daphne Koller,et al.  Causal signals between codon bias, mRNA structure, and the efficiency of translation and elongation , 2014, Molecular systems biology.

[4]  Gene-Wei Li,et al.  The anti-Shine-Dalgarno sequence drives translational pausing and codon choice in bacteria , 2012, Nature.

[5]  H. Salis The ribosome binding site calculator. , 2011, Methods in enzymology.

[6]  Y. Pilpel,et al.  Determinants of translation efficiency and accuracy , 2011, Molecular systems biology.

[7]  Sriram Kosuri,et al.  Causes and Effects of N-Terminal Codon Bias in Bacterial Genes , 2013, Science.

[8]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[9]  M. Kozak,et al.  Regulation of translation via mRNA structure in prokaryotes and eukaryotes. , 2005, Gene.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  J. Plotkin,et al.  Synonymous but not the same: the causes and consequences of codon bias , 2011, Nature Reviews Genetics.

[12]  Alfonso Jaramillo,et al.  RiboMaker: computational design of conformation-based riboregulation , 2014, Bioinform..

[13]  David Tollervey,et al.  Coding-Sequence Determinants of Gene Expression in Escherichia coli , 2009, Science.

[14]  Markus J. Herrgård,et al.  Predictable tuning of protein expression in bacteria , 2016, Nature Methods.

[15]  Tom Ellis,et al.  Predicting Translation Initiation Rates for Designing Synthetic Biology , 2013, Front. Bioeng. Biotechnol..

[16]  Doheon Lee,et al.  Bioinformatics Applications Note Gene Expression Rbsdesigner: Software for Designing Synthetic Ribosome Binding Sites That Yields a Desired Level of Protein Expression , 2022 .

[17]  Eileen Ingham,et al.  Production of self-assembling biomaterials for tissue engineering , 2009, Trends in biotechnology.

[18]  Emidio Capriotti,et al.  Computational RNA Structure Prediction , 2008 .

[19]  J. Shine,et al.  The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. , 1974, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Vivek K. Mutalik,et al.  Composability of regulatory sequences controlling transcription and translation in Escherichia coli , 2013, Proceedings of the National Academy of Sciences.

[21]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[22]  Christopher A. Voigt,et al.  Automated design of synthetic ribosome binding sites to control protein expression , 2016 .

[23]  Jens Nielsen,et al.  Production of natural products through metabolic engineering of Saccharomyces cerevisiae. , 2015, Current opinion in biotechnology.

[24]  Y. Pilpel,et al.  An Evolutionarily Conserved Mechanism for Controlling the Efficiency of Protein Translation , 2010, Cell.

[25]  Jae-Seong Yang,et al.  Predictive combinatorial design of mRNA translation initiation regions for systematic optimization of gene expression levels , 2014, Scientific Reports.

[26]  T. D. Schneider,et al.  Anatomy of Escherichia coli ribosome binding sites. , 2001, Journal of molecular biology.

[27]  Thomas E. Gorochowski,et al.  Trade-offs between tRNA abundance and mRNA secondary structure support smoothing of translation elongation rate , 2015, Nucleic acids research.