Prediction of fine-tuned promoter activity from DNA sequence

The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring biological systems. Author Summary Gene expression is the first step at which information encoded in DNA is transcribed into RNA. Predicting gene expression from DNA sequence can provide insights into the natural variation of gene expression underlying various phenotypes and direct the engineering of genes of desired activity, for example in industrial processes. While several studies show that gene expression is influenced by DNA sequence. its quantitative prediction from DNA sequence alone remains a challenging problem. Unfortunately, studies aimed at developing quantitative models for gene expression prediction are not directly comparable because most have used distinct data sets for training and evaluation. and many of the methods have not been independently verified. Open innovation challenges in which a problem is posed to a wide community provide a framework for independent verification of the performance of various computational methods using the same benchmark data sets and statistical procedures. Here. we describe the best performing computational model amongst those of 20 other teams in the DREAM6 Gene Expression Prediction challenge. We show that a highly predictive gene expression model can be obtained by an unbiased. data-driven approach that makes little assumption on the role of known mechanisms for gene regulation.

[1]  Pablo Meyer,et al.  Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach , 2013, Genome research.

[2]  C. Logie,et al.  Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy , 2012, Proceedings of the National Academy of Sciences.

[3]  Eran Segal,et al.  Manipulating nucleosome disfavoring sequences allows fine-tune regulation of gene expression in yeast , 2012, Nature Genetics.

[4]  David A. Rusling,et al.  DNA looping by FokI: the impact of twisting and bending rigidity on protein-induced looping dynamics , 2012, Nucleic acids research.

[5]  Yaniv Lubling,et al.  Compensation for differences in gene copy number among yeast ribosomal proteins is encoded within their promoters. , 2011, Genome research.

[6]  R. Altman,et al.  Cooperative transcription factor associations discovered using regulatory variation , 2011, Proceedings of the National Academy of Sciences.

[7]  K. Nakai,et al.  Predicting promoter activities of primary human DNA sequences , 2011, Nucleic acids research.

[8]  J. Stamatoyannopoulos,et al.  The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding , 2011, Genome Biology.

[9]  R. Negri,et al.  Promoter architectures in the yeast ribosomal expression program , 2011, Transcription.

[10]  N. Barkai,et al.  Chromatin regulators as capacitors of interspecies variations in gene expression , 2010, Molecular systems biology.

[11]  E. Segal,et al.  p53 binds preferentially to genomic regions with high DNA-encoded nucleosome occupancy. , 2010, Genome research.

[12]  L. Mirny,et al.  Nucleosome-mediated cooperativity between transcription factors , 2009, Proceedings of the National Academy of Sciences.

[13]  Vikram Vijayan,et al.  Oscillations in supercoiling drive circadian gene expression in cyanobacteria , 2009, Proceedings of the National Academy of Sciences.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  H. Kono,et al.  Sequence dependencies of DNA deformability and hydration in the minor groove. , 2009, Biophysical journal.

[16]  E. Segal,et al.  What controls nucleosome positions? , 2009, Trends in genetics : TIG.

[17]  Aviv Regev,et al.  Transcriptional Regulatory Circuits: Predicting Numbers from Alphabets , 2009, Science.

[18]  Eran Segal,et al.  From DNA sequence to transcriptional behaviour: a quantitative approach , 2009, Nature Reviews Genetics.

[19]  Eran Segal,et al.  Incorporating Nucleosomes into Thermodynamic Models of Transcription Regulation , 2009, RECOMB.

[20]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[21]  Young-Joon Kim,et al.  Intrinsic variability of gene expression encoded in nucleosome positioning sequences , 2009, Nature Genetics.

[22]  Noam Kaplan,et al.  Gene expression divergence in yeast is coupled to evolution of DNA-encoded nucleosome organization , 2009, Nature Genetics.

[23]  J. Collins,et al.  DIVERSITY-BASED, MODEL-GUIDED CONSTRUCTION OF SYNTHETIC GENE NETWORKS WITH PREDICTED FUNCTIONS , 2009, Nature Biotechnology.

[24]  Irene K. Moore,et al.  The DNA-encoded nucleosome organization of a eukaryotic genome , 2009, Nature.

[25]  L. Liang,et al.  Mapping complex disease traits with global gene expression , 2009, Nature Reviews Genetics.

[26]  Jason Gertz,et al.  Environment-specific combinatorial cis-regulation in synthetic promoters , 2009, Molecular systems biology.

[27]  E. Segal,et al.  Poly(da:dt) Tracts: Major Determinants of Nucleosome Organization This Review Comes from a Themed Issue on Protein-nucleic Acid Interactions Edited , 2022 .

[28]  E. Siggia,et al.  Analysis of Combinatorial cis-Regulation in Synthetic and Genomic Promoters , 2008, Nature.

[29]  E. O’Shea,et al.  A quantitative model of transcription factor–activated gene expression , 2008, Nature Structural &Molecular Biology.

[30]  S. Wuchty,et al.  Regulatory Hotspots in the Malaria Parasite Genome Dictate Transcriptional Variation , 2008, PLoS biology.

[31]  Modesto Orozco,et al.  DNAlive: a tool for the physical analysis of DNA at the genomic scale , 2008, Bioinform..

[32]  E. O’Shea,et al.  Chromatin decouples promoter threshold from dynamic range , 2008, Nature.

[33]  Steven J. M. Jones,et al.  Dynamic Remodeling of Individual Nucleosomes Across a Eukaryotic Genome in Response to Transcriptional Perturbation , 2007, PLoS biology.

[34]  Eric Mjolsness,et al.  On Cooperative Quasi-Equilibrium Models of transcriptional Regulation , 2007, J. Bioinform. Comput. Biol..

[35]  S. Yamanaka,et al.  Induction of Pluripotent Stem Cells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors , 2006, Cell.

[36]  N. Barkai,et al.  A genetic signature of interspecies variations in gene expression , 2006, Nature Genetics.

[37]  Ronald W. Davis,et al.  Mechanisms of Haploinsufficiency Revealed by Genome-Wide Profiling in Yeast , 2005, Genetics.

[38]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[39]  Michael Q. Zhang,et al.  Interacting models of cooperative gene regulation. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[40]  J. Lieb,et al.  Evidence for nucleosome depletion at active regulatory regions genome-wide , 2004, Nature Genetics.

[41]  R. Stoughton,et al.  Genetics of gene expression surveyed in maize, mouse and man , 2003, Nature.

[42]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[43]  Narayanan Eswar,et al.  Structure of the 80S Ribosome from Saccharomyces cerevisiae—tRNA-Ribosome and Subunit-Subunit Interactions , 2001, Cell.

[44]  U. Alon,et al.  Ordering Genes in a Flagella Pathway by Analysis of Expression Kinetics from Living Bacteria , 2001, Science.

[45]  E. Lander,et al.  Remodeling of yeast genome expression in response to environmental changes. , 2001, Molecular biology of the cell.

[46]  C. Hunter,et al.  Sequence-dependent DNA structure: tetranucleotide conformational maps. , 2000, Journal of molecular biology.

[47]  J. Warner,et al.  The economics of ribosome biosynthesis in yeast. , 1999, Trends in biochemical sciences.

[48]  R. F. Lascaris,et al.  DNA-binding requirements of the yeast protein Rap1p as selected in silico from ribosomal protein gene promoter sequences , 1999, Bioinform..

[49]  V. Zhurkin,et al.  DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[50]  J. Warner,et al.  An RNA structure involved in feedback regulation of splicing and of translation is critical for biological fitness. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[51]  I. Brukner,et al.  Trinucleotide models for DNA bending propensity: comparison of models based on DNaseI digestion and nucleosome packaging data. , 1995, Journal of biomolecular structure & dynamics.

[52]  D. K. Hawley,et al.  DNA bending is an important component of site-specific recognition by the TATA binding protein. , 1995, Journal of molecular biology.

[53]  A V Sivolob,et al.  Translational positioning of nucleosomes on DNA: the role of sequence-dependent isotropic DNA bending stiffness. , 1995, Journal of molecular biology.

[54]  P. Sharp,et al.  Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor , 1995, Nature.

[55]  Q. Ju,et al.  Ribosome synthesis during the growth cycle of Saccharomyces cerevisiae , 1994, Yeast.

[56]  H. Drew,et al.  Sequence periodicities in chicken nucleosome core DNA. , 1986, Journal of molecular biology.