Genome-wide discovery of cis-elements in promoter sequences using gene expression.

The availability of complete or nearly complete genome sequences, a large number of 5' expressed sequence tags, and significant public expression data allow for a more accurate identification of cis-elements regulating gene expression. We have implemented a global approach that takes advantage of available expression data, genomic sequences, and transcript information to predict cis-elements associated with specific expression patterns. The key components of our approach are: (1) precise identification of transcription start sites, (2) specific locations of cis-elements relative to the transcription start site, and (3) assessment of statistical significance for all sequence motifs. By applying our method to promoters of Arabidopsis thaliana and Mus musculus, we have identified motifs that affect gene expression under specific environmental conditions or in certain tissues. We also found that the presence of the TATA box is associated with increased variability of gene expression. Strong correlation between our results and experimentally determined motifs shows that the method is capable of predicting new functionally important cis-elements in promoter sequences.

[1]  Klaus Harter,et al.  Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves , 2006, BMC Bioinformatics.

[2]  Masaru Tomita,et al.  GC-compositional strand bias around transcription start sites in plants and fungi , 2005, BMC Genomics.

[3]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[4]  Pascal von Koskull-Döring,et al.  The diversity of plant heat stress transcription factors. , 2007, Trends in plant science.

[5]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[6]  Barrett C. Foat,et al.  Predictive modeling of genome-wide mRNA expression: from modules to molecules. , 2007, Annual review of biophysics and biomolecular structure.

[7]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Michael A. Beer,et al.  Whole-genome discovery of transcription factor binding sites by network-level conservation. , 2003, Genome research.

[9]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[10]  Barrett C. Foat,et al.  Profiling condition-specific, genome-wide regulation of mRNA stability in yeast. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Olivier Bodenreider,et al.  The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site , 2008, Nucleic acids research.

[12]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[13]  S. Rhee,et al.  TAIR: a resource for integrated Arabidopsis data , 2002, Functional & Integrative Genomics.

[14]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[15]  Luis Herrera-Estrella,et al.  EVOLUTION OF LIGHT-REGULATED PLANT PROMOTERS. , 1998, Annual review of plant physiology and plant molecular biology.

[16]  B. Pugh,et al.  Identification and Distinct Regulation of Yeast TATA Box-Containing Genes , 2004, Cell.

[17]  K. Struhl,et al.  Activator-specific recruitment of TFIID and regulation of ribosomal protein genes in yeast. , 2002, Molecular cell.

[18]  C. Roberts,et al.  A survey of cancer cell lines reveals highly structured and hierarchical relationships within and between DNA and mRNA that may be the result of selection. , 2010, Omics : a journal of integrative biology.

[19]  Piero Carninci,et al.  Comparative evaluation of 5'-end-sequence quality of clones in CAP trapper and other full-length-cDNA libraries. , 2001, Gene.

[20]  Y. Suzuki,et al.  Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. , 1997, Gene.

[21]  Junichi Obokata,et al.  ppdb: a plant promoter database , 2007, Nucleic Acids Res..

[22]  Jun Kawai,et al.  Dynamic usage of transcription start sites within core promoters , 2006, Genome Biology.

[23]  Chintalapati Janaki,et al.  Motif detection in Arabidopsis: Correlation with gene expression data , 2004, Silico Biol..

[24]  E. Bornberg-Bauer,et al.  The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. , 2007, The Plant journal : for cell and molecular biology.

[25]  D. Gonzalez,et al.  Overrepresentation of Elements Recognized by TCP-Domain Transcription Factors in the Upstream Regions of Nuclear Genes Encoding Components of the Mitochondrial Oxidative Phosphorylation Machinery1[W] , 2006, Plant Physiology.

[26]  N. Barkai,et al.  A genetic signature of interspecies variations in gene expression , 2006, Nature Genetics.

[27]  E. Grotewold,et al.  Genome wide analysis of Arabidopsis core promoters , 2005, BMC Genomics.

[28]  Joseph M. Dale,et al.  Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome , 2003, Science.

[29]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[30]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[31]  J. Changeux,et al.  Identification of a DNA element determining synaptic expression of the mouse acetylcholine receptor delta-subunit gene. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.

[33]  Piero Carninci,et al.  Monitoring the Expression Pattern of 1300 Arabidopsis Genes under Drought and Cold Stresses by Using a Full-Length cDNA Microarray , 2001, Plant Cell.

[34]  Stefan R. Henz,et al.  A gene expression map of Arabidopsis thaliana development , 2005, Nature Genetics.

[35]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[36]  M. Waterman,et al.  Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. , 1985, Journal of molecular biology.

[37]  Alexandre V. Morozov,et al.  Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE , 2006, ISMB.

[38]  U. Ohler Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction , 2006, Nucleic acids research.

[39]  H. Echols,et al.  Purification and properties of D protein: a transcription factor of Escherichia coli. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[40]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[41]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[42]  T. Sakurai,et al.  Identification of plant promoter constituents by analysis of local distribution of short sequences , 2007, BMC Genomics.

[43]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[44]  Rongchen Wang,et al.  Microarray Analysis of the Nitrate Response in Arabidopsis Roots and Shoots Reveals over 1,000 Rapidly Responding Genes and New Linkages to Glucose, Trehalose-6-Phosphate, Iron, and Sulfate Metabolism1[w] , 2003, Plant Physiology.

[45]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[46]  A. Bobb,et al.  Conserved RY-repeats mediate transactivation of seed-specific promoters by the developmental regulator PvALF. , 1997, Nucleic acids research.

[47]  Blake C. Meyers,et al.  Genome-Wide Analysis of NBS-LRR–Encoding Genes in Arabidopsis Online version contains Web-only data. Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.009308. , 2003, The Plant Cell Online.

[48]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[49]  K. Struhl,et al.  Mot1 Associates with Transcriptionally Active Promoters and Inhibits Association of NC2 in Saccharomyces cerevisiae , 2002, Molecular and Cellular Biology.

[50]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[51]  N. Alexandrov,et al.  Features of Arabidopsis Genes and Genome Discovered using Full-length cDNAs , 2005, Plant Molecular Biology.

[52]  Nickolai Alexandrov,et al.  Skew in CG content near the transcription start site in Arabidopsis thaliana , 2003, ISMB.

[53]  Xin Chen,et al.  The TRANSFAC system on gene expression regulation , 2001, Nucleic Acids Res..