Computational analysis of core promoters in the Drosophilagenome Citation Genome

Background: The core promoter, a region of about 100 base-pairs flanking the transcription start site (TSS), serves as the recognition site for the basal transcription apparatus. Drosophila TSSs have generally been mapped by individual experiments; the low number of accurately mapped TSSs has limited analysis of promoter sequence motifs and the training of computational prediction tools. Results: We identified TSS candidates for about 2,000 Drosophila genes by aligning 5 expressed sequence tags (ESTs) from cap-trapped cDNA libraries to the genome, while applying stringent criteria concerning coverage and 5 -end distribution. Examination of the sequences flanking these TSSs revealed the presence of well-known core promoter motifs such as the TATA box, the initiator and the downstream promoter element (DPE). We also define, and assess the distribution of, several new motifs prevalent in core promoters, including what appears to be a variant DPE motif. Among the prevalent motifs is the DNA-replication-related element DRE, recently shown to be part of the recognition site for the TBP-related factor TRF2. Our TSS set was then used to retrain the computational promoter predictor McPromoter, allowing us to improve the recognition performance to over 50% sensitivity and 40% specificity. We compare these computational results to promoter prediction in vertebrates. Conclusions: There are relatively few recognizable binding sites for previously known general transcription factors in Drosophila core promoters. However, we identified several new motifs enriched in promoter regions. We were also able to significantly improve the performance of computational TSS prediction in Drosophila. Published: 20 December 2002 Genome Biology 2002, 3(12):research0087.1–0087.12 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2002/3/12/research/0087 © 2002 Ohler et al., licensee BioMed Central Ltd (Print ISSN 1465-6906; Online ISSN 1465-6914) Received: 7 October 2002 Revised: 19 November 2002 Accepted: 27 November 2002 Background Transcription initiation is one of the most important control points in regulating gene expression [1,2]. Recent observations have emphasized the importance of the core promoter, a region of about 100 base-pairs (bp) flanking the transcription start site (TSS), in regulating transcription [3,4]. The core promoter serves as the recognition site for the basal transcription apparatus, which comprises the multisubunit RNA polymerase II and several auxiliary factors. Core promoters show specificity both in their interactions with enhancers and with sets of general transcription factors that control distinct subsets of genes. Although there are no 2 Genome Biology Vol 3 No 12 Ohler et al. known DNA sequence motifs that are shared by all core promoters, a number of motifs have been identified that are present in a substantial fraction. The most familiar of these motifs is the TATA box, which has been reported to be part of 30-40% of core promoters [5]. Prediction and analysis of core promoters have been active areas of research in computational biology [6], with several recent publications on prediction of human promoters [7-10]. In contrast, prediction of invertebrate promoters has received much less attention and has focused almost exclusively on Drosophila. Reese [11] described the application of time-delay neural networks, and in our previous work [12] we used a combination of a generalized hidden Markov model for sequence features and Gaussian distributions for the predicted structural features of DNA. Structural features were also examined by Levitsky and Katokhin [13], but they did not present results for promoter prediction in genomic sequences. As with computational methods for predicting the intronexon structure of genes [14], the computational prediction of promoters has been greatly aided by cDNA sequence information. However, promoter prediction is complicated by the fact that most cDNA clones do not extend to the TSS. Recent advances in cDNA library construction methods that utilize the 5 -cap structure of mRNAs have allowed the generation of so-called ‘cap-trapped’ libraries with an increased percentage of full-length cDNAs [15,16]. Such libraries have been used to map TSSs in vertebrates by aligning the 5 -end sequences of individual cDNAs to genomic DNA [17,18]. However, it is estimated that even in the best libraries only 50-80% of cDNAs extend to the TSSs [16,19], making it unreliable to base conclusions on individual cDNA alignments. We describe here a more cautious approach for identifying TSSs that requires the 5 ends of the alignments of multiple, independent cap-selected cDNAs to lie in close proximity. We then examine the regions flanking these putative TSSs, the putative core promoter regions, for conserved DNA sequence motifs. We also use this new set of putative TSSs to retrain and significantly improve our previously described probabilistic promoter prediction method. Finally, we report the results of promoter prediction on whole Drosophila melanogaster chromosomes, and discuss the different challenges of computational promoter recognition in invertebrate and vertebrate genomes. Results and discussion Selection of expressed sequence tag (EST) clusters to determine transcription start sites Stapleton et al. [20] report the results of aligning 237,471 5 EST sequences, including 115,169 obtained from captrapped libraries, on the annotated Release 2 sequence of the D. melanogaster genome. They examined these alignments for alternative splice forms and grouped them into 16,744 clusters with consistent splice sites, overlapping 9,644 known protein-encoding genes. We applied the following set of criteria to select those 5 EST clusters most likely to identify TSSs. Clusters were required to either overlap a known protein-coding gene or have evidence of splicing. One of the three most 5 ESTs in the cluster had to be derived from a cap-trapped library. In some cases, disjoint clusters overlap the annotation of a single gene; here, we only considered the most 5 cluster. We required the distance to the next upstream cluster to be greater than 1 kb. This requirement, together with the selection of only the most 5 cluster, leads to the selection of only one start site per gene. By doing so, we minimize the erroneous inclusion of ESTs which are not fulllength, but also exclude alternative start sites or start sites of genes with overlapping transcript. Because the 5 ends of ESTs derived from full-length cDNAs are expected to lie in a narrow window at the TSS, we required that the 5 ends of at least three ESTs fall within an 11-bp window of genomic sequence, and that the number of ESTs whose 5 ends fall within this window comprise at least 30% of the ESTs in the cluster. With a single EST we cannot be sure to have reached the true start site, even if it was generated by a method selecting for the cap site of the mRNA [17,19]; with a cluster of ESTs within a small range, we can be more confident that we have defined the actual TSS. By requiring selected clusters to have at least three ESTs we are, however, introducing a bias against genes with low expression levels. The requirement that 30% or more of the 5 ESTs in a cluster terminate within the 11-bp window was introduced because, for large EST clusters, a simple numerical requirement is insufficiently stringent. We identified a total of 1,941 clusters, representing about 14% of annotated genes, which met all of the above criteria. Table 1 shows how the numbers of selected clusters varies when we change a single parameter specified in the requirement for distance to next upstream cluster and the requirement that the 5 ends of at least three ESTs fall in a specified window of sequence to a higher or lower value, leaving the other selection requirements constant. Not surprisingly, the most sensitive criterion by far is the window size. A large number of clusters show slightly different 5 ends, which was also observed by other large-scale full-length cDNA projects [17,18]. At the moment, it is an open question how much of this variation is a result of incomplete extension to the 5 end during library construction or an indication of a larger than expected variation in the transcription initiation process. The most 5 EST of each selected cluster, along with its corresponding genomic location, is presented in Supplementary Table 1 in the additional data files available with this paper online (see Additional data files). We defined the start of the most 5 EST in each of the 1,941 clusters as the predicted TSS and refer to this as position +1 in the analyses reported below. We extracted the genomic sequences from 250 bp upstream to 50 bp downstream of

[1]  S. Smale,et al.  Core promoters: active contributors to combinatorial gene regulation. , 2001, Genes & development.

[2]  Sridhar Hannenhalli,et al.  Promoter prediction in the human genome , 2001, ISMB.

[3]  P. Baldi,et al.  DNA structure in human RNA polymerase II promoters. , 1998, Journal of molecular biology.

[4]  J. T. Kadonaga,et al.  The RNA polymerase II core promoter: a key component in the regulation of gene expression. , 2002, Genes & development.

[5]  A Suyama,et al.  Diverse transcriptional initiation revealed by fine, large‐scale mapping of mRNA start sites , 2001, EMBO reports.

[6]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[7]  Piero Carninci,et al.  Comparative evaluation of 5'-end-sequence quality of clones in CAP trapper and other full-length-cDNA libraries. , 2001, Gene.

[8]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[9]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[10]  Philipp Bucher,et al.  The Eukaryotic Promoter Database EPD , 1998, Nucleic Acids Res..

[11]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[12]  T. Tsunoda,et al.  Identification and characterization of the potential promoter regions of 1031 kinds of human genes. , 2001, Genome research.

[13]  R. Tjian,et al.  Orchestrated response: a symphony of transcription factors for gene control. , 2000, Genes & development.

[14]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[15]  G M Rubin,et al.  Insertion site preferences of the P transposable element in Drosophila melanogaster. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[17]  C. Verrijzer,et al.  Co‐operative DNA binding by GAGA transcription factor requires the conserved BTB/POZ domain and reorganizes promoter topology , 1999, The EMBO journal.

[18]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[19]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[20]  Piero Carninci,et al.  The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes. , 2002, Genome research.

[21]  Heinrich Niemann,et al.  Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition , 2001, ISMB.

[22]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[23]  S. Lewis,et al.  Genome annotation assessment in Drosophila melanogaster. , 2000, Genome research.

[24]  I. Arkhipova,et al.  Promoter elements in Drosophila melanogaster revealed by sequence analysis. , 1995, Genetics.

[25]  Y. Suzuki,et al.  Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. , 1997, Gene.

[26]  L. Duret,et al.  Determinants of CpG islands: expression in early embryo and isochore structure. , 2001, Genome research.

[27]  J. T. Kadonaga,et al.  The Downstream Promoter Element DPE Appears To Be as Widely Used as the TATA Box in Drosophila Core Promoters , 2000, Molecular and Cellular Biology.

[28]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[29]  K Frech,et al.  First pass annotation of promoters on human chromosome 22. , 2001, Genome research.

[30]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[31]  D. Latchman Gene Regulation: A Eukaryotic Perspective , 1990 .

[32]  R. Tjian,et al.  TRF2 associates with DREF and directs promoter-selective gene expression in Drosophila , 2002, Nature.

[33]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[34]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[35]  V. G. Levitsky,et al.  Computer Analysis and Recognition of Drosophila melanogasterGene Promoters , 2001, Molecular Biology.

[36]  Piero Carninci,et al.  Normalization and subtraction of cap-trapper-selected cDNAs to prepare full-length cDNA libraries for rapid discovery of new genes. , 2000, Genome research.

[37]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[38]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[39]  B. Haas,et al.  Full-length messenger RNA sequences greatly improve genome annotation , 2002, Genome Biology.

[40]  Michael Ashburner,et al.  Annotation of the Drosophila melanogaster euchromatic genome: a systematic review , 2002, Genome Biology.