论文信息 - Computational analysis of core promoters in the Drosophilagenome Citation Genome

Computational analysis of core promoters in the Drosophilagenome Citation Genome

Background: The core promoter, a region of about 100 base-pairs flanking the transcription start site (TSS), serves as the recognition site for the basal transcription apparatus. Drosophila TSSs have generally been mapped by individual experiments; the low number of accurately mapped TSSs has limited analysis of promoter sequence motifs and the training of computational prediction tools. Results: We identified TSS candidates for about 2,000 Drosophila genes by aligning 5 expressed sequence tags (ESTs) from cap-trapped cDNA libraries to the genome, while applying stringent criteria concerning coverage and 5 -end distribution. Examination of the sequences flanking these TSSs revealed the presence of well-known core promoter motifs such as the TATA box, the initiator and the downstream promoter element (DPE). We also define, and assess the distribution of, several new motifs prevalent in core promoters, including what appears to be a variant DPE motif. Among the prevalent motifs is the DNA-replication-related element DRE, recently shown to be part of the recognition site for the TBP-related factor TRF2. Our TSS set was then used to retrain the computational promoter predictor McPromoter, allowing us to improve the recognition performance to over 50% sensitivity and 40% specificity. We compare these computational results to promoter prediction in vertebrates. Conclusions: There are relatively few recognizable binding sites for previously known general transcription factors in Drosophila core promoters. However, we identified several new motifs enriched in promoter regions. We were also able to significantly improve the performance of computational TSS prediction in Drosophila. Published: 20 December 2002 Genome Biology 2002, 3(12):research0087.1–0087.12 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2002/3/12/research/0087 © 2002 Ohler et al., licensee BioMed Central Ltd (Print ISSN 1465-6906; Online ISSN 1465-6914) Received: 7 October 2002 Revised: 19 November 2002 Accepted: 27 November 2002 Background Transcription initiation is one of the most important control points in regulating gene expression [1,2]. Recent observations have emphasized the importance of the core promoter, a region of about 100 base-pairs (bp) flanking the transcription start site (TSS), in regulating transcription [3,4]. The core promoter serves as the recognition site for the basal transcription apparatus, which comprises the multisubunit RNA polymerase II and several auxiliary factors. Core promoters show specificity both in their interactions with enhancers and with sets of general transcription factors that control distinct subsets of genes. Although there are no 2 Genome Biology Vol 3 No 12 Ohler et al. known DNA sequence motifs that are shared by all core promoters, a number of motifs have been identified that are present in a substantial fraction. The most familiar of these motifs is the TATA box, which has been reported to be part of 30-40% of core promoters [5]. Prediction and analysis of core promoters have been active areas of research in computational biology [6], with several recent publications on prediction of human promoters [7-10]. In contrast, prediction of invertebrate promoters has received much less attention and has focused almost exclusively on Drosophila. Reese [11] described the application of time-delay neural networks, and in our previous work [12] we used a combination of a generalized hidden Markov model for sequence features and Gaussian distributions for the predicted structural features of DNA. Structural features were also examined by Levitsky and Katokhin [13], but they did not present results for promoter prediction in genomic sequences. As with computational methods for predicting the intronexon structure of genes [14], the computational prediction of promoters has been greatly aided by cDNA sequence information. However, promoter prediction is complicated by the fact that most cDNA clones do not extend to the TSS. Recent advances in cDNA library construction methods that utilize the 5 -cap structure of mRNAs have allowed the generation of so-called ‘cap-trapped’ libraries with an increased percentage of full-length cDNAs [15,16]. Such libraries have been used to map TSSs in vertebrates by aligning the 5 -end sequences of individual cDNAs to genomic DNA [17,18]. However, it is estimated that even in the best libraries only 50-80% of cDNAs extend to the TSSs [16,19], making it unreliable to base conclusions on individual cDNA alignments. We describe here a more cautious approach for identifying TSSs that requires the 5 ends of the alignments of multiple, independent cap-selected cDNAs to lie in close proximity. We then examine the regions flanking these putative TSSs, the putative core promoter regions, for conserved DNA sequence motifs. We also use this new set of putative TSSs to retrain and significantly improve our previously described probabilistic promoter prediction method. Finally, we report the results of promoter prediction on whole Drosophila melanogaster chromosomes, and discuss the different challenges of computational promoter recognition in invertebrate and vertebrate genomes. Results and discussion Selection of expressed sequence tag (EST) clusters to determine transcription start sites Stapleton et al. [20] report the results of aligning 237,471 5 EST sequences, including 115,169 obtained from captrapped libraries, on the annotated Release 2 sequence of the D. melanogaster genome. They examined these alignments for alternative splice forms and grouped them into 16,744 clusters with consistent splice sites, overlapping 9,644 known protein-encoding genes. We applied the following set of criteria to select those 5 EST clusters most likely to identify TSSs. Clusters were required to either overlap a known protein-coding gene or have evidence of splicing. One of the three most 5 ESTs in the cluster had to be derived from a cap-trapped library. In some cases, disjoint clusters overlap the annotation of a single gene; here, we only considered the most 5 cluster. We required the distance to the next upstream cluster to be greater than 1 kb. This requirement, together with the selection of only the most 5 cluster, leads to the selection of only one start site per gene. By doing so, we minimize the erroneous inclusion of ESTs which are not fulllength, but also exclude alternative start sites or start sites of genes with overlapping transcript. Because the 5 ends of ESTs derived from full-length cDNAs are expected to lie in a narrow window at the TSS, we required that the 5 ends of at least three ESTs fall within an 11-bp window of genomic sequence, and that the number of ESTs whose 5 ends fall within this window comprise at least 30% of the ESTs in the cluster. With a single EST we cannot be sure to have reached the true start site, even if it was generated by a method selecting for the cap site of the mRNA [17,19]; with a cluster of ESTs within a small range, we can be more confident that we have defined the actual TSS. By requiring selected clusters to have at least three ESTs we are, however, introducing a bias against genes with low expression levels. The requirement that 30% or more of the 5 ESTs in a cluster terminate within the 11-bp window was introduced because, for large EST clusters, a simple numerical requirement is insufficiently stringent. We identified a total of 1,941 clusters, representing about 14% of annotated genes, which met all of the above criteria. Table 1 shows how the numbers of selected clusters varies when we change a single parameter specified in the requirement for distance to next upstream cluster and the requirement that the 5 ends of at least three ESTs fall in a specified window of sequence to a higher or lower value, leaving the other selection requirements constant. Not surprisingly, the most sensitive criterion by far is the window size. A large number of clusters show slightly different 5 ends, which was also observed by other large-scale full-length cDNA projects [17,18]. At the moment, it is an open question how much of this variation is a result of incomplete extension to the 5 end during library construction or an indication of a larger than expected variation in the transcription initiation process. The most 5 EST of each selected cluster, along with its corresponding genomic location, is presented in Supplementary Table 1 in the additional data files available with this paper online (see Additional data files). We defined the start of the most 5 EST in each of the 1,941 clusters as the predicted TSS and refer to this as position +1 in the analyses reported below. We extracted the genomic sequences from 250 bp upstream to 50 bp downstream of