Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling.

Deciphering the human genome includes locating the promoters that initiate transcription and identifying the exons of genes. Many promoter prediction programs have been proposed, but when they are applied to extended regions of the genome, most of their predictions are false-positives. The extensive collection of gene transcript sequences is an important new source of information, which has not been used previously in promoter predictions. Our approach is to enhance the specificity of predictions by restricting the genomic regions that are searched using gene transcript alignments as anchors in the genome for gene modeling. We developed a consensus promoter prediction method combining previously developed algorithms with the GENSCAN gene modeling program. Our method, CONPRO (CONsensus PROmoter), identifies promoters with very high confidence, and the predicted promoters are guaranteed to be associated with genes. On our test data set, the method correctly detects promoters for approximately half of all human genes (37%-71%), and most predictions are true promoters (85%-90%). Applying our method to the human genome and human genes from the Unigene data set, we find the promoters for 13,744 genes. Of these, 6440 are genes with a functionally cloned mRNA, and 7304 are novel genes for which only expressed sequence tags (ESTs) are available. Candidate promoters for many novel genes will be a useful resource in elucidating complex biological response mechanisms.

[1]  G. B. Hutchinson,et al.  The prediction of vertebrate promoter regions using differential hexamer frequency analysis , 1996, Comput. Appl. Biosci..

[2]  Eric C. Rouchka,et al.  UTR Reconstruction and Analysis Using Genomically Aligned EST Sequences , 2000, ISMB.

[3]  Philipp Bucher,et al.  The Eukaryotic Promoter Database EPD , 1998, Nucleic Acids Res..

[4]  David J. States,et al.  A structure based similarity measure for nucleic acid sequence comparison , 1998, RECOMB '98.

[5]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[6]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[7]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[8]  Piero Carninci,et al.  Comparative evaluation of 5'-end-sequence quality of clones in CAP trapper and other full-length-cDNA libraries. , 2001, Gene.

[9]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[10]  A Suyama,et al.  Statistical analysis of the 5' untranslated region of human mRNA using "Oligo-Capped" cDNA libraries. , 2000, Genomics.

[11]  C. Benham,et al.  Sites of predicted stress-induced DNA duplex destabilization occur preferentially at regulatory loci. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[12]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[13]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[14]  K Frech,et al.  First pass annotation of promoters on human chromosome 22. , 2001, Genome research.

[15]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[16]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[17]  Jean-Michel Claverie,et al.  Detection of Eukaryotic Promoters Using Markov Transition Matrices , 1997, Comput. Chem..

[18]  Victor V. Solovyev,et al.  The Gene-Finder Computer Tools for Analysis of Human and Model Organisms Genome Sequences , 1997, ISMB.

[19]  Gary D. Stormo,et al.  PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices , 1997, Comput. Appl. Biosci..

[20]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.