Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates.

This paper introduces a new computer system for recognition of functional transcription start sites (TSSs) in RNA polymerase II promoter regions of vertebrates. This system allows scanning complete vertebrate genomes for promoters with significantly reduced number of false positive predictions. It can be used in the context of gene finding through its recognition of the 5' end of genes. The implemented recognition model uses a composite-hierarchical approach, artificial intelligence, statistics, and signal processing techniques. It also exploits the separation of promoter sequences into those that are C+G-rich or C+G-poor. The system was evaluated on a large and diverse human sequence-set and exhibited several times higher accuracy than several publicly available TSS-finding programs. Results obtained using human chromosome 22 data showed even greater specificity than the evaluation set results. The system has been implemented in the Dragon Promoter Finder package, which can be accessed at http://sdmc.krdl.org.sg:8080/promoter/.

[1]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[2]  K Frech,et al.  First pass annotation of promoters on human chromosome 22. , 2001, Genome research.

[3]  Michael Q. Zhang,et al.  Large-scale human promoter mapping using CpG islands , 2000, Nature Genetics.

[4]  F E Penotti,et al.  Human DNA TATA boxes and transcription initiation sites. A statistical study. , 1990, Journal of molecular biology.

[5]  S. Cross,et al.  CpG islands and genes. , 1995, Current opinion in genetics & development.

[6]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Philipp Bucher,et al.  The Eukaryotic Promoter Database EPD , 1998, Nucleic Acids Res..

[8]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs , 2000, Nucleic Acids Res..

[9]  D J Segal,et al.  Insights into the molecular recognition of the 5'-GNN-3' family of DNA sequences by zinc finger domains. , 2000, Journal of molecular biology.

[10]  Eugene W. Myers,et al.  Xlandscape: the graphical display of word frequencies in sequences , 1998, Bioinform..

[11]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[12]  H. Prydz,et al.  CpG islands as gene markers in the human genome. , 1992, Genomics.

[13]  Robert O J Weinzierl Mechanisms of Gene Expression: Structure, Function and Evolution of the Basal Transcriptional Machinery , 1999 .

[14]  G. Stormo Gene-finding approaches for eukaryotes. , 2000, Genome research.

[15]  T. Werner Models for prediction and recognition of eukaryotic promoters , 1999, Mammalian Genome.

[16]  Vladimir B. Bajic,et al.  Comparing the Success of Different Prediction Software in Sequence Analysis: A Review , 2000, Briefings Bioinform..

[17]  D. S. Prestridge Computer software for eukaryotic promoter analysis. , 2000, Methods in molecular biology.

[18]  Pierre Baldi,et al.  The Biology of Eukaryotic Promoter Prediction - A Review , 1999, Comput. Chem..

[19]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[20]  A. Lapedes,et al.  Determination of eukaryotic protein coding regions using neural networks and information theory. , 1992, Journal of molecular biology.

[21]  A. Bird,et al.  Non‐methylated CpG‐rich islands at the human alpha‐globin locus: implications for evolution of the alpha‐globin pseudogene. , 1987, The EMBO journal.

[22]  Martin Reczko,et al.  Multistate Time-Delay Neural Networks for the recognition of POL II promoter sequences , 1996 .

[23]  Heinrich Niemann,et al.  Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition , 2001, ISMB.

[24]  S. Lewis,et al.  Genome annotation assessment in Drosophila melanogaster. , 2000, Genome research.

[25]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[26]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[27]  S. Knudsen,et al.  Prediction of human mRNA donor and acceptor sites from the DNA sequence. , 1991, Journal of molecular biology.

[28]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[29]  Graziano Pesole,et al.  CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases , 1996, Comput. Appl. Biosci..

[30]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[31]  Sridhar Hannenhalli,et al.  Promoter prediction in the human genome , 2001, ISMB.

[32]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.