A pattern-based nearest neighbor search approach for promoter prediction using DNA structural profiles

MOTIVATION Identification of core promoters is a key clue in understanding gene regulations. However, due to the diverse nature of promoter sequences, the accuracy of existing prediction approaches for non-CpG island (simply CGI)-related promoters is not as high as that for CGI-related promoters. This consequently leads to a low genome-wide promoter prediction accuracy. RESULTS In this article, we first systematically analyze the similarities and differences between the two types of promoters (CGI- and non-CGI-related) from a novel structural perspective, and then devise a unified framework, called PNNP (Pattern-based Nearest Neighbor search for Promoter), to predict both CGI- and non-CGI-related promoters based on their structural features. Our comparative analysis on the structural characteristics of promoters reveals two interesting facts: (i) the structural values of CGI- and non-CGI-related promoters are quite different, but they exhibit nearly similar structural patterns; (ii) the structural patterns of promoters are obviously different from that of non-promoter sequences though the sequences have almost similar structural values. Extensive experiments demonstrate that the proposed PNNP approach is effective in capturing the structural patterns of promoters, and can significantly improve genome-wide performance of promoters prediction, especially non-CGI-related promoters prediction. AVAILABILITY The implementation of the program PNNP is available at http://admis.tongji.edu.cn/Projects/pnnp.aspx.

[1]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[2]  H. Blöcker,et al.  Predicting DNA duplex stability from the base sequence. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.

[4]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[5]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[6]  N. Sugimoto,et al.  Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. , 1996, Nucleic acids research.

[7]  M. A. El Hassan,et al.  Propeller-twisting of base-pairs and the conformational mobility of dinucleotide steps in DNA. , 1996, Journal of molecular biology.

[8]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[9]  Yvan Saeys,et al.  Large-scale structural analysis of the core promoter in mammalian and plant genomes , 2005, Nucleic acids research.

[10]  Pierre Baldi,et al.  Computational Applications of DNA Structural Scales , 1998, ISMB.

[11]  Yong Wang,et al.  An evaluation of new criteria for CpG islands in the human genome as gene markers , 2004, Bioinform..

[12]  V. Zhurkin,et al.  B-DNA twisting correlates with base-pair morphology. , 1995, Journal of molecular biology.

[13]  I. Brukner,et al.  Trinucleotide models for DNA bending propensity: comparison of models based on DNaseI digestion and nucleosome packaging data. , 1995, Journal of biomolecular structure & dynamics.

[14]  H. Drew,et al.  Sequence periodicities in chicken nucleosome core DNA. , 1986, Journal of molecular biology.

[15]  Michael Q. Zhang,et al.  Boosting with stumps for predicting transcription start sites , 2007, Genome Biology.

[16]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[17]  V. Zhurkin,et al.  DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  R. Ornstein,et al.  An optimized potential function for the calculation of nucleic acid interaction energies I. Base stacking , 1978, Biopolymers.

[19]  P. S. Ho,et al.  Polarized electronic spectra of Z‐DNA single crystals , 1990, Biopolymers.

[20]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[21]  A V Sivolob,et al.  Translational positioning of nucleosomes on DNA: the role of sequence-dependent isotropic DNA bending stiffness. , 1995, Journal of molecular biology.

[22]  R. Blake,et al.  Thermal stability of DNA. , 1998, Nucleic acids research.

[23]  Martin S. Taylor,et al.  Genome-wide analysis of mammalian promoter architecture and evolution , 2006, Nature Genetics.

[24]  C. Chiang,et al.  The General Transcription Machinery and General Cofactors , 2006, Critical reviews in biochemistry and molecular biology.

[25]  V. Solovyev,et al.  Automatic annotation of eukaryotic genes, pseudogenes and promoters , 2006, Genome Biology.

[26]  Y. Shckorbatov,et al.  Dependence of the E. coli promoter strength and physical parameters upon the nucleotide sequence. , 2005, Journal of Zhejiang University. Science. B.

[27]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[28]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[29]  Seng Hong Seah,et al.  Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. , 2003, Genome research.

[30]  Victor V. Solovyev,et al.  PromH: promoters identification using orthologous genomic sequences , 2003, Nucleic Acids Res..

[31]  Modesto Orozco,et al.  Determining promoter location based on DNA structure first-principles calculations , 2007, Genome Biology.

[32]  Pierre Baldi,et al.  The Biology of Eukaryotic Promoter Prediction - A Review , 1999, Comput. Chem..

[33]  Ivanov Vi,et al.  [The A-form of DNA: in search of the biological role]. , 1994 .