A new modeling strategy for eukaryotic promoter recognition and prediction

In this paper, we present a new modeling strategy for the recognition and prediction of promoter region. In our model, we base on following considerations: (1) promoter region comprises a number of binding sites (consensus sequences) that RNA polymerase II can bind to and start the transcription of gene, different promoter can be determined by a combination of different binding sites; (2) the spacing of these binding sites is not always consistent and there is some nucleotide variation in some position in different genes and species. Based on above considerations, we first split promoter region into equal intervals and calculate the occurring probability for each words that is assumed to be the sequences of binding sites in each interval by training sets respectively. Here we combined those interval probabilities into one matrix and refer it to as Interval Position Weight Matrix (IPWM); then a new promoter modeling strategy and feature abstracting method are introduced based on maximal probability model and IPWM. The results of testing on large genomic sequences and comparisons with several currently famous algorithms show that our algorithm is efficient with higher sensitivity and specificity.

[1]  Vladimir Brusic,et al.  Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates. , 2003, Journal of molecular graphics & modelling.

[2]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[3]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  G. B. Hutchinson,et al.  The prediction of vertebrate promoter regions using differential hexamer frequency analysis , 1996, Comput. Appl. Biosci..

[6]  Michael Q. Zhang,et al.  Large-scale human promoter mapping using CpG islands , 2000, Nature Genetics.

[7]  Hong Yan,et al.  Eukaryotic promoter prediction based on relative entropy and positional information. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  V. Bajic,et al.  Computer model for recognition of functional transcription start sites in polymerase II promoters of , 2003 .

[9]  Dominique Mouchiroud,et al.  CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences , 2002, Bioinform..

[10]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[11]  D. S. Prestridge Computer software for eukaryotic promoter analysis. , 2000, Methods in molecular biology.

[12]  T. Werner,et al.  Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. , 2000, Journal of molecular biology.

[13]  S. Cross,et al.  Isolation of CpG islands from large genomic clones. , 1999, Nucleic acids research.

[14]  Jean-Michel Claverie,et al.  Heuristic informational analysis of sequences , 1986, Nucleic Acids Res..

[15]  C Burks,et al.  The density of transcriptional elements in promoter and non-promoter sequences. , 1993, Human molecular genetics.

[16]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.