Predicting Plant Pol-II Promoter Based on Subsequence Increment of Overlap Content Diversity

Promoter identification is the first and the most important step for understanding gene transcription regulation. In this study, one new information content feature, the subsequence increment of overlapping content diversity (IOCD), is firstly presented to describe the subsequence content of plant poll-Ⅱ promoter. The negative datasets include five different regions of Arabidopsis thaliana complete genomes, Codings, Introns, Intergenics, 5'untranslation regions (UTRs) and 3' untranslation regions (UTRs). The prediction capacity of our algorithm is tested by 10-fold cross validation test based on K- mer IOCD. The results show that the IOCD can describe the promoter sequence content well. Further, based on the interval distances between transcription start site (TSS) and translation initiation site (TIS), the method is applied to search the complete genomes of Arabidopsis thaliana and more than ten thousand probable promoters are founded.

[1]  G M Maggiora,et al.  Domain structural class prediction. , 1998, Protein engineering.

[2]  Ying-Li Chen,et al.  Prediction of the subcellular location of apoptosis proteins. , 2007, Journal of theoretical biology.

[3]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[4]  Kathleen Marchal,et al.  Computational Approaches to Identify Promoters and cis-Regulatory Elements in Plant Genomes1 , 2003, Plant Physiology.

[5]  T. Sakurai,et al.  Identification of plant promoter constituents by analysis of local distribution of short sequences , 2007, BMC Genomics.

[6]  Hong Yan,et al.  Eukaryotic promoter prediction based on relative entropy and positional information. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Yvan Saeys,et al.  Large-scale structural analysis of the core promoter in mammalian and plant genomes , 2005, Nucleic acids research.

[8]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[9]  M. Q. Zhang,et al.  Identification of human gene core promoters in silico. , 1998, Genome research.

[10]  Liaofu Luo,et al.  Splice site prediction with quadratic discriminant analysis using diversity measure. , 2003, Nucleic acids research.

[11]  Hong Yan,et al.  PromoterExplorer: an effective promoter identification method based on the AdaBoost algorithm , 2006, Bioinform..

[12]  R. Zhang,et al.  Improving promoter prediction for the NNPP 2 . 2 algorithm : a case study using Escherichia coli DNA sequences , 2004 .

[13]  A. J. Gammerman,et al.  Plant promoter prediction with confidence estimation , 2005, Nucleic acids research.

[14]  Q. Z. Li,et al.  The prediction of the structural class of protein: application of the measure of diversity. , 2001, Journal of theoretical biology.

[15]  A. Krishnamachari,et al.  Computational analysis of plant RNA Pol-II promoters. , 2006, Bio Systems.

[16]  R. Laxton The measure of diversity. , 1978, Journal of theoretical biology.

[17]  Uwe Ohler,et al.  Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment , 2006, Genome Biology.

[18]  Ankush Mittal,et al.  Computational modeling of oligonucleotide positional densities for human promoter prediction , 2005, Artif. Intell. Medicine.

[19]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[20]  John M. Hancock,et al.  PlantProm: a database of plant promoter sequences , 2003, Nucleic Acids Res..

[21]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[22]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[23]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.